Nothing Special   »   [go: up one dir, main page]

Chapter 4 Sample Size

Download as pdf or txt
Download as pdf or txt
You are on page 1of 28
At a glance
Powered by AI
The document discusses determining sample sizes for surveys and how to ensure samples are representative of the target population.

A good representative sample of a population is a sample that is drawn from a population by using a random sampling technique.

The formulas for calculating sample size depend on the objective of the survey, and there are divided into two categories - measuring a single proportion or quantity, or demonstrating a significant difference between two proportions or quantities.

Your professional performance begins with good planning and sample.

A
survey without proper planning and sampling is the introduction of
intentional error into research analysis and a recipe for inefficient sample
estimates and a disastrous policy.
-Abdi-Khalil Edriss-

SAMPLE SIZE DETERMINATIONS


The popular notion the bigger-the-sample-the-better the estimate is not necessarily true.
What is important is a good representative sample of the population under investigation,
and sample size which is large enough to reflect important variations in the population.

Furthermore, two important questions on the designing of any sample survey inquiry are
the total cost of the survey and the precision of the main estimates. Both these are related
to the size of the sample; given the variability of the data, type of sampling and the
method of estimation. Keep in mind that the ultimate goal is, the survey should be
designed to provide estimates with minimum sampling error (meaning, with maximum
precision) when the total cost is fixed; and a sample size that fulfills these conditions is
called the optimal sample size.

I - Size of a Sample and a Population

4.1. What is a good representative sample of a population?


A good representative sample of a population is a sample that is drawn from a
population by using a random sampling technique. Note that a random sampling
method selects elements from a population by giving the elements equal chance of
being selected; hence, this randomization guarantees representativeness of a sample.

4.2. What is a large enough sample?


A sample is large if it follows the theory of large number or central limit theorem
so that a sample that at least contains 30 randomly selected elements or items.

~ 91 ~
This guarantees that the sample is normally distributed; however, depending on
the type of research and considering other factors, the actual sample size should
be determined using one of the formulae being discussed consequently.

4.3. How do we determine size of a sample?


The formulas for calculating sample size depend on the objective of the survey,
and there are divided into two categories. The objective of the survey might be

 To measure a single proportion or quantity


 To demonstrate whether a significant difference exists between two
proportions or quantities

For sample size estimation, we must initially obtain or estimate some basic figures. If the
sample size is too small the desired precision might not be achieved; if the sample size is
too large unnecessary costs may be incurred. Here are some optimal sample size
estimation methods.

II - Sample Size with Single Proportion – Formula A


4.4. How do we determine sample size with only single proportion given?

To calculate the sample size, n, needed using estimated population proportion, p,


evaluate –

z 2 (1 p) p
n
e2
1.962 (1 p) p
n
e2

where z is the z-value yielding the desired degree of confidence, p is an estimate of the
population proportion, and e is the absolute size of the error in estimating p that the
researcher is willing to permit.

POINTS TO PONDER
If no previous estimate of p is available, using p = 0.5 in the sample size formula will
yield the maximum size n required for any given e and z. Any other values of p will yield
a smaller n. Note that statistical proofs show that the value of p(1-p) increases as p
approaches o.5, therefore a safer estimate of n is obtained with the value of p nearer to
0.5.

~ 92 ~
NUMERICAL EXAMPLE
Economics & Business: To calculate a good representative sample size for the micro-
enterprise survey, the study focused on micro-enterprises supported by special five rural
banking groups which comprises about 92% (p=0.92) of the total micro-enterprises
supported by both rural banking groups and streamline commercial banks in the area.
Hence, for 95% (Z=1.96) level of confidence, within 5% (e=0.05) margin of error, and
taking into account the proportion of micro-enterprises supported by rural banking groups
only, the sample size, n, was obtained as follows –

(1.96) 2 (1 0.92)(0.92)
n 113
(0.05) 2

and adding 5% for a possibility of non-respondents, the sample size is 119 (113 + 113 x
0.05 = 119) businesswomen.

Suppose we plan to collect data on households who had access to rural credit and on
those who had no access to rural credit, how do you determine a representative and
adequate sample that does not inject biasedness and inefficiency in the sample
characteristics or estimates?

For example, data set is stratified into two groups, those who benefit from a particular
project and those who do not benefit; adopt a technology, or not adopt a technology; have
access to credit or do not have access to credit, etc. If it is intended to compare two
groups or strata, and to determine the sample size for the survey, the optimal sample size
for each group should be equal or at least 60:40 ratios (beneficiary versus non-
beneficiaries).

NUMERICAL EXAMPLE
Health: Estimate the sample size required to determine the prevalence of iron deficiency
in women (Data: Courtesy of World Vision International – Malawi).

Step 1
Guess/anticipate the proportion we are about to measure. If we expect to find 45
anemic pregnant women out of every 100 pregnant women, our anticipated
proportion will be 0.45. The normal level of confidence on estimates is set at ±
5%. However, for sub-national estimates we may be satisfied with ±10%.

Calculations
p = previously known prevalence = 0.45
e = % error within = ± 5%. For confidence of 95%, Z=1.96 (2-tailed test)

Therefore,

~ 93 ~
(1.96) 2 (1 0.45)(0.45)
Sample size, n 380.3
(0.05) 2

Step 2
Inflate to account for non-responders by 10%, that is,

380 + (380 x 0.10) = 418

when performing a household survey, it is unlikely that there will always be


someone at home. All efforts should be made to return to the vacant households,
as they may include under represented individuals, such as single mothers.

Step 3
It very difficult to predict what the design effect will be before carrying out the
study. After the survey is finished, the design effect can be calculated more
accurately and can be used to calculate the actual margin of error.

A design effect of 2 is usually used for most variables (unless the literature
suggests otherwise). Hence,

2 x 418 = 836

Therefore, to estimate within ± 5% margin of error, and to be 95% sure, 836


pregnant women should be screened for the iron deficiency indicator.

Step 4
Now, how do we estimate the number of households we must visit? To estimate
the number of households that are required for the survey, we must know first the
average household size (6 persons per household is the national estimate for
Malawi) and the proportion of women of childbearing age in the population.

If
o The average household size is 6 persons
o 5% of the population is made up of pregnant women
o 836 women should be screened

Then, we should visit

836/ (6x0.05) = 2787 households

~ 94 ~
POINTS TO PONDER
We can reduce the sample size by lowering the level of confidence to 90%, then Z-score
=1.64. This tells us that the smaller the confidence interval, the smaller the sample size is
needed for the actual survey.

III - Sample size with Double Proportion – Formula B

4.5. How do we determine sample size when two proportions are


given for a single population estimate?

The following formula deals with sample size determination when two proportions are
given on the same indicator or variable –

[v 2 p 0 (1 p 0 ) u p1 (1 p1 ) p 2 (1 p 2 )]2
n
d2
p1 p2
Where, p 0 , u is one-tailed test with Z-value of a normal
2
distribution corresponding to power1 80% and v is one tailed with Z-value
of a normal distribution corresponding to 95% confidence level2.

NUMERICAL EXAMPLE
Economics & Health: Estimate the sample size required to determine whether the
prevalence of anemia in pregnant women has decreased within the previous 12-month
intervention period.

Step 1
Calculations
Previously, 45% of pregnant women were iron deficient. After 12 months of
intervention, the expected decrease in prevalence of anemia would be 10%
resulting in a predicted new prevalence of 35%. Calculate the sample size
required to demonstrate the difference between the proportions.

The estimated prevalence, p1 = 0.45


The expected new prevalence, p2 = 0.35

1
The probability of making a Type II error, denoted by , is due to a decision to accept a false null
hypothesis. The complement (1- ) of the probability of Type II error measures the probability of rejecting
the false null hypothesis, and it is known as the power of a statistical test.
2
Similarly, the probability of making Type I error, denoted by , is rejecting a true null hypothesis, and
referred as the level of significance. The complement (1 - ) of the probability of Type I error measures the
probability of not rejecting a true null hypothesis, and known as confidence level.

~ 95 ~
The difference, d = 0.45 - 0.35 = 0.1

Accept a Power = 80%


u=0.842 (one tailed) Z-value of a normal distribution corresponding to power
80%

v=1.645 (one tailed) Z-value of a normal distribution corresponding to 95%


significance

p1 p2 0.45 0.35
Now, p0 0.4 and then the sample size,
2 2

[v 2 p 0 (1 p 0 ) u p1 (1 p1 ) p 2 (1 p 2 )]2
n
d2

[1.645 2(0.4)(1 0.4) 0.842 0.45(1 0.45) 0.35(1 0.35)]2


0.12
= [1.139689 + 0.4188897 + 0.2275]2 / 0.01= 1.78607872/0.01
= 3.19/0.01 = 319

Step 2
Inflate to account for non-respondents by 10%, that is,

319 + (319 x 0.1) = 350.9

Step 3
Choose an appropriate design effect

Again, use a design effect of 2 for anemia variable to adjust the sample size as
follows.

2 x 350.9 = 701.8

Therefore to detect a true difference of 10% (i.e., a reduction from 45% to 35%)
with a confidence level of 95%, a survey would require 702 pregnant women.

Step 4
Estimate the number of households that must be visited

n = [702/(6x0.05)] = 2340

~ 96 ~
Note that these calculations must be repeated for each of the indicators being measured
in the survey. Otherwise, we should pick an indicator with higher sample size so that it
can take care of the other indicators.

NUMERICAL EXAMPLE
Economics & Agribusiness: Since the various Malawi Social Action Fund (MASAF)
projects are implemented in 27 districts of Malawi in late 1990s, using two-stage
sampling methods, MASAF enumerated areas (EAs) within given district and households
within the selected MASAF EAs will be randomly chosen to have representative sample
size of households for the survey.

The sample (or, the number of households sampled, P) per EA will be determined using
the following formula.

P1i a Mi / Mi c/a

P2i bi / Li

Where
a is the number of EAs to be selected in each of the district
c is the number of EAs to be selected in each of the district sample in the 2004
Malawi Demographic and Health Survey (MDHS)
Mi is the number of households in the ith EA in each district according to the 1998
population cen3Us,
Mi is the total number of households in each of the district according to the 1998
population census,
bi is the household sampled selected in each EA, and
Li is the total number of households listed in the selected ith EA during the 2004
MDHS listing operation

Before the final household selection, a complete household listing operation would be
completed, if not readily available, for each selected EA. However, if listings of
household are available from NSO, the selected households will be verified if matching
will be possible and create a panel of households. This will help to effectively evaluate
the impacts of MASAF 3 APL 1.

To gauge impacts, sample sizes (MASAF 3 and Non-MASAF 3 areas/villages or


beneficiary and non-beneficiary households) will be calculated using 95% confidence
interval, a power of 80%, with an error margin of 10%, and 10% expected change for
most indicators with design effect of 2. Equally, the counterfactual data can be created
either from MDHS 2000 (provided the data are available from National Statistical Office
(NSO) or Non-MASAF 3 areas for comparisons.

~ 97 ~
For example,
Estimate the sample size required to determine whether ‘poor households receiving daily
transfer of US$0.3” has reduced the number of people living on less than US$1 per day
since 2003 MASAF 3 inception.

NUMERICAL EXAMPLE
Economics: Estimate the sample size required to determine whether ‘poor households
receiving daily transfer of US$0.3 has reduced the number of people living on less than
US$1 per day from 2003 to 2007.

Step 1

Calculations
Previously, 55% of the Malawi Population lives below US$1 per day. After 3-4
years of intervention, the expected decrease in number of poor households would
be 10% resulting in a predicted new rate of 45% poverty level. Calculate the
sample size required to demonstrate the difference between the proportions.

The estimated prevalence, p1 = 0.55


The expected new prevalence, p2 = 0.45
The difference, d = 0.55 - 0.45 = 0.1
Accept a Power = 80%

u=0.842 (one tailed) Z-value of a normal distribution corresponding to power


80%

v=1.645 (one tailed) Z-value of a normal distribution corresponding to 95%


significance

p1 p2 0.55 0.45
Now, p0 0.5 and then the sample size,
2 2

[v 2 p 0 (1 p 0 ) u p1 (1 p1 ) p 2 (1 p 2 )]2
n
d2

[1.645 2(0.5)(1 0.5) 0.842 0.55(1 0.55) 0.45(1 0.45)]2


0.12

= 3.341/0.01 = 334.2

Step 2
Inflate to account for non-respondents by 10% (sub-national level), that is,

334 + (334 x 0.1) = 367.4

~ 98 ~
Step 3
Choose an appropriate design effect - again, using a design effect of 2 for most
indicators/variables

2x367.4 = 734.8

Therefore, to detect a true difference of 10% (i.e., a reduction from 55% to 45%)
with a confidence level of 95%, a survey would require, on average, 735
households per district3.

Step 4
Taking into account the proportion of MASAF 3 beneficiaries throughout the 27
districts, the estimated number of households that must be visited for the survey in
the 27 districts is -

n = [735 x 27x0.57] = 11, 312

Note that these calculations must be repeated for each of the indicators being
measured in the survey.

Note that due to the diversity of the indicators to be studied in the survey, sample
size will vary for different indicators. Therefore, the variable with the highest
sample size will be taken as the sample size for all other variables. Although, this
will increase the number of household for other indicators, it is statistically better
to have a large representative sample size than a reduced one, and also it is easier
to administer the whole questionnaire at every selected household during the
survey.

IMPORTANT NOTE - Knowing the number of beneficiaries, which were 6, 841, 055,
from MASAF in the 27 districts, this sample size sounds appropriate in performing
the analysis on the impacts of the projects using panel data as recommended by
the World Bank and MASAF team. It is sound and correct sample size as it is close
to the sample size4 (n=14, 000) used in MDHS 2000 in which this survey intends to
match the households to create the panel data, which will enable us to obtain real
impact of MASAF.

3
Proportional probability sampling methods is applied among the districts as the population size varies.
4
Refer to MDHS 2000 sample design technique, Appendix A, page 197.

~ 99 ~
IV - Sample Size with Variance – Formula C

4.6. How do we determine sample size when only variance or


standard deviation is available?

To calculate the sample size n needed using a population variance, evaluate -

(1.96) 2 2
n
e2

The value for frequently must be estimated, if not use sample variance. A rule of thumb
to estimate if no similar studies have been done is that is approximately equal to 1/6
of the difference between the highest and lowest value in the population.

NUMERICAL EXAMPLE
Economics & Business: A small agricultural credit institution wanted to estimate the
sample size from 1, 200, 000 of its clients. It has information that the standard deviation
of the sample mean is 5 Malawi Kwacha (MK) or the maximum loan given was MK2000
and the minimum was MK500. With allowable error of 10%, estimate the sample size, n,
that is required at the moment.

Solution

(1.96) 2 2
(1.962 )(5) 2
n 9604clients.
e2 0.12

NUMERICAL EXAMPLE
Economics: Past experience indicates that the standard deviation of the amount of maize
consumed per month by households in a certain village is 50 bags. How large sample
must be taken for the estimate of the true mean consumption to have a 95% probability
within 5 bags of maize of the true mean?

Solution

(1.96) 2 (50) 2
n 384 households
(5) 2

NUMERICAL EXAMPLE
Life Sciences: A nutritionist is interested in the effectiveness of fortified food in the
villages, and pre-study record showed there was a monthly consumption of 2 kg among
children in a household throughout the country. How large sample must be taken for the

~ 100 ~
estimate of the true mean consumption to have a 99% probability within 250 grams of
fortified food of the true mean?

Solution

(1.96) 2 2
(1.962 )(0.25) 2
n 2401 households
e2 0.012

V - Sample size with proportions and population size – Formula D


So far, we have seen that a sample size was determined without considering the
population size; however, for more precision on sample size calculation, it is important to
include a population size when it is available.

4.7. When proportions and population size are known, how do you
determine a sample size?

The following formula can be employed to calculate the sample size (Kothari, 2004)
when population size and population proportion of major interest are available.

z2 p.q.N
n
e2 N 1 z 2 . p.q

Where: n = sample size, p = proportion of the population containing the major


interest, q = 1-p, z = number of standard deviation at a given confidence level
( = 0.1), e = acceptable error (precision) and N is the population size

NUMERICAL EXAMPLE
Economics & Agribusiness: The above formula was employed to come up with an
appropriate sample size of Cotton Smallholder farmers in Malawi (2009).

In the Malawi case; Number of cotton farmers, N = 156,023, p = 0.5, q= 0.5, Z = 1.65 at
= 0.1, and e = 0.08, therefore,

1.652 (0.5)(0.5)156023
n 106.3
0.08 156023 1 1.652 (0.5)(0.5)
2

This results to a sample of 106 cotton farmers. There was an additional 10% to carter for
non-response and spoilt questionnaires. Thus, a total number of 117 cotton farmers were
randomly sampled for the interview.

~ 101 ~
VI - Sample size for Different Strata – Formula E
Given a total sample size n, its allocation or distribution to the different strata or groups
would be based on the following principles: (1) a specified total cost of surveying the
sample, (2) with minimum sampling variance of the strata and/or, (3) proportional
allocation (or PPS).

4.8. How do you determine sample size for a stratum or a population


divided in different groups?

(1) Allocation under cost – a simple cost function in stratified sampling is -

m
C c0 ni ci
i 1

Where c0 is the overhead cost, and ci the average cost of taking a sample unit in
the ith stratum, which may vary from stratum to stratum, depending on field
condition (rough roads, distance, mountains, crossing valleys and rivers, weather,
etc.)

If the cost per unit ci is assumed to be the same, c, in all the strata, then the
previous function becomes
C c0 nc

Therefore, given the total cost, the total sample size can be determined as

n (C c0 ) / c

(2) With minimum sampling variance of the strata

For continues data, if σ2 is the desired variance of a sample estimator, y, then the
required sample size n for a specified variance becomes,

m 2

N i si
i 1
n
sss2

~ 102 ~
Also sample size for each stratum or group, given at a fixed cost, is -

nNi i
ni
Ni i
This optimum allocation is known as the Neyman allocation.

Note that the population standard deviation value σi will not known usually, and
estimates such as si (sample standard deviation estimates) would have to be obtained
from previous study or reconnaissance/pilot survey relating to the desired variable.
However, if such information is lacking on the sample standard deviation, then the
alternative is to use the range of the variable and determine the sample standard
deviation and use it in the formula.

2
(3) Proportional allocation – if sss is the desired stratified variance of the proportion p,
then the required sample size n for the Neyman allocation is –

m
N i ( pi qi )
i 1
n
N 2 sss2

And for proportional allocation, the sample size is given as –

m
n ( N i pi qi ) / Ns ss2
i 1

Where pi is the proportion in the ith stratum, and qi = (1-pi)

2
Now, in absence of any information on the stratified sample variance, si , for each
stratum, and if it can be assumed to be the same in all the strata, the Neyman allocation
takes the simple form –

nNi
ni
N
Where ni is proportional to Ni, or that the sample is allocated to the different strata in
proportion to the number of sub-population units Ni.

~ 103 ~
NUMERICAL EXAMPLE
Economics and Business: The data on the number of women who received rural credit
from a certain credit institution in certain villages of Malawi for the years 2005 and 2010
is given in following Table 6.1, in the 5 strata according to the total amount of loan they
had received, along with the present number of households in villages.

Using the data in Table 6.1, determine the allocations of the sample in the different strata
according to the following principles; (i) Neyman allocation, (ii) proportional allocation,
and (iii) allocation proportional to total number of women in different strata.

Table 4.1: Data on loan for 5 strata in 2005 and 2010


Total number Average Total number of
of households number of Sample households in same
Loan in Project area women who standard project area 2010
Stratum received 2005 received loan deviation
i (MK) N i''
Ni xi si
1 0 – 5000 550 30.3 3.4 552
2 5001 – 10000 430 20.5 2.2 450
3 10001 - 15000 300 32.4 4.7 396
4 15001 – 20000 165 19.6 3.8 165
5 Above 20000 92 11.5 2.3 95
All strata N=1537 N”=1658
combined

Now, after calculating the sample size, n=380, the computation to different strata of the
sample 380 households is shown in Table 6.2.

Table 4.2: Computation of allocation for different strata


Neyman Proportional Proportional
allocation allocation to
N i si X i N i xi 380N i si 380N i'' 380 X i
Stratum ni 5 ni ni 5
''
i N i si N Xi
i 1 i 1
1 1870 16665 140.3 126.5 160.4
2 946 8815 70.9 103.1 84.8
3 1410 9720 105.8 90.8 93.5
4 627 3234 47.0 37.8 31.1
5 211.6 1058 15.9 21.8 10.2
All strata 5 5 n=380 n=380 n=380
combined i 1
N i si 5064.6 i 1
Xi 39492

~ 104 ~
VII - Sample size and a model – Formula F
The bigger controversy arises when determining adequate sample size to run a regression
model (refer to Chapter 7 for details on regression models). For example, if one considers
that n=40 is sufficient to run ordinary statistical tests, then does it mean this sample size
is adequate to run a model that has 5 or more independent variables? The answer is NO;

4.9. How do we reconcile this difference?


We have to reconcile sample size that are determined using some formulae and
the minimum sample size required to run a model with a number of variables
included in the model. Therefore, what is suggested by most statisticians and
econometricians is that to take into consideration the number of variables in the
model; and determine the sample required. Thus, using –

n = 10 times the number of relevant independent variables in a given model =


10K, where K is the number of relevant independent variables included in the
model and as long as, 10K greater than n=30 (the magic number in parametric
statistics!)

So far we have discussed sample size in the context of precision and confidence with
respect to one variable only. In research, however, the theoretical framework has several
variables of interest, and the question arises how one should come up with a sample size
when all the factors are taken into account. Krejcie and Morgan (1970) greatly simplified
the sample size decision by providing a table that ensures a good decision model. Table
4.3 provides the generalized scientific guideline for sample size decisions. The interested
student is advised to read Krejcie and Morgan (1970), as well as, Cohen (1969) for
decisions on sample size5.

VIII - Sample size determination without a Formula – Formula G

Furthermore, Roscoe (1975) proposes the following rules of thumb for determining
sample size:
Sample sizes larger than 30 and less than 500 are appropriate for most research.
Where samples are to be broken into sub-samples (male-headed/female-headed
household, urban/rural area, etc.), a minimum sample size of 30 for each category is
necessary.

5
As precautions, note that this table suggests that a specific value of sample size, n, is always appropriate
for a given population size, N, ignoring some statistical parameters such as , , , , , etc. used in model
estimation. Hence, the suggested sample sizes should be used with caution for simple surveys and statistical
parameter estimates.

~ 105 ~
Table 4.3: Sample Size (n) for a given Population Size (N)

N n N n N n
10 10 220 140 1200 291
15 14 230 144 1300 297
20 19 240 148 1400 302
25 24 250 152 1500 306
30 28 260 155 1600 310
35 32 270 159 1700 313
40 36 280 162 1800 317
45 40 290 165 1900 320
50 44 300 169 2000 322
55 48 320 175 2200 327
60 52 340 181 2400 331
65 56 360 186 2600 335
70 59 380 191 2800 338
75 63 400 196 3000 341
80 66 420 201 3500 346
85 70 440 205 4000 351
90 73 460 210 4500 354
95 76 480 214 5000 357
100 80 500 217 6000 361
110 86 550 226 7000 364
120 92 600 234 8000 367
130 97 650 242 9000 368
140 103 700 248 10000 370
150 108 750 254 15000 375
160 113 800 260 20000 377
170 118 850 265 30000 379
180 123 900 269 40000 380
190 127 950 274 50000 381
200 132 1000 278 75000 382
210 136 1100 285 100000 384

In multivariate research (including multiple regression analysis), the sample size


should be several times (preferably 10 times or more) as large as the number of
variables in the study.
For simple experimental research with tight experimental controls (matched pairs,
etc.), successful research is possible with samples as small as 10 to 20 in size.

POINTS TO PONDER
In sum, the sample size, n, is a function of: (1) the variability in the population, (2) precision or
accuracy needed, (3) confidence level desired, (4) type of sampling plan used (for example,
simple random sampling versus stratified random sampling), and (5) the number of
independent variables in a model. Note that these are not considered in Table 6.3, and hence
appropriate sample size should be estimated when conducting a survey using the various
formulae given previously.

~ 106 ~
IX - Determination of a Population size – Formula H

In practice, it is usually difficult to know the size of a population, say population of a


country or wildlife populations or estimating people at a sporting event.

4.10. How do we determine the size of a population?


It is possible to estimate a population size by using one of the four methods
available in statistics; namely, direct method, indirect method, population density
from a quadrant, or population density from stocked population.

Method 1 – Direct Sampling


The first method is called direct sampling. This procedure entails drawing a
random sample, say t, from a wildlife population of interest, tagging each
animal sampled, and returning the tagged animals to the population. At a later
date another random sample size, n, (of a fixed size) is drawn from the same
population, and the number of tagged animals is observed, s. If N represents
the total population size, t represents the number of animals tagged in the
initial sample, and p represents the proportion of tagged animals in the
sample, then N=t/p, where p = s/n. Note that N is an estimate, not the actual
population figure. Hence, using this information the following formulas can
easily be used.

Important Formulas

The proportion of tagged individuals in the sample is

s
pestimated
n

An estimate of population size, N, is given by

t nt
N estimated
Pestimated s

Estimated variance of Nestimated is given by

t 2 n(n s)
Vestimated
s3

Bound on the error of estimation is given by

~ 107 ~
t 2 n( n s )
2 Vestimated 2
s3

NUMERICAL EXAMPLE
Economics & Business: Suppose an officer from Wildlife Malawi is concerned about the
apparent decline in the number of mountain antelopes in Nyka park. Estimates of the
population size are available from previous years. For determination of whether or not
there has been a decline, first a random sample of 100 antelopes is caught (t=100), tagged
and then released. A month later a second sample of 50 is taken (n=50), and twenty
antelopes are recaptured in the second sample (s=20). Estimate the population size, N.
(Assume that tagging does not affect the likelihood of recapture).

Solution
Using the equations given in method 1, we have

nt 50(100)
N estimated 250
s 20

And a bound on the error of estimation is given by

t 2 n( n s ) 1002 (50)(50 20)


2 V 2 2 86.6
s3 203

Thus, the officer estimates the total number of mountain antelopes is 250, with a bound
error of estimation of approximately 87 mountain antelopes. Note that we might be
concerned about the high bound of error. This could have been improved if we had a
larger sample size.

Method 2 – Inverse Sampling


The second technique is inverse sampling. It is similar to direct sampling, but
the second sample size is not fixed. This is, we sample until a fixed number of
tagged animals is observed. Using this procedure, we can also obtain an
estimate of N, the total population size, using N=t/p. When choice is available
between the direct and indirect sampling procedures, the inverse procedure
appears to provide more accurate results.

~ 108 ~
Important Formulas
Estimation of N (note that t = initial sample, n is second sample and s is the
number of recaptured samples within n) is -

nt
N estimated
s

Estimated variance of Nestimated is -

t 2 n( n s )
Vestimated
s 2 ( s 1)

Bound on the error of estimation is -

t 2 n( n s )
2 Vestimated 2
s 2 ( s 1)

NUMERICAL EXAMPLE
Economics: Authorities in Liwonde National Park are interested in the total number of
birds of a particular species that inhabit the park. A random sample of t=200 birds is
trapped, tagged and then released. In the same month a second sample is drawn until 30-
tagged birds are recaptured (s = 30). In total, 100 birds are recaptured in order to find 30
tagged one (n = 100). Estimate N, and place a bound on the error of estimation.

Solution
Using formulas in method 2, we estimate N by

nt
N estimated = 100 (200)/30 = 666.67
s

A bound on the error of estimation is found by

t 2 n(n s) 2002 (100)(100 30)


2 Vestimated 2 2 203
s 2 ( s 1) 302 (30 1)

Hence, we estimate 667 birds of particular species inhabit Liwonde National Park. We are
quite confident that our estimate is within approximately 203 birds of the true population
size.

~ 109 ~
Method 3 – Quadrat
The third technique involves estimating population density and size from quadrat
(plot, volumes or intervals of time samples). That is, estimation of the number of
elements in a defined area or volume can be accomplished by first estimating the
number of elements per unit area (that is, the density of the elements) and then
multiplying the estimate density by the size of the area under study.

It seems that there is nothing new here. However, it is often the case that the
elements being counted (diseased trees, bacteria colonies, traffic accidents, etc)
are themselves randomly distributed over area, volume, or time.

Suppose a region of total area A is to be sampled by randomly selecting n plots,


each of area . For convenience, we assume A = n . Each plot will be called a
quadrat, small area with a cluster of elements. We let nj denote the number of
elements in quadrat j, and the total number of elements in the population having
area A is given by

M= ni, where j = 1, … q (q is number of quadrats). And the density of


M
elements (elements per unit area) is . Knowing this then the following
A
important formulas follow.

Important Formulas
Thus, under the assumption of randomly dispersed elements (assuming nj to have
Poisson distribution), we have the following estimator of and M.

Estimator of the density is

maverage
estimated
each area

Estimated variance of is

estimated
Vestimated
q each area

Bound on the error of estimation of is

estimated
2 V 2
q each area

~ 110 ~
Estimator of the total M is

Mestimated = A estimated

Estimated variance of M is

A2 estimated
VM
q each area

Bound on the error of estimation of M is

estimated
2 VM 2A
q each area

NUMERICAL EXAMPLE
Economics: Department of Forestry is investigating the density of trees having fusiform
rust on a Northern tree plantation of 500 acres. The density is to be estimated from a
sample of q=20 quadrats, where each quadrat is =0.5 acre. The 20 sample plots had an
average m = 2.0 infected trees per quadrat.

i. Estimate the density of infected trees, and place a bound on the error estimation.

ii. Estimate the total number of infected trees in the 500-acre plantation, and place a
bound on the error of estimation.

Solution
(i) Using equation in method 3 with = 0.5, we determine the estimated density as

m 2
4 trees per acre
0.5

The error bound on the estimation is -

4
2 2 0.63
q (0.5)(20)

Thus, we estimate the density as 4.00 0.63, or from 3.37 to 4.63 infected trees per
acre.
~ 111 ~
(ii) Calculation for total number of trees infected in the 500-acre area is

Mestimated = A estimated = 4 trees x 500 = 2000 trees

The error bound on the estimation is -

estimated 4
2 VM 2A 2(500) 632.45
q each area (0.5)(20)

Thus, we estimate the total number of infected trees as 2000 633, or 1368 to 2633 in the
500-acre area of Northern Plantation.

Method 4 – Stocked Quadrats


The fourth method is estimating population density and size from stocked
quadrats, which contains the species of interest. In quadrat sampling of plants
or animals, counting the exact number of the species under investigation is
often difficult. In contrast, detecting the presence or absence of the species of
interest is often easy. Then by just knowing whether or not a species is present
in a sample, quadrat can lead to an estimate of density and of population size.

To explain the notion of stocked quadrats, let y denote the number of sampled
quadrats that are not stocked for a sample of q quadrats, each of area , and
from a population of area A. Now, under the assumption of randomness of
elements, the proportion of unstocked quadrats in the population is
approximately e- . We know that from our previous discussions the sample
proportion of unstocked quadrats is a good estimator of the population
proportion. Thus y/q is an estimator of e- , and this result leads to the
following estimators of and M.

Important Formulas
Estimator of the density is

1 y
estimator ln , where ln denotes natural logarithm
q

Estimated variance of is -

~ 112 ~
1
V 2
(e 1)
q

Bound on the error of estimation is -

1
2 V 2 2
(e 1)
q

Estimator of the total M is -

Mestimated = A

Estimator variance of M is -

A2
VM A 2V (e 1)
q 2

Bound on the error of estimation is -

1
2 VM 2A 2
(e 1)
q

NUMERICAL EXAMPLE
Economics: Recall the previous problem statement of the 500-acre Northern Plantation.
Now for estimation of the density of trees infected by fusiform rust, q=30 quadrats and
= 0.5 acre each will be sampled, but only the presence or absence of infected trees will be
noted for each sampled quadrat, rather than counting the number of trees which is
cumbersome sometimes. Suppose y = 6 of the 30 quadrats show no signs of fusiform rust.
Estimate the density and number of infected trees, placing bounds on the error of
estimation in both cases.

Solution
Using the formulas in method 4, the density is estimated by

ˆ 1 y 1 6
estimator ln ln( ) ( 2)( 1.609) 3.2 trees per acree
q 0.5 30

The bound on the error is -

~ 113 ~
1 1
2 V 2 2
(e 1) 2 (e 3.2( 0.5) 1) 1.519
q 30(0.5) 2

We then estimate the density as 3.2 1.5, or 1.7 to 4.7 infected trees per acre.

For the estimator of the total M is -

M̂ M estimated = A = 3.2 (500) = 1600 and

The bound on the error of estimation is -

1 1
2 VM 2A 2
(e 1) 2(500) 2
(e 3.2( 0.5) 1) 572.02
q 30(0.5)

Now, our estimate of the total number of infected trees is 1600 572, or 1028 to
2172 in the 500-acre Northern Plantation.

4.11. How do we compute a population proportion?


Suppose a researcher wishes to estimate a population proportion or fraction, such
as the proportion of houses in a district with inadequate sanitation facilities, or the
proportion of children who did not receive vaccination.

The best estimator of the population proportion p is the sample proportion


pestimated. Let aj denote the total number of elements in cluster j that possess the
characteristic of interest. Then the proportion of elements in the sample of n
clusters possessing the characteristics or estimator of the population proportion p
is given by –

aj
p̂ pestimated
mj

Where mj is the number of elements in the jth cluster, j = 1, 2, 3, …., n.

Estimated variance of p̂ or p estimated :

2
N n (a j pˆ m j )
Var ( pˆ ) Var ( p estimated ) 2
( )
Nnmmean n 1

~ 114 ~
Bound of the error of estimation:

2 var( pˆ )

NUMERICAL EXAMPLE
Life Sciences: Of the total 415 village residents, a sample 25 residents were asked
whether they have sanitation facilities or not. The data set is given as follows in Table 6.4

Table 4.4: Data for sanitation facilities


Number of Number of
households households
with with
Number of sanitation Number of sanitation
Cluster residents, facilities Cluster residents, facilities
j mj aj j mj aj

1 8 4 14 10 5
2 12 7 15 9 4
3 4 1 16 3 1
4 5 3 17 6 4
5 6 3 18 5 2
6 6 4 19 5 3
7 7 4 20 4 1
8 5 2 21 6 3
9 8 3 22 8 3
10 3 2 23 7 4
11 2 1 24 3 0
12 6 3 25 8 3
13 5 2
mj = 151 aj = 72

m 2j 1047 a 2j 262 ajmj = 511

Solution
The best estimate of the population proportion of households with sanitation facilities is
p̂ or p estimated

~ 115 ~
aj 72
pˆ pestimated 0.48 48%
mj 151

To estimate the variance of p̂ or p estimated , we must calculate

(a j m j pˆ ) 2 a 2j 2 pˆ ajmj pˆ 2 m 2j

and from Table 2.7

(aj – mj p̂ )2 = 262 – 2(0.477) (511) + (0.477)2 (1047) = 12.729

and

mj 151
mmean 6.04
n 25

Then the variance for p̂ or p estimated is

2
N n (a j pˆ m j ) 415 25 12.729
Var ( pˆ ) 2
( 2
( ) 0.00055
Nnmmean n 1 415(25)(6.04) 24

The estimate of p with a bound on the error is

pˆ 2 V ( pˆ ) = 0.48 2 0.00055 = 0.48 0.05

Thus, the best estimate of the proportion of people who have sanitation facilities is 0.48
or 48%. The error of estimation should be less than 5% with probability of approximately
95%.

4.12. How big should the sample be?


Depends on:
o The size of the population.
o The variance and covariance in the population of the
variables we want to measure.
o The desired accuracy of our variable estimate.
o Number of sub-samples to be analyzed.

~ 116 ~
===============================================================
MENTAL GYMNASTICS
CHAPTER FOUR
===============================================================

1. Distinguish between the following of pairs of terms.


a) Non-random sample and Random Sample
b) Large sample and Representative sample
c) Neyman allocation and optimum sample

2. True, False or Uncertain. Support your answer.


a) The formulas for calculating sample size depend on the objective of the survey.
b) The smaller the confidence interval, the smaller the sample size is needed for the
actual survey.
c) Where samples are to be broken into sub-samples (e.g., male-headed/female-
headed household, urban/rural area), a minimum sample size of 30 for all
categories is sufficient.
d) The bigger-the-sample-the-better.
e) The higher the number of subsets to be analyzed, the larger the necessary sample.
f) Larger samples cost more on a liner basis; however, the sampling error decreases
at a rate equal to the square root of the relative increase in sample size.
g) There exist sample size formulas, but in practice these are of limited value.

3. Given a total sample size n, its allocation or distribution to the different strata or
groups would be based on mainly three principles. State and explain these principles.

4. Why are a representative and enough sample size important?

5. The data on the number of food secured households whose income ranges from zero
to above 25, 000 Malawi Kwacha for the years 2007 and 2010 is given in following
Table, in the 6 strata according to the total amount of on- and off-farm incomes along
with the present number of households in project areas of some districts.

Total number of Average Total number


households in number of Sample of households
Income Project area households standard in same project
Stratum (MK) 2007 deviation area 2010
i
xi
Ni si N i''
1 0 – 5000 450 40.1 4.4 500
2 5001 – 10000 240 30.0 3.2 350
3 10001 - 15000 530 22.4 4.5 590
4 15001 – 20000 100 15.6 3.1 200
5 20000-25000 50 8.5 1.3 80
6 Above 25,000 20 3.5 0.9 35
All strata N=1340 N”=1755
combined

~ 117 ~
Using the data in the previous Table, determine the allocations of the sample in the
different strata according to the following principles; (i) Neyman allocation, (ii)
proportional allocation, and (iii) allocation proportional to total number of
households in different strata.

~ 118 ~

You might also like