052180661
052180661
052180661
XXXXXX
Making History Count
A Primer in Quantitative Methods for Historians
Making History Count introduces the main quantitative methods used in his-
torical research. The emphasis is on intuitive understanding and application of
the concepts, rather than formal statistics; no knowledge of mathematics
beyond simple arithmetic is required. The techniques are illustrated by appli-
cations in social, political, demographic, and economic history.
Students will learn to read and evaluate the application of the quantitative
methods used in many books and articles, and to assess the historical conclu-
sions drawn from them. They will see how quantitative techniques can open
up new aspects of an enquiry, and supplement and strengthen other methods
of research. This textbook will also encourage students to recognize the bene-
fits of using quantitative methods in their own research projects.
Making History Count is written by two senior economic historians with
considerable teaching experience on both sides of the Atlantic. It is designed
for use as the basic text for taught undergraduate and graduate courses, and is
also suitable for students working on their own. It is clearly illustrated with
numerous tables, graphs, and diagrams, leading the student through the
various key topics; the whole is supported by four specific historical data sets,
which can be downloaded from the Cambridge Unversity Press website at
http://uk.cambridge.org/resources/0521806631.
.
Cambridge, New York, Melbourne, Madrid, Cape Town, Singapore, São Paulo
v
vi
Chapter 3 Correlation 71
3.1 The concept of correlation 71
3.1.1 Correlation is not causation 72
3.1.2 Scatter diagrams and correlation 72
3.1.3 Outliers 74
3.2 The correlation coefficient 76
3.2.1 Measuring the strength of the association 76
3.2.2 Derivation of the correlation coefficient, r 77
3.2.3 Interpreting a correlation coefficient 82
3.2.4 An illustration of the use of correlation coefficients 83
3.3 Spearman’s rank correlation coefficient 86
3.4 Exercises for chapter 3 90
Chapter 4 Simple linear regression 93
4.1 The concept of regression 93
4.1.1 Explanatory and dependent variables 93
4.1.2 The questions addressed by regression 94
4.2 Fitting the regression line 95
4.2.1 How do we define a line? 95
4.2.2 What criterion should be adopted for fitting the line? 96
4.2.3 How do we find this best fitting line? 98
vii
Chapter 11 Violating the assumptions of the classical linear regression model 300
11.1 The assumptions of the classical linear regression model 300
11.1.1 The model is correctly specified 301
11.1.2 The error term is appropriately specified 301
11.1.3 The variables are correctly measured and appropriately
specified 301
11.2 Problems of model specification 302
11.2.1 Non-linear models 302
11.2.2 Omitted and redundant variables 303
11.2.3 Unstable parameter values 305
11.3 Problems of error specification 308
11.3.1 Non-zero errors 308
11.3.2 Heteroscedasticity 309
11.3.3 Autocorrelation 311
11.3.4 Outliers 316
11.4 Problems of variable specification 319
11.4.1 Errors in variables 319
11.4.2 Multicollinearity 321
11.4.3 Simultaneity 324
11.5 Exercises for chapter 11 326
xiii
xiv
8.4 Least squares deviation from the regression plane with two
explanatory variables
(a) Three-dimensional scatter diagram for a dependent variable and
two explanatory variables 244
(b) Three-dimensional diagram with regression plane 244
9.1 The population of Y for different values of X
(a) Hypothetical populations 265
(b) The form of the population of Y assumed in simple linear
regression 265
9.2 The true (population) regression line and the estimated (sample)
regression line 267
9.3 The F-distribution for selected degrees of freedom 272
9.4 95 per cent confidence interval for a prediction of the mean value of
UNEMP 275
10.1 An intercept dummy variable 285
10.2 A slope and an intercept dummy variable 289
11.1 The marginal cost of production of a statistical textbook 303
11.2 Omitted and redundant variables
(a) Ballantine to show the effect of a missing variable 527
(b) Ballantine to show the effect of a redundant variable 527
11.3 Illustration of the use of dummy variables to allow for structural
changes
(a) Introduction of an intercept dummy variable 306
(b) Introduction of a slope dummy variable 306
(c) Introduction of both an intercept and a slope dummy variable 306
11.4 Heteroscedastic error terms 310
11.5 Typical patterns of autocorrelated errors 313
11.6 Autocorrelation 315
11.7 Outliers
(a) An outlier as a leverage point 318
(b) A rogue outlier 318
12.1 A non-linear relationship between age and height 334
12.2 Geometric curves: YaXb 336
12.3 Exponential curves: YabX 337
12.4 Reciprocal curves: Yab/X 338
12.5 Quadratic curves: Yab1Xb2X2
(a) Rising and falling without turning points 339
(b) Maximum and minimum turning points 339
h
12.6 Logistic curves: Y 340
g abx
xvi
xvii
xviii
xix
Preface
xxi
Elementary statistical analysis
Introduction
This text has three principal objectives. The first is to provide an elementary
and very informal introduction to the fundamental concepts and techniques
of modern quantitative methods. A primer cannot be comprehensive, but
we will cover many of the procedures most widely used in research in the his-
torical and social sciences. The book is deliberately written at a very basic
level. It does not include any statistical theory or mathematics, and there is
no attempt to prove any of the statistical propositions. It has been planned
on the assumption that those reading it have no retained knowledge of sta-
tistics, and very little of mathematics beyond simple arithmetic.
It is assumed that the material in the book will be taught in conjunction
with one of the several statistical packages now available for use with com-
puters, for example, SPSS for Windows, STATA, MINITAB, or SAS. By using
the computer to perform all the relevant statistical calculations and manip-
ulations it is possible to eliminate both the need to learn numerous formu-
lae, and also the tedious work of doing laborious calculations. However, if
the computer is going to provide the results, then it is absolutely essential
that the student should be able to understand and interpret the content and
terminology of the printouts, of which figure 1.1 is a typical specimen, and
the second objective of the book is to achieve this.
This leads naturally to the third and most important objective. The
book is throughout concerned to relate the quantitative techniques studied
to examples of their use by historians and social scientists, and by doing
this to promote the understanding and use of these methods. In the follow-
ing section we introduce four specific studies that will be deployed
throughout the book, but – at appropriate points in the text – we also refer
readers to other examples of the application of quantitative methods to
historical and other issues. A student who studies this text will not be able
Variables Entered/Removed b
Variables Variables
Model Entered Removed Method
1 FARMERS,
CHILDALL,
LONDON, Enter
WEALTH,
a
GRAIN
a. All requested variables entered.
b. Dependent Variable: RELIEF
Model Summary
Std. Error of
Adjusted R the
Model R R Square Square Estimate
1 .532a .283 .271 6.84746
a. Predictors: (Constant), FARMERS, CHILDALL, LONDON, WEALTH, GRAIN
ANOVA b
Sum of Mean
Model Squares df Square F Sig.
1 Regression 5638.346 5 1127.669 24.050 .000a
Residual 14300.769 305 46.888
Total 19939.116 310
a. Predictors: (Constant), FARMERS, CHILDALL, LONDON, WEALTH, GRAIN
b. Dependent Variable: RELIEF
Coefficientsa
Unstandardized Standardized
Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) 11.352 1.718 6.609 .000
GRAIN .301 .089 .204 3.363 .001
CHILDALL 5.609 1.030 .271 5.445 .000
WEALTH -.276 .179 -.081 -1.544 .124
LONDON -4.156E-02 .010 -.234 -4.110 .000
FARMERS 5.873 2.131 .158 2.756 .006
a. Dependent Variable: RELIEF
to claim that she is a fully qualified statistician. However, she should have
the confidence to read chapters or articles which use quantitative methods,
to understand what the authors have done and why they have done it, and
to make her own critical evaluation of the procedures used and the histori-
cal conclusions drawn from the statistical results.
Students should also be able to see from the case studies and other
examples how the use of quantitative methods can open up new aspects of
an enquiry and can supplement and strengthen other methods of research.
We hope that they might then appreciate how their own research projects
might benefit from the application of these methods, and take their own
first steps in this direction.
The book is designed to be used both as the basic text for taught courses
and for students working on their own without an instructor. In planning
the content and sequence of the chapters, one of our primary considera-
tions has been to keep the material in the early chapters at a very elemen-
tary level. Many of those for whom this book is intended will naturally be
somewhat wary about taking a course in quantitative methods, but if they
find that they can make substantial progress in understanding some of the
basic statistical concepts they will gain confidence in their ability to handle
the slightly more difficult material in later chapters. There is a small price to
be paid for this approach, since it means that correlation and regression are
covered twice: first without any statistical theory (in chapters 3 and 4) and
then again in greater depth (in chapter 8). However, the text has been used
for a number of years to teach a class taken almost exclusively by statistical
novices who initially approached the course with deep suspicion, and
experience has shown that this strategy is very successful.
It is, of course, also possible for instructors to follow the material in a
different sequence, or – depending on the time available and the level it is
desired to achieve – to omit certain topics altogether. For example, chapter
7 on non-parametric methods has been included because these procedures
are often appropriate for the sort of problems faced by historians; and it has
been placed at this point in the text because it makes a useful complement
to the discussion of the standard principles of hypothesis testing in chapter
6. However, a course designed primarily to provide a very basic introduc-
tion to regression analysis in 10 sessions might skip this. It might, for
example, start with chapters 2–6 and 8–9, add material on dummy vari-
ables from chapter 10 and on the basic aspects of non-linear regression
from chapter 12, and then cover the appropriate applications in the case
studies in chapters 14 and 15.
This would be sufficient to give students a good grounding in some of
the main aspects of regression methods, and should enable them to cope
with many examples of the use of these methods in the historical literature.
However, the omission of chapter 11 would mean that they would not have
acquired any knowledge of either the serious problems which can arise
when the assumptions underlying the standard regression model (dis-
cussed in chapter 9) are violated, or the associated procedures for diagnos-
ing and – where possible – correcting for these violations. This would also
be a substantial weakness for any students who wished to apply these
methods to their own research. One alternative, for students able to start
with some knowledge of very elementary statistics and wishing to aim a
little higher (while still limited to 10 sessions), would be to skip chapters
1–4 and 7, and work through chapters 5–6 and 8–15.
The normal text and tables are supplemented by material in boxes and
panels. Boxes are used to highlight the fundamental definitions and con-
cepts, and should be studied closely. Panels are used to provide explana-
tions or information at a slightly more advanced level than the rest of the
text. The panels should be helpful for some readers, but those who omit
them will not be at any disadvantage in understanding the remainder of the
text. In addition to footnotes (referred to by a letter), we have also used
endnotes (referred to by a number) where it seems desirable not to burden
the main text with lengthy annotations. Endnotes typically consist either of
lists of references to further examples of the applications of statistical
methods to historical topics, or of technical points which need not distract
all readers although they may be useful to some.
We have given a formula for all the basic concepts even though the book
is written on the assumption that the computers will provide what is
required for any particular statistical operation. It is usually possible to set
out these formulae in various ways, but we have always chosen the variant
that best explains the essential nature of the concept, rather than one that
facilitates computation (and we avoid the complication of alternative ver-
sions of the same formula). We recognize that many readers will not be
accustomed to working with symbols, and so may be uncomfortable ini-
tially with this method of setting out information. However, the multi-part
nature of many concepts means that a formula is the most concise and
effective ‘shorthand’ way of showing what is involved, and we would
strongly urge all readers to make the small initial effort required to learn to
read this language; the rewards for doing so are very substantial.
a
The four studies are George Boyer, An Economic History of the English Poor Law, Cambridge
University Press, 1990, chapters 4 and 5; Timothy J. Hatton and Jeffrey G. Williamson, ‘After the
famine: emigration from Ireland, 1850–1913’, Journal of Economic History, 53, 1993, pp.
575–600; Daniel K. Benjamin and Levis A. Kochin, ‘Searching for an explanation for unemploy-
ment in interwar Britain’, Journal of Political Economy, 87, 1979, pp. 441–78; and Richard H.
Steckel, ‘The age at leaving home in the United States, 1850–1860’, Social Science History, 20,
1996, pp. 507–32.
Cases: The cases are the 37 successive years from 1877 to 1913.
Variables: There are five basic variables for each year. These are the rates per
1,000 of the population emigrating from Ireland (IRISHMIG), the foreign
and domestic employment rates (EMPFOR and EMPDOM), the foreign
wage relative to the domestic wage (IRWRATIO), and the stock of previous
emigrants (MIGSTOCK).
Values: For the main emigration variable (IRISHMIG), the values of this rate
for the first five years are: 7.284, 7.786, 8.938, 18.358, and 15.238.
Nominal measurement
This is the lowest level and conveys no information about the relations
between the values. Each value defines a distinct category but can give no
information other than the label or name (hence nominal level) of the cat-
egory. They are sometimes also referred to as categorical variables.
For example, a study of migration to urban areas might include as one of
the variables the birthplace of the migrants. These towns or villages cannot
be ranked or placed in any order in terms of their value as place of birth
(though they could be by other criteria such as size, or distance from the
migrants’ final destination).
Ordinal measurement
This applies when it is possible to order or rank all the categories according
to some criterion without being able to specify the exact size of the interval
between any two categories.
c
A similar procedure is adopted by Boyer, Poor Law, p. 126 to explain the increase in relief expen-
diture over time on the basis of cross-section variations in expenditure across parishes.
This is a very common situation with historical data and can occur for
one of three reasons:
complete census of textile mills had been taken at the relevant date.
Normally, however, we would have only a sample.
The characteristics of the population variables are known as parame-
ters, those of the sample variables as statistics. Parameters are fixed values
at any point in time and are normally unknown. Statistics, on the other
hand, are known from the sample, but may vary with each sample taken
from the population. The extent of such variation from sample to sample
will depend on the homogeneity (uniformity) of the population from
which it is drawn.d
A crucial feature of any sample is whether or not it is random. A random
sample satisfies three basic conditions. First, every item in the population
(parishes in England and Wales, voters in an election, cards in a deck of
cards) has an equal chance of appearing in the sample. Secondly, every com-
bination of items has an equal chance of selection. Thirdly, there is indepen-
dence of selection: the fact that any one item in the population has been
selected has absolutely no influence on whether or not any other item will be
selected.
When the sample is drawn with replacement the same item can be
selected more than once; for example, after a card is drawn it is put back in
the deck. If the sample is drawn without replacement the item can be
selected only once. Almost all use of sampling in historical analysis is sam-
pling without replacement. Each parish, each voter, each cotton mill is
sampled only once when the data set is being compiled.
The proper procedures to be followed in constructing samples, whether
the sampling should be random or some other type, such as stratified or
cluster sampling, and the size of the sample required for any particular
project are complex subjects which go beyond the scope of this book and
specialized texts should be consulted.3
d
Harold Wilson, Britain’s only statistically informed prime minister, once remarked that he
needed to sip only one spoonful from a plate of soup to know whether it was too hot to drink.
They can also be used in time-series analysis – for example, years in which
there was a war might be given a value of 1, while peacetime years are
assigned a value of 0. In more complex cases there could be more than two
categories. For example, a study of voting behaviour might divide the
electorate into four categories according to whether in the previous elec-
tion they had voted Democrat, Republican, or Independent, or had not
voted.
This procedure effectively transfers the information into a numerical
form (albeit a limited one) and makes it possible to apply standard statisti-
cal tools to the variable. For fuller discussion of the use of dummy variables
see §10.1 and also chapter 13.
Xi
Summation
The symbol used to refer to the total (or sum) of all the observations in the
sequence is the Greek letter capital sigma,
§.
兺 Xi
i1
This is an instruction to add all the values in the series from i1 to i5.
If it is obvious which values are included this may be simplified to
X
Differences
The symbol used to refer to the difference between two series is the Greek
letter capital delta,
Thus if instead of referring to the successive values of an annual variable
X, it is desired to refer to the change in these values from year to year, this is
done by writing
X
Yab X
where X and Y are any two variables, and a and b are any two constants.
This equation tells us that each value of Y is equal to some constant, a,
e
The equation on p. 581 of Hatton and Williamson, ‘Emigration’, refers to current and previous
years in this way.
plus the value of X multiplied by another constant, b.f The precise position
of the straight line will then change depending on the specific values given
to a and b.
This is illustrated in figure 1.2 (a) for the particular case where a2 and
b3. The equation thus becomes
Y23X
Y10 2X
X1 X; X2 X X; X3 XXX, and so on
f
Multiplication can be represented in various ways. Thus the statement ‘a multiplied by b’ can be
written as: ab, as a . b, as a(b), or simply as ab. The order of multiplication makes no difference:
ab is exactly the same as ba.
If one of the terms to be multiplied has more than one component then these must be placed
in brackets; for example b(ac) tells us to add a to c and then multiply the result by b. It is essen-
tial to complete the operation in brackets before the multiplication.
Similarly for division, the statement ‘a divided by b’ can be written as a ÷ b, or as a/b or as
a
, but here the order is critical. a/b is not the same as b/a. The term above the line is called
b
the numerator, and the term below the line the denominator.
g
If the reason for this is not immediately clear, consider the second example in paragraph (d)
below.
§.
16
12 3
Y = 2 + 3X
1
8
3
4 1
0
0 1 2 3 4 5
X
(a) With positive slope Y rises by 3 units when X increases by 1 unit
Y 12
10
8
–2
6 Y = 10 – 2X
1
4
–2
2
1
0
0 1 2 3 4 5
X
(b) With negative slope Y falls by 2 units when X increases by 1 unit
1.6.2 Logarithms
For every positive number there is a corresponding logarithm (log). It is
equal to the power to which a given base must be raised to equal that number.
Take, for example, the number 100 and the base 10. Then
the log of 100 to the base 102
because
102 100
i.e. the base 10 raised to the power 2100
Logs can be calculated for any base greater than 1. Base 10 was originally
widely used, but with computers the base generally adopted is not 10 but a
constant with special mathematical properties known as e, equal to
2.71828 (to 5 decimal places).4
Logs to the base e are known as natural logs, and unless we explicitly
state that we are using logs to base 10, all references in the remainder of this
book are to natural logs. The proper abbreviation for natural logs is ‘ln’,
but ‘log’ is also frequently used and since it is easier to recognize what it
refers to we have also adopted this term. Thus if you see a statement such as
h
If the term to be raised to a power is given in brackets it is again essential to first complete the
operation in brackets. For example (a b)2 is calculated by deducting b from a and then squar-
ing the remainder; this is not the same as a2 b2 .
§.
log INFTMORT, it refers to the natural log of the series in the Poor Law
data set for infant mortality.
When logarithms are used in a calculation, the result can always be con-
verted back to ordinary numbers by obtaining the corresponding expo-
nential (of a natural log) or anti-log (of a log to base 10). As long as that is
done consistently, the final result is unaffected by the choice of base.
(i) The product of two numbers (X and Y) can be found by adding their
logs, and then taking the exponential of the result
log (XY)log Xlog Y
冢冣
log
X
Y
log X log Y
(iii) A number can be raised to a power by multiplying the log of that number
by the exponent, and then taking the exponential of the result
log (X n)n log X
Since square roots and other radicals can be expressed as exponents in the
form of a fraction (see §1.6.1) they can also be calculated by multiplying
the log of the number by the appropriate fraction, and then taking the
exponential of the results
1 1
log( 兹X)log (X2 ) log X
2
4 1 1
log(兹X )log (X4) log X
4
and so on.
Panel 1.1 The log of a variable represents proportional changes in that variable
A fundamental property of logarithms is that the absolute change in the log
of any variable corresponds to the proportionate change in that variable.
To illustrate this, consider the following simple example. In column (2),
Income is rising by 2 per cent in each period, so the proportionate change in
the series, given in column (3), is always 0.02. The natural log of Income is
shown in column (4), and the absolute change in this series in column (5).
The absolute change in the log of Income in column (5) is constant at
0.0198, corresponding to the constant proportionate change in Income
itself in column (3) of 0.02.*
One very useful consequence of this property of logs is that it makes it pos-
sible to graph a time series for any variable on a proportionate basis by con-
verting it to logs. When a statistical program is used to plot such a graph it
typically offers the choice of making the vertical axis either linear or loga-
rithmic. If you select the former the program will plot the original values of
the variable; if you select the latter (usually referred to as a log scale) it will
convert the values to logs and plot these.
In the linear form, each absolute increment will be the same: the distance
on the vertical axis between, say, 100 and 200 units will be exactly the same
as the distance between 800 and 900. This can give a very misleading
impression. The change from 100 to 200 represents a doubling of the series;
the change from 800 to 900 is an increase of only 12.5 per cent.
*
It may also be noted that the constant proportionate change in the series is approximately but
not exactly equal to the constant absolute change in the log of the series. If the constant abso-
lute change in the log of the series is converted back to ordinary numbers (by taking the
exponential) the result is exactly equal to 1the proportionate change. Thus, in this
example, the unrounded change in column (5) is 0.0198026, and the exponential of this is
exactly 1.02.
§.
6,000
4,000
2,000
0
1831 1841 1851 1861 1871 1881 1891 1901 1911
Table 1.1 Calculation of 11-year moving average and deviations from trend, miles of
railway track added in the United States, 1831–1850
Centred Deviation of
Railway 11-year Original original data
miles moving data as from trend
added average % of trend as % of trend
1831 1,172
1832 1,134
1833 1,151
1834 1,253
1835 1,465
1836 1,175 319.3 54.8 45.2
1837 1,224 357.4 62.7 37.3
1838 1,416 359.6 115.7 15.7
1839 1,389 363.4 107.1 77.1
1840 1,516 363.6 141.9 41.9
1841 1,717 348.4 205.8 105.8
1842 1,491 393.2 124.9 24.9
1843 1,159 409.0 38.9 61.1
1844 1,192 495.6 38.7 61.3
1845 1,256 610.8 41.9 58.1
1846 1,297
1847 1,668
1848 1,398
1849 1,369
1850 1,656
Notes:
(1) Brinley Thomas, Migration and Economic Growth, Cambridge University Press, 1954,
p. 288.
(2) See text.
(3) (1)/(2)100.
(4) [(1) – (2)]/(2)100.
§.
moving average shown in figure 1.4 (a) with the fitted trend in figure 1.3. It
is immediately evident that the former does not smooth out the
fluctuations as successfully as the latter. It would, of course, be possible to
take an even longer number of years, say 25, but only at the cost of still
longer periods at the beginning and end of the series for which there would
be no trend value. One additional disadvantage of the moving-average
procedure is that it has no mathematical form independent of the data
from which it is constructed, and so cannot be extrapolated by formula to
cover earlier or later periods. Despite these weaknesses it is very simple to
construct and has been widely used by historians.
We have so far referred to centred moving averages, but it is also pos-
sible to construct other forms of moving average. In particular, if the
focus of interest is the experience of the most recent years, it would be
appropriate to calculate an average of (say) the last five years in the series
at any given date, and to place the average against that date. This series
would then be moved forward one year at a time, so that at each date it
provided a measure of the movements in the underlying series over the
last five years. The ‘tracking polls’ used to provide a rolling measure of the
most recent poll results during the course of an election are one variant of
this. More elaborate variants might attach greater importance to the
experience of the most recent years by calculating a weighted average (see
§2.2.2), with progressively diminishing weights for the more distant
years.
8,000
6,000
4,000
250
200
150
100
50
150
Deviation from trend as per cent of trend
100
50
–50
was depressed and below its trend, have values below 100. The trend has
been removed from the series, and what is left in column (3) is the remain-
ing component: the regular and irregular fluctuations. The corresponding
de-trended series calculated with the 11-year moving average is shown in
figure 1.4 (b).k
Much the same result can be obtained by calculating the percentage
deviation from the trend. This is equal to the absolute difference between
the trend and the original series, expressed as a percentage of the trend, as is
done in column (4) of table 1.1 and figure 1.4 (c). This is also a de-trended
series and will show exactly the same fluctuations as column (3). The only
difference is that the former fluctuates around a mean of 100 per cent and
the latter around a deviation of 0 per cent.
When dealing with monthly time series, or other periods of less than a
year, it is possible to make separate estimates of the seasonal variations and
to separate these from the residual cyclical and irregular fluctuations. Such
seasonal adjustment plays no part in any further work in this text, but it is a
useful illustration of the way in which historical data – for example a series
for birth rates or for agricultural production – can be decomposed into
these three elements: long-run trend, recurrent seasonal fluctuations, and
residual fluctuations, thus enabling the historian to analyse separately each
of these contributions to the overall change in the series. A brief account of
one simple seasonal adjustment procedure is set out in panel 1.2 for those
who wish to see how this three-way split is made.14
k
Since the ratio of two series can be calculated as the difference between their respective loga-
rithms, a de-trended series can also be derived in this way, as with the DEMAND variable meas-
ured by Benjamin and Kochin, ‘Unemployment’, pp. 452–3 (see also §1.6.3).
Panel 1.2 Decomposition of a time series into trend, seasonal factors, and
residual fluctuations
This panel describes a simple procedure for splitting a time series into trend,
seasonal factors, and residual fluctuations, using quarterly data on bank-
ruptcy in England for 1780–84 to illustrate the calculations.* The original
data for the five years are set out with a column for each quarter in row (1) of
table 1.2. The trend is calculated as a 25-quarter moving average over the
longer period 1777–87 (so that there are trend values for all quarters from
1780 I to 1784 IV), and is given in row (2).
Row (3) gives the original series as a ratio to the trend for each quarter
and this reveals the seasonal pattern in the bankruptcy statistics. The main
seasonal feature is clearly the persistently low level in the third quarter of
the year. Hoppit attributes this primarily to the strong favourable effect of
the grain harvest on a wide range of activities and to the higher level of
building. Conversely, bankruptcies were at their highest in the winter
months, and this is also reflected in the ratios for the first and fourth quar-
ters in row (3).
The temptation at this point is to calculate the average of the quarterly
ratios for the five-year period, but a more accurate measure is obtained by
calculating the ratio of the five-year totals for the original series and the
trend. This is given in row (4). Row (5) then makes a minor correction to
these ratios; the average of the four quarterly ratios in row (4) is 0.9721, and
each quarter must be multiplied by 1/0.9721. As a result of doing this, the
average of the corrected adjustment factors in row (5), weighted by the five-
year total of the trend, is exactly equal to 1. This in turn ensures that the total
number of bankruptcies over the five years is not changed by the seasonal
adjustment to the component quarters.
The original series is then divided by the adjustment factors in row (5) to
give the seasonally adjusted series. As can be seen in row (6), the low
autumn levels are raised and the higher levels in the other three quarters are
all reduced. This series thus reflects both the trend in bankruptcy and the
remaining irregular fluctuations from quarter to quarter after the seasonal
variations have been removed.
*
The data are taken from Julian Hoppit, Risk and Failure in English Business, 1700–1800,
Cambridge University Press, 1987, pp. 187–96. For his own de-seasonalized series Hoppit
used a 61-quarter moving average, so our results are not exactly the same as his, though the
differences are small. For his interesting discussion of the short-term fluctuations in bank-
ruptcy see ibid., pp. 104–21.
§.
The final adjustment, made in row (7), also eliminates the trend, thus
leaving only the irregular fluctuations in a de-trended and de-seasonalized
series. Since in this case there was little movement in the trend over these
years (see row (2)), the series in row (7) does not diverge much from the
series in row (6).
The decomposition of the original data into the three elements is now
easily checked. For example, for the first quarter of 1780, if we multiply the
trend (row (2)), by the recurrent seasonal factor (row (5)), and by the resid-
ual fluctuation (row (7)), and divide the result by 100 we get (133.32
1.09383.03)/100121, which is the value of the original series in row
(1).
Notes:
(1) Hoppit, Risk and Failure, pp. 195–5.
(2) 25-quarter moving average calculated on quarterly series for 1777–87.
(3) Row (1)/Row (2).
(4) The five-year total of row (1) divided by the five-year total of row (2).
(5) The quarterly correction factors in row (4) average 0.9721, and each is raised by
1/0.9721. This ensures that the five-year total for row (6) will be the same as for row
(1).
(6) Row (1)/Row (5)100.
(7) Row (1)/(Row (2)Row (5))100.
Notes
1
Many statistical textbooks treat these as two separate levels of measurement, with the
ratio as the highest level. The distinction between them is that the interval level does
not have an inherently determined zero point. However, this issue is hardly ever
encountered in historical research; the example usually given relates to temperature
scales in which 0° is an arbitrarily defined point on the scale but does not represent an
absence of temperature.
2
In some contexts it might be appropriate to think of an infinite as opposed to a finite
population. In the case of the mills this would consist of all the mills which might
hypothetically ever have been constructed in the past or which will be constructed in
future; see further §9.2.3 below.
3
For a brief introduction to general sampling procedures, see H. M. Blalock, Social
Statistics, 2nd edn., McGraw-Hill, 1979. R. S. Schofield, ‘Sampling in historical
research’, in E. A. Wrigley (ed.), Nineteenth-Century Society, Essays in the Use of
Quantitative Methods for the Study of Social Data, Cambridge University Press, 1972,
pp. 146–90, is an excellent discussion of sampling procedures in an historical
context.
4
It is not necessary to know any more about the constant e, but for those who are
curious it is equal to the limiting value of the exponential expression
冢 冣
n
1
1
n
8
Among the classic studies of business cycles are T. S. Ashton, Economic Fluctuations
in England, 1700–1800, Oxford University Press, 1959; A. D. Gayer, W. W. Rostow
and A. J. Schwartz, The Growth and Fluctuation of the British Economy, 1790–1850,
Oxford University Press, 1953; and Arthur F. Burns and W. C. Mitchell, Measuring
Business Cycles, NBER, 1947.
9
Solomos Solomou, Phases of Economic Growth, 1850–1973, Kondratieff Waves and
Kuznets Swings, Cambridge University Press, 1987, is a sceptical investigation of the
evidence for these longer fluctuations in Britain, Germany, France, and the United
States.
10
One such technique is known as spectral analysis; for a brief non-mathematical
introduction see C. W. J. Granger and C. M. Elliott, ‘A fresh look at wheat prices and
markets in the eighteenth century’, Economic History Review, 20, 1967, pp. 257–65.
11
The procedure for fitting a simple linear trend is described in §4.2.5. The more
widely used trend based on a logarithmic transformation is explained in panel 12.1
of chapter 12, and other non-linear trends (such as the one illustrated in figure 1.3)
are introduced in §12.4.
12
E. A. Wrigley and R. S. Schofield, The Population History of England 1541–1871,
Edward Arnold, 1981, pp. 402–53.
13
Examples of historical studies based heavily on patterns determined by such a dis-
tinction between the trend and the cycle include Brinley Thomas, Migration and
Economic Growth, Cambridge University Press, 1954; A. G. Ford, The Gold Standard
1880–1914, Britain and Argentina, Oxford University Press, 1962; and Jeffrey G.
Williamson, American Growth and the Balance of Payments 1820–1913, A Study of the
Long Swing, University of North Carolina Press, 1964.
14
There are alternative methods for calculating seasonally adjusted series using more
advanced statistical techniques; one such method is described in panel 12.2 in §12.4.
Descriptive statistics
Table 2.1 Per capita expenditure on relief in 24 Kent parishes in 1831 (shillings)
Table 2.2 Frequency, relative frequency, and cumulative frequency of per capita relief
payments in Kent in 1831
Note:
In column (1) stands for ‘equal to or greater than’, and
for ‘less than’.
§.
20
10
0
Essex Norfolk Suffolk Cambs Beds Sussex Kent
Table 2.3 Frequency of per capita relief payments in 311 parishes in 1831
Note:
In column (1) stands for ‘equal to or greater than’, and
for ‘less than’.
40
20
0
0–6 6–12 12–18 18–24 24–30 30–36 36–42 42–48
Relief payments (shillings)
(a) With 8 equal class intervals, each 6 shillings wide
50
40
Number of parishes
30
20
10
0
4 8 12 16 20 24 28 32 36 40 44 48
Relief payments (shillings)
80
Number of parishes
60
40
20
0
0–6 6–12 12–18 18–24 24–36 36–48
Relief payments (shillings)
(c) With 6 class intervals of unequal width
In this case we could select these (or other) class intervals with equal
widths because we have the data for every individual parish in the data set.
But what would we do if the data had already been grouped before it was
made available to us, and the class intervals were not of equal width?
Assume, for example, that the information on relief payments in the 311
parishes had been published with only six class intervals, the first four with
a width of 6 shillings as in table 2.3, and the last two with a width of 12 shil-
lings (i.e. with absolute frequencies of 63 and 8, and relative frequencies of
20.26 and 2.57).
Since the width of these last two intervals has been doubled, they must be
shown on the histogram with their height set at only half that of the 6 shil-
ling intervals. The areas (i.e. width height) of the two sets of intervals will
then be proportional. In figure 2.2 (c) the original eight-class equal-width
histogram of figure 2.2 (a) is reproduced, with the alternative rectangles for
the two wider class intervals superimposed on the last four intervals.
The height of the first of these 12-shilling intervals is set at 31.5, equal to
half the 63 parishes in this class; the height of the second at 4, equal to half
of the eight parishes in this final class. It is easy to see that when done on
this basis the area of each of the new rectangles is exactly equal to that of the
two corresponding rectangles in the previous version.
a
We also need to assume that we can increase the number of cases we are dealing with (i.e. that we
can work with very much larger numbers than the 311 parishes in our Poor Law data set). By
increasing the number we avoid the irregularities in the distribution that might occur with small
numbers of cases in narrow class intervals.
§.
Number of parishes
80
parishes in 1831
60
40
20
0
0–6 6–12 12–18 18–24 24–30 30–36 36–42 42–48
Relief payments (shillings)
1.00 311.0
Cumulative frequency distribution
.80 248.8
.60 186.6
.40 124.4
.20 62.2
.00
0 6 12 18 24 30 36 42 48
Relief payments (shillings)
(b) Cumulative frequency distribution
to 1. This simple but important idea of the area under a smooth frequency
curve will be extremely useful in relation to the normal distribution, a
concept which plays a major part in many of the statistical techniques we
employ later in this book.
A different form of frequency polygon can also be used to display the
cumulative frequency distribution defined in §2.1.1. This diagram (also
known as an ogive) is illustrated in figure 2.3 (b) with the data from
column (4) of table 2.3 for the per capita relief payments by the 311 par-
ishes. The scale shown on the horizontal axis indicates the upper limit of
the successive class intervals, and that on the vertical axis indicates both the
cumulative relative and cumulative absolute frequency. (The values for the
latter are obtained by cumulating column (2) of table 2.3.)
The curve of the cumulative frequency distribution traces either the
proportion or the number of cases that are less than the corresponding
value of the variable shown on the horizontal axis. Since there is an exact
correspondence between the absolute and relative frequencies, the two
graphs are identical. For example, the position marked off by the broken
lines in figure 2.3 (b) indicates that relief payments of less than 18 shillings
per person were made by 177 parishes, corresponding to a proportion of
0.569 (56.9 per cent). The s-shape is characteristic of such cumulative fre-
quency diagrams.
(a) Which are the central (i.e. the most common or typical) values within
the distribution?
(b) How is the distribution spread (dispersed) around those central
values?
(c) What is the shape of the distribution?
Each of these features can be described by one or more simple statistics.
These are the basic elements of descriptive statistics, and together they
provide a precise and comprehensive summary of a data set. We shall now
look at each in turn.
§.
There are three principal measures that are used to locate the most
common or most typical cases. They are referred to collectively as meas-
ures of central tendency.
– X
X (2.1)
n
where n is the number of observations in the data set for which there are
values for the variable, X.c
(b) The median is the value that has one-half of the number of observa-
tions respectively above and below it, when the series is set out in an
ascending or descending array. Its value thus depends entirely on whatever
happens to be the value of the observation in the middle of the array; it is
not influenced by the values of any of the other observations in the series.
For example, if there are five observations, the median is the value of the
third observation, and there are two cases above this and two below. When
there is an even number of observations, the average of the two middle
observations is taken as the median.
(c) The mode is the value that occurs most frequently (i.e. is most fash-
ionable). When data are grouped in class intervals the mode is taken as
the mid-point of the category with the highest frequency. In a frequency
distribution the mode is represented by the highest point on the curve. A
distribution may have more than one mode; one with two modes is
described as bimodal.
b
This measure is frequently referred to simply as the mean. When no further qualification is given
the term ‘mean’ can be understood to refer to the arithmetic mean. However, it is also possible to
calculate other means, such as a geometric mean, and when there is any possibility of confusion
the full titles should be used. The geometric mean is calculated by multiplying all the values and
taking the appropriate root: thus if there are five terms in a series (X1, X2, ... , X5) the geometric
mean would be the fifth root of their product:
5
兹X1X2 X3 X4X5
c
We will follow the convention of using lower case n to refer to the number of observations in a
sample. The symbol for the corresponding number in the population is upper case N.
The three primary measures can be illustrated with the following simple
data set containing seven observations set out in ascending order:
3, 5, 5, 7, 9, 10 and 45
The arithmetic meanX/n84/712d
The median7
The mode is 5
The most useful of these measures for most purposes are the mean and
the median. The mean is generally to be preferred because it is based on all
the observations in the series, but this can also be its weakness if there are
some extreme values. For example, in the series just given the mean is
raised by the final very large value, whereas this observation has no effect
on the median. It is a matter of judgement in any particular study whether
or not to give weight to the extreme values in reporting a measure of central
tendency. For the data on RELIEF in table 2.1 the mean is 20.3 and the
median 19.4.
The mode is the only one of these three measures that can be used with
either nominal or ordinal level measurements.
d
The geometric mean is 8.02, indicating that the extreme value (45) has less effect on this measure
than on the arithmetic mean.
§.
Table 2.4 Population and migration for four Irish counties in 1881 and 1911
1881 1911
Conversely Carlow has the highest rate of migration but the smallest popu-
lation. If we ignore these differences in the migration rates and size of the
counties and calculate the simple mean we get an average of 16.9 migrants
per 1,000 (as shown at the foot of table 2.4).
If we wish to take these differences into account, we must instead use a
weighted arithmetic mean. This requires that each county is given an
appropriate weight, in this case its population size in 1881. The weighted
mean is then calculated by multiplying each rate of migration by its weight,
and dividing the total of these products by the sum of the weights. The cal-
culation is set out at the foot of table 2.4, and the result is 10.7 per 1,000.
This is considerably lower than the unweighted mean, and gives a much
more accurate measure of the overall rate of migration from these
counties.
–
More generally, the formula for a weighted mean, X, is
– (wiXi )
X (2.1a)
wi
where wi is the appropriate weight for each observation, Xi, in the series.
We will also introduce one particular form of weighted average that is
very widely used in demographic and medical history, the standardized rate.
If we look at the 1911 data in columns (3) and (4) of table 2.4 we see that the
emigration rates have fallen sharply in all four counties, and that the popula-
tion in Dublin has increased since 1881 while that in the other three counties
has fallen. The weighted average rate of CNTYMIG for 1911 is only 3.0.
However, this decline as compared to 10.7 in 1881 is partly a result of
changes in migration from the individual counties, and partly a result of
changes in the relative importance of the different counties. If we wish to
get a summary measure of the rate of emigration which is unaffected by the
population changes we can calculate a standardized rate. In this case we
would ask, in effect: what would the 1911 rate have been if the distribution
of the population by county were fixed at the proportions of 1881? The
answer is obtained by weighting the four 1911 county rates by their respec-
tive 1881 populations, and is 3.2.
This procedure has very wide application, with the possibility of stan-
dardizing a variety of rates for one or more relevant factors. For example,
crude birth rates per 1,000 women are frequently standardized by age in
order to exclude the effects of changes in the age composition of the female
population. Crude death rates per 1,000 of the population might be stan-
dardized by both age and occupation; suicide rates by both geographical
region and gender; marriage rates by social class; and so on. In every case
the standardized rate is effectively a weighted average of a particular set of
specific rates such as the age-specific birth rates or the age- and occupa-
tion-specific death rates.
(a) Quartiles: Instead of dividing the observations into two equal halves
we can divide them into four equal quarters. The first quartile is then
equal to the value that has one-quarter of the values below it, and three-
quarters above it. Conversely the third quartile has three-quarters of the
values below it, and one-quarter above. The second quartile has two quar-
ters above and two quarters below, and is thus identical to the median.
(b) Percentiles and deciles: Other common divisions are percentiles,
which divide the distribution into 100 portions of equal size, and
deciles, which divide it into 10 portions of equal size. So if you are told
at the end of your course on quantitative methods that your mark is at
the ninth decile, you will know that nine-tenths of the students had a
lower mark than you, and only one-tenth had a better mark.
§.
The fifth decile and the fiftieth percentile are the same as the median; the
25th and 75th percentiles are the same as the first and third quartiles.
2.3.2 The mean deviation, the variance, and the standard deviation
If the objective is to get a good measure of the spread of all the observations
in a series, one method which suggests itself is to find out how much each
value differs from the mean (or some other central or typical value). If we
then simply added up all these differences, the result would obviously be
affected by the number of cases involved. So the final step must be to take
some kind of average of all the differences.
The three following measures all do this in varying ways.
(a) The mean deviation is obtained by calculating the difference
–
between each observation (X i) and the mean of the series (X) and then
finding the average of those deviations.
In making this calculation the sign of the deviation (i.e. whether the
value of the observation is higher or lower than the mean) has to be
ignored. If not, the sum of the positive deviations would always be exactly
equal to the sum of the negative deviations. Since these two sums would
automatically cancel out, the result would always be zero.
The mean deviation does take account of all the observations in the
series and is easy to interpret. However, it is less satisfactory from a theoret-
ical statistical point of view than the next two measures.
(b) The variance uses an alternative way of getting rid of the negative
signs: it does so by calculating the square of the deviations.e It is then cal-
culated by finding the mean of the squared deviations. This is equal to
the sum of the squared deviations from the mean, divided by the size of
the sample.f
THE VARIANCE
e
Taking squares solves the problem because one of the rules of elementary arithmetic is that when
a negative number is multiplied by another negative number the result is a positive number; e.g.
4 416.
f
When we get to the deeper waters of chapter 5 we will find that the statisticians recommend that
n should be replaced by n 1 when using sample data in the calculation of the variance and
certain other measures. We will ignore this distinction until we reach that point. Note, however,
that if you use a computer program or calculator to check some of the calculations done ‘by
hand’ in chapters 2–4 you will get different results if your program uses n 1 rather than n. The
distinction between the two definitions matters only when the sample size is small. If n30 or
more there would be practically no difference between the two estimates.
§.
Table 2.5 Calculation of the sample variance for two sets of farm incomes
County A County B
– – – –
Xi (Xi ⴚX) (Xi ⴚX)2 Xi (Xi ⴚX) (Xi ⴚX)2
14 4 16 2 16 256
16 2 4 8 10 100
18 0 0 18 0 0
20 2 4 29 11 121
22 4 16 33 15 225
Sum 90 0 40 90 0 702
n5
–
Mean (X) 18 18
Variance (s2) 8 140.4
The calculation of the variance for the simple data set given above for
the hypothetical farming incomes in the two counties is set out in table 2.5.
The variance of County A40/58, while that of County B702/5
140.4, and is very much greater. Note that the result of squaring the devia-
tions from the mean is that the more extreme values (such as incomes of 2
dollars or 33 dollars) have a very large impact on the variance.
This example was deliberately kept very simple to illustrate how the var-
iance is calculated and to show how it captures the much greater spread of
incomes in County B. In a subsequent exercise we will take more realistic
(and larger) samples, and use a computer program to calculate the vari-
ance.
The variance has many valuable theoretical properties and is widely
used in statistical work. However, it has the disadvantage that it is expressed
in square units. In the farm income example we would have to say that the
variance in County A was 8 ‘squared dollars’, which is not very meaningful.
The obvious way to get rid of these awkward squared units is to take the
square root of the variance. This leads to our final measure of dispersion.
(c) The standard deviation is the square root of the variance.
Thus:
is equal to
the square root of the
arithmetic mean of the squared deviations from the mean.
兹
(Xi X) 2
s (2.3)
n
The standard deviation is the most useful and widely used measure of
dispersion. It is measured in the same units as the data series to which it
refers. Thus we can say that the mean income in County A is 18 dollars,
with a standard deviation (s.d.) of 2.8 dollars. Such results will often be
–
reported in the form: X 18, s.d.2.8, or in even more summary form as:
–
X 182.8.
The standard deviation can be thought of as the average or typical
(hence standard) deviation from the mean. Thus it will be seen that in
County A the deviations from the mean (see column (2) of table 2.5) vary
between 0 and 4, and the standard deviation is 2.8. Similarly in County B
the spread of the deviations in column (5) is from 0 to 16, and the standard
deviation is 11.8. The standard deviation thus lies somewhere between the
smallest and largest deviation from the mean.
The variance and the standard deviation have several mathematical
properties that make them more useful than the mean deviation. Both play
an extremely important role in many aspects of statistics.
deviation was £17.50 (these are very rough approximations to the actual
data). Because growth and inflation has completely altered the level of
wage payments, it is impossible to tell from this whether the dispersion of
wages was larger or smaller in the later period.
To do this we need a measure of relative rather than absolute variation.
This can be obtained by dividing the standard deviation by the mean. The
result is known as the coefficient of variation, abbreviated to CV (or cv)
–
CVs/X (2.4)
To illustrate this procedure, consider the data on per capita relief pay-
ments in table 2.3. Since there are 311 parishes, the median parish will be
the 156th. There are 84 parishes in the first two class intervals, so we need
another 72 to reach the median parish, and it will thus fall somewhere in
the third class interval, which has a lower limit of 12 shillings. There are a
total of 93 parishes in that class interval and, by assumption, they are
spread at equal distances along the interval of 6 shillings. The median
parish will therefore occur at the value equal to 12 shillings plus 72/93 of 6
shillings, or 16.6 shillings.i
For measures involving the mean, the corresponding assumption is that
all the cases within the group have a value equal to the mid-point of the
group. Thus we take a value of 15 shillings for all 93 parishes in table 2.3
with relief payments between 12 and 18 shillings, and similarly for the
other intervals.j
Let us denote each of these mid-points by Xi, and the frequency with
which the cases occur (as shown in column (2) of table 2.3) by f. These two
values can then be multiplied to get the product for each class interval, f Xi.
The sum of those products is thus f Xi, and the formula for the mean with
grouped data is then this sum divided by the sum of the frequencies, f
– f Xi
X (2.5)
f
There are two points to note with regard to this formula. First, for the
data in table 2.3, the denominator in this formula, f, is 311, which is pre-
cisely what n would be when calculating the mean with ungrouped data
using the formula in (2.1). Secondly, the procedure is exactly equivalent to
the weighted arithmetic mean introduced in §2.2.1, as can be readily seen
by comparing this formula with the one in (2.1a). The mid-points of the
class intervals correspond to the observations in the series, and the fre-
quencies become the weights.
f (Xi X) 2
s2 (2.6)
f
The standard deviation can similarly be calculated as the square root of
the variance
兹
f (Xi X) 2
s (2.7)
f
k
Such distributions are sometimes loosely referred to as log-normal. A true log-normal distribu-
tion is one that is strongly positively skewed when the data are entered in ordinary numbers and
normally distributed when the data are converted to logs.
Figure 2.4
Symmetrical and
skewed frequency
curves
Median
Mean
Median
Mean
with a tail made up of a small number of very low values (negative skew-
ness). This would be characteristic of data on the age of death in Britain or
the United States in any normal peacetime year. There would be a small
number of deaths at low ages as a result of accidents and illness, but the
great majority would occur at ages 60–90 and the right hand tail would
effectively end abruptly a little above age 100.
Figure 2.6 50
Histogram of per
capita relief
payments in 311
parishes in 1831 40
with normal curve Number of parishes
superimposed
30
20
10
0
4 8 12 16 20 24 28 32 36 40 44 48
Relief payments (shillings)
given by the mean and standard deviation of that data. It is thus possible to
see how far the data conform to the normal distribution. This is done, for
example, in figure 2.6 with the histogram of parish relief payments previ-
ously shown in figure 2.2 (b), and it can be seen that it is a reasonably good
approximation.
The fact that many actual distributions approximate the theoretical
normal distribution enables statisticians to make extensive use of the prop-
erties of the theoretical normal curve. One obvious property that we have
already noted implicitly is that the mean, median, and mode are all equal;
they coincide at the highest point of the curve and there is only one mode.
A second property of considerable importance relates to the area under
the normal curve. Irrespective of the particular mean or standard devia-
tion of the curve it will always be the case that a constant proportion of all the
cases will lie a given distance from the mean measured in terms of the standard
deviation. It is thus possible to calculate what the proportion is for any par-
ticular distance from the mean expressed in terms of standard deviations (std
devs).
Since the distribution is perfectly symmetrical we also know that exactly
one-half of the above proportions are to the right of (greater than) the
mean and one-half are to the left of (smaller than) the mean.
§.
90 per cent of all cases are within 1.645 std devs either side of the mean,
leaving 5 per cent in each of the two tails
95 per cent of all cases are within 1.96 std devs either side of the mean,
leaving 2.5 per cent in each of the two tails
99 per cent of all cases are within 2.58 std devs either side of the mean,
leaving 0.5 per cent in each of the two tails
● The distance of 1 std dev from the mean covers 68.26 per cent of all
cases
● The distance of 2 std devs from the mean covers 95.46 per cent of all
cases
● The distance of 3 std devs from the mean covers 99.73 per cent of all
cases
● Roughly two-thirds are between 5 7 and 6 5 tall (the mean 1 std dev)
● That only a small minority, less than 5 per cent, are shorter than 5 2 or
taller than 6 10 (the mean 2 std devs)
These results should seem plausible, and should help you to understand
how to interpret information about the standard deviation for (approxi-
mately normal) distributions that are less familiar than heights.
This leads directly to the idea that areas under the normal curve can be
thought of in terms of the number of standard deviations.
(X X)
Z (2.10)
s
l
For those more accustomed to think in metric units, the equivalent units are a mean of 180 cm
and a standard deviation of 13 cm. The height of roughly two-thirds of the students is thus
between 167 and 193 cm, and fewer than 5 per cent are either shorter than 154 cm or taller than
206 cm.
Table 2.6 Heights of a sample of 40 male students and the standardized distribution
n40
§.
deviation of these heightss 兹25 5.00. This standard deviation is then used in column (4)
to calculate Z. For Z, Z/n0.00/400.0, and this is used in column (5) to calculate the vari-
–
ance of Z(Zi Z)2/n 40/401.00. The standard deviation is therefore 兹1.00 1.00.
n
A more complete table covering all values of Z can be consulted in D. V. Lindley and W. F. Scott,
New Cambridge Statistical Tables, Cambridge University Press, 2nd edn., 1995, p.34, and in most
general statistics books.
Table 2.7 Areas under the standard normal curve for selected values of Z
Note:
The two-tailed area in column (5) refers to both the positive and the negative values of Z.
Source: Lindley and Scott, Statistical Tables, p. 34 for column (2); other columns as explained
in the text.
You will find that authors usually give only one of these five possible
proportions, and different authors choose different proportions. This
divergence in the way the table is printed is confusing, and means that if
you want to use one of these tables you must first establish the form in
which the information is presented.
The proportions are normally given to 3 or 4 decimal places. It is often
more convenient to think in terms of percentages, and these are obtained
by multiplying the proportions by 100 (i.e. moving the decimal point 2
places to the right): for example, 0.50050 per cent.
Once the form in which the data are presented in any particular table is
clarified, it is a simple matter to find the proportion up to or beyond any
given value of Z. Take, for instance, the student in table 2.6 with a height of
77 inches. This corresponds to a value for Z1.0, and is thus 1.0 standard
deviations from the mean height of 72 inches. From column (4) of table 2.7
it can be seen that the proportion greater than this value of Z is 0.1587, or
15.9 per cent. Since there are 40 students this indicates that if the heights of
this sample are normally distributed there should be approximately six stu-
dents (15.9 per cent of 40) taller than this one. Table 2.6 in fact has six taller
students.
Notes
1
Note that the precise class intervals depend on the extent to which the underlying
data have been rounded. The data in table 2.1 were rounded to 1 decimal place, so the
true upper limit of the first class interval in table 2.2 would be 9.94 shillings. All
values from 9.90 to 9.94 would have been rounded down to 9.9 shillings, and so
would be included in the interval 5 but
10. All values from 9.95 to 9.99 would
have been rounded up to 10 shillings, and so would be included in the interval 10
but
15.
2
If a perfect die is thrown once the probability of obtaining a six is 1/6, and the prob-
ability of obtaining each of the other face values from 1 to 5 would be exactly the
same, so these outcomes would represent a rectangular distribution. A pyramid-
shaped distribution would be obtained for the probability of each value if the die
were thrown twice. With each successive throw beyond two the distribution moves
towards the bell-shape of the normal curve. For the corresponding successive distri-
butions when the probabilities relate to tossing increasing numbers of coins, see
figure 5.1.
1911 1991
Note:
Single includes single, widowed, and divorced women. Married refers to currently married
women.
Sources: Census of Population, England and Wales, 1911 and 1991.
0 but
2 14 26,023
2 but
3 44 90,460
3 but
4 83 123,595
4 but
5 58 76,421
5 but
6 48 48,008
6 but
7 25 19,811
7 but
8 16 11,086
8 but
9 6 3,820
9 but
10 9 5,225
10 8 2,634
Note:
No other information is available.
(i) Calculate total wealth in each class interval using (a) the number of
parishes, and (b) the population, as the weighting system. What
assumptions did you make about the level of wealth per person in
each class interval and why?
(ii) Calculate the mean, median, mode, upper and lower quartiles, vari-
ance, standard deviation, and coefficient of variation for total
wealth.
(iii) Change your assumption about the level of wealth per person in the
top and bottom classes and recalibrate the measures of central ten-
§.
(i) Calculate the average level of wealth per person for each county.
Repeat the exercise of question 5, using both the number of parishes
and the number of people in each county to calculate total wealth.
(ii) Compare the measures of central value and dispersion with those
produced by question 5. Identify and account for any discrepancies.
How do your results in questions 5 and 6 compare to the results of
question 2?
BRTHRATE
INFTMORT
INCOME
DENSITY
(i) In each case, inspect the shapes of the histograms and assess whether
the data are normally distributed, negatively skewed, or positively
skewed. Record your findings.
(ii) Now ask the computer to calculate the degree of skewness in each of
these variables. Compare these results to your own findings.
(i) How many of the women were first married between 20 and 29
years?
(ii) What was the minimum age of marriage of the oldest 5 per cent of
the sample?
(iii) What proportion of the women married at an age that differed from
the mean by more than 1.8 standard deviations?
Correlation
This chapter is devoted to one of the central issues in the quantitative study
of two variables: is there a relationship between them? Our aim is to explain
the basic concepts, and then to obtain a measure of the degree to which the
two variables are related. The statistical term for such a relationship or
association is correlation, and the measure of the strength of that relation-
ship is called the correlation coefficient.
We will deal first with the relationship between ratio or interval level
(numerical) variables, and then look more briefly in §3.3 at the treatment
of nominal and ordinal level measurements.a In this initial discussion we
ignore the further matters that arise because the results are usually based
on data obtained from a sample. Treatment of this important aspect must
be deferred until the issues of confidence intervals and hypothesis testing
have been covered in chapters 5 and 6.
If there is a relationship between the two sets of paired variables (for
example, between the level of relief expenditure (RELIEF) and the propor-
tion of unemployed labourers (UNEMP) in each of the parishes, or
between EMPFOR, the annual series for foreign employment and IRMIG,
the number of immigrants from Ireland), it may be either positive or nega-
tive. When there is positive correlation, high values of the one variable are
associated with high values of the other. When there is negative correla-
tion, high values of the one variable are associated with low values of the
other. In each case the closeness of the relationship may be strong or weak.
The third possibility is, of course, that there is no consistent relation-
ship: high values of one variable are sometimes associated with high values
of the other, and sometimes with low values.
a
These different levels of measurement were explained in §1.3.3.
b
T. H. Wonnacott and R. J. Wonnacott, Introductory Statistics, 5th edn., John Wiley, 1990.
§.
Y (a) Y (b)
r = –1
r = +1
X X
Y (c) Y (d)
r = –0.8
r = +0.6
X X
Y (e) Y (f)
r=0
r=0
X X
Figure 3.1 Scatter diagrams showing different strengths and directions of relationships
between X and Y
Perfect correlation
In the upper panel, plot (a) on the left has perfect positive correlation, and
plot (b) on the right shows perfect negative correlation. In each case all the
points fall on a perfectly straight line.
Such very strong association would seldom, if ever, be encountered in
the social sciences. It might, however, occur in a controlled physical experi-
ment; for example, X in plot (a) might be a measure of the flow of water
into a closed tank, and Y the height of the water in the tank.
No linear relationship
The lower panel shows two examples where there is no linear relationship.
They are, however, very different. In (e) on the left there is no relationship
of any kind: the plots are all over the graph. In (f) on the right there very
clearly is a relationship, but it is not a linear (straight-line) one; it is U-
shaped.
Plot (f) is important as a reminder that the techniques for the measure-
ment of association that we are examining here (and until chapter 12)
relate only to linear relationships. Nevertheless, we will find in practice that
there are a great many interesting relationships that are, at least approxi-
mately, linear. So even with this restriction, correlation is a very useful
technique.
As an illustration of an historical relationship, a scatter diagram based
on the data for 24 Kent parishes reproduced in table 2.1 is given in figure
3.2. UNEMP is shown on the (horizontal) X axis and RELIEF on the (verti-
cal) Y axis.
3.1.3 Outliers
Note that one of the points in figure 3.2 (in the top right) is very different
from all the others. This is known as an outlier. One of the advantages of
plotting a scatter diagram before proceeding to more elaborate procedures
§.
20
10
0
0.0 10.0 20.0 30.0
Unemployment (per cent)
is that it immediately reveals the presence of such outliers. The first step
should always be to check the data to ensure that the outlier is a genuine
observation and not an error in the original source or in the transcription
of the data. If it is genuine then its treatment in further analysis is a matter
of judgement and there is no simple rule.
If more is known about the circumstances creating the outlier it may be
decided that these represent such exceptional circumstances that the case
will distort the principal purposes of the study, and alternative procedures
should be invoked. This might apply, for example, when two time series
have been plotted on a scatter diagram and it is realized that the outlier rep-
resents a year in which there was some abnormal extraneous event such as a
war or a prolonged strike. In a regional cross-section it may be found that
the particular area includes a major city and that this is distorting the
results of what is intended to be primarily a study of rural areas.
Alternatively, it may be judged that despite the extreme values the
outlier is a fully representative case and should remain as one of the obser-
vations in the analysis and in the calculation of further statistical measures.
These and other issues relating to the treatment of outliers are consid-
ered in §11.3.4.
In figure 3.1, r equals 0.6 for the modest positive relationship in plot (c),
and it equals 0.8 for the stronger negative relationship in plot (d). If we
calculate the correlation coefficient for UNEMP and RELIEF in the 24 Kent
parishes in figure 3.2 the result is found to be 0.52.
is a measure of the
degree of linear relationship between two variables.
c
Or, more formally, as the Pearson product-moment correlation coefficient.
§.
Values of X Values of Y
large large
Positive correlation: or or
small small
large small
Negative correlation: or or
small large
We have also seen in figure 3.1 that when there is perfect positive or neg-
ative correlation, the points cluster along the straight line (plots (a) or (b)).
The further the points deviate from this line the weaker is the correlation
between the two variables, until we reach the case in plot (e) where there is
no possible single line that would be a good fit through all the points.
To get a numerical measurement it is necessary to give a more precise
content to the scheme set out above, and to this sense of deviations from
the line of perfect correlation. What do we mean by ‘large’ or ‘small’ for any
particular pair of variables? Given the techniques that have already been
Figure 3.3 Y 18
Deviations from the IV +4
12, 16 I
16
mean
14
+6
12 6, 11
–2
+1
Y 10
8 –4 –3
6 + 2 10, 7
III 4, 6 –4 II
4
2
0
0 2 4 6 8 10 12 14
X X
Xi Yi
4 6
6 11
10 7
12 16
– – – ––
Mean: X8– Y10
––
These four sets of points are plotted in figure 3.3, which is divided into four
quadrants by drawing a horizontal line through the mean value of Y
– –
(Y10) and a vertical line through the mean value of X (X8).
The first pair of values (4, 6) appears in quadrant III, with both observa-
tions below their respective means. Thus both deviations from the mean
§.
● Any pair of values in quadrant I will have both values above their respec-
tive means, and both deviations from the mean will be positive. So the
product of the deviations for all such points will be positive.
● Any point in quadrant III will have both values below their respective
means, and both deviations from the mean will be negative. So the
product of the deviations for all such points will also be positive.
● Any pair of values in quadrants II and IV will have one of the pair above
its mean and one below, and so one deviation will be positive and the
other negative. So the product of the deviations for all such points will
be negative.
(a) We have obtained the total of the combined deviations from the mean
by adding up (summing) all the products of the deviations from the
mean. But this measure is obviously affected by the number of cases in
the data set: the larger the number the bigger the sum. To correct for
this we divide by the number of cases; this is equivalent to taking an
average of the product of the deviations. The resulting measure is
called the covariance of X and Y, or COV (XY).
(b) The covariance is measured in the units in which X and Y are meas-
ured, and is thus awkward to use. If the units were changed (for
example from feet to metres) this would cause an inappropriate
change in the covariance. In order to correct for this, the covariance is
divided by both the standard deviation of X and the standard devia-
tion of Y, thus neutralizing the specific units in which X and Y happen
to be measured.
d
If one negative number is multiplied by another negative number the product is positive; if a
negative number is multiplied by a positive number the product is negative.
COVARIANCE
divided by
the number of cases.
(Xi X)(Yi Y)
COV(XY) (3.1)
n
Table 3.1 Calculation of the coefficient of correlation for a sample of four (imaginary)
values of X and Y
4 6 4 4 16 16 16
6 11 2 1 4 1 2
10 7 2 3 4 9 6
12 16 4 6 16 36 24
Sum 32 40 0 0 40 62 32
Mean 8 10
兹 兹
兺(Xi X) 2 40
Standard deviation of XsX 3.16
n 4
–
兹 兹
兺(Yi Y) 2 62
Standard deviation of Y sY 3.94
n 4
– –
兺(Xi X )(Yi Y )
COV(XY) 32/48.0
n
f
For the procedure for calculating the deviations from trend see §1.8. Her method of fitting a
trend is the one explained in §12.4.
§.
Table 3.2 Coefficients of correlation with the British business cycle, 1854–1913a
(1) (2)
Same year Social series lagged one yearb
Notes:
a
The series generally cover a period from the mid-1850s to 1913, but in a few cases the
initial date is later than this. The majority cover only England and Wales but those for con-
sumption of alcohol and for emigration relate to the whole of the United Kingdom.
b
Except for birth rates, which are lagged by 2 years.
c
Excluding deaths from epidemic diseases.
d
Excluding deaths from diarrhoea.
e
Per 100,000 population.
f
Number of paupers relieved indoors (i.e. in workhouses) or outdoors (i.e. in their own
homes) per 1,000 population.
Source: Dorothy Swaine Thomas, Social Aspects of the Business Cycle, Routledge, 1925
prosperity. Her comments on the size of the coefficients show that she was
also sensitive to the issues of interpretation discussed in the previous sub-
section.
Thomas included in her discussion a comparison with the correspond-
ing results obtained for the United States. Marriage rates were found to
follow business conditions even more closely (r0.81) than in England
and Wales. The positive correlation between the cycle and the lagged death
rate was ‘contrary to expectation’ but was also found in the United States (r
0.49); and both countries showed the same positive correlation
between the cycle and lagged infant mortality rates, with a somewhat
higher coefficient in the United States (0.43).
One of the more interesting contrasts was the relationship of divorce
rates to the cycle. In England and Wales there was no relationship between
the two series (r 0.02), whereas in the United States they were highly
correlated (r0.70). The absence of correlation in the former case was
attributed to the fact that divorce was expensive and restricted to a small
class of wealthy people who were not much affected by the business cycle.
Thomas concluded from her investigation that there was good evidence
that the business cycle caused repercussions in various other spheres of
social activity, and that the interrelationship between these social phenom-
ena raised questions for social theorists as to the part played by group
influences on individual actions. She thought her study ‘brings to light ten-
dencies of great interest for the social reformer’ and that, above all, it points
strongly to the need for further research.3
g
Note that precisely the same result could be produced by using formula (3.2b) but the Spearman
coefficient is much simpler to calculate when dealing with ordinal data.
h
Simon Szreter, Fertility, Class and Gender in Britain, 1860–1914, Cambridge University Press,
1996. The issues to which we refer are analysed in chapter 7, especially pp. 335–50. The main
source for the data is table 35 in the Fertility of Marriage Report in the 1911 Census of Population.
i
The main measure refers to couples married in 1881–5 when the wife was aged 20–24; Szreter
also uses data for families formed at earlier and later ages of marriage and dates.
Table 3.3 Fifth-quintile male occupations rank ordered by completed fertility and
female age at marriage
Boilermakers 1 21 400
Brass, bronze workers 2 5 9
China, pottery manufacture 3 12 81
Stone – miners, quarriers 4 2 4
Tanners, curriers 5 13 64
Plasterers 6 20 196
Refuse disposal 7 27 400
Ironfoundry labourers 8 10 4
Plaster, cement manufacture 9 22 169
French polishers 10 30 400
Steel – manufacture, smelting, founding 11 9 4
Bricklayers 12 24 144
Road labourers 13 16 9
Gas works service 14 8 36
Oil (vegetable) – millers, refiners 15 28 169
Navvies 16 14 4
Agricultural labourers i/c of horses 17 3 196
General labourers 18 15 9
Agricultural labourers i/c of cattle 19 1 324
Shipyard labourers 20 35 225
Shepherds 21 6 225
Agricultural labourers 22 18 16
Fishermen 23 34 121
Bargemen, lightermen, watermen 24 39 225
Brick, terracotta makers 25 19 36
Dock, wharf labourers 26 25 1
Coal – mineworkers above ground 27 23 16
Builders’ labourers 28 11 289
Glass manufacture 29 31 4
Masons’ labourers 30 7 529
Coalheavers, coal porters 31 29 4
Tinplate manufacture 32 4 784
Ship – platers, riveters 33 33 0
Source: Szreter, Fertility, Appendix C, pp. 612–13. The rank order is given in Appendix C for
all 195 occupations and the occupations in the fifth quintile were re-ranked from 1 to 39 for
this table.
Notes
1
The second expression is obtained by filling in the formulae for COV (XY) and the
two standard deviations. Note that when this is done all the terms in n cancel out.
The formula for COV (XY) above the line is divided by n; while those for Sx and Sy
below the line are each divided by the square root of n, and the product of 兹n 兹n
n. Thus, both numerator and denominator are divided by n, and so it is eliminated
from the formula.
2
It is beyond the scope of this text to prove that r can never be greater than 1, but it is
perhaps intuitively evident that if there is perfect correlation between X and Y so that
all the points lie on a straight line, then there will be a constant ratio between every
pair of values of X and Y. Let us call this ratio b (as in the equation for a straight line, Y
abX, in §1.5). The ratio between the means of X and Y, and between their devia-
tions from their respective means, must then also be equal to b. All the terms in
– –
(Y Y) in the formula for r can thus be replaced by (X X). If this is done, the numer-
–
ator and the denominator in the formula both become b(X X)2, and so r1.
3
Later work, also using correlation coefficients, has generally tended to confirm
Thomas’ findings. See, for example, Jay Winter, ‘Unemployment, nutrition and
infant mortality in Britain, 1920–50’, at pp. 240–45 and 255–6 in Jay Winter (ed.),
The Working Class in Modern British History, Cambridge University Press, 1983, pp.
232–56, on the inverse relationship between infant mortality and unemployment;
and Humphrey Southall and David Gilbert, ‘A good time to wed? Marriage and eco-
nomic distress in England and Wales, 1839–1914’, in Economic history Review, 49,
1996, pp. 35–57, on the negative correlation between marriage and unemployment
(with no lag) in a sample of towns. For a graphical analysis of the incidence of crimi-
nal behaviour in England and Wales between 1805 and 1892 see V. A. C. Gattrell and
T. B. Hadden, ‘Criminal statistics and their interpretation’, in E. A. Wrigley (ed.),
Nineteenth-Century Society, Essays in the Use of Quantitative Methods for the Study of
Social Data, Cambridge University Press, 1972, pp. 363–96. Their finding that ‘more
people stole in hard times than good’, while the rate of violent crime was stimulated
by ‘high wages and high employment’ and a consequent ‘higher consumption of
liquor’ also supports Thomas.
Wheat Oats
2. Plot scatter diagrams of the following pairs of variables in the Boyer relief
data set.
(i) In each case, use visual inspection to determine whether the data are
uncorrelated, positively correlated, or negatively correlated, and
indicate the probable strength of the association. Record your
results.
(ii) Now calculate the correlation coefficient. Compare these calcula-
tions with the results of your visual inspection.
3. Plot a scatter diagram of the data on UNEMP and BENEFIT from the
Benjamin–Kochin data set. Are there any outliers? Calculate the correla-
tion coefficient between these two variables, including and excluding the
outlier. What does this calculation tell us about the importance of this
unusual observation?
4. Plot a scatter diagram for the variables RELIEF and UNEMP for the 28
Essex parishes in the Boyer relief data set. Are there any outliers? What cri-
teria did you use to identify any unusual observations? Compare the corre-
lation coefficients for the data with and without any outliers. Interpret your
results.
5. The following data have been extracted from Angus Maddison’s compi-
lation of national income statistics for the nineteenth and twentieth centu-
ries (Monitoring the World Economy, 1820-1992, OECD, 1995, p. 23). What
are the correlation coefficients between the levels of income in 1820 and
1870; 1870 and 1913; and 1820 and 1913?
A critic judges that the probable margins of error in the national income
data, especially for 1820 and 1870, make it ‘advisable to confine attention to
the ranking rather than the actual values’ of income per person. In
response, calculate the Spearman rank correlation coefficient for each pair
of years and compare the results to your previous calculations. Do the
comparisons validate the critic’s judgement?
6. Calculate the Spearman rank correlation coefficients for the following
pairs of variables in the Irish migration data set for 1881:
Calculate the Pearson correlation coefficients for the same pairs of vari-
ables. Interpret your results, being sure to indicate which measure is more
appropriate and why.
The aim in this chapter is to extend the analysis of the relationship between
two variables to cover the topic of regression. In this introductory discus-
sion we will deal only with linear (straight-line) relationships between two
variables (bivariate regression). In chapter 8 this analysis will be extended
to include more than two variables (multiple or multivariate regression),
and non-linear relationships will be discussed in chapter 12. As with the
discussion of correlation, problems arising from the use of sample data are
deferred until the issues of confidence intervals and hypothesis testing are
covered in chapters 5 and 6.
In fact we need to know only two values of X and Y to draw a straight line
since, by definition, all points given by the equation lie on the same straight
line.
As shown in figure 1.1 (see §1.5), b measures the slope of the line.
c
An initial visual answer to the first of these questions was obtained in §3.1.2 from a scatter
diagram. The correlation coefficient (r) was introduced in §3.2 as the measure of the strength
and direction of the association.
The larger b is, the steeper the slope, and thus the larger the magnitude of
the change in Y for any given change in X.
In the specific context of regression, b is known as the regression
coefficient, and the basic purpose of a great deal of quantitative analysis is
directed to the task of establishing the regression coefficients for particular
relationships. It is this coefficient that ideally quantifies the influence of X
on Y and enables us to ‘predict’ how Y will change when X changes.
A primary objective of the quantitative historian and social scientist is
thus to test the propositions (hypotheses) suggested by relevant theories
in order to discover whether predicted relationships between sets of
variables are supported by the data. And, if they are, to measure the
regression coefficients in order to give substance to the theory by specify-
ing exactly how the dependent variable is affected by the explanatory
variable.
Figure 4.1 Y 18
Deviations from the Yi
regression line 16
14
12 Ŷi Ŷi
Yi
Y 10
8
Ŷi Ŷi
6 Yi
Yi
4
2 4 6 8 10 12 14
X X
deviations of the actual Ys from their predicted values are known as residu-
als. The advantage of taking the vertical distance from the line as the meas-
urement of this deviation or residual (Yi Ŷi) is that it can then be
compared directly with the corresponding vertical deviation of the point
from the mean – this was the distance adopted in §3.2.2 as the measure of
the total deviation.
However, if the residuals are calculated in this way the positive and neg-
ative values will automatically cancel out. The sum of all the deviations
below the best fitting regression line will equal the sum of all those above
the line. The position would thus be comparable to the one encountered in
measuring dispersion by the mean arithmetic deviation in §2.3.2. There is
a further difficulty: more than one line can be found which satisfies the
condition of equality of positive and negative vertical deviations from the
line.
The way to get round both these problems is to take the square of these
deviations, (Yi Ŷi)2. Since we are interested in all the points in the data set,
we then take the sum of these squared deviations from the regression line,
(Yi Ŷi)2. There is only one line that minimizes this sum of squares of the
vertical deviations, and it is this line that is selected as the ‘best fit’.
To show this visually, figure 4.2 repeats figure 4.1, but this time with
the actual squares drawn on each of the deviations for the four pairs of
observations.
12
Y 10
4
2 4 6 8 10 12 14
X X
minimizes
the sum of the squares of the vertical deviations
of all the pairs of values of X and Y
from the regression line.
Table 4.1 Calculation of the regression coefficient for four values of X and Y
4 6 4 4 16 16 16
6 11 2 1 4 1 2
10 7 2 3 4 9 6
12 16 4 6 16 36 24
Sum 32 40 0 0 40 62 32
Mean 8 10
– –
兺(Xi X)(Yi Y )
b – 2 32/400.80
兺(Xi X)
– –
aY – bX10 – (0.8)(8)3.60
For any given data set these two regressions are not the same. If we have
reason to be interested in both these forms of a mutual inter-relationship
(as in some of the examples discussed in §4.1.1), two separate regression
equations have to be found.
The two regressions will have different intercepts and different slopes. If,
for example, the two variables are expenditure on education and national
income, the amount by which national income grows, when there is an
increase in expenditure on education, will not be the same as the amount
by which expenditure on education increases, when there is a rise in
national income.
20
10
0
0.0 10.0 20.0 30.0
Unemployment (per cent)
as the explanatory variable. The ‘names’ of the actual years covered by the
series (1920, 1921, and so on) have no significance in this context, and we
would in the end obtain the same results if we substituted any other
sequence of numbers that increased by 1 unit each year. The choice of units
for the measure of time will, however, affect the size of the regression
coefficients, as we show below.
Given X and Y, the intercept, a, and the regression coefficient, b, can be
derived by the standard procedure given in §4.2.3, and these two constants
define the trend line. To illustrate the technique we will use the series for
NNP (real net national product in £ million at 1938 factor cost) from the
Benjamin and Kochin data set.3 We do so in two forms: in the first we use
the actual dates (1920–38) as the measure of time (YEAR); in the second
we replace these by the numbers 1–19 (TIME).
When we use YEAR as the explanatory variable, the regression line fitted
by our computer package is
Y 132,130.6070.54 YEAR
When we use TIME as the explanatory variable, the regression is
Y3,236.3370.54 TIME
We see that the two results have the same slope but very different inter-
cepts. In both forms the trend line will rise by £70.54 million every year (i.e.
for every increase of 1 unit in either measure of time). But we are trying to
obtain the trend in NNP and so need to know the position of the line in
relation to this series.
If we take the first year 1920, the equation with YEAR tells us that the pre-
dicted value of the line (Ŷin the terminology of §4.2.2) will be 132,130.6
(70.541920), which is £3,307 million. For the same year, the alternative
equation with TIME tells us that Ŷ will be 3,236.33(70.54 1), which is
again £3,307 million. We will similarly get the same result for every other
year, whichever form of the regression we use. The contrast is nevertheless a
useful reminder of the need to think about the units in which the dependent
and explanatory variables are measured before drawing any conclusions
from a regression. We return to this issue in §4.3.3.
The use of simple linear regression is in many respects an improvement
on the moving average procedure. There are no missing years at the begin-
ning and end of the series, the trend is perfectly smooth, and – if desired – it
could be extrapolated to earlier or later years. However, the method is itself
subject to the important limitation that it can generate only a linear trend,
and can thus be used only when inspection of the data indicates that a
straight line is appropriate. In this case, as can be seen in figure 4.4, the
linear trend does fit the data well. However, if there is clearly evidence of a
non-linear trend, then the calculation must be made with slightly more
advanced methods, notably the log-linear (semi-logarithmic) trend that
will be described in chapter 12 (see especially panel 12.1).
4,000
3,500
3,000
1920 1924 1928 1932 1936
In history and the social sciences we always have to settle for something
less than this. However powerful the selected explanatory factor may be,
there will always be a variety of other factors – both systematic and random
– which will also affect the behaviour of the dependent variable (Y).
It is thus necessary to have a measure of how successful the regression
line is in explaining (in the limited sense of ‘accounting for’) the move-
ments in Y. This measure is known as the coefficient of determination (r2)
and is equal to the square of the correlation coefficient. The value of r2 is
always between 0 and 1.e The closer it is to 1, the larger the proportion of
the variation in Y that has been ‘explained’ by the movements in the
explanatory variable, X.
The statistical package will quickly find r2 for us, but in order to see more
clearly what is happening it will be useful to look briefly at the underlying
logic of the calculation.
The objective is to find a measure of the success with which the move-
ments in Y are explained by the movements in X. We thus need some way to
specify these movements in Y, and we take for this purpose the deviation
from the mean for each value in the series. The justification for this is that if
we knew nothing about the factors determining the behaviour of Y, our
‘best guess’ for the value of any individual case, Yi would have to be that it
–
would be equal to the mean of the series, Y.
e
Since r cannot be greater than 1, r2 cannot be greater than 12, and 11 1 11.
Figure 4.5
Y 18
Explained and
unexplained Yi
deviations
16
Unexplained
14 Ŷi
12 Yi Explained
Y 10
Y Y
Unexplained
Explained
8
Ŷ i
4
2 4 6 8 10 12 14
X
X
X
The deviations (or variations) for each value of Y can then be analysed
as follows:
–
(Yi Y)the total deviation
–
(Ŷi Y)the part of the deviation predicted or ‘explained’ by the
regression line
– –
(Yi Ŷi)the residual or part left unexplained(Yi Y) (Ŷi Y)
§.
Alternative terms are sometimes used. The total variation may be called
the total sum of squares. The explained variation may be called either the
explained sum of squares or the regression sum of squares, and the unex-
plained variation may be called either the residual sum of squares or the
error sum of squares.
Equation (4.4) provides the means to measure the coefficient of
determination.
Table 4.2 Calculation of the coefficient of determination for four values of X and Y
Note:
For column (3) a3.60 and b0.80 as calculated in table 4.1.
We can now apply this measure to our basic parish example. For RELIEF
and UNEMP in Kent we get an r2 of 0.275. Unemployment is thus able to
account for only 27.5 per cent of the variation in payments by these par-
ishes, and the greater part of the variation has thus to be explained by other
factors. These could include the generosity of the ratepayers, the different
systems of child allowances, the level of income from wages and other
sources such as allotments, the extent of sickness, and the proportion of
old people in the parish.
The actual details of the calculation of r2 for the simple four-case data set
are shown in table 4.2.
COV(XY)
b (4.6)
s2X
COV(XY)
r (4.7)
sXsY
The first of these two equations can easily be re-written as
COV(XY) s
b Y (4.6a)
sXsY sX
and if the first of these terms in (4.6a) is replaced by r we have
sY
b r (4.8)
sX
This states that the regression coefficient is equal to the correlation
coefficient (the square root of the coefficient of determination) multiplied
by the ratio of the standard errors of Y and X. Three intuitive results follow
immediately from (4.8).
First, if r0, then b must also be 0.
Secondly, although r is independent of the units of measurement of X and
Y, b is not. To return to our earlier example, if RELIEF expenses were meas-
ured not in shillings but in pounds, the standard error of X would fall to one-
twentieth of its level (since there are 20 shillings to a pound). The correlation
coefficient, r, would remain unchanged. Thus, b would be 20 times larger in
this alternative regression. This is an example of a scalar effect, which arises
when a variable is revalued consistently across all observations.f
The scalar effect of changing the denomination of a variable can be very
useful to researchers. Let us imagine, for example, that a PhD student is
interested in the demographic origins of homelessness in the United States
in the 1930s and considers urban size to be a major influence. She therefore
collects cross-section data on homelessness and total population for each
city. If she decided to regress the proportion of each city’s population who
were homeless in 1930 on the number of individuals in each city according
to the US Census of that year, she would find that the coefficient, b, would
be very small, on the order of 0.00000001.g
f
Note that the relationship between YEAR and TIME in the regression fitted to NNP in §4.2.5 is
not consistent in this sense: the ratio of 1920 to 1 is very different from the ratio of 1938 to 19.
g
Assume that there are five cities, with populations of 3,400,000, 5,000,000, 1,500,000, 500,000,
and 200,000. Proportions of homelessness are 0.03, 0.06, 0.02, 0.01, and 0.0085. From this, we
can calculate that sy 0.018814; sx 1,823,623; r0.97; and b0.00000001.
0
0 4 8 12 16 20
X
X
(a) When sx is low relative to sy
Y 25
20
15
10
0
0 4 8 12 16 20
X
X
(b) When sx is high relative to sy
Figure 4.7
Ballantine for a
simple regression
with one
explanatory variable
Y X
different variance.5 How does this difference in the relative size of the vari-
ances affect our understanding of the Ballantine? In order to understand
this, let us delve into the derivation of the Ballantine a little further, with
the aid of our simple regression model from §4.2 based on the data given in
table 4.1.
The variances of X and Y can be calculated as 10 and 15.5, respectively
(from columns (5) and (6)); thus, in the Ballantine representation of this
regression, Y would have a larger circle than X. The covariance between X
and Y is 8.0 (from column (7)). The covariance is a measure of how far the
two variables move together, or covary; this is the statistical equivalent of
the shaded area in figure 4.7.
The size of the shaded area relative to the total area of each circle is equal
to the ratio of the covariance to the variance of X and Y, respectively. The
coefficient of determination is simply the product of these two ratios. Thus
8 8
r2 0.41
10 15.5
This result is thus the same as the one calculated in a different way in table
4.2.
In the bivariate case, the Ballantine also incorporates all the information
necessary to calculate the regression coefficient. Recalling that r is simply
the square root of the coefficient of determination, and that the standard
deviations of X and Y are equal to the square roots of their respective vari-
ances, we can use (4.8) to calculate the regression coefficient, b, as
兹15.5 3.94
b 兹0.41 0.64 0.80
兹10 3.16
Notes
1
The dependent variable may also be referred to as the regressand, or as the explained,
endogenous, or target variable. The explanatory variable can similarly be referred
to as the regressor or the predictor, or as the independent, exogenous, or control
variable.
2
Alternative possibilities would be to measure either the horizontal distance or the
distance represented by a line drawn at right angles to the regression line.
3
Note though that this is not the form of trend actually used by Benjamin and Kochin.
They calculated a so-called log-linear trend by a procedure we will explain in panel
12.1 in chapter 12.
4
The Ballantine was originally introduced by Jacob and Patricia Cohen, Applied
Multiple Regression/Correlation Analysis for the Behavioral Sciences, Lawrence
Erlbaum Associates, 1975, to discuss correlation coefficients, and later mis-spelt and
extended by Peter Kennedy to the analysis of regression techniques, Peter E.
Kennedy, ‘The “Ballentine”: a graphical aid for econometrics’, Australian Economic
Papers, 20, 1981, pp. 414–16.
5
One way to conceptualize the Ballantine with equal circles is to assume that all vari-
ables have been standardized by the Z transformation, so that all variances are forced
to one.
In which of these cases might you expect a mutual interaction between the
two variables?
X: 4, 8, 12, 16
Y: 10, 10, 3, 7
Calculate by hand:
The regression coefficient, b
The intercept, a
The total sum of squares
The explained sum of squares
The residual sum of squares
The coefficient of determination.
§.
3. In the table below, (2) is an imaginary series. (3) is the official (London
Gazette) market price of wheat, rounded to the nearest 5 shillings. (4) is a
hypothetical series.
1810 20 105 12
1811 50 95 23
1812 60 125 32
1813 40 105 22
1814 20 75 8
1815 10 65 7
1816 30 75 17
1817 30 95 17
1818 20 85 8
1819 40 75 18
1820 20 65 8
1821 20 55 8
(i) By hand, calculate the correlation coefficient for wheat prices and the
number of disturbances.
(ii) By hand, find the equation for the regression line for wheat prices
and the number of disturbances. (Hint: think first about which is the
appropriate explanatory variable.)
(iii) Plot the corresponding scatter diagram and regression line.
(iv) By hand, calculate the correlation coefficient for number of ship-
wrecks and the number of disturbances.
(v) Compare the coefficients calculated in (i) and (iv).
CATHOLIC
AGE
ILLITRTE
Record the coefficients and standard errors. Plot the residuals from each
regression. What do they tell you about the extent to which 1920 is an
outlier? Re-run these regressions dropping 1920. Record the coefficients
and standard errors and plot the residuals.
6. Critics of your analysis of disturbances and wheat prices in question 3
suggest the following errors in the statistics:
(i) The number of disturbances in 1819 was 60 not 40.
(ii) The price of wheat in 1819 was 70 not 75.
(iii) All the figures on the number of disturbances have a margin of error
of plus or minus 10 per cent.
(iv) The figures of wheat prices are too low, since they represent market
prices received by farmers, not retail prices paid by workers; retail
prices are 25 per cent higher in each year.
Draw up a statistical response to all of these criticisms. Generate new
results where applicable.
5.1 Introduction
The previous chapters have been concerned with a review of basic descrip-
tive statistics. We now move to the much more important topic of induc-
tive statistics. This chapter and chapter 6 will be devoted to an exploration
of some of the implications of working with data from samples. The basic
purpose of inductive statistics is to make it possible to say something
about selected characteristics of a population on the basis of what can be
inferred from one or more samples drawn from that population (see
§1.3.4).
Samples can be obtained in various ways. Typically, the historian works
with information extracted from surviving documents such as parish reg-
isters, manuscript census schedules, household inventories, farm accounts,
private diaries, and records of legal proceedings. She will usually be
working with a sample of such records, either because that is all that was
ever available (only some people keep diaries), or because only some of the
original documents have survived, or because the records are so detailed
that it would be too expensive and time-consuming to extract all the infor-
mation they contain.
Whenever sample data are used it is necessary to pose the fundamental
question: How good are the results from the sample? How much can we
learn about the population as a whole from the data available to us, and
with what confidence? The overall answer to these questions involves three
separate issues that must be considered briefly before turning to the statis-
tical aspects.
(a) Is the sample of records that has survived representative of the full set
of records that was originally created?
(b) Is the sample drawn from the records representative of the information
in those records?
(c) Is the information in those records representative of a wider popula-
tion than that covered by the records?
records survive more often for bigger firms than for smaller ones). In such
cases, it may be possible to develop sampling strategies to counteract the
bias in the surviving data and create a more representative data set. Finally,
there may be cases where it is not possible to establish the bona fides of the
surviving data at all, either because there is no basis for comparison with
the population (e.g. household inventories) or because the data are clearly
not representative (e.g. private diaries in an era of limited literacy).
We do not suggest that statistical analysis of unrepresentative evidence
is uninteresting, but in what follows we are assuming that the sample is rep-
resentative of the population in all three of the dimensions mentioned,
such that we can apply the logic of inferential statistics to recover informa-
tion about the population attributes from the sample available to us.
Once representativeness has been established, we can focus on the
nature of the sample results. So, when we ask, ‘How good are the results
obtained from the sample?’, we are asking, in effect, whether the same
results would have been obtained if a different sample had been used.
Unfortunately, it is normally not possible to find out by drawing more
samples (either because the information is not available or because the
exercise would be too expensive). Instead we have to rely on statistical
theory to tell us about the probability that the results we have obtained can
be regarded as a true reflection of the corresponding characteristics of the
underlying population. Our task is thus to learn how we can make infer-
ences about various parameters of the population, given what we know
from the sample statistics (see §1.3.4).
In particular, in this chapter we will consider how to achieve a given
probability that a specified range or confidence interval around the sample
statistic will cover the true (unknown) value we wish to estimate; and in
chapter 6 we will explore the closely related topic of the procedure for
testing statistical hypotheses. If we were considering only one variable,
these statistics might be, for example, estimates of the mean and standard
deviation of the sample. If we were dealing with the relationship between
samples of data for two variables, we might have sample statistics for the
correlation and regression coefficients.
As noted before (§1.4), Greek letters are usually used to denote charac-
teristics of the population, while Roman letters are used for the corre-
sponding estimates from samples from that population. The Greek and
Roman symbols we will use in this and other chapters are set out below.
a
Strictly, it should be all possible samples of the given size.
SAMPLING DISTRIBUTION
A sampling distribution
is the distribution of a sample statistic
that would be obtained if
b
A sample size of 10 or 20 will often be enough for this purpose, but anything less than 30 is
usually treated as a small sample. The reason for this is discussed in §5.4.2.
§.
–
1 (b) On average, the known sample mean (X) will be equal to the mean of
the sampling distribution.
–
1 (c) It therefore follows that, on average, the known sample mean (X) will
be equal to , the unknown population mean.
The value of the sample mean can thus be taken as a point estimate of
the population mean.c Naturally any individual sample is likely to be a little
above or below the population mean. The second contribution made by
statistical theory enables us to calculate by how much any single sample is
likely to miss the target.
–
3 (a) The estimation error for any given point estimate, X , is determined
by the shape of the sampling distribution.
–
The flatter the sampling distribution, the more the possible values of X
–
will be spread out, and the wider the distance between any given X and .
Conversely, the greater the clustering of sample means, the less the likely
–
estimation error of any given X relative to . Because the standard devia-
tion of the sampling distribution thus indicates the range of possible error
in estimating owing to differences between samples, it is commonly
referred to as the standard error.
The estimated
standard deviation of the sampling distribution
is known as
the Standard Error.
1 (b) The size of the standard error depends on the size of the sample being
drawn from the population. As the sample size increases, the more
clustered will be the array of sampling means around the population
mean, , and the smaller will be the estimation error attributable to
–
any given sample mean, X.
The ideal formula for the standard error of the sample mean is
/ 兹n,
where
(lower case Greek sigma) is the standard deviation of the
c
A point estimate is a single value. The alternative is an interval estimate covering a range of
values and thus making a less precise claim to identify the single (unknown) population value.
–
This equation enables us to calculate the error involved in adopting X as
an estimate of , on the basis of information known only from the sample
itself.d
d
As noted in chapter 2, computer programs and calculators may calculate the standard error
using n 1 rather than n. The reason for doing this is explained in §5.4.2. The distinction
between the two definitions matters only when the sample size is small. If n30 or more there
would be practically no difference between the two estimates.
§.
BRTHRATE15.6360.308
We have thus far derived the standard error of the sample mean.
Standard errors can also be calculated for other sample statistics; it is neces-
sary, therefore, to indicate which standard error is being reported in any
case. This is often accomplished by adding the symbol for the characteristic
in brackets after the abbreviation SE (for standard error); so for the sample
–
mean it would be SE(X). At a later point in this chapter (see §5.6) we will
give the formulae for the standard error for some other common sample
statistics.
On the basis of these four theoretical propositions, we are able to use
statistics derived from a sample to estimate the value of a population
parameter and to indicate how accurate we consider the estimate to be. In
§5.5, we will use these simple rules to construct intervals within which we
believe our estimate of the unknown population mean to fall, and to calcu-
late the probability that we will be correct in this belief.
However, we are not yet ready for this. We must first spend some time
developing the procedures for calculating the confidence interval, and we
do this in §5.4. It is also necessary to pay some more attention to the
meaning of the sample being drawn, and one important aspect of this is
discussed in panel 5.1.
ably large compared to the sample. This is comforting; historians usually have
to make do with samples that are too small, so it is not likely that they will
often have to worry about samples which are too large.
For two of the most important statistics that we deal with in this chapter –
the mean and standard error of the sampling distribution – we can be
more precise. The following propositions may be stated without further
proof:*
兹
N n
N 1
where N is the size of the population and n the size of the sample.
If the sample is not more than about one-fifth of the population the cor-
rection needed is only 2–3 per cent of the standard error that would be
obtained if the sample were drawn with replacement, and can generally be
ignored. However, if the sample drawn without replacement is a larger frac-
tion of the population than this, the correction should generally be made.
Correction is desirable in such cases because under-statement of the
standard error will lead to a corresponding under-statement of the sam-
pling errors and confidence intervals explained below (see §5.5), and may
thus create a misleading impression of the reliability of the inferences that
can be made on the basis of the evidence from the sample.
It is also important to take the correction factor into account when com-
paring the means from two different samples, especially when both are
small relative to the population as a whole. Thus, if we wish to determine
whether the mean birth rate in Kent parishes was higher than in Sussex par-
ishes and whether this difference is statistically meaningful, we should rec-
ognize that both samples were drawn without replacement and construct
the standard errors of the means accordingly. We shall return to this issue
more formally in §6.7.1.
*
For illustrations of these propositions see exercises 2, 3, and 5 at the end of this chapter.
We can thus see that there is only a one in four chance of getting either
no heads or two heads, and two in four chances of getting one head. If we
think of these chances in terms of probabilities they can be summarized as
follows:
§.
Notice that the total of the probabilities is 1, because the three results
exhaust all the possible outcomes. These results represent the probability
distribution or probability density function for this particular discrete
random variable: the number of heads obtained in two tosses of a coin.
Exactly the same ideas can be applied to a continuous random variable
(which can take any value within a relevant range) such as GNP or age of
marriage, though the mathematics needed to determine the probability
distribution may be considerably more complicated. If such a probability
distribution is depicted graphically, the value of the random variable, X, is
measured along the horizontal axis, and the vertical axis shows the prob-
ability that X lies within some specified interval; for example, the probabil-
ity that the age of marriage of a woman lies between 16.0 and 18.0 years.4
The total area under the curve again exhausts all the possibilities and is
equal to 1.
One particular form of such a curve is the very important normal theo-
retical probability curve. This looks exactly like the normal curve dis-
cussed and illustrated in chapter 2. It is also a perfectly smooth and
symmetrical, bell-shaped, curve. The location and spread of the curve
along the horizontal axis are similarly determined by its mean and stan-
dard deviation. It too has a standardized equivalent, Z, with mean equal to
zero and standard deviation equal to 1. The way in which this shape is
derived from the probability distribution of a discrete variable is explained
in panel 5.2 and illustrated in figure 5.1.
It may, at first, seem rather confusing to regard a particular distribution
as both an empirical frequency distribution and a measure of probability.
However, this follows directly from the objective notion of probability as
long-run relative frequency. This is, in effect, what is meant when it is said
that the probability of getting a head when tossing a single coin is 1/2, or
that the probability of getting an ace when one card is drawn from a pack of
52 is 1/13. We would not expect to get one ace every time we drew 13 cards
from the pack, but if we drew a single card 13,000,000 times (from an ultra-
durable pack) then we would expect the number of aces to be very close to –
although not exactly equal to – 1,000,000.5
This standardized normal probability distribution has all the same
properties as the standard normal curve discussed in §2.6.2. We can thus
*
This is in effect a weighted average (see §2.3.2) calculated as the sum of the number of heads
multiplied by their respective probabilities:
** Note that the number of possible permutations of heads and tails is equal to 2n, where n is
the number of separate coin tosses. No matter how many tosses, only one permutation in
each sequence will be either all tails or all heads.
§.
0.3
0.2 10 coins
20 coins
0.1
00.0
.0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1.0
Proportion of heads
use the information in published tables such as table 2.7 to find the propor-
tion of the distribution in the area defined by any value of Z. We might, for
example, ascertain the proportion of the distribution up to or beyond a
specific value of Z, and can then interpret that proportion as the probability
of getting some given result that is greater than (or, if Z is negative, less
than) that value of Z.
Similarly the proportion of the distribution lying between two specific
values of Z can be interpreted as the probability of getting some result in
the range defined by those two values. Each of these values of Z corre-
sponds to some value in the original distribution before it was standard-
ized, according to the definition of Z given by (5.3).
particular standard deviation is known as the standard error (SE), and for a
sample mean its value – as given in (5.2) – is s/ 兹n. Using this SE to stan-
dardize the deviation of the sample mean from the population mean, we
can define Z as
(X )
Z (5.4)
s/兹n
We would thus obtain a separate value for Z for every sample we draw
from a given population and for which we calculate a mean. These values of
Z represent a theoretical probability distribution. If we depicted this on a
–
graph the successive values of X would be measured along the horizontal
axis, and the vertical axis would record the probability of obtaining those
values. The total area under the curve again exhausts all the possibilities
and is equal to 1.
However, in order to obtain this formula for Z, it was necessary to
replace the unknown
by the known sample standard deviation, s. Such
substitution will almost always be necessary in historical research, because
there is usually no way of knowing the population values. Unfortunately,
the statisticians have established that this procedure is not entirely satisfac-
tory. It turns out that s/ 兹n 1 is a better estimate of the standard devia-
tion of the sampling distribution than s/ 兹n. The reason for this can be
explained intuitively in terms of the concept of degrees of freedom (df).
This is discussed in panel 5.3.
As a consequence, an alternative standardized measure has been
devised, which is known as Student’s t, or simply t, where
(X )
t (5.5)
s/兹n 1
Here, too, a separate value for t would be obtained for every sample
drawn from a given population for which a mean was calculated. These
values of t similarly represent a theoretical probability distribution of a
sample statistic, known as the t-distribution.
The t-distribution is symmetric and bell-shaped, with mean and mode
equal to zero. It is thus similar in appearance to the standard normal prob-
ability distribution, but its precise shape depends solely on a parameter, k,
which is determined by the number of degrees of freedom (df) in the esti-
mate of the population standard deviation. This equals n 1, i.e. df is one
less than the size of the sample.
There is thus a different distribution – and so different probabilities –
for each df, and the distribution is flatter (less peaked) and more spread out
§.
the lower the number of df (see figure 5.2). However, the difference
between successive t-distributions diminishes rapidly as the sample size
(and so the number of df) increases, and the distributions become less
spread out. After we reach about 30 df, the t-distribution approximates
increasingly closely to the standard normal probability distribution.6
In principle, the t-distribution is always superior to the standard normal
Z-distribution when
is unknown, but its advantages are especially impor-
tant when the sample size is small (less than about 30). In such cases the
df = 2
e
With the standard normal distribution, if the probability of getting a result greater than Z1.96
is 2.5 per cent, the probability of getting a result greater (further to the right) than that, such as Z
2.228, would be less than 2.5 per cent, and thus less than the probability of getting that result
with the t-distribution.
§.
Source: D. V. Lindley and W. F. Scott, New Cambridge Statistical Tables, 2nd edn., Cambridge
University Press, 1995, table 10, p. 45.
The column headings in table 5.1 refer only to one tail, but because the
distribution is symmetrical the area in both tails marked off by any given
value of t is always exactly double the one-tailed area.f The value that marks
off a specified tail in this way is called the critical value or critical point, and
will feature prominently in chapter 6 (see §6.3.2).
f
Similarly, the corresponding cumulative and central areas, or the area to the right of the mean –
which were shown in table 2.7 – can easily be found, given that the total area1 and the area to
the right (or left) of the mean0.5.
§.
–
X the sampling error (5.6)
We can also decide how large we want this margin of error to be. The
interval can be made wider or narrower depending on how confident we
want to be that the interval estimate we adopt does contain the true .
Given the sample size, if it is desired to be more confident – at the cost of
being less precise – a wider interval should be chosen, i.e. the sampling
error would be increased.
The usual practice is to opt for a 95 per cent confidence interval. This
means that we expect to be correct 19 times out of 20 (or 95 times out of
100) that our interval estimate does include . Equally, of course, it means
that we have to recognize that we will be wrong (in the sense that our inter-
val estimate will not actually include ) once in every 20 times. In other
words, by choosing a 95 per cent confidence interval we are satisfied with a
95 per cent probability that the estimate will be correct.
It is important to note that this is not the same as saying that we can be
95 per cent confident that lies within the particular interval estimate
derived from a single sample. The value of is fixed and cannot change;
–
what can change is the value of X obtained from successive samples, and
–
thus the range around X given by the confidence interval. With a 95 per
cent confidence interval it is expected that only one in every 20 intervals
derived from successive samples will not contain .
It is thus a statement of confidence in the procedure we are adopting,
embracing all the (hypothetical) interval estimates we would obtain if we
were able to repeat the calculations with a large number of samples of a
given size.
–
Xsampling error. (5.6)
– –
Xt 0.025 SE(X) (5.7)
–
and SE(X) is the Standard Error
of the sampling distribution of the mean.
In §5.3 the standard error for BRTHRATE for the sample of 214 English
parishes (s/ 兹n) was found to be 0.308 births around a mean rate of
15.636 births per 100 families. If we look up t 0.025 for the t-distribution for
213 degrees of freedom in table 5.1 we find (by interpolation) that the crit-
ical value is 1.971.8
The 95 per cent confidence interval for the mean birth rate (births per
100 families) of the population is thus
15.636(1.9710.308)
15.6360.607
§.
We are thus able to say that there is a 95 per cent probability that a range
from 15.029 to 16.243 (i.e. from 15.636 0.607 to 15.6360.607) births
per 100 families will contain the (unknown) population mean.g
This is equivalent to saying that with the statistical procedure we are
using we can expect our sample to provide confidence intervals that will
include the population mean about 95 per cent of the time. If 20 samples
were taken, it is probable that the inference about the population mean, ,
–
that we draw from the sample mean, X, would be right 19 times and wrong
once. Of course, if we only have one sample there is no way of knowing
whether this is one of the 19 that gives the correct result, or the twentieth
one that gives the wrong result!
If the prospect of getting the wrong result 5 per cent of the time is alarm-
ing then it is always open to the investigator to adopt a higher standard.
Choosing a 99 per cent confidence interval would mean that the investiga-
tor could expect these (wider) intervals to include the unvarying mean
about 99 per cent of the time.
A familiar variant of this problem usually occurs during an election
when opinion polls are used to predict the proportion of the electorate that
will vote for a particular party or presidential candidate.h In a run of such
polls there are typically some that excite special interest by deviating sub-
stantially from the remainder in the series, but once it becomes apparent
that they have not detected a sudden shift in opinion the pollsters acknowl-
edge reluctantly that this deviant result was a ‘rogue poll’.
In other words, this is one of the incorrect results that any sampling pro-
cedure is bound to produce if it is repeated sufficiently often. The fre-
quency of rogue polls is systematically related to their sample size. The only
way for the pollsters to reduce the probability of errors would be to work
with larger samples, but that is often too expensive for their sponsors.
To see what the confidence interval would be for the mean BRTHRATE
if the sample size were smaller, imagine that instead of the 214 parishes the
sample had been confined to the 23 parishes for which information was
found for Kent (county 1).i The mean BRTHRATE obtained from this
sample was 18.547 with a standard deviation of 4.333. Since the sample size
is small it is important to replace 兹n by 兹n 1 in the formula for the
g
If we used s/ 兹n 1 rather than s/ 兹n, the standard error would change imperceptibly to 0.309
and the confidence interval would be 15.6360.608.
h
In this case the relevant sampling statistic is a proportion rather than a mean, but the statistical
principle is the same.
i
For those who recall that we worked with 24 Kent parishes in chapters 2–4 it may be helpful to
note that those statistics related to the first Boyer data set for poor relief. The number of parishes
in his birth rate data set is smaller, and in the case of Kent is 23.
–
standard error.j SE(X) is therefore 4.333/ 兹22 0.924. With a sample of
this size there are 22 df, and if we look up table 5.1 for the t-distribution
with 22 df we find a critical value for t0.025 of 2.074.
Thus with this sample the 95 per cent confidence interval for the mean
population birth rate is
18.547(2.0740.924)
18.5471.916
On the basis of this sample we are thus able to say that there is a 95 per
cent probability that a range from 16.631 to 20.463 births per 100 families
will contain the (unknown) population mean. With the smaller sample we
are able to place the population mean only within a fairly wide interval
(with 95 per cent confidence).
If a higher standard of confidence were required (say, 99 per cent) the
proportion to be excluded in each tail would be 0.005 (giving a total pro-
portion of 0.01 or 1 per cent in both tails). For 22 df, the critical value is
t0.005 2.819 (see column (6) of table 5.1) and the confidence interval for
the population BRTHRATE would be correspondingly wider: 15.942 to
21.152 births per 100 families.
It should also be noted that there is no overlap between the 95 per cent
confidence interval for the population mean birth rate obtained from the
sample of 23 Kent parishes and the corresponding interval obtained earlier
from the full sample of 214 parishes. The lower bound for the former
(16.631) is actually higher than the upper bound for the latter (16.243).
This is a sharp reminder that it would not be good sampling procedure to
attempt to derive an estimate of the mean birth rate in England as a whole
from a sample of parishes drawn exclusively from one county (see §1.3.4).
As should also be obvious, any confidence interval is directly affected by
the size of the sample on which it is based. The larger the sample, the
smaller will be the standard error, and hence the smaller the width of the
interval. The alternative to achieving a higher standard of confidence at the
expense of a wider interval is thus to increase the size of the sample where
this is not precluded by considerations of cost or availability of data.
j
See n. 3 (p. 144).
§.
兹
(1 )
SE(p) (5.8)
n
where (lower case Greek pi) is the (unknown) proportion in the popula-
tion. Because is not known it is estimated by using the sample propor-
tion, p, and the formula becomes
兹
p(1 p)
SE(p) (5.9)
n 1
The formula for the 95 per cent confidence interval for a sample size of
(say) 41 (so df40) is thus
pt0.025 SE (p)
兹
p(1 p)
p 2.02 (5.10)
n 1
In the light of the concepts discussed in this chapter it should now be clear
that this is not satisfactory.
It is necessary to ask the same question about the correlation coefficient
as about any other result obtained from a sample: is the value obtained for r
simply the result of random (chance) features of the particular sample, or
does it reflect systematic and consistent features of the relationship
between the population of the two variables? In other words, if the correla-
tion coefficient for the population mean is (lower case Greek rho), is this
sample correlation coefficient, r, a good estimator of ?
To answer this question in terms of confidence intervals we need to
know the standard error of the correlation coefficient, SE(r). This is
1 2
SE(r) (5.11)
兹n 1
Unfortunately, this formula is of limited use, both because the value of
is not usually known and because the sampling distribution of r is highly
skewed unless the sample size is quite large (a sample of at least 100 is rec-
ommended). To replace it statisticians have developed a transformation,
Fisher’s z, which is obtained from r as follows9
1
z loge
2 冢 冣
1r
1 r
(5.12)
k
Lindley and Scott, Statistical Tables, table 17, p. 59.
present, we can simply note that SE (b), the standard error of a sample
regression coefficient for the slope, b, of a regression line with only one
explanatory variable, is
兹
(Yi Ŷ) 2
SE (b) (5.14)
(n 2)(Xi X) 2
The formula for the 95 per cent confidence interval for the population
regression coefficient, ß, and a sample size of (say) 32 (df30) is therefore
Substituting the formula for SE (b) from (5.14) in this equation gives
兹
(Yi Ŷ) 2
b2.04 (5.16)
(n 2)(Xi X) 2
In this chapter we have reviewed the formulae for the standard errors of
the sampling distributions of a mean, proportion, regression coefficient,
and correlation coefficient. These four statistics illustrate the basic princi-
ples. For other sample statistics your statistical program and computer will
provide the required formula for the standard error and do the necessary
calculations.
Notes
1
A full discussion of all the statistical (and financial) issues that arise in drawing an
appropriate sample cannot be provided here. A brief introduction to the topic is
given in the references given in §1.3.4. W. G. Cochran, Sampling Techniques, 3rd
edn., Wiley & Sons, 1977, is a good example of a more comprehensive text.
2
This is the procedure adopted by Boyer in his analysis of the English Poor Law Data
(An Economic History of the English Poor Law, Cambridge University Press, 1990, pp.
131, 149). Parishes that did not return relatively complete questionnaires to the Poor
Law Commission were excluded from his sample. However, every parish returned
some information and Boyer was able to compare these results with those for his
sample to indicate that incomplete respondents were not systematically different
from those that did reply in full. He thus decided that he could treat his data as
though they were drawn from a random sample.
3
The statisticians actually make a subtle distinction here. Although they tell us that
the sample standard deviation, s, is the best estimate of
, they say that the best esti-
mate of the standard deviation of a sampling distribution is not s/ 兹n, but s/ 兹n 1.
If the sample size is large (more than 30) the difference made by using s/ 兹n rather
than s/ 兹n 1 in the denominator hardly matters, as can be seen by reworking the
BRTHRATE example given below, but with smaller samples it becomes progressively
more important. We will say more about this distinction in §5.4.2 and offer an intui-
tive explanation for it in panel 5.3.
§.
4
We are told by the statisticians to refer in this way to the probability of a continuous
random variable having a value within some specified interval, because the probabil-
ity of any one specific value is always zero.
5
In this context, each card must of course be replaced immediately it has been drawn.
The number of aces would not be expected to be exactly 1,000,000 with replacement
because each draw is completely independent. Assume, for example, that at any point
in the sequence of drawing the cards, the proportion of aces had fallen 20 below 1 in
13. The cards have no memory of the previous draws, and so there is absolutely no
reason to expect that the next 20 draws will all be aces. At any stage, the probability
for each succeeding draw is exactly the same as it was at the outset of the experiment:
1 in 13.
6
The variance of the t-distribution is given by the formula k/(k 2), where k equals
the number of df (n 1). This means that the larger the sample size (and hence k)
becomes, the closer the variance is to 1, and hence the more closely the distribution
approximates the standard normal probability distribution. As we know from
§2.6.2, the standard deviation (and hence also the variance) of this standard normal
distribution is also 1.
7
The t-distribution should be used in preference to the Z-distribution whenever
is
not known, irrespective of the sample size. However, it should be intuitively obvious
that the difference between s and
will be greater the smaller the size of the sample
on which s is based, and so the use of the t-distribution is again especially important
when dealing with small samples.
8
When the number of df is large it is recommended that interpolation should be har-
monic. For an explanation see Lindley and Scott, Statistical Tables, p. 96.
9
An alternative version of (5.12) is sometimes quoted, which is not calculated in
natural logs but in logs to the base 10 (see §1.6.2). This is
z 1.1511log10
冢 冣1r
1 r
,
and gives the identical result. The value of Fisher’s z for any given r can also be
obtained from published tables, for example, Lindley and Scott, Statistical Tables,
table 16, p. 58.
(iii) Plot the distributions obtained in (i) and (ii) and give an intuitive
explanation of why the distribution approximates more closely to
the normal distribution as the sample size increases.
(i) Calculate by hand the mean and standard deviation of the popula-
tion. (Note: this is the entire population; not a sample.)
(ii) Specify all possible samples of size 2 which can be drawn with
replacement from this population, and calculate the means of each
of these samples. (Hint: this means that you can have both 2, 3 and
3, 2 and also 2, 2, etc.)
(iii) Calculate the mean of the sampling distribution of means. Compare
your result with the population mean.
(iv) Calculate the standard error of the sampling distribution of means.
Compare your result with the value obtained from the formula SE
/兹n.
(v) Comment on your results.
3. Take the same population as in question 2 and undertake the same exer-
cises, but using the approach of sampling without replacement. Comment
on your results, especially with regard to any contrasts and similarities with
the answers to question 2.
4. Using the Boyer relief data set, calculate the mean and standard devia-
tion of INCOME for the 24 Kent parishes:
(i) Calculate by hand the 90 per cent and 99 per cent confidence inter-
vals for mean INCOME for the 24 Kent parishes. Compare your
result with the confidence intervals given by your statistical
program.
(ii) Calculate by hand the 99 per cent confidence interval for the propor-
tion of Kent parishes with a workhouse. Compare your result with
the confidence interval given by your statistical program. (Hint: treat
this a mean.)
(iii) If the 95 per cent confidence interval for INCOME is calculated for a
sample of 24 Kent parishes it is found to be 34.36252.129. How
many parishes would need to be included in the sample to reduce the
95 per cent confidence interval to 1? (You can take t 0.025 1.96.)
(iv) By how many times would the sample need to be increased in order
to reduce the confidence interval to one-eighth of the size of 1
adopted in question (iii); i.e. to 0.125? What general result can be
derived from this?
(v) Calculate the 95 per cent confidence interval for the regression
§.
Hypothesis testing
were drawn from additional parishes in each county. If so, the apparent
difference shown by the two existing samples of parishes is simply a
reflection of random factors that are of no real interest to historians.
If the answer is (b), then it is a potentially interesting finding and histo-
rians of the Poor Law should proceed further to consider why per capita
relief payments were some 28 per cent higher in Sussex than in Kent.
The jury rapidly acquitted A. They also acquitted B but only after a long
discussion. C was found guilty but on a majority verdict. D was quickly and
unanimously found to be guilty.
The jury’s decisions can be interpreted as follows. In each case the ques-
tion they had to ask themselves was: What is the chance that the man would
have been doing what the police reported if he was innocent? If they judged
the probability to be high, they would decide that he was innocent. If they
judged it to be low, they would reject the presumption of innocence and
find the man guilty. In making their judgement they have to recognize
that if they set their probability standard too high they run the risk of
acquitting a guilty man; if they set it too low they may convict someone
who is innocent.
In these four cases the jury evidently drew their dividing line between
high and low probability somewhere in the region between the chance of B
§.
being seen driving away from the store if he was innocent, and the prob-
ability of C being seen near the warehouse and then being overheard boast-
ing in the pub if he was innocent. In A’s case the likelihood of innocence
was found to be very high, in D’s it was obviously very low.
We will return to this story in §6.3.6.
In the subsequent sections we will explain each of these stages, and then
in §6.7 we will illustrate three specific tests of hypotheses involving a
difference of sample means, a correlation coefficient, and a regression
coefficient.
(a) the underlying population from which the sample was drawn; and
(b) the research procedures used to obtain the sample data.
It is only if one is certain about (a) and (b) that it is possible to isolate the
null hypothesis as the issue that is to be decided by the selected test.
One of the crucial questions with respect to (a) is whether or not the
population distribution is approximately normal (i.e. if it looks approxi-
mately like the theoretical frequency curve described in §2.6). If it is, then
one set of tests, in which this normality is assumed, is appropriate. If it is
not, then another set, known as non-parametric tests, which will be intro-
duced in chapter 7, must be employed. These latter tests do not assume a
normal distribution and are also applicable to nominal and ordinal level
data.
The crucial question with respect to (b) is whether or not the sample is
truly random. (A brief indication of what is required to satisfy this condi-
tion was given in §1.3.4.) If the sample is not random, then it is essential to
consider the extent to which there might be a bias in the data because of
this. If there is a serious bias, the data are contaminated and the results
cannot be relied on.
It may, nevertheless, be possible to argue that even though the sample is
not random, the results are not invalidated. For example, as noted in
chapter 5 (see n. 2), Boyer (Poor Law, p. 129) claims this for his data, on the
grounds that the parishes that completed the returns to the Poor Law
Commission from which his samples are drawn were not systematically
different from those that did not.
The more carefully he waits until he is sure it is safe to cross, the less likely he is
to be run over. But if he tries to achieve maximum safety he might wait
forever and would never get to the other side of the road.
(a) If the scientists do not reject H0 when it is true (i.e. the drug does have
adverse side effects) they will make the correct decision. The drug will
not be introduced and no patients will suffer from the adverse
effects.
(b) If they reject H0 when it is true (and instead fail to reject H1) they will
make a Type I error: the drug will be introduced and some unfortu-
nate patients will suffer the adverse effects.
(c) If they do not reject H0 when it is false the drug will not be introduced.
This will represent a Type II error, and the beneficial treatment the
drug could have provided will be lost.
(d) If they reject H0 when it is false (and instead fail to reject H1) they will
again make the correct decision: the drug can be introduced and
patients will be treated without the risk of adverse side effects.
These possible outcomes can be summarized in the box below.
§. ?
Decision:
Reject H0 Do not reject H0
SIGNIFICANCE LEVEL
The more there is at stake in rejecting a true null hypothesis, the greater
the need to select a low probability of making a Type I error, even though,
as we have seen, this will increase the prospect of making a Type II error.
The probability of making a Type II error is designated as . The prob-
ability of not committing a Type II error (1 ) is known as the power (or
efficiency) of a test of the null hypothesis.
Do not reject
the null hypothesis
in this region
Critical region: Critical region:
reject reject
the null hypothesis the null hypothesis
Do not reject
the null hypothesis
in this region Critical region:
reject
the null hypothesis
The trade-off
There is, however, a price to be paid for this. Consider, for example, a com-
parison of the 1 per cent and 5 per cent significance levels. At the 1 per cent
significance level the critical (rejection) region is smaller, and the non-
rejection region is bigger, than at the 5 per cent level. Therefore, at the 1 per
cent level, with a smaller critical region, there is less likelihood of a Type I
error (rejecting H0 when it is correct). But then there will automatically be a
larger non-rejection region, which means there is more likelihood of a
Type II error (not rejecting H0 when it is false).
There is thus an inescapable trade-off: the less the chance of a Type I
error, the greater the chance of a Type II error. The standard approach to
hypothesis testing is to assume that Type I errors are less desirable than Type
II errors and to proceed accordingly. For this reason, the power of a test is
rarely invoked in empirical analysis. The conventional methodology
instead focuses on setting the significance level and then choosing a
method that minimizes the chances of committing a Type I error.
Thus in the example of the drug trial, the consequences of introducing a
drug with adverse side effects are usually viewed as far more serious than
the consequences of not introducing a drug with potential life-saving or
life-enhancing properties.1 This logic is reinforced by the recognition that,
in general, it is very hard to identify the sampling distribution of the
research hypothesis, and one cannot therefore easily observe the probabil-
ity of committing a Type II error. As a rule, therefore, researchers are
content to recognize the trade-off by not reducing to extremely low
levels, but beyond this little more is done or said.
H0: 1 2
H1: 1 2
that she will adopt the 5 per cent level and then finds that her results are
significant at the 1 per cent level, she will normally note this with special
satisfaction, and the results might then be described as ‘highly significant’.
If, however, the research related to something like the experiment to test
a new drug for adverse side effects, it would be advisable for the drug
company to require a higher degree of certainty that the drug was safe. It
would thus be necessary to choose a lower level of significance in order to
have an extremely low probability of making a Type I error (namely, reject-
ing the null hypothesis of adverse effects when it was true, i.e. when the
drug actually did have such effects). In such a crucial context a level even
lower than 1 per cent might be appropriate.
It is an important principle of good research practice that the appropri-
ate level should be decided in advance; it should not be influenced by the
results once they have been obtained.
Significance levels can also be regarded as the complement of the
confidence intervals discussed in chapter 5. If a null hypothesis is rejected
at the 5 per cent level, i.e. with a 5 per cent probability of error in coming to
this conclusion, this is exactly equivalent to saying that it lies outside the 95
per cent confidence interval. Conversely, if it is not rejected at the 5 per cent
level, this is equivalent to saying that it falls within the 95 per cent
confidence interval.
PROB-VALUE OR p-VALUE
Prob-value or p-value
(or sig or statistical significance)
is the probability
that the outcome observed would be present
if
the null hypothesis were true.
from the drug store if he was innocent and the lower probability of C being
seen near the warehouse and then being overheard boasting in the bar if he
was innocent.
A’s case falls well into the non-rejection region: there is a very good
chance (high probability) that he could have been noticed outside the
house a couple of days before the theft even if he was innocent. Conversely,
D’s case is well into the critical region: the jury obviously thought that there
was very little chance (very low probability) that he would have been trying
to sell the stolen goods to a dealer the morning after the theft if the pre-
sumption of his innocence were true. A low probability thus leads to rejec-
tion of the null hypothesis.
Note that Type II errors (failing to reject the null hypothesis when it is
false) do not come into this story at all. The emphasis in the Anglo-
American legal system on the notion of reasonable doubt is consistent with
setting a low significance level, (but not too low, since it is reasonable
rather than absolute doubt that is at issue). In other words, these judicial
systems are primarily concerned to avoid making Type I errors: finding
accused persons guilty when they are innocent.
However, in jurisdictions in which the rights of the accused are less sac-
rosanct, or in a situation where it is seen as more important to avoid cases
where the guilty are acquitted, the judicial system might be more con-
cerned to avoid Type II errors: finding the accused innocent when they are
actually guilty. In these circumstances it is considered more important to
lock up potential wrongdoers without too much concern for any infringe-
ment of their civil rights if they are indeed innocent. A prime example of
this in the case of the United Kingdom was the policy of internment of sus-
pected terrorists during the 1980s. The movement towards parole denial in
certain categories of crime in the United States (especially involving vio-
lence against children and sexual deviancy) may be viewed similarly.
● The test statistic expresses the specific result calculated from the
sample data in a form suitable for comparison with the sampling distri-
bution.
● The sampling distribution is associated with a theoretical probability
distribution for all possible outcomes of that particular test statistic.
distribution was that the value of the population mean, (mu), was taken
as the benchmark, and the deviation (or distance) from that benchmark of
–
a particular sample mean, X, was expressed in terms of the standard devia-
tion of the sampling distribution of the sample mean.
This standard deviation of a sampling distribution was given the special
name of standard error. It is commonly designated by the abbreviation SE
followed by the statistic to which it relates. Thus for a sample mean it would
–
be SE(X), for a sample regression coefficient it would be SE(r), and so on.
The question thus posed is: How many standard errors from the bench-
mark is the value of the mean obtained from the given sample?
–
It was stated in §5.3 that SE(X) is equal to
/ 兹n. So in order to answer
that question, it would be necessary to know the population standard devi-
ation (
), but that information is hardly ever available. However, as
explained in §5.4.2, if the sample size is large enough (generally taken to be
more than 30), we can use the standard deviation of the sample, s, as an
estimate of
. The standardized deviation of the sample mean from the
population benchmark can then be measured as
(X ) (X )
Z (6.1)
SE(X) s/兹n
The table for the proportionate areas under the curve of the standard
normal distribution can then be used to assess the probability of getting a
particular value greater (or less) than Z, or the probability of getting a par-
ticular value within some range, for example from the mean to Z.
The alternative t-distribution replaces s/ 兹n, by a better estimate of the
standard error, s/ 兹n 1, to give
(X ) (X )
t (6.2)
SE(X) s/兹n 1
It was further noted that although the t-distribution is, in principle,
always superior to the Z-distribution when
is unknown, its advantages
are especially important when the sample size is less than about 30.
We can now apply these distributions to our primary theme of hypothe-
sis tests. In what follows we refer, for simplicity, to the t-tests, since histo-
rians often have to work with small samples, but the same fundamental
ideas apply to both Z- and t-tests.
The basic idea behind the main test statistics is exactly the same as in the
above review of the theoretical distributions, except that in the context of
hypothesis tests it is the value specified in the null hypothesis that is taken as
the benchmark. In a test of a sample mean this would typically be the value
of the population mean as in (6.2).2 In a test of a sample regression
divided by
the standard error (SE)
of the sampling distribution of that statistic:
Sample estimate Null hypothesis
tcalc (6.3)
SE
b
The corresponding test statistic calculated for a Z-test would be Zcalc, and so on. An alternative
notation uses an asterisk to distinguish the test statistic; thus t*, Z*, etc.
§.
The test statistic given by (6.3) is used if the null hypothesis relates to
some specific non-zero value. This would apply if relevant values are given
by some other source; for example, if it is known from past experience that
the proportion of students failing their statistics course is 15 per cent, and it
is desired to test whether the current failure rate of 17 per cent is indicative
of a deterioration in performance. Alternatively, the specific value might be
derived from a theoretical proposition regarding, say, the slope of the
regression coefficient; for an illustration of the case where the null hypoth-
esis is b1 see §6.7.3.
However, if – as is often the case – the null hypothesis is that the value is
zero, then the second term in the numerator can be eliminated, and the
formula reduces to the simpler form known as the t-ratio
Sample estimate
tcalc (6.4)
SE
t-RATIOS
Where the null hypothesis is that the value of the sample statistic0
the test statistic, tcalc, becomes:
the ratio of
the sample estimate
to
the standard error (SE) of the sampling distribution of that statistic:
Sample estimate
tcalc (6.4)
SE
1. Calculate the test statistic (e.g. tcalc) from the sample data using the rel-
evant formula.
2. Compare this result with the corresponding tabulated sampling distri-
bution, the theoretical probability of getting a result greater than the
selected critical value given by the choice of significance level and crit-
ical region(s).
3a. If the calculated statistic is larger than the specified theoretical prob-
ability (and thus falls in the critical region) the null hypothesis is
rejected: see figure 6.2 (a).
3b. If it is smaller than the specified theoretical probability (and thus falls
in the non-rejection region) the null hypothesis cannot be rejected: see
figure 6.2 (b).
The logic of this procedure should be clear from the preceding discus-
sion. The fundamental principle is that H0 should be rejected only if the
results obtained from the sample are unlikely to have occurred if H0 is true.
What is meant by ‘unlikely’ has been decided in advance by the selection of
the significance level and critical region.
To recapitulate the full procedure, consider again the case of the drug
trial:
More generally, the statistical packages calculate the probability that the
observed relationship found in the given data set would be present if H0 is
true. The lower the prob-value the less likely it is that such a result would be
obtained if H0 were true. So if a very low prob-value (or equivalent term) is
reported the null can be rejected.
This procedure is often found to be rather confusing. There are a
number of reasons for this, and it may help to remember the following
points:
Non-rejection
region Critical region
Non-rejection
region Critical region
distribution, and thus towards or into the critical region where the
probability is low.
● Secondly, when this occurs and the test statistic is in the critical region,
the procedure is to reject the null hypothesis, which sounds depressing.
But remember that H0 is usually specified as the undesirable outcome,
so rejection of the null is actually good news for the researcher.
We use the data for RELIEF for the counties of Kent and Sussex referred
to at the beginning of this chapter, and the issue to be considered is whether
the difference of 5.76 shillings between the sample means is statistically
significant. If so, it is clearly large enough to be also historically interesting.
The relevant summary statistics are:
Kent Sussex
The null hypothesis is that there is no difference between the two sample
means:
H0: 1 2
Panel 6.1 Calculation of the standard error of the difference between two
sample means
– –
The calculation of SE(X1 X2) in the denominator of (6.6) depends on
whether or not the variances of the (unknown) underlying populations
from which the two samples are drawn,
12 and
22, can be assumed to be
equal. (A special test, known as the F-test, can be applied to test the null
hypothesis, H0:
12
22)*.
If this null hypothesis cannot be rejected, the population variances can be
assumed to be equal, and the information from the two samples can be
pooled. The formula for the pooled estimate of that common variance (sp2)
uses information from both samples as follows.
(23)(7.641) 2 (41)(8.041) 2
24 42 2
1342.8523 2650.965
62.403
64
This pooled variance (s2p) can then be used in the formula given in (6.8) to
calculate the SE of the difference between the two sample means:
兹
– – s2p s2p
SE(X1 X2) (6.8)
n1 n2
兹
62.403 62.403
兹2.600 1.486 2.021
24 42
What is done if the outcome of the F test is that the null hypothesis of no
difference between the variance of the populations can be rejected? In that
case it has to be assumed that the population variances are unequal and the
information from the two samples cannot be pooled. It is then necessary to
use each of the separate sample variances in the calculation of the SE. The
formula for the standard error is then:
* The F-test will be introduced in §9.3.1: it is applied to the testing of equal sample variances
in question 6 in the exercises for chapter 9. An alternative test for equality of sample vari-
ances, known as Levene’s test, is used by SPSS.
§. t-
兹
– – s21 s2
SE(X1 X2) 2 兹2.433 1.539 1.993 (6.9)
n1 n2
This is marginally smaller than the result with the pooled variance, and the
SE of the difference between the means will be marginally larger.
However, it must be noted that if the alternative correction for unequal
variances does have to be applied, it is also necessary to correct for a change
in the number of degrees of freedom to be used in testing the SE for statisti-
cal significance. The correction, which arises from the inefficiency of the
additive formula (6.9) as an estimate of
x̄ x̄ , may be quite large.
1 2
Thus, in the case of the comparison between Kent and Sussex, the
number of degrees of freedom for the case of unequal variances is 35, com-
pared to a figure of 64 under the assumption of equal variances.4 A compar-
ison of the critical t-values at the 5 per cent significance level in these two
cases (1.98 for 64 df; 2.04 for 35 df) indicates that the need to correct for
unequal variances tends to make the t-test more stringent. The smaller the
samples, the greater the effect of the correction.
that there is no difference between the means is only 0.6 per cent. This is
very low and so the null hypothesis can be rejected at a markedly lower
significance level than 5 per cent.3
To work this out we need to know SE(r), the standard error of a correla-
tion coefficient (i.e. the standard deviation of the sampling distribution of
all the correlation coefficients that would be derived from an infinite
number of random samples). The statisticians have calculated that for the
special case of 0
兹
1 r2
SE(r) (6.11)
n 2
where r is the correlation coefficient and n is the size of the sample, i.e. the
number of pairs of values of X and Y for which the strength of the associa-
tion is being tested.5
Substituting this formula for SE(r) in the previous equation gives the
test statistic in the following box.
r r兹n 2
(6.12)
兹
1 r2 兹1 r2
n 2
with n 2 degrees of freedom.
c
The Spearman rank correlation coefficient (rs) was introduced in §3.3 as an alternative measure
of correlation for use with ordinal or ranked data. For large samples (probably above 30 and
§. t-
2.892
3.406
0.849
From table 5.1 we see that for n 230 degrees of freedom the critical
value for a two-tailed significance level of 1 per cent (0.005 in one tail) is
2.750. tcalc is considerably greater than this and thus lies well within the crit-
ical region, so the null hypothesis can be rejected at the 1 per cent level.6
兹
(Yi Ŷi ) 2
SE (b) (6.14)
(n k)(Xi X) 2
certainly above 50), the statistical significance of rs can be gauged by the statistic, zrs 兹n 1,
using the procedure introduced in §2.5.2. For smaller samples, a special table must be used, such
as that provided by Sidney Siegel and N. John Castellan, Jr., Nonparametric Statistics for the
Behavioral Sciences, 2nd edn., McGraw-Hill, 1988, pp. 360–1.
兹
SE(b) (Yi Ŷi ) 2
(n k)(Xi X) 2
tcalc can then be compared with the theoretical value of the t-distribution in
the usual way and a decision reached with regard to the null hypothesis.
Before working through an illustration of this test, there are a few addi-
tional points that should be mentioned in relation to t-tests for regression
coefficients. The first relates to a widely used rule of thumb. Reference to
the table of the t-distribution shows that for a two-tailed test at the 5 per
cent level of significance the value of t is very close to 2 for sample sizes of
20 or more (see column (4) of table 5.1). (This is the small-sample equiva-
lent of the statement in §2.6.1 that 1.96 standard deviations cover 95 per
cent of a normal distribution, leaving 5 per cent in the two tails.)
This leads to a very convenient rule of thumb for establishing the statis-
tical significance of a regression coefficient without undertaking more
formal calculations. If the t-ratio (i.e. the ratio of the regression coefficient
to its standard error) is equal to 2 or more, the relationship between that
explanatory variable and the dependent variable is statistically significant
at the 5 per cent level.
Secondly, it is also straightforward to test other null hypotheses regard-
ing the relationship between the dependent and explanatory variables; for
example, that has some specified value, say 1.7 This simply requires the
insertion of that value in the numerator in place of the zero, so the test sta-
tistic becomes
§. t-
b 1
tcalc (6.16)
兹
(Yi Ŷi ) 2
(n k)(Xi X) 2
Table 6.1 Calculation of the regression coefficient and t-ratio for the regression of
UNEMP on BWRATIO
Source: For (1) and (2) see the data set for inter-war unemployment.
– –
(Yi Y), (Xi X), and (Yi Ŷ). The first step is to calculate the value of the
regression coefficient, b, and the intercept, a, using the formulae given in
(4.2) and (4.3) in §4.2.3. This gives
3.472
b 15.64
0.222
and
a13.70 15.64(0.467)6.40
UNEMP6.4015.64 BWRATIO
As the final step in the procedure this value for tcalc is compared with the
theoretical t-distribution (summarized in table 5.1). In this case the
number of degrees freedom (df) is (n 2)17. For 17 df and a two-tailed
test at the 5 per cent significance level the critical value is 2.110. Since tcalc is
less than this it falls well within the non-rejection region, and H0 cannot be
rejected.
If a statistical computer program is used in place of table 5.1 it is possible
to be more precise. For this regression it reports that the prob-value for the
explanatory variable is actually 0.101. This indicates that the probability of
getting a t-statistic of 1.74 when H0 is that there is no relationship between
UNEMP and BWRATIO is as much as 10.1 per cent. The null hypothesis
would thus fail to be rejected even if the significance level had been set at 10
per cent.
Notes
1
A classic illustration of the consequences if such an approach is not adopted was the
case of Thalidomide. This anti-nausea drug was introduced in the 1950s for the
benefit of pregnant women, but was later demonstrated to have caused substantial
numbers of severe birth defects among the children of those who took the drug. Note
that in the wake of such cases, the legal rules of liability certainly intensify the prefer-
ence for significance over efficiency among drug companies.
2
Since we have laid stress on the fact that the population standard deviation,
, is typ-
ically unknown, it might legitimately occur to you to ask how we know the value of
the population mean, . The answer would be that we very seldom do know what
is, and the test statistics historians calculate are normally variants of (6.2a) that do
not involve hypotheses about the value of a single mean, and thus do not involve .
See, for example, §6.7.
3
For a number of applications of a difference of means test by an historian see Ann
Kussmaul, Servants in Husbandry in Early Modern England, Cambridge University
Press, 1981, pp. 57–9 and 64–5. For example, she has two samples giving information
on the distances travelled by servants between successive hirings in the late eight-
eenth century. The larger sample, based on the Statute Sessions held at Spalding (in
Lincolnshire), indicates a mean distance for male servants of 12.32 km. The figure
calculated from the smaller sample, based on settlement examinations (held to
determine entitlement to poor relief) in Suffolk, was 8.24 km. The two estimates
were found to differ significantly (p0.0031). Kussmaul also uses the corresponding
procedure to test for differences of proportions.
4
The corrected number of degrees of freedom will be calculated automatically by the
computer, using the formula:
冢 冣
2
s21 s22
n1 n2
df 2
冢n 冣 冢 冣 冢 冣 冢 冣
2 2
s21 1 s22 1
1 n1 1 n2 n2 1
5. The example in the text of the difference of means test of relief payments
in Kent and Sussex does not take into consideration the fact that these data
were collected by sampling without replacement. Using the procedures
introduced in chapter 5, recalculate the example incorporating the finite
population correction. Report your results. Does the correction weaken or
strengthen the conclusions reached in the text? Explain your conclusions.
Can any general rules be learned from this example?
6. Use the Boyer relief data set to test the hypothesis that workhouses in
Southern England were more likely to be built in parishes with more aggre-
gate wealth at their disposal (you will need to construct a new variable that
measures total wealth per parish; this can be achieved by multiplying
average wealth by parish population).
A critic of your analysis of the relationship between wealth and the work-
house notes that the distribution of total wealth by parishes is not even
approximately normally distributed. Indeed, the critic observes that the
distribution is log-normal (see §2.4.1).
(ii) Generate bar charts of total wealth and the log of total wealth by
parish to assess these criticisms. Then repeat (i) with the log of total
wealth. Does this cause you to amend your evaluation of the main-
tained hypothesis that wealthier parishes were more likely to build
workhouses?
(iii) Is there a more straightforward answer to your critic, that does not
involve recalculation of the data? (Hint: remember the Central Limit
Theorem.)
(i) Use the data to test for equality in the average level of births per 100
families in Bedfordshire and Essex, assuming equal variances in the
two counties. Note the relative size of the standard deviations of the
birth rate for the two counties. What difference does it make to your
conclusions if you treat the population variances in the two counties
as unequal rather than equal?
(ii) Use the data to test for equality in the average level of births per 100
families in Buckinghamshire and Sussex. Once again, there is a
§.
(i) With this new evidence in hand, you decide to reassess the results of
question 7. The number of observations has been increased, such
that there are now 35 parishes reported for Bedfordshire and Essex;
37 parishes for Buckinghamshire, and 45 for Sussex.
(ii) What do you find when you repeat the exercise? How would you
explain any change in the results?
(iii) What do the results of questions 7 and 8 tell us about the robustness
of the t-test for difference of means between samples?
9. In Steckel’s data set on the patterns of leaving home in the United States,
1850–60, there are two variables. YNGERFAM and OLDERFAM, that
record the number of younger and older children in each household.
(i) Calculate the mean and standard deviation of each variable.
(ii) You wish to record the total number of children in the household.
Create a new variable, TOTALFAM. Calculate the mean and stan-
dard deviation of the new variable.
(iii) Compare the mean and standard deviation of TOTALFAM with the
means and standard deviations of YNGERFAM and OLDERFAM.
What do these comparisons tell you about the additivity of these
measures of central tendency?
10. In a revision of Benjamin and Kochin’s analysis of inter-war unemploy-
ment in Britain, a researcher addresses the question of the separate effects
of benefits and wages on unemployment. He wishes to test whether the
impact on unemployment of a change in the benefit level (BENEFITS),
holding the wage rate constant, was the same as a change in the average
wage (WAGES), holding the benefit rate constant. He also wishes to test
whether the coefficient on the benefit rate is significantly different from
Non-parametric tests
7.1 Introduction
Historians cannot always work with problems for which all the relevant
data are based on quantitative measurements. Very often the only informa-
tion available for analysis relates to the number of cases falling into
different categories; the category itself cannot be quantified.a Thus house-
hold heads might be classified according to their sex, political affiliation, or
ethnicity. Wars might be grouped into epic, major, and minor conflicts.
Women might be subdivided by their religion, the forms of birth control
they practised, or the socio-economic status of their fathers.
Alternatively the historian might have data that can be ranked in order,
but the precise distance between the ranks either cannot be measured or is
unhelpful for the problem under consideration. One example might be a
ranking of all the universities in the country by a newspaper combining on
some arbitrary basis a medley of criteria such as the quality of students
admitted, library expenditure per student, and total grants received for
research. Another might be a ranking of the power of politicians in an
assembly on the basis of some measure of their influence on voting in the
assembly.1 Similarly, an historian of religion might construct a ranking of
the intensity of religious belief of the members of a community according
to the frequency of their church attendance in a given period.
In all such cases it is not possible to apply the techniques of hypothesis
testing described in chapter 6. There are two fundamental reasons why par-
ametric tests such as Z and t cannot be used.b First, these statistical tests
a
In the terminology of §1.3.3 the information is at the nominal level of measurement, as opposed
to the interval or ratio level of most economic data.
b
Statistical tests such as Z and t (or the F-test that will be introduced in §9.3) are known as para-
metric tests because they test hypotheses about specific parameters (characteristics of the popu-
lation) such as the mean or the regression coefficient.
cannot be made with nominal or ordinal levels of data, they require vari-
ables that are measured on an interval scale. With nominal level data it is
simply not possible to perform any of the basic operations of arithmetic
such as would be needed to calculate a mean. With some forms of ordinal
level data, for example, the ranking of politicians on the basis suggested
above, it might be possible to perform such operations, but the interpreta-
tion of the results would be problematic. The successive differences
between any two ‘scores’ are not true intervals with substantive meaning,
and consequently the interpretation of test results, such as the position of
the test statistic in relation to the critical region, may be misleading.
Secondly, the parametric tests assume that the sample observations
satisfy certain conditions, in particular, that they are drawn from popula-
tions that are approximately normally distributed (see §6.2.1).
To take the place of the parametric tests in situations where they are not
appropriate, a large battery of non-parametric tests has been developed.
These can be applied to nominal or ordinal level data, and require the
minimum of assumptions about the nature of the underlying distribution.
It is thus appropriate to use a non-parametric test when there is reason to
think that the relevant population distribution is not normal (or its exact
form cannot be specified). This is particularly likely to be the case when
dealing with small samples.
The fundamental principles of hypothesis testing outlined in chapter 6
apply equally to the various non-parametric tests. It is again necessary:
c
Tables are given in D. V. Lindley and W. F. Scott, New Cambridge Statistical Tables, 2nd edn.,
Cambridge University Press, 1995, for all the tests referred to in this chapter.
d
In addition to these tests for two independent samples there are other non-parametric tests for
two related samples (i.e. those that are not independent), or for three or more samples, usually
referred to collectively as k samples. There are also tests for a single sample, and three examples of
these tests will be discussed in §7.4 and §7.5.
Table 7.1 Rankings of two independent samples of students (imaginary data, four
variants)
University I II III IV
Harvard 1 1 1 1
Harvard 2 2 2 3
Harvard 3 3 3 5
Harvard 4 4 4 7
Harvard 5 5 9 9
Harvard 6 11 10 11
Harvard 7 12 11 13
Harvard 8 13 13 15
Harvard 9 14 14 17
Harvard 10 20 20 19
Yale 11 6 5 2
Yale 12 7 6 4
Yale 13 8 7 6
Yale 14 9 8 8
Yale 15 10 12 10
Yale 16 15 15 12
Yale 17 16 16 14
Yale 18 17 17 16
Yale 19 18 18 18
Yale 20 19 19 20
means of a runs test would be to list the 20 observations in rank order. If the
symbols H and Y are used to designate the respective samples, the list for
variant III of table 7.1 – ranked in descending order – would look like this:
Each sequence of the same letter constitutes a run. In the above illustration
there are a total of seven runs: 4 in H and 3 in Y.
The basic idea behind the test is that if the underlying distributions are
identical, the letters representing the two samples will be randomly scat-
tered all through the list, creating large numbers of short runs. On the other
hand, if the underlying distributions are different, the ranks from one of the
samples are likely to predominate in some parts of the list, and those from
the other sample in other parts, giving rise to a small number of long runs.
The test cannot say anything about how the distributions might differ
– it might be, for example, with respect to either the location (mean or
median) or the shape (dispersion or skewness) of the distribution, or to
both. It is thus mainly useful as a quick and easy test when the focus of
interest is not on the specific features of the distribution but only on detect-
ing a general difference.
The null hypothesis is that there is no difference between the two
samples, i.e. they come from the same population. Even if no direction is
predicted, the runs test is effectively a one-tailed test because H0 can be
rejected only if there are a small number of long runs.e However a direction
may be predicted in advance; for example, the research hypothesis might
be that one sample would have a higher ranking than the other. This would
be a possible reason for selecting a lower significance level, such as 2.5 per
cent, thus forming a smaller critical region and so making it harder to reject
the null hypothesis.
The test statistic, r, is the number of runs and can be counted very easily.
Given the choice of significance level and the test statistic, the decision can
be made on whether or not to reject the null hypothesis. These final steps in
the test can be performed on the computer by your statistical software,
though for small samples it may be as quick to carry out the work by hand.
There is no need to memorize the formulae the computer will apply, but
they are discussed below so that those who wish to increase their under-
standing can see how the results are obtained. It is also useful to show expli-
citly that this non-parametric procedure belongs to the same family of test
procedures as those already discussed in chapter 6.
If the samples are small (neither more than 20), the exact sampling dis-
tribution for the number of runs, r, can be calculated for different combi-
nations of sample size, and is given in a published table.f
If the sample sizes are larger than this, the Central Limit Theorem (see
§5.3.1) applies and the sampling distribution of r is approximately normal.g
The standardized value of Z is calculated from the mean, r, and standard
deviation,
r, of the sampling distribution of the number of runs, r.
e
This is an exception, therefore, to the general principle stated in §6.3.3 that a one-tailed test
should be used only when direction is predicted.
f
Lindley and Scott, Statistical Tables, table 18, pp. 60–2. The table is in two parts, the upper and
lower percentage points (corresponding to the left- and right-hand tails of the distribution), but
for the present test it is only the latter we are interested in. (The upper percentage points are
required when the runs test is used below for the one-sample test of randomness, for which both
tails may be relevant; see §7.4.)
g
Recall from §5.3.1 that it is possible for the distribution of a sampling distribution to be approxi-
mately normal even when the underlying population from which the samples are drawn is not
assumed to be normal.
§. -
兹
2n1n2 (2n1n2 n1 n2 )
r (7.2)
(n1 n2 ) 2 (n1 n2 1)
The standardized value Zcalc can thus be derived in the usual way by cal-
culating the difference in units of standard errors between the actual
number of runs given by the particular sample, and the theoretical mean
given by the sampling distribution.h
This is
r r
Zcalc (7.3)
r
To make the test decision, the resulting Zcalc is compared with the theo-
retical sampling distribution. For small samples this is given by the special
table; for large samples the table for the standard normal distribution (par-
tially reproduced in table 2.7) is used.i
I II III IV
(2 runs) (5 runs) (7 runs) (20 runs)
H H H H
H H H Y
H H H H
H H H Y
H H Y H
H Y Y Y
H Y Y H
H Y Y Y
H Y H H
H Y H Y
Y H H H
Y H Y Y
Y H H H
Y H H Y
Y Y Y H
Y Y Y Y
Y Y Y H
Y Y Y Y
Y Y Y H
Y H H Y
Note:
* Probability of obtaining this value of r if the null hypothesis is correct. Both sample sizes are
less than 20, so the exact distribution of the test statistic can be used rather than the normal
approximation.
§. -
the sample means (or of some other measure of location), but can also be
used as a test for differences in dispersion or skewness. The basic idea is that
if the underlying distributions are identical, the ranking of the observa-
tions from the two samples will be broadly similar. If they are not identical,
the ranks of the one sample will be predominantly higher (or lower) than
those of the other. This difference might occur because, for example, the
mean of one sample was above the mean of the other, or because one
sample was more spread out than the other.
The null hypothesis is thus that the two distributions are identical. The
research hypothesis might either be simply that they are different (a two-
tailed test); or, more specifically, that the rank values from one of the distri-
butions are predominantly higher (lower) than those from the other (a
one-tailed test in the specified direction).
The hypothetical ranking of the 20 students from Harvard and Yale is
again used to illustrate the procedure, which is as follows:
(a) Combine both samples, and rank the students in ascending (or
descending) order from 1 to 20.j
(b) Take the lowest rank in one of the samples (if the samples differ in size
it is usual to take the smaller but this does not affect the result – we will
take the Harvard sample), and count the number of students with
higher ranks than this in the second sample (Yale). Repeat this for each
successive rank in the Harvard sample.
(c) Add the results of this counting exercise for all 10 Harvard students.
This sum is the test statistic, U.
(d) Compare U with the appropriate theoretical sampling distribution to
test the probability that the observed difference in ranks could have
come from the same distribution.
The Wilcoxon rank sum test is a variant of the Mann–Whitney test that
simply counts the sum of the ranks of the smaller (or first) sample. This
sum, W, is the test statistic, and there is a fixed relationship between W and
U.3 Furthermore, the exact significance levels of W and U are the same, so
the two tests always give the same result.
The relevant calculations for the subsequent stages of this test are again
given in order to clarify what the computer does if asked to perform the
test. If the samples are small (neither more than 20) it is possible to use
j
If the underlying scores result in a tie, the rank allocated is the mean of the ranks that would have
otherwise been allocated to the tied scores. If the tie is within one of the samples the calculations
are otherwise unaffected; if it is a tie between samples a correction to the formula for Zcalc is
required.
§. -
a special table that gives the exact sampling distribution for the
Mann–Whitney test statistic, U.k
With larger sample sizes the sampling distribution of U will be approxi-
mately normal. In this case the mean and standard deviation of this sam-
pling distribution can be calculated as in the following box:
兹
n1n2 (n1 n2 1)
u (7.5)
12
A standardized value for the test statistic Zcalc in units of standard errors
can be derived as before.
This is
U u
Zcalc (7.6)
u
To make the test decision, the resulting Zcalc is compared with the theo-
retical sampling distribution. For small samples this is given by the special
table; for large samples the table for the standard normal distribution (par-
tially reproduced in table 2.7) is used.l
k
Lindley and Scott, Statistical Tables, table 21, pp. 66–7. The table gives the lower percentage point
of the distribution.
l
For the complete table of the normal distribution see Lindley and Scott, Statistical Tables, table 4,
pp. 34–5.
Table 7.3 The Mann–Whitney U-test and the Wilcoxon rank sums test
W 19 U 9 W 20 U 10
W 17 U 8 W 18 U 9
W 15 U 7 W 16 U 8
W 13 U 6 W 14 U 7
W 11 U 5 W 12 U 6
W 9 U 4 W 10 U 5
W 7 U 3 W 8 U 4
W 5 U 2 W 6 U 3
W 3 U 1 W 4 U 2
W 1 U 0 W 2 U 1
W100 U45 W 110 U 55
2 Test statistics, significance values, and results for all four series in table 7.1
Test statistics
Note:
* Probability of obtaining this value of U (or W) if the null hypothesis is correct. Both sample
sizes are less than 20, so the exact distribution of the test statistic can be used rather than the
normal approximation.
using the tables, but with the precise prob-values we have more
information.8
兹
n1n2
KS ZD (7.7)
n1 n2
2 Test statistics, significance values, and results for all four series in table 7.1
Test statistic
Kolmogorov– Significance
Series D n1n2D Smirnov Z (2-tailed)* Decision
Note:
* Probability of obtaining this value of D if the null hypothesis is correct. The significance
levels quoted are an approximation.
§. -
兹
10 10
KS Z0.4 0.4 2.2361 0.8944
10 10
The values of KS Z for all four cases are reported in column (4) of the lower
panel of table 7.4.9 These are then used to calculate the approximate
significance levels in column (5).
The prob-value for I is 0.00 (zero probability of getting this result if H0 is
true) and so H0 must clearly be rejected. The prob-values then rise from
0.164 for II to 0.40 for III and finally – for the extreme case of IV – to 1.00
(100 per cent probability of getting this result if H0 is true). The decisions
listed in column (6) are thus identical to those made on the basis of the
tables.10
o
We particularly recommend Sidney Siegel and N. John Castellan Jr., Nonparametric Statistics for
the Behavioral Sciences, 2nd edn., McGraw-Hill, New York, 1988.
not reflected in a numerical value or a rank order, but instead involve its
classification into two or more categories. For example, a study of voting
patterns in the United States in the presidential election of 1896 might clas-
sify voting both by degree of urbanization (large cities, other urban areas,
rural areas) and by political choice (Democrat, Republican, Populist). It
would then be possible to see whether there were any significant differences
between voters related to their location. Or a historical demographer
might classify women in a given locality on the basis of their fathers’ occu-
pation (merchant, farmer, craftsman, labourer), and of their own marital
status (single, married, widowed, remarried). The purpose would be to see
whether there was any relationship between their fathers’ occupation and
their own marital status.
Our aim in the remaining sections of this chapter is to explain some of
the procedures that can be used when dealing with nominal (categorical)
data of this sort. On some occasions a simple comparison of percentages
may reveal such clear differences that nothing more is needed. But if the
differences are small, or the relationship between different categories is
more complex, then it would be necessary to adopt an appropriate statisti-
cal procedure.
We begin in §7.3.1 with a brief digression to introduce the basic idea of
contingency tables or cross-tabulations used for such data. In §7.3.2 we
discuss an application of the 2-test (lower case Greek chi, pronounced ki-
squared) that can be used when it is desired to compare one or more
samples with respect to a variable that is classified into two or more catego-
ries. This is a test of whether or not the differences between the categories
are statistically significant. We also introduce a new theoretical probability
distribution that is required for this test.
In §7.5 we will look at two other procedures that can be used when there
is only one sample and the object is to test the observed distribution against
some theoretical alternative. Finally, some of the corresponding non-para-
metric measures of the strength of association between the categories are
discussed in §7.6.
q
Anne Digby, Madness, Morality and Medicine, A Study of the York Retreat, 1796–1914, Cambridge
University Press, 1985.
§. -
Treatment outcomes
Numbers
Superintendent
Jepson (1796–1823) 117 31 11 159
Allis (1823–41) 96 29 22 147
Thurnam and Kitching
(1841–74) 158 68 30 256
Total 371 128 63 562
Percentage
Superintendent
Jepson (1796–1823) 20.8 5.5 2.0 28.3
Allis (1823–41) 17.1 5.2 3.9 26.2
Thurnam and Kitching
(1841–74) 28.1 12.1 5.3 45.5
Total 66.0 22.8 11.2 100.0
Source: Digby, Madness, p. 231. The classification excludes patients who died in the Retreat.
can then be applied to establish whether the observed joint frequency dis-
tribution in table 7.5 could have occurred by chance.11 The use of the test
for this purpose is sometimes referred to as a test of independence (or, con-
versely, of association) between the variables.
The 2-test involves a comparison of the observed frequencies (fo) with
those that would be expected if there were no relationship between the vari-
ables (fe). At the core of this measure is the difference between the observed
and the expected frequency in each cell.
The difference is squared (for familiar reasons – compare §2.3.2), and
the squares are then standardized by dividing by the expected frequency in
each cell.r Finally, this measure is summed over all cells. If fo and fe agree
exactly, 2 0. The greater the discrepancy between the observed and
expected frequencies, the larger the value of 2.
r
This ensures that the biggest contributions to 2 come from the biggest discrepancies between fo
and fe , not from the cells with the largest number of cases.
THE 2 -TEST
It is defined as follows:
(fo fe ) 2
2 (7.8)
fe
This will have a mean of zero and a standard deviation of 1. Then square
this to obtain Z12. If a second independent variable is calculated in the same
way, this will be Z22, a third will be Z32, and so on.
The total number of such squared terms is given by the parameter k, and k
can have any value from 1 to infinity. If we designate the chi-square random
variable as V, then
V Z21 Z22 ... Z2k (7.9)
The precise shape of the distribution, including its mean and standard
deviation, depends solely on k, which represents the number of degrees of
freedom (df) of the chi-square distribution. Just as the normal curve was
completely defined by two parameters, its mean and standard deviation, so
the chi-square distribution is completely defined by this one parameter, k. A
further property of this distribution is that the mean is also k, and the vari-
ance is 2k.
Because the chi-square distribution is formed by the process of squaring
a variable it can have only positive numbers, and when df is small the distri-
bution is heavily skewed to the right. However as df increases the distribu-
tion becomes more symmetric, and at about 30 df it becomes effectively the
same as a normal distribution. Some illustrative distributions are shown in
figure 7.1.*
*
The percentage points of the chi-square distribution are given for selected df from 1 to 100 in
Lindley and Scott, Statistical Tables, table 8, p. 41.
Probability density
degrees of freedom
df = 10
df = 30
Chi-square
The test statistic 2calc measures the difference between the observed fre-
quencies (shown in table 7.5) and the frequencies that would be expected if
there were no relationship between the two variables.
Out of a total of 562 patients in the three samples, 371 (66.0 per cent)
were classified as recovered. If there were no relationship between regime
and outcome, the same proportion would be expected in all patients. The
expected frequencies would thus be 66 per cent of the 159 patients admitted
by Jepson, 66 per cent of the 147 admitted by Allis, and 66 per cent of the
256 admitted by Thurnam and Kitching.
This gives expected frequencies for the patients classified as recovered of
104.9, 97.0, and 169.0 to compare with the observed frequencies of 117, 96,
and 158. Expected frequencies for the patients classified as improved (22.8
per cent) and not improved (11.2 per cent) can be derived in the same way.
The full calculation of 2 is carried out in table 7.6, and produces a result for
2calc 9.64.
In order to compare this with the theoretical chi-square distribution so
as to make a decision regarding its significance, it is first necessary to deter-
mine the number of degrees of freedom associated with the cross-tabula-
tion. This concept was first introduced in panel 5.3 in §5.4.2, and its
application in the present context is explained in panel 7.2. We see from
this that for the 33 tabulation of table 7.6 there are 4 degrees of
freedom (df).
If we enter the published tables for the theoretical probability distribu-
tion of chi-square with 4 degrees of freedom (df) we find that for a 5 per
cent significance level, the critical point is 9.488.s In other words, 5 per cent
of the area under the curve lies to the right of this value. Any calculated
value for the test statistic larger than that falls in the critical (rejection)
region; any statistic lower than that falls in the region in which the null
hypothesis cannot be rejected. Since the calculated test statistic is 9.64 it is
greater than the critical value and the null hypothesis of no difference can
be rejected.
It is thus established that there is a statistically significant association
between the regimes of the successive superintendents and the results of
treatment at the Retreat. As with all tests of statistical significance, the
result is partly determined by the sample size. It is important to be aware
that the value of 2calc is directly proportional to the size of the combined
samples used in the calculation. If the data in table 7.6 were all scaled down
by half, the value of 2 would also be halved. It is thus relatively easy to get a
large and statistically significant result with large samples.
s
Lindley and Scott, Statistical Tables, table 8, pp. 40–1.
Table 7.6 2-test for statistical significance: calculations using data for York Retreat,
classified by superintendent’s regime and treatment outcomes
Jepson
Recovered 117 104.9 12.1 145.3 1.38
Improved or relieved 31 36.2 5 5.2 27.5 0.76
Not improved 11 17.8 5 6.8 46.3 2.60
Allis
Recovered 96 97.0 5 1.0 1.0 0.01
Improved or relieved 29 33.5 5 4.5 20.4 0.61
Not improved 22 16.5 5 5.5 30.7 1.86
Thurnam and Kitching
Recovered 158 169.0 11.0 121.6 0.72
Improved or relieved 68 58.4 59.8 95.3 1.63
Not improved 30 28.7 5 1.3 1.6 0.06
Total 562 562.0 5 0.0 489.7 9.64
Note:
(1) See table 7.5.
(2) See text.
The test has shown that the differences are statistically significant. It
could be, however, that it is only a very weak association, of no real histori-
cal interest for medical historians. It remains for historians of the institu-
tion to investigate the strength of the association, and to evaluate its
underlying causes and consequences. A number of such measures of the
strength of association will be considered in §7.6.
extremes, and a runs test would be appropriate for this. If the sample is
large, with either or both the number of events (H and L) greater than 20,
then the distribution of r is approximately normal, with mean, r, and
standard deviation,
r, exactly as given previously by (7.1) and (7.2). The
test statistic, Zcalc can thus be obtained as in 7.3. This can then be compared
with the normal distribution (as in table 2.7) to obtain the prob-value for
either a two-tailed or a one-tailed test.t
If the sample is small, with neither event (H or L) greater than 20, the
required result can be obtained directly from a published table. The statis-
ticians have worked out the sampling distribution of r that could be
expected from repeated random samples, and we can refer to the tables of
this distribution to establish what intermediate number of runs can be
treated as consistent with a random sample.u With small samples we can
either ask the computer to make the calculation or we can do it directly
from the table. For larger samples the computer is obviously quicker.v
To see how the procedure works for a small sample we will consider four
cases. I and II are those already mentioned, with r25 and 2, respectively.
In III the sequence was
giving 8 runs.w
Entering the table for the distribution of the number of runs for n1 12
and n2 13 and adopting the 5 per cent level, we find that the upper bound
is 18 and the lower is 9. The upper bound indicates that if r is 19 or higher it
is so large that the probability of obtaining it with a random sample is 5 per
cent or less. Similarly, the lower bound indicates that if r is 8 or lower the
probability of obtaining it with a random sample is 5 per cent or less.13
Accordingly, the null hypothesis cannot be rejected for values of r
between 9 and 18, but can be rejected for any samples with r outside this
range. Since three of our variants (I, II, and IV) fall outside this range it
t
For the complete table of the normal distribution see Lindley and Scott, Statistical Tables, table 4,
pp. 34–5.
u
Lindley and Scott, Statistical Tables, table 18, pp. 60–2.
v
Note that in order to perform the test with a computer package it may be necessary to recode the
sequences from letters to numbers; for example by replacing all the Ls by a 1 and all the Hs by a 2.
w
To simplify the illustration we have kept the number of both events constant in each case, but if a
random selection had really been undertaken four times we would not expect that there would
always be 12 Hs and 13 Ls on the list.
leaves only one, III, with 11 runs, for which the null hypothesis of a random
sample cannot be rejected.
Note that the essence of this test is that it is based solely on the order in
which the events occur, not on their frequency. It is the pattern of the
ranking that informs us that the null hypothesis should be rejected for var-
iants I, II, and IV. There is nothing in the frequencies (in every case 12 Hs
and 13 Ls) to identify either the extreme lack of randomness in samples
I and II, or the narrower difference in respect of randomness between III
and IV.
It is also worth pointing out the difference between this test and the two-
sample test described in §7.2.1. In the present test, perfect alternation (as in
II) – and more generally a large number of runs – leads to rejection of the
null hypothesis of randomness. By contrast, in the Wald–Wolfowitz runs
test, perfect alternation (as in variant IV of table 7.1) – and more generally
a large number of runs – means that the null hypothesis of no difference
between the two samples cannot be rejected.
The rival hypothesis is that the runs in prices are largely independent of
the sequence of yields because of the possibility of grain storage. An excep-
tionally abundant harvest in one year will enable grain to be carried over
not only for the next year, but also for subsequent years. Even if the follow-
ing year was one of below average yields, prices might remain below
average because grain was being sold from store. Thus there would be runs
in prices that would not reflect corresponding runs in yields.
In the light of this analysis, one element in the investigation of the data
on prices (and on yields where that exists) would be to establish whether or
not the clusters of above and below average observations are random. Tony
Wrigley applied a one-tailed runs test to French data on wheat yields and
prices over the period 1828–1900, and showed that the null hypothesis of
randomness could not be rejected for the yields (prob-value0.36), but
was easily rejected for the prices (prob-value0.0028).x
When the same one-tailed test was applied to a series on wheat prices at
Exeter over three subperiods between 1328 and 1789, the prob-values were
respectively 0.0436, 0.0021 and 0.0003. As Wrigley noted, the null hypoth-
esis can only just be rejected at the 5 per cent significance level for the first
subperiod, but in the two other subperiods ‘it is increasingly and ultimately
extremely improbable that the runs were random’ (p. 129).
On the basis of these findings Wrigley suggested that the presence of non-
random runs in the price data, but not – for France – the data on yields, sup-
ported the storage rather than the seed corn hypothesis; and further, that the
trend over time towards a stronger pattern of non-random fluctuations in
prices also suggested that storage was the relevant factor since the impact of
seed corn would have been most pronounced in medieval times.
This conclusion has been reinforced by the application of the runs test
to data on both wheat yields and prices at the Winchester manors in the
years 1283–1350. For yields the prob-value was 0.402, whereas for prices it
was 0.0017. This suggested a high probability that the yield sequence was
random but the price sequence was not. The contrast thus confirmed the
message of the French yield data. Further support for the storage hypothe-
sis was given by the application of this and other tests to data on butter
prices. For butter, unlike wheat, the null hypothesis of random fluctuations
in price could not be rejected, and this is what would be expected for a
commodity that could not be stored.y
x
E. A. Wrigley, ‘Some reflections on corn yields and prices in pre-industrial economies’, in E. A.
Wrigley, People, Cities and Wealth, Blackwell, 1987, pp. 92–132.
y
Randall Nielsen, ‘Storage and English government in early modern grain markets’, Journal of
Economic History, 57, 1997, pp. 1–33. For further discussion of this issue see Karl Gunnar
Persson, Grain Markets in Europe 1500–1900, Cambridge University Press, 1999.
z
See for example, Voth, Time and Work, p. 86.
§. - -
aa
Lindley and Scott, Statistical Tables, table 23, p. 70 has this form, and it is this value which SPSS
reports under the name of Kolmogorov–Smirnov Z.
兹
2
(7.11)
n
§. -
兹
2
V (7.12)
nmin(r 1, c 1)
兹
2
(r 1) and (c 1) would be (3 1)2, and V would be equal to ,
2n
which is the same as /2. The advantage of Cramer’s V is that the
maximum value it can take is always 1.
A third measure of association based on 2 is the contingency
coefficient, denoted by C, and calculated as
兹
2
C (7.13)
n
2
C is easily computed once 2 has been calculated, but suffers from the limi-
tation that its upper limit depends on the number of categories in the con-
tingency table, although it is always less than 1. For example, if it is a 22
table the upper limit is 0.707, if it is a 33 table the maximum value is
0.816, and so on.
The final measure we will refer to is known as Goodman and Kruskal’s
tau, where tau is the name for the Greek letter . This measure of associa-
tion is calculated on a quite different basis, and involves a procedure
(which we need not consider here) for using values of the independent var-
iable to predict values of the dependent variable.
If there is no association between the variables, the explanatory variable
is no help in predicting the dependent variable and 0. If there is perfect
association between the variables the explanatory variable would be a
perfect predictor and 1. This version of the measure is denoted b
because A columns of the explanatory variable are used to predict B rows of
the dependent variable.
In some circumstances, however, the researcher may simply be inter-
ested in the strength of association between the two variables without
regarding either one as the possible explanatory variable. It is then possible
to calculate an alternate version, denoted a, in which the procedure is
reversed and the B rows are used to predict the A columns. Except for the
extreme cases of zero and perfect correlation, the numerical values of
a,and b will not be the same.
兹 兹
168.59 168.59
V 兹0.1483
379(4 1) 1137
0.38
bb
Ann Kussmaul, A General View of the Rural Economy of England 1538–1840, Cambridge
University Press, 1990. This brief description of the classification does not do justice to the depth
and subtlety of the work.
X A P H Total
Source: Kussmaul, General View, p. 77. For the classification of parishes by seasonal type
see text.
By the final pair of periods, 1741–80 and 1781–1820, the strength of the
association over time had risen to 0.60. The full set of calculations thus
shows the constancy of seasonal typing rising steadily from the late eight-
eenth century, as parishes settled into increasingly specialized patterns.cc
Notes
1
A number of such measures of political power are discussed in Allan C. Bogue,
‘Some dimensions of power in the thirty-seventh Senate’, in William O. Aydelotte,
Allan C. Bogue and Robert W. Fogel (eds.), The Dimensions of Quantitative Research
in History, Oxford University Press, 1972. For example, one measure attributed more
power to Senator A than Senator B, if the number of times A voted with the majority
on a set of proposals when B was in the minority was greater than the number of
times B voted with the majority when A was opposed. All senators could then be
ranked on the basis of the total number of their positive power relationships in all
possible pairings.
2
The value given for Zcalc in table 7.2 incorporates a small correction to the formula given
in (7.3). The computer program (in this instance SPSS) has added 0.5 to the numera-
tor when Zcalc is negative, and –0.5 when it is positive. The scale of the correction is thus
equal to 0.5/
r, and since
r is determined solely by the size of the two samples (see
(7.2)) the correction diminishes as the sample sizes increase. In table 7.2 the sample
sizes are the same for each variant, and so the correction is constant at 0.230.
It is thus a correction that is effective when using the approximation of the
sampling distribution to the normal distribution with small samples. When Zcalc is
negative the correction makes it smaller; when it is positive the correction makes it
cc
Kussmaul, General View, pp. 76–8. For other examples of the use of Cramer’s V by Kussmaul, see
pp. 68 and 134–6.
larger. It thus moves the critical value further into the respective tails, making it more
difficult to reject the null hypothesis.
3
Given the lower of the two possible values for U, the corresponding minimum value
of W can be obtained as follows:
n1 (n1 1)
W U
2
where n1 is the size of the sample with which the calculation of U was started.
4
It is not actually necessary to repeat the procedure with the other sample as the start-
ing point since there is a simple relationship between U and U, the sum obtained by
starting with the second sample: U(n1n2) – U, where n1 and n2 are the numbers in
the respective samples. W is then easily derived from U by the formula for W and U
in n. 3.
5
We have specified a two-tailed test so the research hypothesis is simply H1: u1 u2.
If, however we had selected a two-tailed test with direction predicted it might have
been H1: u1
u2, in which case we could have moved to the 2.5 per cent level. The
table gives a critical value for this level of 23. Since the decision rule would be the
same (reject H0, for any value of U equal to or less than this), it would be harder to
reject the null hypothesis. The other possibility is H1: u1 u2. To test this alterna-
tive it is necessary to move to the higher of the two possible values for U obtained by
starting with the other sample (for example, to U 55 for series D). With this as the
research hypothesis the decision rule would be to reject the null hypothesis if U
(rather than U) is equal to or less than the critical value.
6
As can easily be seen, the value of Z is the same (though the sign is different) whichever
sample one starts from. The fact that U and U give the same absolute result for Z
(though with different signs) can be seen from the following example. For variant B
and two samples each of 10 observations, u 1⁄2 (n1n2) 1⁄2 (100)50. Since U45,
the numerator of ZU u 45 50 5. If we take the higher value for U of 55 the
numerator is 55 505. The sign of Z is relevant only if direction is predicted, i.e. if
H0 is that the mean of one of the samples is larger (smaller) than the other.
7
Because our examples are based on small samples SPSS is able to report two meas-
ures of significance. One is described as ‘exact significance’ and this is the only one
quoted in table 7.3. It is based on the exact sampling distribution of the test statistic
for small samples, and is thus consistent with the critical values given for these
samples in the published tables. The other is described as ‘asymptotic significance’
and is the approximation based on the normal distribution. For large samples this is
the only measure of significance available; it becomes progressively more accurate as
the sample size increases.
8
For an example of the use of the Mann–Whitney test by an historian see James H.
Stock, ‘Real estate mortgages, foreclosures, and Midwestern agrarian unrest,
1865–1920’, Journal of Economic History, 44, 1984, pp. 89–106. Stock used this non-
parametric test (see pp. 97–9) to explore the relationship of three qualitative levels of
agrarian protest (ranked across 12 Midwestern states) with the corresponding rank-
ings of various measures of the levels and volatility of farm debt in those states. The
null hypothesis of no relationship between debt and unrest was rejected at the 1 per
cent level.
See also Hans-Joachim Voth, Time and Work in England, 1750–1830, Oxford
University Press, 2001, pp. 88–94, 227–8. Voth assembled evidence of people being at
work on different days of the week from statements by witnesses who appeared before
the Old Bailey in London between 1749 and 1763. He calculated the Mann–Whitney
U to test whether Monday was statistically different from other days of the week
(because workers were observing the practice of St Monday); and also to test whether
work was markedly less frequent on the 46 days of the year that were the old Catholic
holy days. In both cases the null hypothesis of no difference was rejected.
9
The test reported in table 7.4 is a two-tailed test. If a one-tailed test is required, it is
necessary only to divide the reported prob-value for KS Z by 2. There is, however, an
alternative presentation of the one-tailed test statistic, known as 2, calculated as
n1n2
4D2
n1 n2
This is equal to four times the square of KS Z. The 2 test statistic is compared to the
chi-squared distribution with 2 degrees of freedom; this distribution will be
described in §7.3. The critical value of the 2 test statistic at the 5 per cent level is
兹
n1 n2
1.36
n1n2
Note:
The body–mass index is calculated as (703 weight in lb)/(height in feet)2.
Source: British Parliamentary Papers, Report on Partial Exemption from School Attendance
(Cd. 4887, 1909), p. 281.
The data were used to support Mr Morley’s argument that ‘the effect of
half-time [labour] is nearly universally bad’.
Apply the Wald–Wolfowitz runs test to determine whether the statistics
bear out this interpretation.
(i) Use the body–mass index for March 1908 to determine whether
these boys were drawn from the same population.
(ii) Then use the change in the body–mass index between March and
October to test whether there are any systematic differences in the
change in the body–mass index between the two groups after the
beginning of half-time work.
3. The sequence of US presidential administrations (R indicates
Republican, D indicates Democratic) since the Civil War has been as
follows:
1865–1932: R R R R R D R D R R R R D D R R R
1932–2000: D D D D D R R D D R R D R R R D D
(i) By hand, use the runs test for randomness to determine whether
there is evidence that the incumbent party has an advantage in suc-
cessive elections over the entire period, 1865–2000.
(ii) Are your results any different if you subdivide the period into two
parts: 1865–1932, and 1932–2000?
4. The density of retailers in different parts of Manchester in 1871 was as
follows (higher scores indicate a greater density; the average for the whole
of Manchester being 1):
Bakers: 2.04, 1.30, 1.50, 2.10, 1.68, 0.87, 1.80, 1.12, 1.22
Butchers: 1.40, 0.82, 1.01, 1.27, 0.95, 0.86, 1.11, 0.58, 0.95
Source: Roger Scola, Feeding the Victorian City: The Food Supply of Victorian
Manchester, 1770-1870, Manchester University Press, 1992, p. 299.
(i) Do these figures support the hypothesis that there was no difference
in the distribution of butchers and bakers in these regions? Choose
the appropriate non-parametric test and make all necessary calcula-
tions by hand.
(ii) Test the hypothesis that butchers were more concentrated in some
parts of Manchester than others.
5. Evaluate the hypothesis that average wealth holdings in the neighbour-
ing counties of Berkshire and Buckinghamshire are drawn from the same
population. Apply the parametric difference of means test as well as the
non-parametric Mann–Whitney and Kolmogorov–Smirnov tests.
What differences do you detect between these test results? Which is to be
§.
preferred and why? Consider both historical and statistical reasons in your
answer.
6. A survey of households in Liverpool in 1927-9 recorded information on
the first occupation of children after leaving secondary school and on the
occupation of their fathers.
Child’s occupation
Occupations Professional Clerical and Manual All
of father and business commercial labour occupations
Source: D. Caradog Jones (ed.), The Social Survey of Merseyside, 3, Liverpool University Press,
1934, pp. 178–180.
(i) Do the data support the report’s conclusion that, ‘the occupational
grade of the parent has considerable weight in the determination of
the occupational grade of the child’?
(ii) Do the data support the inference that the children of ‘higher social
class’ (as measured by a higher occupational grade) are more likely to
enter higher occupations?
(iii) Would your responses to (i) and (ii) differ if you were asked to evalu-
ate whether children of each occupational class were more likely to
be selected from the same occupational class as their father?
(iv) Would your responses to (i) and (ii) differ if the survey were based
on a sample of 171 (i.e. one-tenth as large)?
Write a few sentences explaining what these figures indicate about the
extent and pattern of social mobility in inter-war Britain.
7. By hand, construct a contingency table of occupational structure by
region, using the Steckel data set. (Hint: the table will be 5 4.) Express the
cross-tabulations as both counts and relative frequencies.
(i) By hand, calculate the 2 test statistic. Compare to the critical point
at an appropriate level of significance, being careful to specify cor-
rectly the number of degrees of freedom.
(ii) Does the evidence support the hypothesis that the occupational
structure was identical in all US regions in 1850?
(iii) How would you test the hypothesis that occupational structure was
identical in all regions outside the North-East? What do you find?
8. By hand, use the one-sample Kolmogorov–Smirnov test on the data for
Cambridge parishes in the Boyer relief data set to determine whether there
are statistically significant differences between their populations. (Hint:
order your data from lowest to highest.) Report your findings.
Why does it matter that you order the data properly before undertaking
the calculation?
9. Using the data constructed for question 7, calculate the four measures of
strength of association discussed in §7.6.1. Report your findings. Do the
four tests reach the same conclusion about the strength of association
between region and occupational structure? How do you explain any
differences?
10. Can you think of a historical case in which data are available for related
samples? (Hint: look at the exercises for this chapter.)
Multiple linear regression
Multiple relationships
In this chapter we once again take up the subject of regression, first intro-
duced in chapter 4, and this will now be our central theme for the remain-
der of this book. In chapter 4 we dealt only with simple regression, with one
dependent and one explanatory variable. In the present chapter we will
extend the model to see what happens when there is more than one explan-
atory variable. We introduce this idea in §8.1, and explain various aspects
of the concept of multiple regression in §8.2. The related concepts of
partial and multiple correlation are covered in §8.3.
In chapter 9 we will examine some of the underlying ideas in more
depth, and will also deal with some of the issues arising from the fact that
the data underlying our regressions are typically drawn from samples and
so are subject to sampling error. Two further extensions of the basic linear
regression model, the use of dummy variables and of lagged values, are
then introduced in chapter 10.
it might now be
Yab1X1 b2X2 b3X3 (8.2)
Thus in an analysis of the level of per capita relief payments by English
parishes in 1831, Y would be RELIEF and X1 might still be UNEMP, as in
§4.2.3, but additional explanatory variables could also be incorporated in
the investigation to assess their effect on RELIEF. For example, X2 might be
FARMERS, the proportion of labour-hiring farmers in the total number of
parish taxpayers; and X3 might be LONDON, a measure of the distance of
the parish from London, used as a proxy for the cost of migration. The
signs before the explanatory variables indicate that Y is expected to rise as
X1 and X2 rise, but to fall as X3 rises.
The fundamental underlying theoretical principles and statistical pro-
cedures required for the estimation of multiple regression equations, and
for the evaluation of the coefficients, are essentially the same as those
already discussed in earlier chapters in relation to simple regression. The
formulae and calculations naturally become more laborious and compli-
cated with the inclusion of additional explanatory variables, and the com-
putations can be left to the statistical computer programs.
There are, however, a few points relating to the interpretation of the
results of multiple regression and correlation that it will be helpful to
discuss.
The notation and terminology used for simple regression and correla-
tion is adapted as shown in the right-hand column of table 8.1. As will be
seen, the introduction of additional explanatory variables produces four
new terms. In the following sections we will look at these terms and con-
sider some of the more important changes in interpretation involved when
dealing with multivariate relationships.
Table 8.1 Notation for simple and multiple correlation and regression
冧 冧
rYX
1 . X2
Partial correlation coefficients
(see §8.3.1–8.3.3)
RY . X1X2
Multiple correlation coefficient
(see §8.3.4)
explained by
the several explanatory variables
in a multiple regression.
The value of R2 always lies between 0 and 1, and the higher it is the more
the variation in Y that has been explained (sometimes referred to as an
indication of ‘the goodness of fit’). However, the total variation in Y (which
is the denominator in the calculation of R2) is unaffected by the number of
explanatory variables, whereas each additional explanatory variable will
add a term to the numerator. As a consequence of this, the addition of
another explanatory variable will almost always raise the value of R2, and
will certainly never reduce it. A small correction to R2 to take account of the
number of explanatory variables used is described in §8.2.8.
3,000
2,000
1,000
0
0.0 1.0 2.0 3.0 4.0 5.0 6.0
HOMELESS
7,000
6,000
5,000
WELFARE
4,000
3,000
2,000
1,000
0
0.0 1.0 2.0 3.0 4.0 5.0 6.0
HOMELESS
(b) Scatter diagram identifying the effect of TEMP on the relationship between
WELFARE and HOMELESS
§.
throughout the year. When she re-ran her initial regression with FOOD-
COST as a second explanatory variable, the results were:
WELFARE 6,182846 HOMELESS77.5 FOODCOST (8.5)
(3.0) (5.9) (3.3)
With the addition of this variable the R2 has increased from 0.63 to 0.83.
The impact of adding the cost of food to the regression is revealing. The
overall explanatory power of the equation has, once again, risen consider-
ably. Clearly, the added variable is helping to explain more of the variation
in the level of WELFARE over the year. Moreover, the high t-statistic on
FOODCOST indicates that it is a statistically significant explanatory vari-
able (the prob-value is 0.9 per cent).
But what is really interesting is that the coefficient on HOMELESS has
barely changed at all from the simple regression model of (8.3). Moreover,
its t-statistic, far from collapsing with the addition of a new statistically
significant coefficient, has actually risen (from 4.2 to 5.9, or from a prob-
value of 0.2 per cent to one below 0.05 per cent). The addition of FOOD-
COST has sharpened the relationship between welfare payments and the
level of homelessness.
Why has this happened? Why is the impact of this alternative second
variable so very different from the addition of TEMP? The answer is to be
found in the relationships among the explanatory variables. In the case of
HOMELESS and TEMP, the reason why the addition of TEMP to the equa-
tion has such a startling effect on HOMELESS is that the two variables were
not only strongly related to WELFARE, but were also strongly related to
each other.
The correlation between HOMELESS and TEMP is 0.73, indicating a
strong degree of interdependence between them.d So when TEMP was
added to the equation, it essentially replicated much of the information
previously provided by HOMELESS. Once it is also recognized that TEMP
was more strongly correlated with WELFARE than was HOMELESS, it
becomes clear that TEMP has more power to explain the behaviour of the
dependent variable. In effect, the addition of TEMP swamps HOMELESS
as an explanatory variable.
In contrast, the movements of HOMELESS and FOODCOST were
almost completely independent of each other.e The correlation of the two
variables is only 0.0026. The addition of the new variable thus provided new
d
When two explanatory variables are highly correlated with each other, they are sometimes
referred to as being highly collinear.
e
When two explanatory variables exhibit independence, as measured by a very low correlation,
they are sometimes referred to as orthogonal to each other.
and independent information that the regression model could use to explain
WELFARE. This information did not replicate any of the information con-
tained in the movement of HOMELESS; its addition therefore did not alter
the relationship between HOMELESS and WELFARE contained in (8.3).
Two morals emerge from this simple example. First, the effect of con-
trolling for additional explanatory variables is a matter of empirical analy-
sis and cannot be determined a priori. Second, it is important for the
historian to understand the underlying structure of the model being tested.
The regression equation produces parameter values net of interdepen-
dences among the variables; but in many cases, it is these interdependen-
cies that are important for a full understanding of the historical process
being analysed.
A good start to a better understanding of the connections among the
explanatory variables is to ask the computer to construct a correlation
matrix, which sets out the simple correlation coefficients for each pair of
variables (including the dependent variable) in tabular form (see table 8.3
in §8.3.1 for an example). A full discussion of how these simple correla-
tions may be used to measure the impact of controlling for an additional
explanatory variable is presented in the analysis of partial and multiple
correlation in §8.3.
in this case is equal to the sum of the green, orange, and brown areas. The
yellow area shows the unexplained sum of squares, that part of Y that is not
explained by the behaviour of X1 or X2. The R2 is thus equal to the ratio of
the green, orange, and brown areas to the total area of the yellow circle.
Although all three areas determine the value of R2, only that part of the
overlap that is unique to each variable is used to determine the partial regres-
sion coefficient. Thus, only the information contained in the green area is
used to determine the value of b1, the coefficient on X1. Similarly, only the
orange area is used to determine the value of b2, the coefficient on X2.
The regression analysis ignores the information contained in the brown
area, since it is impossible to allocate it between X1 and X2. This is the
reason why the regression coefficient on X1 changes with the addition of a
second explanatory variable. It is less that some part of the original explan-
atory power of X1 is displaced onto X2, but rather that the addition of X2
compromises the original structure of the relationship between X1 and Y.
One further effect of the inclusion of a second variable will be that the
relationship between Y and X1 is now estimated with less information (the
size of the overlap is smaller). This means that the accuracy with which we
can measure the value of the regression coefficient declines – the standard
error of b1 increases, and its t-statistic drops.
The Ballantine clearly indicates what is happening in the case of the
Chicago welfare mission. In the original regression formulation in §8.2.3 it
was assumed that only the proportion of homelessness among the popula-
tion influenced outlays on welfare. In terms of the Ballantine, the green and
brown areas in combination were used to determine b1 and R2.
When TEMP was added to the regression, more information (in the
form of the orange area) was added to the overall model of welfare outlays.
A higher proportion of the yellow circle is thus overlapped, and R2 rises. At
the same time, less information was used to estimate the value of b1, which
now depends on the green area alone, producing both a different
coefficient and a different standard error. With the addition of TEMP, not
only does the coefficient on HOMELESS change (indicating a high degree
of correlation between the temperature on the street and the number of
people sleeping rough), but its statistical significance declines.g
The Ballantine further indicates that the impact on b1 of introducing an
additional explanatory variable will be greater, the higher the degree of
correlation between X1 and X2. In figure 8.3(a), on the pullout sheet, p. 527,
the two variables are independent (or orthogonal to each other). There is
no overlap between them. Thus, the addition of X2 to a regression of Y on
g
The simple correlation of TEMP and HOMELESS is 0.73.
stant; the angle of the slope in the other direction (from top to bottom)
measures the effect of X2 while X1 is held constant.
It is in this sense that multiple regression holds one explanatory variable
constant, while measuring the effect of the other. It was reflected in figure 8.1
(b) by the dotted parallel lines (each with the same slope) showing the effect
of homelessness while controlling for the influence of the temperature.
If there are more than two explanatory variables, all but one have to be
held constant in the calculation of the partial regression coefficient for the
remaining variable. This makes the calculations quite complex, and we can
be grateful that computers do them for us.
X2 X1
X2 X1
–
R2 is equal to R2 corrected for
the number of explanatory variables plus the intercept:
–
冢
R2 1 (1 R2 )
n 1
n k 冣 (8.7)
*
This explanation is based on Peter E. Kennedy, A Guide to Econometrics, 3rd edn., Blackwell,
1992, pp. 66–7.
The number of control variables on the right of the point determines what is
called the order of the partial correlation coefficient. Thus simple correlation,
which ignores any other variables, is called a zero-order correlation. With two
control variables, as in the example in the previous paragraph, it is a second-
order correlation (or partial), with three it is a third-order, and so on.
Consider, for example, the inter-relationship between five variables
from the Poor Law data set for a sample of 311 parishes: RELIEF, UNEMP,
FARMERS, GRAIN, and LONDON. As a first step in the analysis of the
relationships between these five variables a statistical computer program
might be used to calculate the zero-order correlation between all possible
pairs; in this case there would be 10 pairs. The results are typically pre-
sented in a matrix of five rows and five columns, as in the upper panel of
table 8.3, together with the two-tailed significance levels of the coefficients.
With each of these simple (bivariate) correlations no account is taken of
any possible effect of the three remaining variables. In order to do this it is
necessary to calculate first-, second-, and third-order partials, taking
various combinations of two variables while controlling for one, two, or
three of the others.
The lower panel of table 8.3 shows two of the third-order partials. In the
first, RELIEF is correlated with UNEMP while controlling for FARMERS,
GRAIN, and LONDON. In the second, RELIEF is correlated with
LONDON, while controlling for the three remaining variables.
These tables can then be scrutinized to see how the coefficients change
with the introduction of the various control variables. For example, the
simple (bivariate) correlation coefficient for RELIEF and LONDON was
–0.35; when the influence of the three other variables was controlled for,
the third-order coefficient dropped to –0.26.
This partial correlation procedure can thus help one to get a better
understanding of the possible causal links within a set of related variables.
It not only measures the strength of the various combinations of relation-
ships, but can also be a very effective way of detecting hidden relationships.
Whereas the OLS regression coefficients indicate the net effect of the rela-
tionships among the explanatory variables, partial correlation coefficients
enable the researcher to diagnose more completely the process by which
the regression coefficients are generated.
They can also help to uncover spurious relationships, by revealing when
an apparent relationship between X1 and Y is actually the result of two sep-
arate causal relationships between these two variables and a third variable,
X2. The fact that the actual causal relationship is not between X1 and Y
would emerge through the calculation of a first-order partial correlation
coefficient in which X1 was correlated with Y while controlling for X2.
Table 8.3 Correlation coefficients and partial correlation coefficients for five
variables (data relate to 311 parishes in England and Wales)
LONDON r 1.000
Sig
RELIEF LONDON
LONDON r 1.000
Sig
RELIEF UNEMP
UNEMP r 1.000
Sig
§.
Table 8.4 Zero-order and partial correlations between church attendance, population
density, and church density in England in 1851a
Zero-order correlations
1 ATTEND and POPDENS 0.353** 0.536**
2 ATTEND and CHURCHES 0.572** 0.418**
Notes:
a
Note that log values are used for both POPDENS and CHURCHES in the urban analysis.
** indicates p
0.001.
Source: Crockett, ‘Variations in churchgoing rates’.
i
Alasdair Crockett, ‘Variations in churchgoing rates in England in 1851: supply-side deficiency or
demand-led decline?’, in Alasdair Crockett and Richard O’Leary (eds), Religion in Modernity:
Patterns and Processes in Europe and America, Cambridge University Press, forthcoming.
The first of these fitted regression lines explains as much of the variation
in POPDENS as can be attributed to CHURCHES. The remaining, unex-
plained, variation is reflected in the residuals. ATTEND will be one factor
contributing to these residuals.
The second regression line explains as much of the deviation in
ATTEND as can be attributed to CHURCHES. The remaining, unex-
plained, variation is again reflected in the residuals. POPDENS will be one
factor contributing to these residuals.
With these two regressions CHURCHES has been allowed to explain as
much of the variation in the two other variables as it can. The final step is
then to correlate the residuals from the two regressions. This gives the
desired measure of the relationship between POPDENS and ATTEND that
is independent of any influence from CHURCHES.
§.
R is a measure of the
closeness of the linear relationship
between all the variables
taken together.
The letter R is used for this measure, with subscripts to indicate which
variables have been included in the calculation. Thus RY · X X X would
1 2 3
indicate the correlation of Y with three other variables.
j
We have simplified the presentation by comparison with table 8.3 by omitting both the own cor-
relations (e.g. WELFARE with WELFARE) and the statistical significance.
§.
Numerically, for any given set of variables, R is simply the square root of
the corresponding coefficient of multiple determination, R2 (see §8.2.1).
By convention R is always presented as positive, but the sign of the overall
correlation coefficient has no meaning, since some of the explanatory vari-
ables may be positively correlated and others negatively. R can vary
between 0 and 1. A value of 0 indicates no linear relationship, but – as with
the two-variable case – it is always possible that there is a strong non-linear
relationship.
Notes
1
The name Ballantine was originally given to the shape of interlocking circles because
of the similarity between figure 8.2 and the advertising logo of P. Ballantine & Co., a
producer of beer and ales in Newark, New Jersey in the nineteenth century. The
Ballantine logo consisted of three interlocking rings, each symbolizing the essential
characteristics of Ballantine ale. Before Prohibition, these characteristics were
identified as Purity, Strength, and Flavor; after the repeal of the 18th Amendment in
1933, the ring of Strength was renamed the ring of Body.
2
For examples of the use of beta coefficients to compare the relative magnitudes of the
effects on the dependent variable of different explanatory variables see William A.
Sundstrom and Paul A. David, ‘Old-age security motives, labor markets and farm
family fertility in antebellum America’, Explorations in Economic History, 25, 1988,
pp. 190–2; and Michael Edelstein, Overseas Investment in the Age of High
Imperialism, Columbia University Press, 1982, pp. 102–9.
2. You wish to build a model of family size in Irish counties in the early
twentieth century. Use the Hatton and Williamson cross-section data set to
test the hypotheses that family size is directly related to religion (CATHO-
LIC), age (AGE), and literacy (ILLITRTE).
This chapter builds on our previous work and takes a succession of giant
strides towards realistic quantitative analysis of relationships between two
or more variables. After completing this material it will be possible to
examine a large range of potential historical problems, and to read criti-
cally many of the studies in history and the social sciences that make use of
regression and related quantitative techniques.
We will begin in §9.1 with a brief discussion of the concept of a ‘model’
and of the associated methodology of quantitative research into the rela-
tionships between two or more variables. The basic technique of linear
regression (discussed in chapters 4 and 8) is then extended in §9.2 to
examine the reasons why the observed values of the dependent variable
deviate from the regression line, and to consider the implications of these
deviations. In §9.3 a new test statistic, the F-test, is introduced and used to
test the significance of the multiple regression as a whole. Finally, §9.4 is
devoted to a further useful summary statistic, the standard error of the
estimate.
§.
the demographer it may be the trends and fluctuations in fertility and mor-
tality or in international migration.
In each of these and many similar problems a quantitative analysis
begins with the specification of a model.a This is the researcher’s formula-
tion of the relationship between the phenomenon that she wishes to
analyse, and other factors that she thinks will help to explain that phenom-
enon. In our earlier terminology she has one dependent variable, and
attempts to specify which are the relevant explanatory variables, and how
they are related to the dependent variable.
In most historical writing this is, of course, done implicitly. The non-
quantitative historian does not say in as many words: ‘here is the model I
have formulated to explain why the number of criminal offences first
increased and then declined’, but her discussion of the issue is, in effect, an
attempt to do this.1
The distinctive feature of the quantitative approach is first, that the rela-
tionships are made explicit; secondly, that they can be measured and tested.
Whether or not this is a better approach depends largely on how well the
relationships can be specified by the researcher, and how accurately the
major relevant variables can be measured. Even if the results are statisti-
cally poor, the exercise will probably help to clarify the crucial issues.
The specification of the model will thus determine:
a
This is also referred to as the formulation of a maintained hypothesis.
It is now necessary to consider more carefully the reasons for these devia-
tions from the regression line and to examine their implications. The
reasons why the observed values of Y do not all lie neatly on the regression
line can be broadly divided into two groups.
(a) The dependent variable, Y, may be measured incorrectly. This might, for
example, be the result of inaccuracies in the collection or processing of
the data, or of ambiguities about the precise definition of Y.
(b) Relevant explanatory variables may be omitted from the model. This
could happen because it did not occur to the researcher to include them;
or because the necessary data were not available; or because they consist
of aspects such as tastes or attitudes that are not readily measurable.
(c) There may be other forms of mis-specification of the model, including:
(i) Assuming a linear relationship between the variables when it is
actually non-linear.c
(ii) Including irrelevant explanatory variables, because the researcher
wrongly thought they were important.
(iii)Assuming that a time-series relationship is constant over time (so
that there is no change in the intercept or regression coefficients)
when it is actually changing.
b
The technical aspects of ‘model specification’, i.e. of the formulation of a relationship between a
dependent variable and one or more independent variables, are examined in §11.2 and there is a
discussion of the more general issues in §12.5 and §12.6.
c
The specification of non-linear models will be discussed in chapter 12.
YabX (9.1)
In order to recognize both the existence of possible errors of measurement
and specification, and the potential importance of the stochastic factors,
statisticians prefer to add to this an additional ‘error term’ designated by e.d
d
Some authors refer to the error term as a disturbance or a stochastic term, and it is also desig-
nated in some texts by u or by the Greek letter (epsilon).
§.
YabXe (9.2)
This adds an additional (random) variable that covers all the possible
reasons for the deviation of Yi from the value Ŷi indicated by the regression
line.
Since the value of this random variable, ei, cannot be actually observed
in the same way as the other explanatory variables, certain assumptions
have to be made about how it might behave. These assumptions are of crit-
ical importance to the evaluation of the model.
YX (9.3)
We do not know the values of and for the population; all we have are
the estimates of a and b in the regression line derived from the sample. It is
thus necessary to establish how reliable a is as an estimate of , and how
reliable b is as an estimate of . In order to do this we must revert to the
issue of the error term.
e
This would be analogous to the exercise specified in question 7 of chapter 5, except that the infor-
mation assembled from the successive samples would be the values of the two variables which
are assumed to be related, not the sample means of a single variable.
§.
µ1 µ2
X1 µ3
X2
X3
(a) Hypothetical populations
Frequency
of Y
µ1
µ2
X1 µ3
X2
X3
(b) The form of the population of Y assumed in simple linear regression
1. The mean value of the random error, e, is zero (0) for all values of
X.
2. The variance of the random error, e, is the same for all values of X, and
is equal to the population variance,
2.
3. Values of e for different values of X are statistically independent so that
if, say, Y1 is large, there is no reason to think that this will have any
effect on Y2, making it more likely to be either large or small.
4. The mean values of Y all lie on a straight line. This is the population
regression line.
*
The reason for this was explained in §6.2.1.
§.
Y1
Y2
Y3 Estimated regression line
Ŷ = a + bX
Y1
X1
X2 True regression line
X3 Y = α + βX
If we had the information for all possible values of the relationship be-
tween X and Y we could fit the true population regression line, YX.
Since we have only the sample information the best that can be done is to
estimate the sample regression line, ŶabX, assuming that the distribu-
tion of the error term is as specified in panel 9.1.
Since the true population regression line is not actually known, the error
terms representing the deviation of the actual values of Y from the true
regression line cannot be measured. We can, however, observe the devia-
tions of the estimated regression line from the sample values of Y.
The relationship between this estimated regression line and the true
(unknown) regression line that would be given by the population data is
shown in figure 9.2. The values of Y observed in the sample are Y1, Y2, and
Y3, and the estimated regression line passes through the values Ŷ1, Ŷ2, and
Ŷ3. Because of the random error terms, e1, e2, and e3, the values of Y pre-
dicted by the true regression line differ from the observed values of Y. These
error terms reflect the combined effect of the errors of measurement and
specification discussed in §9.2.1, and the stochastic factors considered in
§9.2.2.
f
What happens if any of these assumptions are violated is discussed in chapter 11.
§.
F-TESTS
The F-test is the ratio of two sample variances, s12 and s22,
where s12 s22.
s21
F (9.4)
s22
g
Another use of the F-test is in hypothesis testing of statistical differences between sample vari-
ances; see panel 6.1.
then
V1/k1
F (9.7)
V2/k2
The shape of the F-distribution depends solely on these two parame-
ters: k1 in the numerator and k2 in the denominator of the ratio. The dis-
tribution can take any value between 0 and infinity, and for small values
of k1 and k2 it is skewed, with a long tail to the right. As the two parame-
ters become large, however, the F-distribution approximates to the
normal distribution. It is illustrated for selected values of k1 and k2 in
figure 9.3.
Published tables for the F-distribution are given in many statistical
texts, though they are usually heavily abridged since it is necessary to cover a
range of combinations of the two degrees of freedom (k1 and k2 ) for each
proportion of the distribution (for example, the upper 5 per cent).**
*
The other two are the Z- and t-distributions encountered in §5.4.
**
D. V. Lindley and W. F. Scott, New Cambridge Statistical Tables, 2nd edn., Cambridge
University Press, 1995, pp.50–5 give the upper percentage points (right-hand tail) of the F-
distribution for proportions from 10 per cent to 0.1 per cent for various combinations of
df.
§.
Fcalc can be compared in the usual way with the theoretical probability of
getting this particular result as determined by the theoretical F-distribution
(see panel 9.2) at the chosen level of significance, where the required
degrees of freedom are k 1 and n k. If Fcalc is greater than the theoretical
(tabulated) value of F the null hypothesis is rejected. In other words, we
reject the proposition that all the s taken together are zero and have no
influence on the dependent variable.
Probabilitydensity
F (20,30)
F (8,4)
F (4,2)
manipulation of these terms that the relationship between Fcalc and the
equivalent coefficient of multiple determination, R2, is as follows
R2/(k 1)
Fcalc (9.9)
(1 R2 )/(n k)
If the regression model explains very little of the variations in the depen-
dent variable, R2 will be very small and therefore, as the above formula
shows, the value of Fcalc will be very low. Conversely, the stronger the rela-
tionship specified in the model, the higher the value of Fcalc. Thus a high
value of Fcalc is generally also an indicator of strong overall relationships
between the dependent variable and the set of explanatory variables as a
whole.
兹
兺(Yi Ŷi ) 2
SEE (9.10)
n 2
It is used to calculate a confidence interval
around the regression line.
The coefficients from this regression could be used to calculate the mean
value of UNEMP that would be predicted for any given value of BWRATIO
(say, X0).i However, the regression line in (9.11) is estimated on the basis of
only one set of observations and thus represents only one sample from the
population of all possible relationships between the two variables. Other
samples would give different values for the constant, a, and the regression
coefficient, b1, and thus different answers to the calculation.
It is thus necessary to have a confidence interval for this prediction of
the mean value of UNEMP when BWRATIOX0, and to specify the
desired level for this, say 95 per cent.
i
Note that X0 is not one of the specific values taken by the explanatory variable. It is any value
chosen by the researcher, and may well lie outside the actual range of values for BWRATIO used
in the estimation of the regression line if the aim is to predict what UNEMP might be at some
other level.
兹
1 (X0 X ) 2
(ab1X0) t0.025SEE – (9.12)
n 兺(Xi X ) 2
Note two features of (9.12). First, if the mean value of UNEMP is calcu-
lated when the value of X0 chosen for BWRATIO is equal to the mean value
–
of that series, then X0 X. The numerator in the second term under the
–
square root (X0 X) thus becomes 0, and the confidence interval simplifies
1
to SEE .
兹n
Second, when X0 is not equal to the sample mean value of X, then the
–
further it is from X in either direction, the larger that second term under the
square root becomes, and so the greater the width of the confidence interval
around the predicted mean value of UNEMP. This widening of the interval
as the value of the explanatory variable moves away from its mean value is
–
illustrated in figure 9.4 (where a broken vertical line is drawn through X).
Thus while an estimated regression line can be used to extrapolate to
periods (or cases) outside the sample, it is essential to be cautious in
attempting to predict the mean value of Y for a value of X that is any dis-
tance away from the sample mean of X.
兹
305.17
SEE 4.24
17
The next step is to obtain the values required for (9.12). This has three
components in addition to SEE. First, the regression line specified in (9.11)
must be estimated to obtain the values for the constant, a, and the regres-
sion coefficient, b1. These are found to be 6.40 and 15.64, respectively.
Secondly, the table for the theoretical t-distribution (see table 5.1)
shows that for 17 degrees of freedom, the two-tailed value of t at the 5 per
cent level is 2.110. Thirdly, the values in the final term of (9.12) must be
derived. Let us take X0 at a value of 0.10, a little below the lowest value of
–
BWRATIO observed in the sample. X , the mean value of BWRATIO in the
sample, is 0.467, and the sum of the deviations of X from its mean,
–
(X–X)2, is 0.222. Since n is 19, the required term is
兹
1 (X0 X ) 2
–
n 兺(Xi X ) 2
兹 兹
1 (0.10 0.467) 2 0.135
0.053 兹0.660
19 0.222 0.222
0.813.
The full calculation of the 95 per cent confidence interval for X0 0.10
on the basis of (9.12) is thus
–
兹
1 (X0 X ) 2
Y0 a b1X0 t0.025SEE –
n (Xi X ) 2
6.40 (15.64 0.10) 2.110 4.24 0.813
7.96 7.27
0.69 to 15.23.
At this distance from the mean level of BWRATIO the width of the
confidence band around a forecast of the mean value of the unemployment
is thus rather large.
The same procedure was followed for a range of alternative values of X0
to derive the full set of confidence intervals drawn in figure 9.4.
兹
1 (X0 X ) 2
(a b1X0) t0.025SEE – 1 (9.13)
n 兺(Xi X ) 2
§.
Note
1
For a provocative statement of a leading quantitative economic historian’s views on
‘covert theory’, see Robert W. Fogel, ‘The specification problem in economic
history’, Journal of Economic History, 27, 1967, pp. 283–308.
(i) A series on incomes based on tax data, given tax evasion by the rich
(ii) A series on poverty based on reported income, given welfare fraud
by the poor
(iii) A series compiled by a census taker who leaves out every other house
(iv) A series compiled by census subjects who round their ages to the
nearest fifth year
(v) A series compiled by a government official who inverts every
number, e.g. writing 75 not 57
(vi) A series compiled by a government official who inverts every other
number
k
The first term diminishes because as n gets larger, 1/n gets smaller. The second term diminishes
because the bigger the sample, the more cases of X there are to include in the sum of the devia-
– –
tions of Xi from in the denominator. (Xi X)2 thus gets larger and larger relative to (Xo X)2.
each observation. Re-run the basic regression above and report the
results.
(a) What do you notice about the statistical significance of the
intercept?
(b) What do you notice about the relationship between the F- and t-
statistics?
(c) What do you notice about the overall statistical significance of
the equation? What does this tell you about the relationship
between the F-test and the intercept?
(iii) Now run a variant of the basic regression, separating out the compo-
nent parts of BWRATIO in the form
UNEMPab1 BENEFITb2 WAGEe
(a) Report the F-statistic and its significance.
(b) Use the formula (9.9) to explain why the value of the F-statistic
has fallen so much, despite the small increase in R2.
8. How would you use the F-test to determine the statistical significance of
the R2 statistic? (Hint: use (9.9).)
9. Calculate by hand the standard error of the estimate (SEE) for the regres-
sion coefficient of disturbances on wheat prices in chapter 4, question 3.
(i) What is the 95 per cent confidence interval around the predicted
number of disturbances at a wheat price of 100 shillings per qtr?
(ii) Will this be larger or smaller than the 95 per cent confidence interval
around the predicted number of disturbances at 85 shillings per qtr?
(Hint: this second question should not require a second calculation.)
In chapter 8 the simple linear regression model was extended to cover the
introduction of two or more explanatory variables, and in chapter 9 the
model was given its essential stochastic form.
In the present chapter two further extensions are described. First, §10.1
is devoted to the use of dummy variables. This is a procedure developed to
enable us to include in a regression a variable that cannot be measured in
the same way as a continuous numerical value (for example, income, or age
at marriage) but is instead represented by two or more categories (for
example, single, married, or widowed). This special form of a nominal (or
categorical) scale is known as a dummy variable.
Secondly, in §10.2 we develop the implications of the idea that it may be
appropriate for one or more of the explanatory variables to refer to an
earlier period than the one to which the dependent variable relates. Such
lagged values recognize the fact that there may be a delay before the
changes in the explanatory variable make their full impact.
§.
were paid. These are dealt with by assigning a value of 1 whenever the
answer is Yes, and a value of 0 whenever the answer is No. Other examples
of classifications into two nominal categories that could be represented
by a dummy variable include male/female, urban/rural, graduate/non-
graduate, and house owner/not house owner.
A corresponding procedure in the context of a time series would be to
assign a value of 1 to every year in the period when there was some excep-
tional event such as a war, a strike, or an earthquake, and a value of 0 to all
other years.
Dummy variables can also be applied to a nominal variable with more
than two categories. For example, in a study of earnings it might be appro-
priate to distinguish four categories of social class: professional, skilled,
semi-skilled, and unskilled. Similarly, in a political study of US policy in
the period since the Second World War it might be useful to include the
role of different presidents as one of the explanatory variables, assigning a
separate dummy variable for each of the periods in which successive presi-
dents had held office. Other multiple category variables for which dummy
variables might be used include religion, occupation, region, or time
period.
Whatever the number of categories, one of them must be selected as the
reference category (also referred to as the control category or benchmark),
and this reference category must be omitted from the regression. An alterna-
tive, but generally less satisfactory procedure, is to omit the intercept rather
than the reference category. Failure to adopt one or other of these proce-
dures is known as the dummy variable trap. It is essential not to fall into
this trap, because what is measured if neither the control category nor the
intercept is omitted is effectively an identity with a perfect linear relation-
ship between the dummies and the intercept. In this situation; the r2
becomes 1 and the standard errors on all the coefficients become 0. Clearly
such results are nonsensical.
To illustrate the proper application of dummy variables, consider the
data set for children leaving home in the United States in the decade
1850–60. One of the explanatory variables was the place of residence of the
family in 1850, subdivided into four areas, the Northeast, North-central,
South, and Frontier.a Northeast was selected as the reference category.
There are then three separate dummies (one fewer than the number of cate-
gories), each of which takes a value of either 1 or 0, as follows.
a
The Frontier area was defined as Minnesota, Iowa, Kansas, Oklahoma, Texas, and states further
west: Richard H. Steckel, ‘The age at leaving home in the United States, 1850–1860’, Social
Science History, 20, 1996, p. 512.
R2 0.357
N 239
This illustrates two aspects of the use of dummy variables very clearly. First,
the four dummy variables enable districts to be subdivided into five catego-
ries, and the separate regression coefficients measure the effect on wages as
the districts get progressively further from London. Secondly, the districts
more than 100 miles from London are used as the reference category, and
are thus the basis for comparison with wages in districts nearer to the
capital.
INCOMEab1EDUCe (10.3)
For all black workers, however, BLACK1, so –40 x 1 40. Since this
is a constant it can be added to (or, rather, because in this case it is negative,
deducted from) the intercept, and the equation can be written as
200
40
100
0 2 4 6 8 10
EDUC (number of years education beyond primary school)
their position relative to the intercept depending on the sign and size of the
coefficients on the respective dummy variables.
The workers can thus be subdivided into four categories, with the two
dummy variables given the values of 1 or 0 as appropriate, and there would
be a separate equation for each category. For example, the equation for
white female workers would be
which reduces to
INCOMEab1EDUC0b3 e (10.7)
The intercept for white female workers would thus be (ab3). Treating
the other three categories in the same way (and omitting the error term),
the right-hand side of the four regressions can be summarized as
Males Females
White (a00)b1EDUC (a0b3)b1EDUC
Black (ab2 0)b1EDUC (ab2 b3)b1EDUC
The intercept for white males is a and, as before, the coefficients b2 and
b3 are added to this intercept to determine the difference it makes to
INCOME if a worker is in one of the other categories (controlling for
EDUC). Thus for any level of EDUC, INCOME for male black workers
will differ from that for male white workers by b2; INCOME for female
white workers will differ by b3; and INCOME for female black workers
will differ by b2 b3. We might expect to find that both b2 and b3 are
negative.d
Males Females
White (a000)b1EDUC (a00b4)b1EDUC
Black (ab2 00)b1EDUC (a0b3 0)b1EDUC
d
For an application of this approach to the measure of wage discrimination against women in the
United States see Claudia Goldin, Understanding the Gender Gap. An Economic History of
American Women, Oxford University Press, 1990, pp. 84–7.
§.
INCOMEab1EDUCb2 BLACK
b3 EDUCBLACKe (10.9)
For all white workers, both BLACK and EDUCBLACK are zero. So
the coefficients b2 and b3 are eliminated and the coefficient on EDUC
remains as b1. However, for all black workers BLACK1, and so EDUC
BLACKEDUC. The intercept becomes (ab2), and the coefficient on
EDUC becomes (b1 b3).
The regression for white workers thus reduces to
INCOMEab1EDUCe (10.10a)
whereas that for black workers is
INCOME19032 EDUC
The corresponding regression lines for the separate groups are shown in
figure 10.2. As can be seen, the line for black workers has both a different
intercept and a different slope.
The degree of discrimination revealed by this second data set is thus
greater than that in the data underlying figure 10.1. In that case each addi-
tional year of post-primary education had the same effect on the incomes
of the two groups of workers: incomes rose by $32 for each extra year. In
figure 10.2 that is still true for white workers, but for black workers the
return to an extra year of education is only $27. In addition when we
control for the effects of education there is a bigger gap: the difference
between the two intercepts is now $65 rather than $40.
However, this formulation of the two regression lines immediately
raises an awkward question: has anything been achieved by working
through the procedure for interacting a dummy variable with a numerical
variable? We have simply ended with two separate regressions. Why could
we not simply have formed two separate regressions to begin with?
The answer is that in this case we would get exactly the same estimates
from the regression with the EDUCBLACK interaction term as we
would from two separate regressions, one for white and a second for black
workers. In order for it to be worthwhile using the interaction between a
§.
300
Y = (190 – 65) + 27 EDUC
(BLACK = 1)
200
65
100
0 2 4 6 8 10
EDUC (number of years education beyond primary school)
(INT) and partly on the value of goods sold (SALES) in the previous year.
Using the subscripts t for the current year, and t 1 for the year one before
that, the regression would be specified as
INVt ab1INTt b2SALESt 1 et (10.11)
SALESt 1 is an example of what is known as a lagged variable. Lagged
variables are treated in exactly the same way as any other explanatory vari-
ables and in a simple model such as the above do not require any special
procedures. Their use is not restricted to annual time periods; the periods
might equally be months or quarters if the required data are available.
When the effects of a variable are spread over several time periods in this
way the resulting model is known as a distributed lag model. Such models
are particularly common in economic analysis, where it is assumed that
firms, households, or other agents need time to assemble and react to the
information which influences their actions. However, these models might
also be used in other contexts.
For example, a political historian might think that voters’ political
choices are influenced by the economic conditions they experienced in the
year before an election and not just by the conditions prevailing in the year
of the election. She might then model this by including lagged values of rel-
evant economic variables, such as income or inflation.2 Other possible
explanatory variables reflecting successes or failures in foreign policy, or
major political scandals, might be similarly lagged.
One question that is immediately posed by such models is the length of
time over which this influence from the past is effective. One assumption,
which avoids the need to select a specific period, is that the effects last
forever: all previous periods are relevant. This is known as an infinite dis-
tributed lag model. Alternatively, the researcher may assume that the effect
lasts for only a fixed period of time, say five years.
For example, the model for INV in (10.11) might be altered to specify
that INV is affected by SALES both in the current year and also in each of
the five preceding years
INVt ab1INTt b2SALESt b3SALESt 1 b4SALESt 2
b5SALESt 3 b6SALESt 4 b7SALESt 5 et (10.12)
However, the choice of five lagged values is rather arbitrary since the
theory of investment on which the model is based does not really provide
guidance on this issue. Furthermore, equations like this might be difficult
to estimate for two reasons. First, each time period involves a separate
regression coefficient, and thus one additional parameter that has to be
estimated; and each one uses up one further degree of freedom. In addi-
§.
e
Multicollinearity will be discussed more fully in §11.4.2.
f
When a model assumes in this way that the dependent variable is influenced by its own past
values this is referred to as autoregression. Special care is necessary when interpreting the results
from such a model because the inclusion of a lagged dependent variable violates the assumptions
underlying the standard OLS regression model. This issue is further discussed in §11.3.3.
that (lower case Greek lambda) is the rate of decline of the distributed lag
and is a positive number with a value greater than 0 but less than 1. We can
incorporate this in an infinite version of (10.12)
Notes
1
We may note here that it is also possible to form an interaction variable as the
product of two numerical variables. One way in which this can be done will be
covered in §12.1, where we consider the possibility of specifying a non-linear rela-
tionship (such as a parabola) that includes X2 as one of the explanatory variables.
Since X2 is obtained by multiplying a numerical variable, X, by itself, a squared term
such as this can equally be thought of as a further type of interaction variable. The
final type is then one in which one numerical variable is multiplied by another, as in
the example of AGRIC and LAND which is discussed in §14.2.2. In both these forms
of interaction variable the effect is that the slope of the regression relationship is
changing continuously.
2
See, for example, Howard S. Bloom and H. Douglas Price, ‘Voter response to short-
run economic conditions: the asymmetric effect of prosperity and recession’,
American Political Science Review, 69, 1975, pp. 1240–54. Their model of voting in
the United States from 1896 to 1970 included a measure of income designed to repre-
sent ‘the change in the “average” voter’s real spending power prior to each election’.
The best available measure of income as perceived by the voter was judged to be the
‘percentage change in real per capita personal income during the year preceding each
election’ (p. 1243, italics added).
3
This geometric pattern is also known as a Koyck distributed lag. This is an attractive
specification of the influence of past events at a constant rate over time. However,
this may not be the most appropriate form of lag. There are many other patterns
which have been suggested, including an arithmetic lag, in which the impacts
declines linearly, and a polynomial lag, which can take various shapes, for example
one in which the impact first rises for earlier values of the variable and then falls.
To illustrate this last pattern, consider a change in the rate of interest that is
thought to build up to its maximum effect on INV over two years and then to die
away. The appropriate polynomial for a model with INVt as the dependent variable
would make the effect larger for a change in period t 1 than for one in t, larger still
for a change in t 2 than for one in t 1, but then progressively smaller for a change
in t 3 and all earlier periods.
4
There is a further point that might be made. Although Koyck’s transformation solves
certain problems, it is purely a mathematical manipulation and can be criticized
because it does not have any theoretical justification. There are, however, a number
of alternative models that do have a basis in economic theory, including the adaptive
expectations model and the stock adjustment (or partial adjustment) model.
Economic historians can find further discussion of these models in most economet-
ric textbooks.
the population; share of union workers in the labour force; size of the
government sector in the economy). The data for the model are drawn
from five successive elections between 1968 and 1984; a number of the
explanatory variables are in the form of dummy variables. Specify the
dummy variables for the regression if:
2. Let us imagine that a historian believes that certain counties were more
generous in their provision of poor relief than others and that this is a signifi-
cant explanation of the variation in relief payments across Southern parishes.
She finds qualitative support for this hypothesis for Sussex in the archives.
Use the dummy variable technique to test the hypothesis that Sussex
parishes were more generous than all others, holding other parish charac-
teristics constant. What do you find? Does the researcher’s insight stand up
to statistical scrutiny? What is the impact on the coefficients and standard
errors on the other variables in the Boyer equation?
3. The researcher, emboldened by her findings, sets out to test the more
general hypothesis that county was an important determinant of variation
across all Southern parishes.
Once again, use the dummy variable procedure to test this hypothesis.
Be certain to exclude one county to avoid the dummy variable trap. Which
county did you choose as the default and why? Do the coefficients and stan-
dard errors on the additional variables bear out the researcher’s hypothe-
sis? How do you interpret the size and signs of the coefficients?
Examine the coefficients and standard errors on the other variables. Do
these new results cause you to re-evaluate Boyer’s interpretation?
4. In §10.1.1, we reported the results of Boyer’s analysis of the influence of
proximity to London on agricultural wage levels in Southern England in
1903. Use the dummy variable procedure to repeat the exercise for 1831,
using the information in Boyer’s relief data set on agricultural wages
(INCOME), distance from LONDON, and specialization in GRAIN
§.
where the variables are as defined in §10.2. It is unclear from the discussion
of these results whether the relationship between investment and sales is to
be interpreted as a Koyck distributed lag model, in which this year’s invest-
ment is determined by the history of sales over many previous years, or as a
statement that this period’s investment depends on last year’s sales.
As a commentator on these results, how would you interpret the coeffi-
cients on both INT and SALES if the underlying relationship between INV
and SALES were (a) a one-period lag, or (b) an infinite distributed lag?
§.
The conditions required for OLS to satisfy the criteria for BLUE fall into
three categories.
(a) The OLS approach assumes that the relationship between the depen-
dent and explanatory variables is linear. In a simple bivariate regres-
sion, the regression line is a straight line; in a multivariate regression,
the relationship between the dependent variable and each of the
explanatory variables is a straight line (see §11.2.1).
(b) The list of explanatory variables is complete and none is redundant.
There are no omissions of relevant explanators, no inclusion of irrele-
vant variables (see §11.2.2).
(c) The relationship between the dependent and explanatory variables is
stable across all observations (whether across individuals or over time)
(see §11.2.3).
(a) The mean value of the error term across all observations is zero (see
§11.3.1)
(b) Each of the error terms has the same variance (the assumption of
homoscedasticity) (see §11.3.2)
(c) The individual error terms are uncorrelated with each other (see
§11.3.3).
(a) The explanatory variables are correctly measured, i.e. that they are not
systematically distorted so as to be always too high or always too low
(see §11.4.1).
(b) The explanatory variables are not systematically correlated with each
other (they are not collinear) (see §11.4.2).
(c) The explanatory variables are not systematically correlated with the
error term (see §11.4.3).
The rest of this chapter shows what happens when these various
assumptions are not maintained. We shall also suggest some simple tools
for detecting and correcting these violations of the classical linear regres-
sion model. However, there are two important points it is worth noting at
the outset. First, not all deviations from the list of fundamental assump-
tions are equally disastrous. Second, not all violations are easily detectable
or correctable. Some can be identified or corrected only with techniques far
beyond the scope of this simple introduction; some cannot be corrected at
all without finding new and better data.
60
50
40
30
20
10
Z
0
0 4 8 12 16 20
Output of textbooks (000)
When there is more than one explanatory variable, this may not be a
reliable methodology. In this case, the simplest method may be to split the
sample into subgroups according to the size of the explanatory variable
suspected of having a non-linear impact on Y, and run separate regressions
for each subgroup. In the absence of non-linearities, the coefficients should
not be significantly different from each other.
Not all non-linearities are as severe in form or consequence as the text-
book example in figure 11.1. But without correction, all will produce
biased and inefficient estimates of the ‘true’ relationships among the vari-
ables. Fortunately, most are amenable to the solution of transforming the
variables. The procedures for doing this are explained and illustrated in
detail in chapter 12 (§12.2).
properly specified, the green area (and only the green area) will be used to
estimate the impact of X1 on Y; the orange area (and only the orange area),
the impact of X2 on Y; the brown area (for reasons given in §8.2.5) is
ignored. In this case, the estimators will be unbiased. But, if X2 is omitted,
X1 will use the information in both the green and brown areas to estimate
the relationship with Y, generating biased regression coefficients.b The
extent of the bias will be directly related to the extent to which the two
circles overlap (i.e. the extent to which X1 and X2 are collinear).c
The Ballantine for examining the case of the redundant variable is
shown in figure 11.2 (b), on the pullout. In this case, the true model for
estimating Y uses only information about X1; the coefficient is estimated
using the information in the green and brown areas. However, an addi-
tional variable (or set of variables) X2 is added spuriously; it has no real
impact on Y, although because of its strong collinearity with X1, it will
overlap with Y in the brown (and orange) areas.
This overlap reduces the information available to the regression model
for the estimation of the coefficient on X1. The resulting coefficient is not
biased (since the green area uses information unique to X1), but it is
inefficient: it has a higher variance than would result in the correctly
specified model (the green area being smaller than the green–brown area,
the coefficient is measured with too little information).
Thus, omission is more hazardous than redundancy. But this sort of logic
may lead to ‘kitchen-sink’ regression analysis in which every measurable var-
iable is included in a regression on the off-chance that it will turn out to be
important. The problem, as may readily be imagined, is that the more redun-
dant variables are added, the more the efficiency of the relevant explanatory
variable will suffer. In the extreme, the coefficient on X1 will have such a large
standard error that it will appear to be statistically insignificant, and will be
ignored when explaining the evolution of the dependent variable.d
b
At the same time, because it uses more information to estimate the coefficient, it will have a
smaller variance. This indicates the danger of comparing t-statistics from different regressions to
determine which is the superior specification.
c
Obviously, if X1 and X2 are orthogonal (see §8.2.5), the Ballantine circles will not overlap and
omission of one variable will not generate bias in the evaluation of the other. If the researcher is
interested only in a hypothesis test of the impact of X1 on Y, the omission of orthogonal variables
is acceptable. However, if the purpose is to explain the behaviour of Y, omission of X2 will consti-
tute a mis-specification since the model does not capture all the influences on Y.
d
You may wish to add a fourth circle to the Ballantine of figure 11.2 (b) that intersects points A
and B; this will reduce the total amount of information used to link X1 to Y to a thin sliver,
effectively reducing its explanatory power to almost nothing.
§.
Figure 11.3 Y 40
Illustration of the
use of dummy
variables to allow 30
for structural
changes
20
Z
10
(a) Introduction of
an intercept dummy
variable 0
0 2 4 6 8 10
X
Y 30
20
10
Z
(b) Introduction of
a slope dummy
variable 0
0 2 4 6 8 10
X
Y 40
30
20
(c) Introduction of 10
Z
both an intercept
and a slope dummy
variable 0
0 2 4 6 8 10
X
§.
where Di 0 if Xi
Xt
1 if Xi Xt
In figure 11.3 (b) the slope changes after the break (this is equivalent to
the threshold effect of the DVD case). This is commonly referred to as a
piecewise linear regression. In this case, it is necessary to add a slope
dummy variable, obtained by the interaction of the relevant dummy vari-
able and the explanatory (numerical) variable thought to be affected by the
structural break. The dummy variable will have a value of 0 up to the
threshold, and of 1 at and above it. In this case, the model takes the follow-
ing shape
Yi a0 b1Xi b2(Xi Xt) Di (11.2)
where Di 0 if Xi
Xt
1 if Xi Xt
In figure 11.3 (c) the entire relationship changes after the break: both
slope and intercept are different. It is therefore necessary to incorporate
both an intercept dummy and a slope dummy. In this case, it is imperative
that the regression also includes the untransformed variables (the constant
and X). Thus, the model takes the following form
where Di 0 if Xi
Xt
1 if Xi Xt
The dummy variables should be evaluated in the same way as any other
variable in the regression. Clearly, if they are not statistically significant, the
source of the structural change has been misdiagnosed. Alternatively, it
may be that the Chow test is identifying some other problem in the con-
struction of the model.
11.3.2 Heteroscedasticity
Heteroscedasticity is a complicated term that describes a fairly simple
violation of the assumptions of classical linear regression methods. The
CLR approach assumes that the variance of the error term is equal across
all observations; this is known as homoscedasticity (equal spread).
Heteroscedasticity (unequal spread) arises when the variance of the error
term differs across observations.
In figure 11.4 it can easily be seen that the variance of the error (the
breadth of the error distribution) in Y becomes progressively larger as X
increases. This case of heteroscedastic errors may be compared to figure 9.1
(b), where the error term is shown, by assumption, as homoscedastic. Note
that the violation of the CLR assumption occurs only when the error term
shows a clear, systematic, pattern of distortion; random differences in the
size of the error term across observations do not constitute heteroscedas-
ticity.
Heteroscedasticity may arise for a number of reasons.
(a) In time-series analysis, it may arise because of a process of learning by
experience, or, as econometricians commonly say, error-learning. If for
example, the numbers of errors your class makes on statistics exams
declines as you take more of them, the error term in a regression of
X0
X1
X2
X3
11.3.3 Autocorrelation
The CLR approach requires that the errors associated with each observa-
tion are independent. If there is a systematic relationship among the errors,
this is known as autocorrelation.
The most common environment for autocorrelation to occur is in time-
series analysis. In this case, the error in one time period is influenced by the
error in a previous period, which is in turn influenced by the error in a
period before that, and so on. The structure of errors when mapped against
time will show a systematic pattern, rather than a random distribution as
required by the CLR model. Time-series autocorrelation is also referred to
as serial correlation.
Serial correlation of errors happens if a shock to the system being
described by the regression model creates echo or ripple effects over more
than one period. A collapse of the stock market, for example, will reverber-
ate through the financial and economic system for some time after its
2.0
1.5
1.0
1.0
Errors
Errors
0.0
0.5
–1.0
0.0
–2.0
–3.0 –0.5
1850 1860 1870 1880 1850 1860 1870 1880
3.0 3.0
(c) (d)
2.5
2.0
2.0
1.0
Errors
Errors
1.5
0.0
1.0
–1.0
0.5
0.0 –2.0
1850 1860 1870 1880 1850 1860 1870 1880
Figure 11.5 distinctly non-linear trajectory over time, diverging from the linear
Typical patterns specification implicit in OLS regression.
of autocorrelated Although autocorrelation is normally associated with time-series data,
errors
it may also crop up in cross-section analysis. In this case, the error asso-
ciated with one observation (one Irish county, or one Poor Law district) is
systematically related to the error in another observation (or observa-
tions). Such spatial autocorrelation could occur in migration analysis, for
example, if there were information chains which linked one county (say,
Dublin) to another (say, Limerick). In this case, a shock to the level of emi-
gration of Dubliners would also increase emigration from Limerick.
Similarly, if the authorities in charge of setting Poor Law rates in one Kent
parish were influenced by rate-setting in other parishes in the county, it
g
The alternative regression lines gradually converge towards the true relationship because the
influence of shocks tends to diminish over time.
§.
Figure 11.6 90
Autocorrelation
80
60
50
True OLS line
40
30
10
1850 1860 1870 1880
11.3.4 Outliers
A potential further problem results when one (or more) of the errors in the
error term is unusually large, relative to the rest of the distribution. While
this does not formally contravene any of the assumptions of the CLR
model it does have some negative consequences, especially for OLS estima-
tion, and should not be ignored.
A potential warning sign of this problem is the presence of outliers
in a sample. These are observations in a data set that do not conform to
the patterns suggested by the remainder of the sample.i They are unusual
observations.
Two types of outlier may be identified. One occurs when the value of an
explanatory variable for an observation is very different from the remain-
ing values in the sample. The other occurs when the dependent variable for
an observation is very different from that predicted by the regression
model (it is measured with a large residual).
Both types of outlier may seriously affect regression results. An unusual
value of an explanatory variable relative to the rest of the sample may con-
stitute a leverage point, giving it unusual influence over the value of the
regression coefficient. Thus, in the bivariate model of figure 11.7 (a), most
of the data points are clustered together in the centre of the distribution,
while one observation clearly diverges. The slope of the OLS regression line
in this example depends entirely on the position of that one data point.
h
One test that was much used in previous econometric work to test for serial correlation in auto-
regression is Durbin’s h. However, recent analysis suggests that this is an inappropriate proce-
dure.
i
For a preliminary analysis of the problem, see §3.1.3.
§.
The second type of outlier is depicted in figure 11.7 (b), in which one
observation is measured with a large error. The least squares method will
cause the regression to pivot towards that observation, as it searches to
minimize the aggregate value of the residuals (‘the sum of squares’), thus
distorting the results. Once again, a single observation has a large influence
on the regression results.
Thus, the key issue is not whether an observation is an outlier per se, but
whether it is an influential observation – i.e. whether it has unusual
influence on the regression model.
Not all influential observations are equally harmful. Indeed, it is often
pointed out that the existence of explanatory variables far from the centre
of the sample is a good thing, because it gives the regression model more
information with which to discern patterns in the relationship among var-
iables. As a general rule, the more tightly clustered the values of a particular
explanatory variable, the more poorly defined its coefficient estimate.
The real problem is not with leverage points (unless it can be shown that
they are misleading), but with rogue observations. These are observations
that do not belong with the rest of the data set, not because one or more of
the explanatory variables were unusual, but rather because the model
linking X and Y is not the same for that observation as for the rest of the
data. Rogue observations carry the potential to distort parameter estimates,
invalidate test statistics, and may lead to incorrect statistical inference.
How should we test for outliers and determine whether they are good or
bad influences?
Clearly, the first step is to determine whether there are any influential
observations in the data set. The crucial question is whether the regression
model is significantly changed by the presence of any single observation.
This can be addressed by comparing the model results with and without
each data point in sequence. Tests can determine whether the absence of an
observation causes any regression coefficient to change by a significant
amount (the dfBeta test), or whether it significantly affects the measure-
ment of the dependent variable (the dfFits test).
Let us say that a model has been tested and one or more observations are
revealed to be influential. What should be done about them? It is tempting
to discard the observations entirely. But, because of the potential value of
leverage points, this should be resisted. Rogue observations are more
clearly expendable. But how can they be distinguished?
Possible leverage points are often identified by graphing a scatter plot of
the data and searching for unusual observations.j But there are clear limits
j
This was done in figure 3.2, in our preliminary discussion of the problem.
30
20
10
0
0 2 4 6 8 10 12
X
(a) An outlier as a leverage point
Y 40
30
20
10
0
0 2 4 6 8 10 12
X
(b) A rogue outlier
§.
to the usefulness of graphical aids, especially when dealing with more com-
plicated multivariate models, where it may be difficult to spot an outlier by
eyeballing. More helpful are tests (such as Hadi’s) that systematically
examine the distributional parameters of the data set and determine
whether any observations fall into the upper or lower tails.
There are also a number of tests for outliers due to error in the estimat-
ing equation.k One is to include observation-specific dummies in the
regression, and to test whether the coefficients on these are statistically
significant. Ideally, this test should be run over all observations, in order to
pick up any outlier – it is a test of the entire system, not just of one apparent
odd case.
Regardless of the outcome of these tests, if outliers are identified, the best
course is to abandon OLS estimation, and to utilize one of the various
regression methods that have been developed to accommodate outliers.
These are known as robust regressions, implying that their results are robust
to problems of this sort. There are a variety of robust estimators, almost all of
which use weighting techniques to reduce the emphasis on high-error obser-
vations. All are superior to OLS methods in this situation. They may be run
over the entire data set and their results compared to the OLS specification.
If the robust regression method generates very different results, it suggests
some further problems with the data. For an illustration, see the robust
regression we run in §14.1.2 to test the Benjamin and Kochin model.
The Government are very keen on amassing statistics – they collect them, add
them, raise them to the nth power, take the cube root and prepare wonderful dia-
grams. But what you must never forget is that every one of those figures comes in
the first instance from the chowty dar (village watchman), who just puts down what
he damn well pleases.
k
Visual inspection of the regression residuals is not one of them, since the pivoting of the OLS line
to minimize the total sum of squares will tend to mask individually large residuals (see figure
11.7 (b)).
l
Sir Josiah Stamp, Some Economic Factors in Modern Life, P. S. King & Son, 1929, p. 258.
11.4.2 Multicollinearity
Multicollinearity is a term that was originally introduced to capture the
effect of a perfect, or exact, linear relationship between two (or more) of
the explanatory (right-hand) variables in a regression.
m
Kennedy discusses the instrumental variables approach in a Ballantine framework in Peter E.
Kennedy, A Guide to Econometrics, 3rd edn., Blackwell, 1992, pp. 143–5.
We have already met this problem in the form of the ‘dummy variable
trap’ (see §10.1.1). The problem could also arise if, in a regression on
county-level Irish migration, both AGRIC (the share of the labour force in
agriculture) and NONAG (the share of the labour force employed outside
agriculture) were included; or if a regression of aggregate relief payments
per county were to include both the number of persons receiving relief and
the average payment per recipient. But, as these examples make clear, the
chances of encountering perfect linear relationships outside of design flaws
in the model are rare.
Given the rarity of this situation, the term multicollinearity is now gen-
erally applied more broadly to situations in which two or more explanatory
variables are highly, rather than perfectly, correlated with each other.
Within our data sets, there are numerous cases where we would expect
to find significant correlations among variables. In the cross-section analy-
sis of migration by Irish county, for example, we might well expect RELIEF
and HOUSING to be significantly related to each other, as well as URBAN
and AGRIC. Neither would be perfectly collinear, but statistical tests will
reveal high correlation coefficients across the data set.
What will happen as a result of such multicollinearity? In the extreme
case of perfect collinearity, the computer will generate coefficient estimates
that are indeterminate with standard errors that are infinite. Indeed, if any
regression produces such a result (see, for example, question 3), it is an
indication that the researcher has either fallen prey to multicollinearity, or
has fallen into the dummy variable trap. The solution is easy: the offending
supernumerary variable must be identified and dropped from the list of
right-hand variables.
What about the weaker, but more frequent, case of high but imperfect
collinearity? The standard errors will be very high relative to the size of the
coefficient, generating a low t-statistic, and suggesting that the explanatory
power of the affected variable will be weak. The reported confidence inter-
val around the measured coefficient will be large; frequently, it will include
zero, suggesting that the variable is without statistical significance. In cases
of true multi-collinearity (in which more than two explanatory variables
are highly correlated with each other), the statistical significance of all
affected variables will be pushed towards zero. At the same time, the overall
explanatory power of the equation, as measured by R2, may be very high
(e.g. 0.9 or higher).
But not all cases of insignificant t-statistics can be blamed on multicolli-
nearity. In many cases, it may simply be that there is no systematic relation-
ship between X and Y. How can one distinguish between these two cases?
§.
11.4.3 Simultaneity
If we run a regression of Y on X, we are asserting that Y is the dependent
variable and X is its explanator. There is a strict causal link from the right-
hand side of the equation to the left.
But (as we have already noted in §4.1.1) it is not difficult to think of
circumstances when the relationship between X and Y is not unidirec-
tional. In such cases, Y is influenced by X but X is also influenced by Y. The
classic example relates to the market for a particular commodity. Increases
in the supply of this commodity will tend to reduce its price (as its scarcity
declines, people have to pay less to purchase it). At the same time, however,
as the price of the commodity falls, its supply tends to drop (as producers
shift towards manufacturing other items that will generate more revenue).
Price is determined by supply (and demand); at the same time, supply (and
demand) are determined by price.
This is the case of simultaneous equations, in which there is no truly
independent, or exogenous, variable determined outside the framework of
the model. All variables are, to a greater or lesser extent, endogenous – that
is, determined within the model. The explanatory variables are no longer
independent of the error term (since a shift in the error term directly
changes the dependent variable, which in turn changes the explanatory
variable because of the feed-back effect), thus violating one of the principal
assumptions of the classical linear regression.
Simultaneity generates biased estimates of the relationship between X
and Y. Moreover, the bias does not disappear as the sample size increases (it
is therefore also inconsistent). In the extreme case, this arises because it is
simply not possible to characterize the relationships between X and Y with
a single estimating equation. The extent of bias will depend on the degree
of correlation between the explanatory variables and the error term: the
higher the correlation, the greater the bias.
What can be done about simultaneity? If the extent of such bias is small,
it is tempting to do nothing. In other words, if the extent of feed-back from
Y to X is very limited, the extent of correlation of the X and e will be small,
and the bias in the estimated coefficient on X will also be small.
If, however, the problem is considered to be more significant, then one
of two solutions may be followed. The first is to model the system more
appropriately and then estimate the relationships as a system, rather than
as a series of separate regressions. Such an approach, however, falls far
beyond the limits of this book. A more limited response is to use single-
equation methods, but to do so as a two-part procedure. This is the tech-
nique known as two-stage least squares (2SLS). It is a special case of the
We have treated each of the violations of the CLR model separately. However,
although our analysis implies that each problem of mis-specification may be
treated in isolation, most practitioners recommend that more general strate-
gies be employed that investigate the model as a whole. We discuss this and
other aspects of best-practice model evaluation in §12.6.
Notes
1
An alternative methodology for identifying structural breaks is to run the regression
over one subgroup, and then apply the estimated coefficients to the known explana-
tory variables to forecast the value of the dependent variable in that part of the
sample thought to lie beyond the structural break or threshold. The measured values
are then compared to the forecast; if the measured values fall outside the predicted
confidence interval, this is prima facie evidence of parameter change.
2
Note that autocorrelation may occur because of specification problems in either
model or variable selection. We observed an example of model mis-specification in
our discussion of the U-shaped cost curve in §11.1.1, where we noted that the resid-
uals from figure 11.1 follow an inverted U-shape against firm size. This situation is
n
Both errors-in-variables and simultaneous equations raise the same problem of correlation
between X and e, although the causal connection is significantly different in each case.
similar to that depicted in figure 11.5 (d) and represents a clear case of autocorrela-
tion. A similar outcome may result from excluding a relevant explanator from a
regression, especially if the missing variable follows a systematic pattern over time.
In these cases, the correct procedure is to correct for the mis-specification; if this is
done properly, it should also take care of the autocorrelation.
3
The analysis in this chapter assumes first-order autocorrelation, i.e. that the correla-
tion across errors is from this period to the last, and so on. In quarterly data, it may be
that the error in this quarter’s data is correlated not to the last quarter’s error, but to
the error from one year (four quarters) ago. If this is the case, the DW statistic will
not diagnose the problem. More advanced textbooks discuss the procedures appro-
priate to this situation.
4
Simultaneity is not to be confused with the situation of reverse causation in which a
model is compromised because the causation really runs in the opposite direction
from that specified by the researcher; see, for example, the discussion of such a possi-
bility in relation to Benjamin and Kochin’s model in §14.1.1. Reverse causation is a
problem of model mis-specification that arises not from contraventions of the statis-
tical properties of the CLR model, but rather from a fundamental misreading of the
historical evidence. It is another instance of the principle that correlation need not
imply causation.
(i) Check that your results are the same as those reprinted in table 15.2.
(ii) Instruct the computer to generate residuals (i.e. Ŷi Yi) from the
estimating equation. Plot these residuals against the value of (a)
BRTHRATE; (b) INCOME; and (c) POP.
Describe and interpret your findings. Are there any indications of model
mis-specification in these results?
2. With the aid of the combined Boyer data set, construct a model that tests
the importance of omitted and redundant variables for evaluating regres-
sions. For example, in a model explaining INCOME, you might choose to
exclude LONDON; alternatively, you might choose to include COUNTY
(as a continuous, rather than as a dummy, variable). Be sure to include the
impact of both problems on t-statistics and R2. Is it correct to assert that the
impact of omitted and redundant variables is larger, the smaller the
number of explanatory variables included in the regression?
§.
5. Write a brief essay, identifying what you consider to be the most likely
sources of error in historical statistics, being sure to indicate which errors
are most serious for statistical analysis and how you might try to evaluate
their significance.
6. A researcher uses the Boyer relief data set to construct a model explain-
ing the cross-section pattern of wealth per person across early nineteenth-
century English parishes. The model is simply a linear regression of
WEALTH on three variables, LONDON, DENSITY, and GRAIN. Is there
any reason to suspect that the data may be contaminated by heteroscedas-
ticity?
As a preliminary test, you examine the variance of wealth across
parishes in the data set. What do you find and how do your findings
compare with your prior expectations? (Hint: instruct the computer to
group the parish data by size, rather than trying to analyse the problem
using parishes singly.)
As the next step, you re-examine the researcher’s model for signs of
heteroscedastic disturbances. Run the regression model and examine the
residuals to determine whether there are signs of heteroscedasticity, and if
so whether:
(a) they conform with the preliminary analysis of the variance of wealth
(b) they appear to be serious.
Write a brief report to explain your findings.
7. Test the Benjamin–Kochin data set for signs of serial correlation in:
The dummy variable test focuses on the stability of only one explanatory
variable. You wish to evaluate the stability of the entire model over the
period. In order to do so, you choose the Chow test. A working description
of the test may be found in §14.1.2.
(iii) Use the Chow test to evaluate whether there is a structural break in
the emigration model in 1896. What do you find?
(iv) Compare the coefficient estimates from the two subsamples with the
regression run over the entire period.
What is your overall response to the critic? How, if at all, does this alter our
understanding of Irish emigration in this period?
Further topics in regression
analysis
140
120
100
80
60
40
0 4 8 12 16 20 24 28 32 36
Age (years)
a
If any of these relationships were used to model data in regression analysis, the model would also
have to include an error term.
§. -
b
Equation (12.2) has a single explanatory variable, X, which appears as X and also squared. It is
one of a family of curves (known as polynomials). A quadratic might have additional explana-
tory variables (X2, X3, and so on). A higher-order polynomial will have higher powers than X2;
for example, the cubic equation will have a term in X3, and the model might be: Yab1X1
b2X23. Any polynomial will be non-linear.
c
The mathematical name for a curve such as this, that approaches progressively closer to a line
without ever reaching it, is an asymptote.
d
In each case, we have set a400; b1 45; and b2 1.5. Only the signs have been changed.
7,000 b = 3.5
6,000
5,000
4,000
3,000
b=3
2,000
1,000 b = 2.5
0
5 7 9 11
X
Y 2.0
Negative values of b
1.5
b = –1
1.0
0.5
b = –2
b = –3
0.0
1 3 5 7 9
X
§. -
15,000
10,000
b = 2.5
5,000
b = 2.2
b=2
0
5 7.5 10
Y 2.0
b less than 1
1.5 b = 0.9
1.0
b = 0.7
0.5
b = 0.5
0.0
0.5 5.0 9.5 14.0 18.5
X
Y 5.0 Y 1.0
4.5 a and b both positive a is negative, b is positive
0.5
4.0
3.5 0.0
3.0
–0.5
2.5
2.0 –1.0
1.5 –1.5
1.0
–2.0
0.5
0.0 –2.5
0 5 10 15 20 25 30 0 5 10 15 20 25 30
X
X
Y 2.5 Y –1.5
a is positive, b is negative
a and b both negative
2.0 –2.0
1.5 –2.5
1.0 –3.0
0.5 –3.5
0.0 –4.0
–0.5 –4.5
–1.0 –5.0
0 5 10 15 20 25 30 0 5 10 15 20 25 30
X
X
Figure 12.4 included in this expression ensures that the relation between X and Y is
Reciprocal curves: non-linear; as X rises, the value of the expression and, therefore, the slope
Y a b/X of the curve will change.
In the case of the lower panel, figure 12.5 (b), the location of the turning
point, at which the slope of the curve changes (either from positive to neg-
ative, or vice versa), is when X equals the ratio of (b1/2b2). In the left-hand
curve, b1 equals 45 and b2 is 1.5; thus, the curve reaches its maximum
when X equals 15. In the right-hand curve, b1 is 45 and b2 equals 1.5; the
curve reaches its minimum when X equals 15.e
e
Note the similarity between the top curve in figure 12.5 (a) and the geometric curve with b0;
and also the similarity between the lower curve in figure 12.5 (b) below the turning point and the
exponential curve with b
1.
§. -
Y 4,000 Y 1,000
2,000 –1,000
1,000 –2,000
0 –3,000
0 5 10 15 20 25 30 0 5 10 15 20 25 30
X X
b1 is negative, b2 is positive
700 400
600 300
500 200
400 100
300 0
0 5 10 15 20 25 30 0 5 10 15 20 25 30
X X
Figure 12.5 The final non-linear relationship to be considered is the logistic. The
Quadratic curves: most general form of the logistic curve is given in (12.7). With four con-
Y a b1X b2X2
stants and the variable, X, there are numerous possible permutations, and
we will mention only a selection of the various shapes the curve can take. If
the constants g and h are both positive, b is greater than 1, and the power x
in the denominator is negative, then the logistic curve slopes upward from
left to right in an elongated S-shape (also known as a sigmoid curve). If the
b = 1.5
0.0
–10.0 –5.0 0 5.0 10.0
The constant, a, determines the point at which the curve begins to rise;
the greater a is, the further to the right will be the point at which the curve
begins to bend upwards. The final constant, b, determines the slope of the
curve: the greater it is, the steeper the slope.
Figure 12.6 shows two logistic curves, for both of which g and h1, so
that the upper limit of both curves is 1. The two curves have the same value
of a (2), but different slopes: for one b1.5 and for the other be (the
exponential constant introduced in §1.6.2, with a value to four decimal
places of 2.7183). The reason for choosing this value for b is explained
below.
§. -
f
For example, if (12.5) were modelled with an additive rather than a multiplicative error, so that
the model was YaX1bX2b e rather than YaX1bX2be, then it would be intractably non-linear.
becomes
Yab1X1 b2Ze (12.2a)
in which there are no non-linear terms. This procedure is adopted in §12.4
to fit a non-linear trend to a time series. Similarly, the reciprocal model in
(12.3) becomes
YabZe (12.3a)
This simple procedure can be followed for all models in which the non-
linearity occurs only in the X variables, not in the regression coefficients.
(i) two numbers can be multiplied by taking the sum of their logs, e.g.
and
(ii) a number raised to a power equals the exponent the log of the number,
e.g.
log X3 3 log X.
h
The multivariate version in (12.5) is essentially the same but has one extra term in log X2. The
transformation is thus: log Ylog ab1 (log X1)b2 (log X2).
i
Some writers also refer to this as a log-linear model because – as can be seen in figure 12.6 – it is
linear when plotted in logs. However, this usage may cause confusion, and we prefer to reserve
the term log-linear for semi-logarithmic models such as (12.6a) and (12.9), where only the
explanatory variable is in logs. By extension, the semi-log model (12.8), where only the depen-
dent variable is in logs, is sometimes called a lin-log model. Using the terms in this fashion has
the advantage of indicating immediately whether the logged variables are on the left-hand side
of the equation (log-linear), the right-hand side (lin-log), or both (log-log).
j
See Timothy J. Hatton and Jeffrey G. Williamson, ‘After the famine: emigration from Ireland,
1850–1913’, Journal of Economic History, 53, 1993, p. 581; and Daniel K. Benjamin and Lewis A.
Kochin, ‘Searching for an explanation for unemployment in interwar Britain’, Journal of Political
Economy, 87, 1979, p. 453.
Y – arithmetic scale
logarithmic
transformation of a
geometric model
1.5
a = 2, b = –1
1.0
0.5
0.0
1 3 5 7 9
X – arithmetic scale
2
Y – logarithmic scale
1
0.9
0.8 a = 2, b = –1
0.7
0.6
0.5
0.4
0.3
0.2
1 2 4 5 8 10
X – logarithmic scale
(b) The double logarithmic transformation: log Y log a b(log X)
§. -
Models of this form are less common, but are used, for example, in
studies of earnings, with the log of the wage taken as the dependent vari-
able and factors such as age, education, and a dummy variable for union
membership as the explanatory variables.k
A third version of the semi-log (or log-linear) model is obtained by
transformation of the exponential curve (12.6), giving an equation in
which both Y and the coefficient b are in logs, but X is not
Y*⬅log Y
and
X*⬅log X
Y – arithmetic scale
logarithmic
transformation of an
exponential model
15,000
10,000
a = 2, b = 2.5
5,000
0
5.00 6.25 7.50 8.75 10.00
Y – arithmetic scale
(a) The original curve: Y abX
20,000
Y – logarithmic scale
10,000
5,000
4,000 a = 2, b = 2.5
3,000
2,000
1,000
500
400
300
200
100
5 6.25 7.5 8.75 10
X – arithmetic scale
Y 2X1.75 (12.4d)
Essentially the same procedure is followed for the estimation of the
semi-log models, except that there is only one new variable to be created in
each case: X*⬅log X for the model of (12.8), and Y*⬅log Y for the model
of (12.9). In neither case is the intercept a in log form.
(a) The first case is the double logarithmic (log-log) model in which both
the dependent and the explanatory variable are in logs. In this form the
regression coefficient informs us that when there is a small proportion-
ate change in X there will be a small proportionate change in Y equal to
b. So
Y X
b
Y X
and if the proportionate change in X is 1 per cent, the proportionate
change in Y will be b per cent.
Economists will recognize that if this relationship is re-written
Y X
b (12.10)
Y X
it shows that b defines an elasticity. Thus if this double logarithmic
model was applied to the demand for a commodity, in which Y was
some measure of quantity and X was the price, the regression
coefficient, b, would measure the price elasticity (and would have a
negative sign).
In other contexts this expression for the regression coefficient, b,
can also be interpreted as the constant rate of growth of Y associated
with a constant rate of growth of X. In the example used in §12.2.2 in
relation to (12.4c) above, b was 1.75. This means that when X grows at
1 per cent p.a., Y grows at 1.75 per cent p.a.
(b) The second case is the semi-log (lin-log) model such as (12.8) in which
the explanatory variable, X, is in log form but not the dependent vari-
able, Y. In this case the interpretation of the regression coefficient, b, is
that when there is a small proportionate change in X, there will be an
absolute change of b units in Y. So
X
Y b
X
If, for example, the proportionate change in X is 0.01 (or 1 per cent)
then the absolute change in Y will be 0.01b; this is the same as b/100.
(c) Thirdly, we can consider the alternative form of semi-log (log-lin)
model in which the dependent variable, Y, is in log form but not the
explanatory variable, X; for example (12.9). In this case the regression
coefficient tells us that when there is an absolute change in X, there will
be a proportionate change in Y of b. So
Y
bX
Y
§. -
12.3.2 Elasticities
It is often interesting to establish the elasticity implied by a particular
regression model. This is the ratio of the proportional change in the depen-
dent variable to the proportional change in the explanatory variable, or
Y X
Y X
As noted above (see (12.10)) when the model is specified as a double loga-
rithmic relationship, the elasticity can be read directly from the regression
coefficient, b. However, it is only for this model that b is identical to the
elasticity; it does not hold for any other model.
In every other case, if the researcher wants to determine the elasticity
she can always do so, but it requires an additional manipulation of the
regression coefficient. For example, in the case of the lin-log model consid-
ered in paragraph (b) above, where it is the explanatory variable X that is in
冢 YY冣
and so
Y
b X (12.13)
Y
In order to convert this into an elasticity it is necessary to multiply both
sides by X, giving4
Y X
bX (12.14)
Y X
We have thus established that for a semi-log model in which the explan-
atory variable is in log form, the elasticity is measured by b/Y. For the alter-
native semi-log model in which the dependent variable is in log form, the
proportionate change in Y caused by a proportionate change in X, or elas-
ticity, is measured by bX.
In the same way it can be shown that for a standard linear model the
elasticity is
X
b
Y
and for the reciprocal model it is
1
b
XY
It is important to note, however, that all these measures of elasticity
differ in one vital respect from the measure associated with the double log-
arithmic model. In that model the elasticity is constant irrespective of the
values of X and Y. In all the other models the elasticity is variable and its
precise value will depend on the value of X and Y. So when these elasticities
§.
are reported, the single result given is typically calculated at the mean
values of the variables.
In this form of the model only the current value of SALES is shown on
the right hand side. In this case the regression coefficient, b, measures the
change in INV caused by a change in SALES in the current period, and is
referred to as an impact multiplier. The total effect of the change in SALES
in all previous periods, or long-run multiplier (which we will refer to as b̃
to distinguish it from the short-run coefficients), is not given directly by
(12.15). However, it can easily be derived as
b
1
where b is the coefficient on the current SALES, and is the rate of decline
of the distributed lag and is given by the coefficient on the lagged depen-
dent variable, INVt 1.5
This measure of the trend in NNP would normally be left in logs and
plotted in that form. If, however, it were desired to convert back to ordinary
numbers, the procedure for 1920 would be
m
This is the series given in the data set and used previously when fitting a linear trend in §4.2.5.
§.
errors. There will, however, be many series for which a linear trend is not a
good fit, and it is then necessary to fit a more complex model.
To illustrate one such case, we revert to the series used in §1.8 for miles
of railway track added in the United States from 1831 to 1913 (RAILWAY);
this was shown together with its fitted trend in figure 1.3. To model this we
use a quadratic in which a term in X2 is added to the model used previously
(see (12.16)). For this version we take TIME as the explanatory variable,
using a sequence of numbers from 1 to 83. The model is thus
log RAILWAYlog alog b1 TIMElog b2 (TIME)2 e
(12.22)
The trend in actual miles can then be derived for each year by applying
these results and finding the exponential.n For example, for 1881 (which is
51 in the series for TIME) the trend value is found from
and similarly for all other years. This is the long-term trend in construction
of railway track that was fitted to the data in figure 1.3.
The key to understanding why this model produces a trend which first
rises and then falls, is to notice the effect of the term for (TIME)2. Its
coefficient, log b2, is smaller than the coefficient on TIME, log b1, but is neg-
ative. In the early years of the series, the positive impact of b1 outweighs the
negative effect of b2, even though TIME is much smaller than (TIME)2. But
in later years, (TIME)2 becomes so large relative to TIME that eventually its
negative coefficient dominates, and the trend starts to decline.
To illustrate this effect, we can compare the above calculation for 1881
with comparable calculations for 1891, when the trend value was close to
n
These results are rounded to simplify the presentation, but the calculations are very sensitive to
the precise coefficients, and the illustrations for 1881 and other years were done with figures to 7
decimal places: b1 0.1185348 and b2 0.0009457.
冢 冣
t
r
Xt X0 1 (12.17a)
100
*
Note that when the rate of growth is small, a close approximation to the compound growth
rate can be obtained more simply from the regression coefficient itself (multiplied by 100).
In the present example b0.01781 so the approximation would be 1.78 per cent p. a.
**
The linear trend fitted in §4.2.5 was also based on all points in the series, but the growth rate
cannot be obtained directly from the regression, and because the trend is not linear in the
logs, the slope of the line does not correspond to a constant rate of growth. The growth rate
can be calculated by applying the compound interest formula to two selected years, but it will
vary according to the years selected.
§.
where X0 and Xt are the values of the series in the two selected years, r is the
rate of growth, and t the number of years between X0 and Xt over which the
growth is to be calculated.
In order to solve (12.17a) it must first be converted to logarithms, giving
冤
logXt log X0 t log 1
冢 r
100 冣冥 (12.17b)
so
冢
log 1
r
100 冣
log Xt log X0
t
If this is estimated for the period from 1777QI to 1787QIV, using suc-
cessive numbers 1, 2, 3, … for each quarter to represent TIME, the com-
puter obtains
The trend value for 1777QI (in ordinary numbers) would thus be
exp[4.9237 (0.00321)]exp(4.9205)137.07; for 1777QII it would be
exp[4.9237 (0.00322)]exp(4.9173)136.63, and so on.
Having calculated the trend, in the second stage the log of BANKRUPT
is regressed on its trend (TRENDB) and the seasonal dummy variables. The
trend is converted back to ordinary numbers because it is again a semi-log
model (see (12.6a)), with only the dependent variable in logs.
Dummies are created for three of the four quarters, leaving the fourth
as the control quarter. Thus, if QIV is taken as the control quarter, the
dummy variables are
Each of the coefficients on the dummies captures the effect of its own
seasonal variations on the pattern of bankruptcy, relative to the pattern in
the control quarter.
§.
The t-statistics show that the seasonal dummy for the third quarter is
highly significant, though those for the first and second quarters are not.
Together with the F-test (for which the prob value is 0.000) there is clear evi-
dence of a strong seasonal pattern in bankruptcies in this period.
Since we have used a semi-log model the seasonal coefficients are in
logs, and are multiplicative rather than additive. The first step in calculating
the quarterly adjustment factors is thus to obtain their exponentials (and
their relationship then becomes additive). The second step is to convert
these from adjustments relative to the control quarter, QIV, into the actual
adjustments required.
To do this we need some simple algebra, based on the following four
equations:
DQI 1.0057 DQIV (12.20a)
The first three of these equations are simply statements of the value of
the seasonal dummies relative to QIV, based on the exponentials of the
regression coefficients. Equation (12.21) incorporates the rule that the four
adjustment factors must sum to 4; in other words, their average must be
exactly 1 so that the annual totals will not be changed by the removal of the
seasonal pattern.
Since all four equations relate to DQIV we can substitute the first three
in (12.21) to get
DQIV(11.00571.02770.7283)4
so
DQIV4/(3.7617)1.0633
We now know the seasonal adjustment factor for QIV. Therefore the
factor for QI is 1.0057 1.06331.0694, and similar calculations for QII
and QIII complete the exercise. The full set of seasonal adjustments is set
out below, with those previously calculated in chapter 1 given for compari-
son. The differences are not large and lie within the 95 per cent confidence
intervals derived from the t-statistics.
its peak level; and with 1901, by when it had declined. Since the constant
term is always the same we ignore that and concentrate on the two remain-
ing terms. Since TIME increases in each case by 10 years, the change in the
term log b1 (TIME) will always be 100.118531.1853. It remains to
work out the change in the second term and to calculate the net effect of the
two terms. The former is done in columns (2) and (3) of table 12.2, the
latter in column (4).
This calculation shows that between 1881 and 1891 the combined effect
was an increase of 0.1261; between 1891 and 1901 it was a fall of 0.063.
The corresponding exponentials are 1.1344 and 0.9389. This is a (multivar-
iate) semi-log model with only the dependent variable in logs, so these
exponentials represent the ratios of the dependent variable. The final result
is thus that the ratio of 1891 to 1881 is 1.1344, corresponding to an increase
in the trend value by 13.44 per cent, from 4,295 to 4,873 miles. By contrast
the ratio of 1901 to 1891 is 0.9389, corresponding to a decrease over the
second period by 6.11 per cent, from 4,873 to 4,575 miles.
The exercise also underlines the fact that although the net change in the
two explanatory variables can be used to calculate the percentage change in
§.
Table 12.2 Calculation of change in fitted trend, miles of railway track added in the
United States, 1881, 1891, and 1901
Notes:
(1) and (2) calculated from results of trend fitted by (12.22)
with TIME measured by a series running from 1 (for 1831) to 83 (for 1913).
(4)(1)(3).
the trend over any period, this will not be a constant rate of growth; it will
vary according to the position of the selected dates on the fitted trend.
Inspection of the t-statistics and other tests suggests that this quadratic
provides a reasonably good fit for this series. If it did not, we could explore
higher-order polynomials such as cubics or quartics (i.e. models with terms
in X3 or X4). These additional terms would act in a similar fashion to the term
in (TIME)2, and would eventually produce additional turning points in the
trend.6
Alternatively we could try other curves. In his classic study of the trends in
a large number of production series for the United Kingdom, United States,
and other countries Simon Kuznets fitted a variety of non-linear functional
forms including parabolas and logistic curves.o The logistic is appropriate
for many production series that start at a low level, rise rapidly for a time, and
then level off as a result of factors such as declining prices and profitability as
a consequence of the entry of other producers, saturation of the market, and
competition from new products. The upper and lower bounds of this
process are captured by the coefficients g and h discussed in §12.1.
o
Simon Kuznets, Secular Movements in Production and Prices, Houghton Mifflin, 1930.
p
Some researchers who wish to use logs, but who are confronted with data points equal to 0, have
been known to add a small positive number (e.g. 0.001) to all the values in the series to permit a
logarithmic transformation.
§.
q
There are advanced procedures, such as RESET, discussed in most econometrics textbooks, that
test whether the most appropriate functional form for the data is linear or non-linear.
r
As can be seen, for example, by comparison of figure 12.2 when b0 with figure 12.5 (a), when
both b1 and b2 0; or of figure 12.2 when b 2, with figure 12.3 when b0.7.
obtained in this way can have no credibility; neither low standard errors
nor high R2s are worth anything in such circumstances. If a model has no
underlying principle there is absolutely no reason to believe that any rela-
tionship obtained by the regression reflects any sort of causal relationship
between the dependent and explanatory variables.
The implementation of these general ideas can now be illustrated by a
few examples from recent literature. In each case we have space to identify
only one or two principal features of the models, and our brief comments
do not do justice to the richness of the analytical frameworks developed in
these studies. In each case, we are focusing on the specific issue of including
and specifying a non-linear relationship in the model, although it should
be clear that the examples chosen have been selected for their general sensi-
tivity to the imperatives of model specification.
Working from these ideas Gillian Hamilton was able to develop a suc-
cessful model of the duration of apprenticeship contracts in late eight-
eenth- and early nineteenth-century Montreal.s Most factors in her
model entered the regression as linear explanators. But theoretical con-
siderations recommended the addition of age-squared as a variable to test
for non-linear effects in age. Younger boys were both less mature and less
experienced, so that employers would require longer service contracts to
ensure full recompense for the expense of training. Moreover, the risks
that the apprentice would under-perform were highest for the very
young, about whom the employer had little or no prior knowledge. The
reported coefficient on age is negative while that on the square of age is
positive, indicating that contracts became increasingly short, the older
the apprentice.t
One way for employers to find out something about the potential risk
was to require a probationary period before negotiating a contract.
Hamilton found that contract lengths were much shorter for probationers,
as theory would suggest. Moreover, boys who were sponsored by a guar-
dian served longer terms, consistent with the hypothesis that information
was a key ingredient in contract bargaining. The signs and significance of
other variables in the regression equation further support the underlying
theory, with lengthier contracts for apprentices who negotiated money at
the end of the apprenticeship (a form of head-right), or access to schooling
or church during it.11
A further illustration of the value of human capital theory in historical
applications may be found in the analysis of the earnings of immigrants
into the United States compared to those of native-born workers in the
same occupations. The theory tells us that earnings will be systematically
related to variables that affect individual productivity, such as education,
work experience, and age. Moreover, it indicates that the relationships will
be non-linear, albeit in different ways. Thus, there are diminishing returns
to each additional year of experience; whereas the returns to added years of
education are likely first to rise (as literacy and learning skills are devel-
oped), and then to fall (as knowledge becomes less functional). For manual
workers, the relationship between age and earnings certainly has a strong
s
Gillian Hamilton, ‘The market for Montreal apprentices: contract length and information’,
Explorations in Economic History, 33, 1996, pp. 496–523.
t
The respective coefficients are 2.033 and 0.043. Thus at age 10 their combined effect on the
length of contract is (10 2.033)(1000.043) 16.03. At age 20 it is (20 2.033)
(4000.043) 23.46, a reduction in contract length of only 7.4, not 10, years. The implied
shape of the age–duration relationship is similar to that of an exponential curve (figure 12.3)
with a value of b of about 0.9.
§.
u
S. J. Prais and H. S. Houthakker, The Analysis of Family Budgets, Cambridge University Press,
1955, pp. 87–100.
§.
retical insights and empirical observation, and of ideas drawn from epi-
demiological work on the spread of disease, stimulated modelling of the
diffusion of new technologies in terms of an S-shaped logistic curve, with
the dependent variable measuring the proportion of firms adopting the
innovative technology.
The general proposition is that there is an initial period of slow growth
when the product or process is first introduced by a small number of more
innovative firms; more rapid diffusion (comparable to contagion) as the
advantages of the technology become more widely known; and a final
phase in which diffusion slows down once the majority of potential users
have installed the technology. Factors underlying this pattern include the
initial lack of information about the advantages of the innovation and the
costs of acquiring this other than by observing the experience of early
adopters, and the gradual reduction of uncertainty as a common percep-
tion of the value of the innovation is formed.
For innovations to which these factors apply, these trends are effectively
modelled by a logistic curve, where the parameter, b, measures the speed of
diffusion, and is itself related positively to the profitability of the innova-
tion and negatively to the capital investment required to adopt it.16 A very
similar model is used in relation to demand for household appliances. In
this context, it is suggested that the speed of diffusion is determined by
whether the innovation merely increases the amount of discretionary time,
as with time-saving appliances such as vacuum cleaners and washing
machines – in which case diffusion is slow; or actually enhances its quality,
as with time-using appliances such as radios and TV sets – in which case
the appliances spread much more rapidly.17
Not all theories emanate from economics, of course. It is a strength of
economic history that it is able to draw upon a well-established array of
testable theories. But theory need not be formal to establish hypotheses
against which to judge the empirical evidence. Indeed, some critics of
model-building in economic history have suggested that its theoretical
base is too dependent on a narrow range of assumptions, arguing that the
less formal and more inductive theorizing of other disciplines has its
advantages.
history are the decline in birth and death rates, and the association of these
trends with the processes of industrialization and modernization. We
mention only a few studies here out of the wide array of quantitative analy-
ses of the demographic transition.
The pioneer of quantitative demography was William Farr, the first
Statistical Superintendent of the General Registry Office in London who,
during his tenure from 1838 to 1880, wrote a series of reports evaluating
the so-called ‘laws of mortality’. Among these was Farr’s Law, which
posited a mathematical relationship between the crude death rate (the
ratio of total deaths to population) and urban density. The best fit to the
data for registration districts outside London in the 1860s was a geometric
model, with b equal to 0.122. Farr’s Law has been criticized by more recent
scholars, but it represented an important first step towards understanding
mortality patterns in mid-Victorian Britain.18
A more elaborate model of late-nineteenth-century mortality in Britain
has been estimated on data for a sample of 36 towns. As one component of
this study, a double logarithmic specification was adopted in which the log
of age-specific mortality was regressed on the three sets of explanatory var-
iables, all in logs. These covered measures of density, of public health indi-
cators such as expenditure on sanitation and water, and a measure of
nutritional status proxied by the volume of food affordable with the level of
real income in the different towns.v
A comparative analysis of child mortality in Britain and America at the
end of the nineteenth century analysed differences across occupations,
focusing primarily on the influence of environment and income.
Environment was proxied by the proportion of workers in each occupation
who lived in urban centres; this was entered in the regression as a linear
variable. Income, however, was entered in log form, ‘since we expect to
observe diminishing returns in the mortality effects of income gains’.w
An earlier study of fertility analysed the number of surviving children
under the age of four among a survey of American workers in the late nine-
teenth century. Among the variables included in the model were measures
of household income, the industry and occupation of the (male) house-
hold head, whether the wife was working, the ethnicity of the husband, the
age of the wife, and the square of wife’s age. The last two variables were
included to capture ‘the curvilinear relationship of fertility to age of
woman’. The analysis found that the age variables were large and highly
v
Robert Millward and Frances N. Bell, ‘Economic factors in the decline of mortality in late nine-
teenth century Britain’, European Review of Economic History, 2, 1998, pp. 263–88.
w
Samuel H. Preston and Michael R. Haines, Fatal Years: Child Mortality in Late Nineteenth-
Century America, Princeton University Press, 1991, pp. 194–8.
§.
x
Michael Haines, Fertility and Occupation: Population Patterns in Industrialization, Academic
Press, 1979, pp. 212–23.
The implication of this finding is thus that for property crime the moti-
vation effect was the most powerful, while for personal crime the opportu-
nity effect dominated. The study explores some of the reasons why this
might be so, and also looks at lagged effects to explain the difference
between short-term and long-term patterns.
Turning next to the politics of democracy, a variety of issues have been
the subject of quantitative analysis. These include voter participation, the
effects of incumbency, and – in the United States – the presidential ‘coat-
tails’ effect; but perhaps the most interesting is the famous proposition
associated with President Clinton: ‘It’s the economy, stupid’. The underly-
ing notion is that a vote represents a rational choice between alternatives,
and that one of the fundamental factors which influences that choice is the
state of the economy and its impact on the individual voter.22
One of the more persuasive of several attempts to investigate this rela-
tionship is the multivariate regression model proposed by Bloom and
Price.aa Their model postulates that voting behaviour is determined by
both long- and short-run forces, and is tested on data for US elections for
the House of Representatives from 1896 to 1970. It is a linear model, with
the dependent variable represented by the deviation of the Republican
share of the two-party vote in those elections from the share that would be
expected on the basis of a measure of long-run party identification. The
explanatory variable is simply the percentage change in real per capita
income during the year preceding each election, but two separate regres-
sions are estimated, for periods with Republican and Democratic presi-
dents respectively.
The model does not assume that economic conditions are the sole deter-
minants of voting behaviour. Other factors, such as foreign policy and can-
didates’ personalities, are clearly relevant and must be regarded as omitted
variables in the terminology of §11.2.2. They are thus represented by the
error term as sources of the unexplained variation in the dependent vari-
able, but as long as they are not correlated with the changes in the explana-
tory variable their omission will not bias resulting estimates of the impact
of economic factors on the vote.
In addition to the test over the full set of 37 elections, the regression was
run separately for elections preceded by declining income and those pre-
ceded by rising income. One of the principal findings is that the relation-
ship between voting and economic conditions is asymmetric. Politicians
aa
Howard S. Bloom and H. Douglas Price, ‘Voter response to short-run economic conditions: the
asymmetric effect of prosperity and recession’, American Political Science Review, 69, 1975, pp.
1240–54.
are punished by the voters for economic downturns, but are not rewarded
for economic upturns.
If the economy can influence the outcome of elections it is natural to
consider the possibility that politicians may seek to influence the economy
for political reasons. This aspect of political debate has stimulated numer-
ous quantitative studies of the ‘political cycle’. What are the motivations of
political parties when formulating their economic policies? Were periods
with Republican (or Labour) administrations more or less likely to be asso-
ciated with low inflation, full employment, rapid growth, and small budget
deficits than periods when Democrats (or Conservatives) were in office?
According to the ‘opportunistic’ theory, political parties have no policy
preferences of their own, and simply choose economic policies that max-
imize their chances of election. The rival ‘partisan’ theory postulates that
left-wing parties are more strongly motivated to promote growth and
reduce unemployment than to curb inflation, and that these priorities are
reversed for right-wing parties. In recent developments these models have
been further refined by the incorporation of rational expectations.
A comprehensive study of this issue by Alesina et al. has formulated a
series of predictions based on variants of these models, and tested them in a
large set of multivariate regressions for the United States, United Kingdom,
and other industrial democracies. For target variables such as output,
inflation, and the money supply their concern is with rates of growth and
so the dependent variable is formulated in logs, but for other targets such as
unemployment and the budget deficit it is the level that is relevant and so
the relationship is linear. The key explanatory variables are lagged values of
the dependent variable, and dummy variables for the periods when the
various political parties were in office. The major conclusion, for both the
United States and other countries, is that the post-war data generally
support the rational partisan theory, particularly with respect to growth
and unemployment.bb
bb
Alberto Alesina, Nouriel Roubini and Gerald D. Cohen, Political Cycles and the Macroeconomy,
MIT Press, 1997. For a quantitative analysis of the same issues of policy formation in the United
Kingdom see Paul Mosley, The Making of Economic Policy, Harvester Press, 1984.
§.
the very statistics used to evaluate the success of the model. If the
specification search is motivated by a desire to maximize the value of R2
and t-statistics, then it is inappropriate to employ them as neutral arbiters
of the success or failure of a model.
How, then, should a conscientious reader approach an article to deter-
mine whether the researcher has resisted the temptations of specification
searches in his work? How should a reader approach empirical controver-
sies? Clearly, we have suggested that traditional arbiters of success, such as
R2 and t-statistics, may mislead. This is not to say that authors who only
report such matters are obviously guilty of worst practice, nor that we
should automatically throw out their results. But readers would be wise to
look for evidence that the research has been undertaken in a less proble-
matical way.
Best-practice methodology focuses on mis-specification searches, i.e.
the use of the various instruments discussed in chapter 11 to determine
whether the model, or its empirical implementation, suffers from fatal
errors of omission or commission. Almost all such tests focus on the prop-
erties of the residuals from the regression; the best work indicates that these
residuals have been evaluated for problems, such as serial correlation,
heteroscedasticity, omitted variables, structural change, etc. Ideally, these
tests should be carried out simultaneously, in order to minimize the pos-
sibility that the test statistics for each procedure are themselves contami-
nated by the order in which they are carried out.
If any of the tests do reveal clear evidence of a problem, the researcher
should show evidence of having considered the possibility that the model is
mis-specified and needs revision, rather than having immediately turned
to one of the many tools at her disposal for ‘correcting’ for the problem.
Obviously, the menu of possible specification errors depends on the
type of regression exercise – cross-section models should be tested for
some problems, and time-series models for others. But such tests are an
essential ingredient of good model-building and should be evaluated by
readers.
Any model is, of course, only as good as the data on which it is based.
Problems can arise in many ways, including errors of measurement or of
omission in the underlying source, errors in drawing samples from the
data, and errors in the procedures subsequently adopted to record or
process the data. Even if a series is reliable for one purpose, it may be
unsuitable if used as a measure of a different concept.cc Careful scrutiny of
cc
For example, unwillingness of victims to report crime may mean that an accurate measure of
recorded crime is a very poor measure of the crimes actually committed; evasion of tax may
§.
all data before they are used for regression analysis is thus something to
which historians must give the highest priority, and visual inspection of the
data can be particularly valuable.
It is also very unwise to run regression models borrowed from other
studies on a data set that might well have very different properties. Thus,
the tendency among some economists to employ historical data in order to
run regressions derived from contemporary economic theory as a test of
the general validity of the theory, without any attempt to examine the
empirical and institutional differences between past and present, is to be
deplored.dd As a general conclusion, the more information about the pro-
cedures employed in evaluating the data before and after the regression is
run, the sounder are likely to be the results and the more convinced the
reader should be of their value.
Assume, however, that two researchers have both shown exemplary
commitment to the principles of best-practice modelling, but have none-
theless produced divergent results. Are there any other clues that might
assist the reader in choosing between them?
One procedure that is often invoked as a good way to test the strength of a
model is out-of-sample prediction. This involves reserving some of the data
points from the sample, running the model over the rest, and using the
regression results to predict the value of the dependent variable outside the
period over which the regression was estimated. If the model is a good one –
and if the underlying structural conditions have not changed – the predictions
should compare well with the actual values over this additional period.
Clearly, if one model has better predictive powers than another over the
same omitted sample, this is indicative of model superiority.
Another comparative strategy employed by some researchers is the
development of a general model that incorporates both rival interpreta-
tions of the event or process being studied. The general model is then grad-
ually pared down as empirically marginal features are discarded, or as rival
elements are tested for mis-specification and found wanting.
mean that an accurate measure of income assessed for taxation is a very poor measure of the
income actually received by taxpayers.
dd
The point might be illustrated by studies of the labour market, where theories of wage determi-
nation formulated in the late twentieth century might be totally inappropriate in relation to
earlier periods in which both the role of trade unions and the size of firms were very different.
More fundamentally, human behaviour itself is not immutable. Modern mores and codes of
conduct that underpin current theories of individual action in economics, politics, and other
disciplines are likely not to have applied in the distant past. See also the more detailed discussion
of quantitative studies of historical changes in Larry W. Isaac and Larry J. Griffin, ‘Ahistoricism
in time-series analysis of historical process: critique, redirection and illustrations from US labor
history’, American Sociological Review, 54, 1989, pp. 873–90.
Notes
1
We are indebted to Roderick Floud for these data. They are taken from J. M. Tanner,
R. H. Whitehouse and M. Takaishi, ‘Standards from birth to maturity for height,
weight, height velocity and weight velocity: British children, 1965 (Part I)’, Archives
of Disease in Childhood, 41, 1966, pp. 454–71; and I. Knight and J. Eldridge, The
Heights and Weights of Adults in Great Britain, HMSO, 1984.
2
If the symbols are changed, economists may recognise (12.5) as the familiar
Cobb–Douglas production function, in which output (Q) is determined by inputs of
labour (L) and capital (K): QA L K. The two features of this specification are
(i) the equation is non-linear in variables but linear in parameters, and (ii) the effect
of the explanatory variables is multiplicative, not additive.
3
The manipulation is based on the rules for powers given in §1.6.1, in particular, the
rule that any number raised to the power 0 is equal to 1. Thus ex e x ex x e0 1,
and we can substitute ex e x for 1 in both the numerator and the denominator of
(12.7a). We therefore get
ex e x ex e x ex e x
Y x x x
e e e
x x x
e (e 1) e (1 ex )
The terms in e x in the numerator and the denominator then cancel out, to leave the
version in 12.7b.
4
The following points may help some students to see more easily why it is necessary to
multiply by X. First, the definition of an elasticity, which was initially given as
Y X
Y X
may be re-written as
Y X
Y X
Y X
X Y
Secondly, the definition of b in this semi-log model, which was initially given as
Y
X
Y
Y 1 Y 1
Y X X Y
Multiplication of the second term in this revised definition by X then gives the
required elasticity.
5
b/1 gives the value for the long-run multiplier because the sum of all the
coefficients on current and past values for SALES in the untransformed (10.15) in
§10.2.2, can be written as b(12 …n) and the mathematicians tell us that
the expression in brackets is an example of a geometric progression which sums to
1/1 .
6
See, for example, Robert Woods, The Demography of Victorian England and Wales,
Cambridge University Press, 2000, pp. 377–8, where a third-order polynomial is
used to plot the relationship between population density and distance from centre of
London. The trend shows an initial rise in density as distance increases and then a
decline. In quadratics and higher-order polynomials, the number of turning points
is always one less than the size of the highest power.
7
Richard Layard and Stephen Nickell, ‘The labour market’, in Rudiger Dornbusch
and Richard Layard, The Performance of the British Economy, Oxford University
Press, 1987, pp.131–79; see especially p.139.
8
When the problem is confined to two or more of the explanatory variables it repre-
sents a form of multicollinearity, a problem briefly discussed in §11.4.2.
9
Recent developments in time-series statistics have produced even better procedures,
such as error-correction mechanisms, but these are too advanced for this text.
10
For an example of a relatively simple Keynesian model see T. Thomas, ‘Aggregate
demand in the United Kingdom 1918–1945’, in Roderick Floud and Donald
McCloskey (eds.), The Economic History of Britain since 1700, II, 1860 to the 1970s, 1st
edn., Cambridge University Press, 1981, pp. 332–46. A much more advanced and
comprehensive model covering the same period is developed in Nicholas H.
Dimsdale and Nicholas Horsewood, ‘Fiscal policy and employment in inter-war
Britain; some evidence from a new model’, Oxford Economic Papers, 47, 1995,
pp. 369–96.
11
An earlier investigation of servant indentures in colonial America also found some
evidence of non-linearity. In this case, age was entered as a series of dummy variables
for each year, giving flexibility to the measured relationship between age and length
of indenture. The coefficients on the dummies fell from 2.75 for those aged less than
15 to 0.17 for age 19, showing that age and length of indenture were negatively
related, while the change in the coefficient for each successive year showed that the
relationship was not linear. David Galenson, White Servitude in Colonial America,
Cambridge University Press, 1981, pp. 97–113.
12
For a stylized representation of the age–earnings profile derived from data on US
households in 1889–90, see Hartmut Kaelble and Mark Thomas, ‘Introduction’, in Y.
S. Brenner, H. Kaelble and M. Thomas (eds.), Income Distribution in Historical
Perspective, Cambridge University Press, 1991, pp. 19–23.
13
These points are illustrated by Timothy J. Hatton, ‘The immigration assimilation
puzzle in late nineteenth-century America’, Journal of Economic History, 57, 1997,
pp. 34–63; and Christopher Hanes, ‘Immigrants’ relative rate of wage growth in the
late 19th century’, Explorations in Economic History, 33, 1996, pp. 35–64.
14
Severals models of asset accumulation have found this relationship with age. See, for
example, J. R. Kearl, Clayne L. Pope and Larry T. Wimmer, ‘Household wealth in a
settlement economy: Utah, 1850–1870’, Journal of Economic History, 40, 1980, pp.
477–96; and Livio Di Matteo, ‘Wealth accumulation and the life-cycle in economic
history: implications of alternative approaches to data’, Explorations in Economic
History, 35, 1998, pp. 296–324.
15
On American industry see Jeremy Atack, ‘Economies of scale and efficiency gains in
the rise of the factory in America, 1820–1900’, in Peter Kilby (ed.), From Quantity to
Quiddity, Wesleyan University Press, 1987, pp. 286–335; and John A. James,
‘Structural change in American manufacturing, 1850–1890’, Journal of Economic
History, 43, 1983, pp. 433–59. James Foreman-Peck and Robert Millward, Public and
Private Ownership of British Industry 1820–1990, Oxford University Press, 1994, pp.
197–239, is a masterly study of the performance of public and private industry. The
Anglo-American productivity gap is analysed in S. N. Broadberry and N. F. R. Crafts,
‘Britain’s productivity gap in the 1930s: some neglected factors’, Journal of Economic
History, 52, 1992, pp. 531–58.
16
The classic study is Z. Griliches, ‘Hybrid corn: an exploration in the economics of
technical change’, Econometrica, 25, 1957, pp. 501–22. See also the application to the
diffusion of the stationary high-pressure engine in Cornish mining in the early nine-
teenth century in G. N. von Tunzelmann, Steam Power and British Industrialization,
Oxford University Press, 1978, pp. 252–64.
17
For the application of the logistic curve to expenditure on household durables see
Sue Bowden and Paul Turner, ‘The demand for consumer durables in the United
Kingdom in the interwar period’, Journal of Economic History, 53, 1993, pp.244–58.
The suggested explanation for the logistic pattern is given by Sue Bowden and Avner
Offer, ‘Household appliances and the use of time: the United States and Britain since
the 1920s’, Economic History Review, 47, 1994, pp. 725–48.
18
For a critical evaluation of Farr’s Law see Woods, Demography of Victorian England
and Wales, pp. 190–202. Woods also analyses the evidence for a relationship between
the incidence of specific diseases, such as measles and tuberculosis, and population
density using log-linear models, see pp. 317–25.
19
See, for example, the exemplary studies of the surviving sixteenth-century judicial
records of suicide in S. J. Stevenson, ‘The rise of suicide verdicts in south-east
England, 1530–1590: the legal process’, Continuity and Change, 2, 1987, pp. 37–75;
of violent crime in J. S. Cockburn, ‘Patterns of violence in English society: homicide
in Kent, 1560–1985’, Past and Present, 130, 1991, pp. 70–106; and of the concept of
‘crowds’ in Mark Harrison, Crowds and History, Mass Phenomena in English Towns,
1790–1835, Cambridge University Press, 1988.
20
Some other examples include Edward Shorter and Charles Tilly, Strikes in France
1830–1968, Cambridge University Press, 1974; Neil Sheflin, Leo Troy and C.
Timothy Koeller, ‘Structural stability in models of American trade union growth’,
Quarterly Journal of Economics, 96, 1981, pp. 77–88; A. R. Gillis, ‘Crime and state sur-
veillance in nineteenth-century France’, American Journal of Sociology, 95, 1989, pp.
307–41; and E. M. Beck and Stewart E. Tolnay, ‘The killing fields of the deep South,
the market for cotton and the lynching of blacks, 1882–1930’, American Sociological
Review, 55, 1990, pp. 526–39.
21
For an application of a linear regression model of the relationship between crimes
against property and fluctuations in prices in the seventeenth and eighteenth centu-
ries, see the major study by J. M. Beattie, Crime and the Courts in England,
1660–1800, Oxford University Press, 1986, pp. 199–237. Prices are measured by the
Schumpeter–Gilboy index; the regressions are run separately for urban and for rural
parishes in Sussex, and for Surrey, and also distinguish wartime and peacetime
periods. The general finding is that there was a clear positive relationship between
the fluctuations in prosecutions and in prices (with an R2 of 0.41 for Surrey and of
0.52 for Sussex), but there were some variations over time and between urban and
rural parishes.
22
The early history of research on this topic is reviewed in one of the seminal modern
studies, Gerald H. Kramer, ‘Short-term fluctuations in US voting behavior,
1896–1964’, American Political Science Review, 65, 1971, pp. 131–43. Subsequent
studies in the United States include Edward R. Tufte, Political Control of the
Economy, Princeton University Press, 1978; Gregory B. Markus, ‘The impact of
personal and national economic conditions on the presidential vote: a pooled
cross-sectional analysis’, American Journal of Political Science, 32, 1988, pp. 137–54;
and Robert S. Erikson, ‘Economic conditions and the Congressional vote: a review
of the macrolevel evidence’, American Journal of Political Science, 34, 1990, pp.
373–99. For some of the corresponding studies of the United Kingdom see James
E. Alt, The Politics of Economic Decline, Cambridge University Press, 1979,
pp. 113–38.
23
No doubt part of the reason for this approach lies in the preference among research-
ers and readers for providing positive results, i.e. finding that a hypothesis is consis-
tent with the data. Were the criteria for judging intellectual quality to include
analyses that demonstrate that a model is inconsistent with the data, researchers
would have less incentive to employ specification searches until ‘good’ R2 and t-
statistics were produced.
(i) Draw the curve of this relationship freehand (and without using the
computer), indicating the direction of slope at younger and older
ages.
(ii) Calculate the age at which earnings begin to decline. What are the
average earnings at this age?
(iii) Can the average worker expect to earn more on retirement (at age
65) than when he started work at age 15?
3. In a model of death rates by age for the Massachusetts population in
1865, the fitted regression line is
(iii) What is the expected death rate for this population: At the age of 10?
At the age of 70?
4. Evaluate the following propositions about non-linear models, indica-
ting whether they are true or false and explaining why:
(i) ‘Logarithmic transformation is appropriate for multiplicative non-
linear relationships; substitution is appropriate for additive non-
linear relationships.’
(ii) ‘The advantage of the linear trend model is that, since it is a straight
line, it produces a constant growth rate to the data.’
(iii) ‘The linear growth rate is measured as the elasticity of the dependent
variable against time.’
(iv) ‘The number of turning points in a polynomial series is always one
less than the size of the power; thus there is 1 turning point in a
quadratic equation, 2 in a cubic, etc.’
(v) ‘Equation (12.15) indicates that the long-run multiplier will always
be larger than the impact multiplier.’
5. You review a regression result of the following form:
Y52X 1 X2
Would you consider that a great deal of information has been lost by apply-
ing the short-hand formula in this case?
(ii) Re-run the log-linear regression for IRMIG over the years,
1877–1913. Report your results and work out the implied growth
rate of the emigration rate. How does this rate compare to that cal-
culated from the compound growth formula, fitted to the end-
points alone?
Would you consider that a great deal of information has been lost by apply-
ing the short-hand formula, in this case?
Are there any general lessons about the applicability of the compound
growth-rate formula to be learned from a comparison of your findings for
the two periods?
8. Use the basic Benjamin–Kochin (OLS) regression to calculate the elastic-
ity of unemployment with respect to the benefit–wage ratio.
Once again, you wish to compare the fitted values for 1936–8 from a regres-
sion run over 1920–35 to the actual unemployment rates in those years and
to the fitted values from a regression run over the entire sample. Report
your results and draw any appropriate inferences.
What does this exercise suggest about the relative strengths of the two
models for explaining inter-war unemployment in Britain?
10. Set up a model explaining the time-series behaviour since the Second
World War of any ONE of (a) US imports of oil; (b) the divorce rate in the
United Kingdom; (c) the level of foreign aid from the G7 countries to
African states. Be sure to specify the explanatory variables to be used in the
model; the form of the relationship between each explanatory variable and
the dependent variable; and what you would look for when analysing the
results.
§.
whose values are bounded in this way are known as censored. Most such
variables are censored from below, that is they cannot fall below a certain
level (usually, they cannot take on negative values). Less commonly, vari-
ables are censored from above. One example is the measurement of house-
hold savings in a given period, which cannot be greater than the total
income received.c
c
A related problem occurs if a sample excludes observations according to some criterion, such as
a minimum height requirement in a sample of army recruits, or maximum family size in a
survey of household expenditures. Such cases are known as truncated samples. Their solution
requires more complex econometric theory than is possible in this text. Unhappily, truncated
samples do occur with some frequency in historical research and researchers should always be
careful to consider the problem.
§.
ownership,
Detroit, 1919 0.8
S-shaped curve
0.6
0.4
OLS regression line
0.2
0.0 o o o o o o
– 0.2
0 2 4 6 8 10 12
Household income ($000)
0.6
0.4
0.2
0.0
–0.2
–4 –3 –2 –1 0 1 2 3 4
Standard deviations from the mean
d
It would in some respects be more appropriate to refer to the probit model as the normit,
reflecting its origin in the normal curve, just as the logit is derived from the logistic curve.
where
and
e
An alternative way of writing ‘e to the power X’ is exp(X), and this variant is occasionally seen in
articles and textbooks. In this chapter, we will retain the use of ex.
§.
0.002
0.001
0.000
0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00
Proportion of left-handers
that there is only one parameter that matters, namely the proportion of left-
handed people in the population. We symbolize this proportion by p. What
proportion is likely to produce this result?
Clearly, it cannot be that all Americans are either left- or right-
handed; thus the likelihood of p0 and p1 are both zero. If the propor-
tion were 0.1 (1 in 10), each occurrence of a right-handed observation
would occur with a probability of 0.9, and each occurrence of a left-
handed with a probability of 0.1. Since the individuals were chosen ran-
domly from a very large population, each observation is independent,
such that the probability of drawing this specific sample is the product of
each separate probability.
Thus, if p0.1, the likelihood of generating our sample is
0.90.90.10.10.90.90.90.90.10.90.0004783
0.80.80.20.20.80.80.80.80.20.80.0016777
This picture sketches the likelihood function. The formula for the like-
lihood function in this case is:
L(p) p (1 p)
yi 1 yi 0
(13.6)
where 兿 indicates the product of all the positive terms and 兿 the product
yi1 yi0
of all the negative terms. Dealing with product terms can be complicated,
especially when working with more complicated distributional forms.
Generally, therefore, researchers prefer to use the logarithm of the likeli-
hood function, or the log-likelihood function, which is additive rather than
multiplicative in the terms. This is defined for our simple example as:
where n and n1 indicate the total number of observations and the number of
positive observations, respectively.
Both the likelihood and the log-likelihood functions reach their highest
values when p0.3. This is the maximum-likelihood estimate of p for this
class of distribution and this particular sample.
The example establishes the methodology commonly applied in
maximum-likelihood models:
(a) Select the distribution most appropriate to the data being investigated.
(b) Establish the likelihood function for given parameters of the distribu-
tion.
(c) Determine the value of the parameters that maximize the value of the
likelihood function, given the data.
The parallels between our simple model and the general problem
involving dichotomous dependent variables should be clear. A given
sample produces a series of observations that are binary, taking the
value of 0 or 1. We interpret these values as probabilities and try to
determine what factors influence the particular distribution of 0s and
1s among the sample.
In the case of the logit model, the assumed distribution of the data is
the logistic. The likelihood function is constructed by assigning (13.4)
§.
ew e w
log L 兺 log 1 e 兺 log 1 e
yi 1
w
yi 0
w
(13.9)
and to limit inferences to the sign and significance of each coefficient.f The
signs and the standard errors of logit coefficients may be interpreted in
exactly the same way as for OLS regressions. For example, if the sign of a
coefficient is negative, it indicates that the dependent variable falls as the
explanatory variable increases in value. If the t-statistic falls below a critical
level, it indicates that the logit coefficient is not statistically significantly
different from zero at a certain level of . Similarly, the larger the t-statistic,
the smaller the confidence interval around the reported coefficient.g
However, in other cases, an historian may not consider it sufficient to
establish the sign and significance of the logit coefficients. She may also be
interested in making comparisons across coefficients, to determine which
explanatory variable has a greater impact on a particular outcome.
Alternatively, she may wish to discover how big an impact a certain change
in the explanatory variable (say, income) has on the value of the dependent
variable (say, the probability of owning a car). In order to accomplish any
of these tasks, it is necessary to understand exactly how the logit coefficient
is to be interpreted.
It may be obvious from the way this has been stated, that interpreting
logit coefficients is not as straightforward as it is with an OLS regression.
Unlike the OLS formulation, the coefficient, bk, is not a direct measure of
the change in the dependent variable, P(Yi 1) for a unit change in explan-
atory variable, Xk. Moreover, we cannot simply substitute the value of Xk
f
Note, however, that no empirical meaning can be ascribed to the constant term in a logit func-
tion.
g
Some programs also produce prob-values for the logit coefficients.
into the equation and read off the value of the dependent variable directly,
as with OLS formulations. The information can be abstracted, but it
requires some manipulation of the coefficients.
To understand what is at issue, let us return to our sample of Detroit
households in 1919. We have estimated a logit model for the pattern of
automobile ownership with income as the sole explanatory variable, X1.
The coefficient on income is found to be 0.6111; the constant term is
3.6921. We wish to know the predicted level of automobile ownership at
a certain income, say, $2 thousand. The value of the logit function at that
income level can be calculated by substituting the coefficients and the value
of the explanatory variable into the regression equation, ab1X1. This
produces
Clearly, this figure cannot be the expected probability of owning a car, since
that must lie between 0 and 1. How can we move from what we see (the
value of the logit function), to what we want to know (the probability of car
ownership at X1 2)?
The first step in answering this question is to note that if we divide (13.4)
by (13.5b) we get
P(Yi 1)
ew (13.10a)
P(Yi 0)
where ew is the expected value of the dependent variable from the regres-
sion.5 The ratio of the two probabilities on the right-hand side of the equa-
tion is known as the odds ratio.
It is thus equal to
P(Yi 1)
(13.10b)
P(Yi 0)
§.
w loge 冢 P(Yi 1)
P(Yi 0) 冣 (13.11)
To complete the derivation, recall that w is our shorthand term for the com-
plete set of explanatory variables, abk Xk. Thus, the log odds ratio is equal to
the estimated value of the logit function at a given set of explanatory variables.
This is the
logarithm of the odds ratio.
It is equal to
loge 冢 P(Yi 1)
P(Yi 0) 冣 (13.11)
(i) Take the exponent of 2.4699; this equals 0.0846. This step trans-
forms the log odds ratio into the odds ratio proper (ew, as in (13.10a)).
(ii) Divide 0.0846 by 1.0846; this produces 0.078. This step extracts the
value of P(Yi 1) from the odds ratio (using ew/(1ew)), as previously
shown in (13.4).
The procedure may also be used to calculate the average level of auto-
mobile ownership for the sample as a whole from the logit regression.
This could be accomplished by first estimating P̂Y for each observation
and then calculating its average. However, such a laborious process is
unnecessary, since the average value of P̂Y is precisely equal to the mean of
–
P(Yi 1) for the sample as whole (which we designate as PY). Thus,
–
P̂Y/nPY. This result is the maximum-likelihood equivalent of the stan-
dard OLS finding that the sum of all the deviations from the regression
line is equal to zero; the logit regression provides the best fit to the data on
average.7
This is
precisely equal to
–
the sample mean of the dependent variable, PY.
discuss the effects of an absolute change in §13.2.3 and §13.2.4, and will
deal with the proportionate changes in §13.2.5.
This is the
impact on the dependent variable, P(Yi 1),
of
an infinitesimal change in an explanatory variable, Xk.
The alternative impact effect measures the change in P(Yi 1) relative to
a (larger) finite change in Xk. The impact effect of an absolute change in Xk is
commonly indicated by the expression, PY/Xk, where (capital Greek
delta) is the mathematical symbol for a finite change. It is usually calcu-
lated for a change of one unit in Xk, although the impact of other finite
changes may also be estimated. Since PY/Xk is a ratio (unlike ∂PY/∂Xk),
the scale of the impact effect differs as Xk changes. If the effect is measured
for a change of one unit then Xk 1, and the impact effect is simply equal
to PY.
i
∂PY/∂Xk is the mathematical symbol representing a very small change in P(Yi 1) caused by a
very small change in one explanatory variable, Xk, while any other explanatory variables are held
constant. (For the mathematically inclined, it is the partial derivative of the logistic function
with regard to Xk at the point of measurement.) Note that it must be thought of as a single
expression; unlike PY/Xk, it is not a ratio of a numerator to a denominator, in which the value
of the denominator can change from one calculation to another.
This is the
change in the dependent variable, P(Yi 1),
relative to
a finite change in an explanatory variable, Xk.
Probability of auto-ownership
1919 0.8
0.6
Impact effect
0.4
Marginal effect
0.2 B
A
0.0
0 2 4 6 8 10 12
Household income ($ 000)
Likewise, the slope of the dotted line also varies along the curve. This
can be seen by comparing the tangency lines at A and B in figure 13.4. It
happens that the slope (and therefore the size of the marginal effect) is
largest at the middle of the logit curve (i.e. when the predicted value of
P(Yi 1)0.50). The further we move from the middle of the curve, the
more the slope will diminish (and the marginal effect will decline in
value). The varying slopes are due to the inherent non-linearity of the
logit curve.
It is therefore important not to place too much emphasis on the meas-
urement of either the marginal or the impact effect of a change in an
explanatory variable at a single point on the curve. But on the other hand,
researchers have no wish to report a whole battery of results for each indi-
vidual, or for each possible permutation of the explanatory variables. They
invariably do provide one indicator of the effect of an absolute change, and
this is usually the average marginal, or average impact, effect for all obser-
vations. This collapses the multiplicity of possible outcomes into a single
index, which gives equal weight to the effect of a change in, say, income on
car ownership for every household in the sample.
PY 1 PYi
Xk n 兺 X
i k
(13.13a)
There remains the question of how the reported value of the marginal
effect should be interpreted. Is 0.0942 a large number or a small one?
Without a point of comparison, it is hard to tell. In a bivariate model,
therefore, with only one explanatory variable, the marginal effect is of
limited interest. In a multivariate model, however, it provides a means to
compare the impact on the dependent variable of very small changes in
each of the explanatory variables. One variant of this approach that is
often to be found in logit (and probit) exercises is the publication of the
ratio of the marginal effects, indicating the relative strength of the
explanatory variables in influencing the behaviour of the dependent
variable.
bk loge
冢 P(Yi 1)
P(Yi 0) 冣 (13.14)
The new value of the log odds ratio after the change is thus equal to the origi-
nal value plus the regression coefficient.
We can illustrate the procedure with reference to our Detroit sample,
and we again take the individual household with an income of $2 thou-
sand.9 The predicted value of P̂Y for this household has already been calcu-
lated (in §13.2.2) at 0.078. The sequence of steps to calculate the new
§.
predicted value of P̂Y for an increase in income of one unit ($1 thousand) is
as follows:
(i) New value of log odds ratio original valuelogit coefficient on
income
2.46990. 6111
1.8588
(iii) New value of P(Yi 1) new odds ratio/(1new odds ratio)
0.1559/1.1559
0.1349
(iv) Change in P(Yi 1) new value of P(Yi 1) old value
0.1349 0.0780
0.0569.j
In the unit case, where Xk 1 for all observations, this simplifies to
PY 1
Xk n 兺P̂
i
Yi
(13.15b)
On the face of it, this requires the tedious process of going through
steps (i) through (iv) above for every household in the sample.
j
Researchers sometimes report odds-ratio coefficients rather than logit coefficients when pre-
senting their results. These are exactly equivalent: the odds-ratio coefficient is simply the expo-
nential of the logit coefficient. But, if the results are reported in this way, it becomes possible to
reduce the number of steps to calculate the change in P̂Y for a unit change in Xk. The new value of
the odds ratio (step (ii)) is equal to the odds-ratio coefficient multiplied by the original odds
ratio. The remaining steps are the same.
PY Xk
(13.16b)
k
Xk PY
Equation (13.16b) shows that the calculation of the elasticity involves
two separate elements: the ratio of the changes in PY and Xk, and the ratio of
the values of Xk and PY. The first element measures the slope of the relation-
ship between PY and Xk; the second element measures the location at which
this slope is to be calculated.
Since there are two ways to measure the slope of the logit curve at a given
point, it follows that there are two ways to calculate the elasticity.k For the
point elasticity, the first element is the marginal effect,
PY
Xk
k
We have adopted the terms ‘point’ and ‘unit’ to distinguish the two elasticities, but it should be
noted that some authors simply refer to ‘the elasticity’ without specifying exactly which they
have measured.
The corresponding calculation for the unit elasticity starts from the
formula
–
PY X k
– (13.19)
k
Xk PY
However, we saw in (13.15b) that for the sample as a whole, the first
term on the right-hand side was equal to
1
n 兺 P̂Y
i
i
This term can be substituted in (13.19) to give the formula for the unit elas-
ticity for the sample as a whole
–
1 Xk
k
n i兺P̂ Yi
–
PY
(13.20)
We may return once again to our Detroit example to illustrate the appli-
cation of these two formulae. In §13.2.4 we measured the marginal and
impact effects of a change in income for the sample as a whole at 0.0942 and
0.074, respectively. These may be converted to point and unit elasticities by
–
multiplying by the ratio of the sample means of income (Xk $2.95 thou-
–
sand) and car ownership (PY 0.175). The point elasticity is thus measured
at 1.5885; the unit elasticity at 1.2474.
§.
–
1 Xk
k
n 兺 i
P̂Yi –
PY
(13.20)
However, the point elasticity may mislead if the change in the explanatory
variable is not small. This would always be the case, for example, if we were
estimating the effect of a change in a dummy explanatory variable from 0
to 1. Under these circumstances, it is clearly better to calculate the unit elas-
ticity, even if its derivation is more complicated.
At this stage it may be helpful to summarize the rather complex set of
procedures for interpreting logit coefficients. We do so in two ways. First,
with the aid of figure 13.5, which provides a simple flow diagram incorpo-
rating all the relevant formulae for measuring the effect of the absolute and
proportionate changes for the sample as a whole.
Secondly, to show how these formulae and procedures work out in prac-
tice, we have also given in table 13.1 some stylized results from the simple
bivariate model of car ownership in Detroit.
Column (1) in table 13.1 indicates the level of income at thousand-
dollar intervals of the income distribution, from $1 thousand to $10 thou-
sand. The remaining entries follow logically from this information by the
formulae adopted above. The final row in the table gives the results for the
sample as a whole; it is a weighted average of the figures reported in the
income specific rows.l This row summarizes the average response of auto-
mobile ownership to changing income across all observations.
The non-linearity of the logit function is clearly revealed in the table.
The measurement of the impact of both the absolute and proportionate
l
It is a weighted, rather than a simple, average because the income is not distributed evenly across
all income levels in the original data set.
–
P̂Yi (1 P̂Yi ) X k
bk 兺 i n
–
PY
§.
Table 13.1 Deriving the findings from a logit regression (hypothetical data)
Notes:
(2) is the predicted value of the logit function from the regression, abX, where X is
income, a–3.6921, and b0.6111.
(3) is the exponent of (2).
(4) is calculated using the formula OR/(1OR), where OR is the value of the odds ratio in
(3).
(5) is (4)[1 – (4)]the regression coefficient, b [0.6111].
(6) is the difference between successive values of (4) (the value in row 10 is the difference
between 0.918 and the value of (4) if X11).
(7) is (5)(1)(4).
(8) is (6)(1)(4).
changes clearly varies within the sample; the point elasticity, for example, is
0.58 at an income level of $1 thousand, rising to almost 2.0 at $6 thousand,
and declining steadily to 0.50 at $10 thousand.
For this reason, although our calculations thus far have focused primarily
on the average effect of changing income for the sample as a whole, it is clear
that such a summary calculation hides a great deal of information about the
behaviour of automobile ownership in our data set. The averages are clearly
useful, but it will often be desirable for a researcher to go behind them to
investigate changes at specific points of the logistic curve. For example, she
may be interested in the responsiveness of automobile ownership to changes
in income for the median household, or she may wish to compare the beha-
viour of ownership for lower- and upper-income families. The non-linearity
of the logit function ensures that these valuations will diverge, as table 13.1
makes clear.
PY/X1 X1/P̂Y 1
Lower quartile: 0.044 1.50/0.059 1.119
Median income: 0.069 2.42/0.099 1.688
Upper quartile: 0.114 3.75/0.198 2.159
13.2.7 Goodness of fit
Researchers are often interested in the goodness of fit for a regression equa-
tion. Since the logit model does not use least squares methods, it is not pos-
sible to estimate the R2 directly. Instead, it is necessary to construct other
measures of the goodness of fit, which have become known as pseudo R2
measures.
As explained in panel 13.1, the optimal evaluation of the parameters of
the logit model is determined by maximizing the value of the log-
likelihood function. It is this function that is the basis for constructing
the pseudo R2 measures. Since there is no ideal value of the log-likelihood
function against which the model estimate can be compared (corre-
sponding to the explanation of 100 per cent of the variance in an OLS
model), the goodness-of-fit measure compares the value of the model
estimate to the value of the log-likelihood function if the model explained
nothing of the variation in PY. This estimate is known as the base likeli-
hood.
Let the maximum log likelihood be symbolized as log L1 and the log
value of the base likelihood as log L0.
The first goodness-of-fit measure is defined as
1
pseudo R2 1 (13.21)
1 [2(log L1 log L0 )/n]
where
m
The crucial change is in step (i). The new value of the log odds ratio will be 1.8894(0.6111
2.05) 0.6366. The rest of the sequence proceeds accordingly.
n
A perfect fit would require that the model precisely match the distribution of (Yi 0) and (Yi
1) for all observations, in which case the value of log L1 would be zero.
o
Another method of measuring goodness of fit in logit models, which begins from a similar
premise, is to calculate the number of correct predictions of positive outcomes. Great care needs
to be taken when interpreting such measures, for technical reasons that are beyond the scope of
this text.
§. :
p
The critical value of the chi-squared distribution with 3 degrees of freedom at the 1 per cent level
is 6.635.
Table 13.2 Detroit automobile ownership, 1919: the full logit model
(hypothetical data)
CONSTANT 0.7285
(1.137)
Log L0 463.726 pseudo R2 0.2958
Log L1 253.717 McFadden’s R2 0.4529
–2 log 420.02
Notes:
*** No constant term is reported for the odds-ratio specification, since the expected value of
the dependent variable is completely independent of the intercept term.
*** These effects and elasticities are reported for the sample averages.
*** The unit elasticity for HOME is evaluated at the sample mean of the dummy variable
(0.185).
Figures in parentheses are t-statistics.
q
This is, of course, because the change in HOME is not being measured at its sample mean, which
is neither 0 nor 1.
evaluate the impact of changing family size. The procedure for calculating
these local effects is the same as outlined in §13.2.6.11
It will be seen that there is very little difference from the results of table
13.2, run over the original 1,000 observations, indicating the power of the
grouped methodology when properly applied.12
§.
How do we transform this result into something more familiar and useful?
As with the logit model, we can calculate the impact of two forms of abso-
lute change in an explanatory variable, and two forms of proportionate
change.
We begin with the calculation of the impact effect of a unit change in Xk ,
the explanatory variable, INCOME. The coefficient on INCOME in the
probit regression, given in §13.4.1, is 0.3575. This probit coefficient has a
special interpretation: it indicates that a unit increase in income raises the
probit index by 0.3575 standard deviations. The impact of such a change on
the expected value of automobile ownership will vary according to where
on the standardized normal distribution this change of 0.3575 s.d. takes
place.
To measure this we need to translate the value of the probit index into
the expected value of AUTO, and this is done by means of the published
tables of the area under the standardized normal distribution, an excerpt
from which was presented in table 2.7. The value of the cumulative area to
Z1.4084 is 0.9205.r In order to calculate AUTO we need to find the value
in the tail beyond Z. This is equal to 1 0.92050.0795, which gives us P̂Y,
the average expected level of car ownership at the means of the explanatory
variables.
r
D. V. Lindley and W. F. Scott, New Cambridge Statistical Tables, 2nd edn., Cambridge University
Press, 1995, table 4, p. 34 by interpolation between the values for Z of 1.40 (0.9192) and 1.41
(0.9207) to get 0.9205. Because the table is symmetrical, the area in the lower tail beyond
1.4084 is identical to the area in the upper tail beyond 1.4084.
§.
The value of the probit index after a unit change in INCOME is equal to
the original value of the probit index plus the regression coefficient on
INCOME, which equals 1.40840.3575 1.0509. In the same way as
before, this can be translated into an expected value of AUTO using the
table for the standardized normal distribution. This gives an area to Z of
0.8533, so the new value is (1 0.8533)0.1467. The original value was
0.0795, so the absolute increase in the level of car ownership is 0.1467
0.07950.0672. The estimate of the impact effect of a unit change in
INCOME, calculated at the means of all the explanatory variables, is thus
0.0672.
The alternative measure, the marginal effect (PY/Xk), is derived by
multiplying the probit coefficient for the relevant explanatory variable (in
this case 0.3575) by a tabulated statistic we have not previously encoun-
tered: the height of the normal density function, valued at the probit
index.15 For a probit index of 1.4084 the figure is 0.1480, so the marginal
effect is equal to 0.35750.14800.0529.
We turn next to the two elasticities. As with the logit elasticities in
§13.2.5 the calculations involve two elements: the ratio of the proportion-
ate changes in Py and Xk, and the ratios of the values of Py and Xk. For the
–
sample as a whole, the latter are again taken at their mean values, i.e. at Xk
and P̂Y.
The unit elasticity can thus be calculated in the usual way (see (13.19)) as
the ratio of the change in the dependent variable (PY) to the change in the
explanatory variable (Xk), multiplied by the ratio of the two variables at
–
their mean values (Xk/P̂Y). We have already calculated PY 0.0672 and
–
P̂Y 0.0795. We know that the mean level of the explanatory variable, Xk, is
$2.95 thousand, and for a unit change Xk is 1. So the unit elasticity is
(0.0672/1) (2.95/0.0795)2.4936.
The alternative point elasticity is similarly calculated by multiplying the
marginal effect (PY/Xk) by the ratio of the two variables at their mean
values. The first term was calculated above as 0.0529 so the point elasticity
is 0.0529(2.95/0.0795)1.9633.
The difference in scale between the point and unit elasticity measure-
ments is the probit equivalent of the difference between the slope of the
marginal and impact effects in the logit function, as shown in figure 13.4.
These probit calculations are indicative of the procedure. However, they
are based on only one point on the normal distribution function. Since this
function is no less non-linear than the logistic curve, we have to be careful
about generalizing from local results to the sample as a whole. As with the
logit estimates of the average effects and elasticities, if we wish to know the
effect of a change in Xk for the overall sample, the correct procedure is to
take the average effect of a given (infinitesimal or unit) change for each
observation. And, as with the logit model, theory and the computer make
this a relatively straightforward exercise. Once again, the average value of
–
P̂Y for the observations as a whole is PY. And once again, the computer will
calculate the effect of a change in the explanatory variable on the probit
index for each observation, from which the impact on P(Yi 1) for the
sample as a whole can be calculated.
s
In what follows, we leave the technical details to econometrics texts and the method of solution
to the computer package.
§.
The second type of response model applies where the assigning of values
to states is random rather than ordered. In such cases the appropriate
multi-response model is either the multinomial logit or the multinomial
probit model. A model investigating emigration from Ireland might
analyse the pattern of migrant destinations according to a set of individual
and locational characteristics. A dichotomous model of Irish emigration
that distinguished only between those who did not emigrate and those who
did, could be extended by subdividing the latter by assigning 1 to those who
went to the United Kingdom, and 2, 3, and 4 to emigrants to the United
States, Canada, and the Antipodes, respectively. Similarly, a study of auto-
mobile ownership in Detroit in 1919 might try to model brand selection,
by assigning 1 to Ford, 2 to Chevrolet, 3 to Buick, etc. In both cases, the
order by which the values are assigned is random, so clearly the ordered
probit model would be inappropriate.17
In the case of multinomial models, there is a significant difference
between the logit and probit alternatives, in contrast to their use in binary
choice models. The logit model, although less computationally awkward,
has one significant drawback. This is known as the independence of irrele-
vant alternatives. It arises because of the assumption in the logit model
that all options are equally likely.
Let us assume that the model of Irish emigration is recast by separating
the Antipodes into Australia and New Zealand. The logical expectation
would be that this change would have no impact on the measured probabil-
ities of emigrating to the other destinations. Unfortunately, in the multino-
mial logit model, this is not the case; all the probabilities will decline by the
same proportion. Thus, unless all categories of the dependent variable are
equally likely, the multinomial logit may be inappropriate.
The alternative is to use multinomial probit, which does not assume
independence of the error term across alternative states. Its disadvantage is
its high computational complexity and expense, limiting analysis to fewer
than five categories.
t
James Tobin, ‘Estimation of relationships for limited dependent variables’, Econometrica, 1958,
26, pp. 24–36.
§.
60
Hours of work
50
30
20
10
0 2 4 6 8 10 12 14
Hourly wage
60
True line
50
Expenditure on life insurance
40
30
20
Biased OLS line
10
–10
0 100 200 300 400 500
Income
The model that Tobin developed to deal with the problem of censored
dependent variables has become known as the tobit. Very simply the tobit
model sets up the regression problem in the following form:
Yi we (13.24)
where w again stands for the constant and all the explanatory variables
with their coefficients in the regression equation, and
and
Yi 0 if expenditure is zero
The tobit approach thus distinguishes the decision to buy from the deci-
sion of how much to buy. Note that the structure of (13.24) is similar to
that of (13.1), simply substituting a continuous variable for the binary
choice in the first restriction. The tobit model may thus be seen as an exten-
sion of the logit-probit method.u
The model is solved simultaneously using maximum-likelihood
methods, by setting up a likelihood function with two elements, one relat-
ing to the zero observations, the other to the positive outcomes. Each part
of the likelihood function is based on the standard normal distribution.
The tobit coefficients have a dual interpretation. They measure the
impact of a change in the explanatory variables both on the probability of
making a purchase and on the size of the purchase made. By solving as a
single system, the tobit does not allow for differential effects of the set of
explanatory variables on the two components of the household decision.
Rather it asserts that the variables that determine whether to buy operate in
the same way to determine how much to buy. So, in the case of our Detroit
households, the sensitivity of expenditure on automobiles is the same
throughout the income distribution; but those families below a certain
income level, although desirous of a car, cannot buy one because it is not
available at their offer price.
Since Tobin’s pioneering work, deeper understanding of the underlying
theory behind censored regressions has unearthed some problems with the
tobit solution. In particular, it produces biased coefficient estimates if the
explanatory variables are not normally distributed. This is a serious
u
Tobit models may be applied to censoring from above as well as below. They may also be adapted
to cases in which the dependent variable is constrained from above and below (e.g. when hours
of work cannot fall below 20 or rise above 50).
§.
v
If income is distributed log-normally, it is possible to rescue the tobit specification by using the
logarithm of income instead, but other variables in the regression model may also be non-
normal.
Two-stage model
pseudo R2 2 log R2
0.1272 420.02 0.4165
Note:
Figures in parentheses are t-statistics.
shown in table 13.3. The most striking difference between the single-equa-
tion and the two-stage model is in the interpretation of the two secondary
explanatory variables, HOME and FAMSIZE. Whereas the tobit regression
suggests that the value of the automobile dropped as family size increased
and rose as wealth increased, the two-stage model permits a more sophisti-
cated interpretation.
A comparison of the probit and the second-stage OLS equations in
columns (2) and (3) of the table indicates that HOME was primarily a
determinant of whether a family purchased a car or not, but was not a sta-
tistically significant factor in explaining the value of the car owned. A
bigger family size reduced the probability of owning a car for any given
level of income or wealth; but for those with incomes above the threshold, a
larger family translated into a more expensive (perhaps larger?) car. It
seems clear that the two-part specification permits a richer interpretation
of the data than the tobit.
Notes
1
The following are a few recent examples of studies that use logit or probit procedure
for the analysis of binary dependent variables. The effects of early industrialization
on the employment of children, in Sara Horrell and Jane Humphries, ‘“The exploita-
tion of little children”: child labor and the family economy in the industrial revolu-
tion’, Explorations in Economic History, 32, 1995, pp. 485–516. The retirement and
related decisions made by a sample of Union army veterans in 1900 and 1910, in
Dora L. Costa, The Evolution of Retirement, University of Chicago Press, 1998. The
probability of dying from or contracting disease for a sample of Civil War recruits
during their military service, in Chulhee Lee, ‘Socio-economic background, disease
and mortality among Union army recruits: implications for economic and demo-
graphic history’, Explorations in Economic History, 34, 1997, pp. 27–55. The prob-
ability of voting for or against tariff repeal in the British Parliament in the 1840s, in
Cheryl Schonhardt-Bailey, ‘Linking constituency interests to legislative voting beha-
viour: the role of district economic and electoral composition in the repeal of the
Corn Laws’, Parliamentary History, 13, 1994, pp. 86–118.
2
This can be easily shown by rewriting the OLS regression, Yi abXi ei in the
form ei Yi a bXi. When Yi 1, the error term is equal to (1 a bXi); when
Yi 0, it is equal to (– a bXi). The error term will be larger for households for whom
Yi 1 than for those for whom Yi 0.
3
The logistic curve is the cdf of the hyperbolic-secant-square (sech2) distribution.
4
We can derive (13.5b) by re-writing (13.5a)
ew
P(Yi 0) 1
1 ew
as follows
1 ew ew 1 ew ew 1
P(Yi 0)
1 ew 1 ew 1 ew 1 ew
to produce (13.5b).
Similarly, we may derive (13.5c) as follows, recalling from §1.6.1 that the multipli-
cation of powers is achieved by addition of the exponents, so that ew e w ew w,
then
1 ew ew
P(Yi 0)
1e w
1 ew
1 ew ew ewe w ew ew
1 ew 1 ew
ewe w ewe w
w w w
e e e w
e (1 e w )
w
e
1 e w
which produces (13.5c).
5
Dividing (13.4) by (13.5b) is equivalent to multiplying (13.4) by the reciprocal
(inverse) of (13.5b), i.e.
ew 1 ew w
e
1 ew 1
to produce (13.10).
6
To see why the logarithmic transformation of ew produces w, we apply the rules for
the log of a number raised to a power (see §1.6.2) to the expression loge(ew) to
produce w logee. Since log e to the base e1, this gives w.
7
However, the individual predicted values of car ownership (P̂Yi) from the regression
almost never take on the exact value of the original data, although they are con-
strained to the range (0,1). In the Detroit data set, the fitted values range from 0.04 to
0.95. The logit function, although clearly superior to a linear specification, nonethe-
less produces only an approximation to the actual distribution of data points.
8
Equation (13.12) can be obtained by calculating the derivative of the logit function
with respect to Xk. For an individual household, this gives
PYi ew ew 1
bk bk (13.12a)
Xk (1 e )
w 2
(1 ew ) (1 ew )
From (13.4) the first term is equal to P(Yi 1), and from (13.5b) the second term is
equal to P(Yi 0). Hence ∂PY /∂Xk P(Yi 1) P(Yi 0) bk.
i
9
It was suggested previously that the impact effect is best calculated for discrete vari-
ables, such as a dummy explanatory variable. In our simple bivariate model, we have
no dummies; we therefore calculate the impact effect for a unit change in income.
The procedure set out in the text may be generalized to the case of a discrete variable.
The one significant difference in the procedure for binary dummy variables is that
the absolute effect is measured by comparing P̂Y for the cases when the dummy
equals one and zero. When calculating the impact effect of a change in a dummy var-
iable for the sample as a whole, it is necessary to set the dummy variable to zero for all
observations to provide the base-line for comparison, and then to compare this
result to the case when the dummy variable is set to one for all observations.
10
The unit elasticity may either be measured by comparing the impact effect of moving
from 0 to 1 relative to the average value of the dummy in the sample as a whole, or by
using the average value of the dummy as the starting point for estimating the impact
effect, e.g. moving from 0.185 to 1.185 and using 1/0.185 as the second term in the
elasticity formula.
11
One cautionary note. It is important to ensure that the selection of the values of Xk
are internally consistent. Thus, if the researcher is working through the effects of
changing income at the lower quartile of INCOME in the sample, it is essential that
she use the average values of FAMSIZE and HOME at this income level, rather than
the lower quartiles of each variable, which would be inconsistent and, therefore,
inappropriate.
12
For an interesting illustration of the logit model applied to group data see Konrad H.
Jarausch and Gerhard Arminger, ‘The German teaching profession and Nazi party
1
兹2
1
冢
exp z2
2 冣
where is the constant pi and z is the probit index. The table is reproduced in G. S.
Maddala, Introduction to Econometrics, 2nd edn., Prentice-Hall, 1992, pp. 610–11.
16
An insightful application of an ordered response model to analyse successive votes in
the US Senate in 1929–1930 on the introduction of the Smoot–Hawley tariff is found
in Colleen M. Callahan, Judith A. McDonald and Anthony Patrick O’Brien, ‘Who
voted for Smoot–Hawley?’ Journal of Economic History, 54, 1994, pp. 683–90. The
variable VOTE has a value of 2 for a representative who voted yes for both the initial
passage of the bill in May 1929 and the final passage in June 1930; 1 if he or she voted
in favour of initial passage but against final passage; and 0 if the representative voted
no on both occasions.
17
One example of a model that uses multinomial logit regressions is Robert A. Margo,
‘The labor force participation of older Americans in 1900: Further Results,’
Explorations in Economic History, 30, 1993, pp. 409–23, in which the procedure was
used to distinguish between not working, being currently employed, and being in
long-term unemployment. Another is Daniel A. Ackerberg and Maristella Botticini,
‘The choice of agrarian contracts in early Renaissance Tuscany: risk sharing, moral
hazard, or capital market imperfections?’, Explorations in Economic History, 37,
2000, pp. 241–57, in which the distinctions were between owner-occupation of land,
fixed-rent contracts, and share contracts.
Report the values of the pseudo R2 and the two log-likelihood values.
You wish to work out the level of HOME when INCOME is $2 thousand
and $5 thousand. Report your answer, showing the requisite steps.
(c) Larger families are less able to migrate, because of the increased
expense
(d) Immigrants, who have already migrated once, are more likely to
migrate again
(e) Urban households are more likely to migrate than rural
(f) The absence of a wife increases the probability of moving
(g) Migration tends to be more frequent in some regions than others
(h) Unskilled workers tend to move less often than other occupations
(i) Migration is less frequent for households without property
(j) For those with property, the wealthier the family, the less likely they
are to move.
(i) Set up a model to test the hypotheses. (Note: you will need to instruct
the computer to generate new dummy variables in order to test
hypotheses (c), (d), (e), and (i)).
Use the logit procedure to test these hypotheses. Record your results
and write a brief essay interpreting them. How effective is the model at
explaining household migration? Which hypotheses are supported and
which are not? Be sure not only to examine the signs and significance of
the coefficients, but pay attention to the historical significance of the
results. (Hint: think about the elasticities.) Is this a complete model of
household migration?
(ii) Run a probit regression and compare the results to the logit formula-
tion. Analyse any differences.
5. Take the probit results from question 4 and use them to interpret the
value of the probit index at the mean of the variables. Apply the method
introduced in §13.6 to calculate the impact effect of a unit change in the
value of real estate on the probability of migration at the mean. Compare
the result of this exercise to the impact effect of a unit change in PROPER-
TY using the logit coefficients.
6. A researcher wishes to analyse why some Southern parishes built work-
houses while others did not. Her maintained hypothesis is that workhouses
tended to be built in large, rich, densely settled parishes. These parish char-
acteristics are captured by POP, WEALTH, INCOME, and DENSITY.
(i) Set up a logit model to test the hypotheses and report the results.
(ii) Re-organize the data into groups and re-run the logit model. Set up
your groups carefully, using the criteria established in §13.3.2.
Report your results, being sure to identify any differences from the
ungrouped results.
7. Set up a model using the Steckel data set to investigate the relationship
between the value of the household’s real property and the characteristics
of the family. Motivate your selection of explanatory variables.
The researcher further argues that the number of children at which allow-
ance is paid is influenced by the same variables. In order to test this hypoth-
esis, it is necessary to generate a multinomial variable, which takes the value
of 0 if no child allowance is paid; 1 if allowance begins with 3 children; 2 if it
begins with 4 children; and 3 if it begins with 5 or more children.
We noted in chapter 1 that the most important of our three aims in this text
was to enable our readers to read, understand, and evaluate articles or
books that make use of modern quantitative methods to support their
analyses of historical questions. We selected four specific case studies that
we have referred to and used as illustrations at various points in preceding
chapters. In this and the following chapter we want to look more closely at
each of these studies, and to see in particular how their models were
specified, how any technical problems such as autocorrelation or simultan-
eity were handled, and how the regression coefficients, standard errors
(and t-ratios or prob-values), and other statistical results were interpreted
and related to the historical issues addressed in the articles.
The present chapter covers two studies investigating the causes of
unemployment in inter-war Britain and of the nineteenth-century emigra-
tion from Ireland to the United States and other countries. Chapter 15 is
devoted to the impact of the Old Poor Law on relief payments, earnings,
and employment in England in the 1830s, and to the factors that
influenced children’s decisions to leave the family home in the United
States in the mid-nineteenth century. We assume that students will by now
have carefully read the case studies and thought about the institutional and
historical questions with which they deal. The comments that follow are
thus concerned primarily with their statistical aspects, and we will refer to
the historical background only where this is necessary to give a context for
the statistical issues.
Each of these studies is briefly introduced in appendix A and the vari-
ables used are described and listed in tables A.1–A.5. Each series is given
an abbreviated name written in capital letters, and we shall refer here to
the series by these names (as we have done throughout the text) rather
than by the symbols or alternative references used in the original studies.
They estimate this model using ordinary least squares (OLS), and report
the following results (p. 453), with t-statistics shown in parentheses1
–
R2 0.84, R2 0.82, DW2.18, SEE1.90
a
Unless otherwise noted all page references in §14.1 are to Daniel K. Benjamin and Lewis A.
Kochin, ‘Searching for an explanation of unemployment in interwar Britain’, Journal of Political
Economy, 87, 1979, pp. 441–78.
§. -
We can see immediately that the t-ratios are all well in excess of 2 and thus
all the coefficients are statistically significant (see §6.7.3 and (6.15)). The
coefficients are also correctly signed: positive for BWRATIO (a rise in the rel-
ative level of benefits increases UNEMP) and negative for DEMAND (the
higher output is relative to its trend value, the lower the level of UNEMP).
The size of the coefficients appears to give greater predictive importance
to DEMAND than to BWRATIO. However, since coefficient size depends
crucially on the size of the explanatory variable, it makes sense to scale the
two coefficients by calculating standardized beta coefficients (see §8.2.7).
The betas are 0.4533 for BWRATIO and 0.8422 for DEMAND, indicating
that a change of one standard deviation in DEMAND has a much greater
impact on UNEMP than a change of one standard deviation in BWRATIO.
The coefficient of multiple determination, R2, is very high at 0.84, and
–
because there are only two explanatory variables the adjusted R2 is only
marginally lower than this. The model thus accounts for 82 per cent of the
variation in inter-war unemployment. However, we should not attach
too much importance to this since a high R2 is not uncommon with time-
series data in which many variables may fluctuate together even if not
causally related. The value of DW, the Durbin–Watson statistic, is close to
2, thus indicating that there is no problem of autocorrelation (see
§11.3.3). SEE is the standard error of the estimate, as defined in (9.10)
and §9.4.
Benjamin and Kochin (pp. 453–5) also consider the possibility that
there might be a problem of mis-specification if the direction of causation
runs in the opposite direction from that assumed by their model. They
discuss two mechanisms that might lead to such reverse causation, in
which case a high level of unemployment would be the cause of a high
benefit–wage ratio rather than an effect. However, they conclude that
neither was operating in inter-war Britain.
The first mechanism might have occurred if high levels of unemploy-
ment caused low levels of wages and thus raised the BWRATIO. They
dismiss this possibility on the grounds that most of the variation in the
inter-war BWRATIO was caused by movements in BENEFITS rather than
in WAGES; and also show that if BENEFITS and WAGES are entered separ-
ately in an alternative specification of the model, the coefficient on the
former remains strongly significant (with a t-ratio of 4.04) whereas the
latter is insignificant (with a t-ratio of –0.82).
The second mechanism might have occurred if the level of benefits had
been influenced by the level of unemployment. Benjamin and Kochin argue
that there is no evidence in the parliamentary debates to suggest that this
was a relevant consideration, and that the most important determinant of
the scale of benefits was the current and prospective financial condition of
the Fund from which they were paid.3
In addition to the primary model in (14.1), they also experiment with a
variety of other specifications (p. 453, n. 16). These include one in which
the relationship between UNEMP and BWRATIO is modelled as non-
linear, thus allowing for the possibility that the ratio of benefits to wages
might have a very different impact on UNEMP at different levels of
BWRATIO. In another a time trend (a series which simply increases by 1
unit each year) was added to (14.1);b and in others the primary model was
estimated for different time periods, omitting successively 1920, 1920–1,
1920–2 and 1920–3. They find, however, that the results of all these alterna-
tive functional forms are statistically very similar to those given by (14.1).
Benjamin and Kochin (pp. 464–70) use the regression coefficients on
BWRATIO and DEMAND from (14.1) to estimate the effect of the unem-
ployment benefit scheme. Unfortunately, their model involves a complica-
tion which means that the coefficient on BWRATIO cannot be taken in the
usual way to measure its impact on the dependent variable, and the manner
in which they deal with this involves some rather complicated economic
and mathematical reasoning that goes well beyond the scope of this text. We
simply note that their final conclusion is that over the period as a whole ‘the
insurance system raised the average unemployment rate by about five to
eight percentage points’ (p. 468), and thus had a very substantial effect.
for UNEMP and BWRATIO in which the observation for 1920 appears to
be a clear outlier, and the remaining years fall into two groups. The obser-
vations for 1921–9 lie on one straight line with a negative slope, and those
for 1930–8 on another line, again with a negative slope, though for these
years a very slight one.d A negative slope is, of course, the opposite of what
Benjamin and Kochin find, since their regression coefficient on BWRATIO
is positive.
In order to demonstrate the sensitivity of the BWRATIO coefficient to
the sample of years for which the equation is estimated, Ormerod and
Worswick re-estimate the equation successively omitting the years from
1920. Their first alternative is thus for 1921–38, the second for 1922–38,
and so on for 10 equations ending with 1930–8.
As one test of sensitivity they take a 95 per cent confidence interval
around the original regression coefficient on BWRATIO of 18.3, giving
them a range from 10.6 to 26.1.e They then show that six of the 10
coefficients on BWRATIO estimated for their 10 alternative sample periods
fall outside this range.f
As a further test Ormerod and Worswick examine the robustness of the
Benjamin and Kochin results when a time series is added to their 10 alter-
native time periods.4 The 95 per cent confidence interval is considerably
larger in this case (because the standard error on the regression coefficient,
BWRATIO, for the period 1920–38 is much larger) and the range when the
time trend is included is from 4.3 to 33.6. None of the coefficients for
periods starting in 1923 or later falls within this range, and only one is sta-
tistically significant.
In their vigorous reply, Benjamin and Kochin reject these tests of the
stability of their coefficients as being inappropriate.g In their place, they
apply two other tests for parameter stability. The first is a dummy variable
test, based on the equation for a piecewise linear regression, introduced in
§11.2.3. This test takes the basic regression and adds a slope dummy vari-
able, equal to the interaction of the explanatory variable whose stability is
being tested (in this case, the benefit–wage ratio) and a categorical variable
d
You have already plotted this scatter diagram in question 3 of chapter 3.
e
The 95 per cent confidence interval corresponds broadly to 2 standard errors (SE) either side of
the estimate; it can be calculated as shown in §5.6.3, using (5.16). Ormerod and Worswick give
the SE as 3.87, and their confidence interval is thus 18.362(3.87).
f
Ormerod and Worswick, ‘Unemployment’, table 3, p. 405.
g
Daniel K. Benjamin and Levis A. Kochin, ‘Unemployment and unemployment benefits in twen-
tieth-century Britain: a reply to our critics’, Journal of Political Economy, 90, 1982, pp. 410–36;
see especially pp. 412–15.
set at 0 below the threshold of change and 1 above. The revised equation is
thus
UNEMPab1BWRATIOb2DEMAND
b3DUMMY*BWRATIOe (14.3)
The test statistic is the t-statistic on b3, evaluated at standard levels of
significance. If the t-statistic falls below the critical level, then the model
does not indicate structural change and the null hypothesis of stability
should not be rejected. Since Benjamin and Kochin are testing for general
parameter stability, rather than the existence of a single structural break,
they ran the dummy variable test over a range of subperiods, beginning
with 1921–38 and changing the beginning period by one year up to 1930–8.
In no case did the t-statistic on b3 pass at the 5 per cent level of significance.
The dummy variable test asks whether a single parameter is stable over
the regression sample. The Chow test asks whether the entire model is
stable, i.e. whether all the parameters are stable. The Chow procedure
divides the sample into two (or more) subperiods, running the same
regression over each, and then comparing their results to those from a
regression run over the entire sample. The comparison is based not on
parameter values, but on the overall explanatory power of the equations, as
measured by the unexplained variation, or residual sum of squares (RSS),
first introduced in §4.3.2.h
Let us identify the RSS for the entire sample as RSS1; for the two sub-
samples, we use RSS2 and RSS3. The test statistic for the Chow test is:
(RSS1 RSS2 RSS3 )/k
F (14.4)
(RSS2 RSS3 )/(n2 n3 2k)
where n2 and n3 indicate the number of observations in the subsamples and
k is the number of explanatory variables in the original (and by extension,
the subsample) regression.
This calculated statistic may be compared to the critical value of the F-
statistic from any published F-table. The critical value (Fmn) is reported for
two degrees of freedom, one for the denominator and one for the numera-
tor. The df in the Chow test are equal to k and (n1 n2 2k), respectively. If
the calculated F is larger than the critical value at the appropriate level of
significance, then the null hypothesis of model stability is rejected; if F falls
below the critical value, the null hypothesis is not rejected.
Benjamin and Kochin once again ran the Chow test over a range of pos-
sible subsamples in order to test for general model stability. In no case do
h
The unexplained variation is also referred to as the error sum of squares (ESS).
§. -
they find the calculated F-statistic to be greater than the critical value at the
5 per cent level of significance, indicating that the null hypothesis of
parameter stability should not be rejected.
The starting point for Ormerod and Worswick’s criticism was that 1920
appeared to be a clear outlier. This suggestion has also been made by other
historians critical of Benjamin and Kochin’s results. To our knowledge, no
one has formally evaluated whether 1920 really is an outlier.5 In §11.3.4, we
discussed the general problem of outliers and influential observations. In
that section, we distinguished ‘good’ outliers from ‘bad’ outliers, suggest-
ing that leverage points, in which the values of explanatory variables are
significantly different from the rest of the sample, may provide valuable
information about the model as a whole. In contrast, rogue observations
marred by large prediction errors should be discarded.
Is 1920 a good or a bad outlier? A scatter-plot of the data on UNEMP
and BWRATIO certainly marks it out as a leverage point, and this is
confirmed by a more formal statistical test, due to Hadi, which determines
whether the value of the observation falls into the tail of the sample distri-
bution as a whole. The values of BWRATIO for 1920 and 1921 are
significantly different from the rest of the sample at the 5 per cent level (the
Hadi test detects no outliers in either UNEMP or DEMAND).
To determine whether any observations in the Benjamin–Kochin data
constituted outliers due to estimation error, we ran two tests. A dummy
variable test, which evaluated the statistical significance of observation
dummies in the estimating equation, identified 1921 and 1930, but not
1920, as outliers at the 5 per cent level. A second test involves running the
regression for as many times as there are data points, dropping each
observation in turn, and predicting the value of the dependent variable
for the missing variable. This is a variant of the method of out-of-sample
prediction. An observation is identified as an outlier if the difference
between the predicted and actual values of the dependent variable is sta-
tistically large.6 The null hypothesis of no difference between the reported
and predicted unemployment rates was rejected for a larger number of
observations, including 1921, but excluding both 1920 and 1930.i On this
basis, 1920 does not appear to have been an estimation outlier, while 1921
clearly was.
Thus it appears that 1920 was a leverage point, but not a rogue observa-
tion. 1921, on the other hand, may have been both. But how did these data
points influence the overall equation? The substance of Ormerod and
Worswick’s critique is that the high and significant value of the coefficient
i
The other outliers by this criteria were 1924–6 and 1936–8.
j
As with all such tests, it is imperative that they be run over the entire sample; ‘cherry-picking’
certain observations, such as 1920–1, for special evaluation is inappropriate.
k
If 1921 is excluded, the OLS regression result becomes:
–
R2 0.89, R2 0.88 SEE1.56
It is not possible to generate a Durbin–Watson statistic since the data are no longer continuous.
l
Or rather, to our results as given in n. 2. Note that robust regression techniques do not product
standard goodness-of-fit measures.
§.
Benjamin and Kochin’s response to their critics was not the last word on
the origins of unemployment in the inter-war period, as might be pre-
dicted from the tone of the debate.m More sophisticated econometric mod-
elling of the inter-war labour market has been undertaken, both by
advocates of a benefit story and by those who find it unpersuasive. Much of
the debate has continued to work with annual observations, although
some scholars have attempted to increase the degrees of freedom of such
models by introducing quarterly data.7
However, much the most intriguing developments in inter-war labour
history have emerged from micro-economic analysis of survey data for the
1920s and 1930s. The pioneering work was undertaken by Barry
Eichengreen, who sampled the records of the New Survey of London Life
and Labour that was undertaken between 1928 and 1932. With this data,
Eichengreen was able to test Benjamin and Kochin’s hypothesis about the
influence of rising benefits on labour supply decisions for individual
workers, rather than generalizing from market aggregates.
Eichengreen’s conclusions offered a substantial revision of the benefit
argument. For household heads, who dominated employment in this period,
the ratio of benefits to wages played almost no role in determining whether
they were in work or out of work. But for non-household heads the replace-
ment rate did matter, perhaps causing their unemployment rate to rise to as
much as double what it would have been had replacement rates stayed at 1913
levels. However, since these secondary workers accounted for only a very
small part of the London labour force, the weighted average of the two effects
was small, leading Eichengreen to conclude that, ‘it does not appear that the
generosity of the dole had much effect on the overall unemployment rate’.n
m
There are clearly several respects in which their modelling procedure would not conform with
what is today regarded as best practice in the sense of §12.6, but the article reflects standard prac-
tice of the time it was written.
n
Barry Eichengreen, ‘Unemployment in inter-war Britain: dole or doldrums?’, in N.F.R. Crafts,
N. H. Dimsdale and S. Engerman (eds.), Quantitative Economic History, Oxford University
Press, 1991, pp.1–27, quotation at p. 22.
o
All page references in this section are to Timothy J. Hatton and Jeffrey G. Williamson, ‘After the
famine: emigration from Ireland, 1850–1913’, Journal of Economic History, 53, 1993, pp. 575–600.
Table 14.1 Determinants of total Irish emigration rates, time-series data, 1877–1913
(1) (2)
Coefficient t-statistics
R2 0.86
Residual sum of squares 80.05
Durbin–Watson 1.90
LM(1) 0.15
p
The other statistic reported in table 14.1 is the residual sum of squares (RSS), which indicates
how much of the variance in the dependent variable is not explained by the regression. It is used
by Hatton and Williamson to enable comparisons between (14.5) run on total emigration, and
other versions run on slightly different data sets.
q
Remember that EMPFOR is measured as 1 minus the proportion of the labour force unem-
ployed each year. A fall in unemployment from 10 to 1 per cent is equivalent to a rise in employ-
ment from 90 per cent to 99 per cent, i.e. by 10 per cent.
§.
Since Hatton and Williamson are interested in the long-term trend as well
as the annual fluctuations, they also formulate an alternative long-run version
of the model. In this the long run is characterized by the elimination of any
impact from short-term changes in wage and employment rates, or in the
emigration rate. This is achieved by setting EMPFOR, EMPDOM, and
IRWRATIO equal to zero, and by regarding IRISHMIGt and IRISHMIGt 1
as equal. Making these changes (after which the time scripts are no longer
needed), and moving the second term for IRISHMIG to the left-hand side,
the model becomesr
b5
log EMPDOM
1 b8
b6
log IRWRATIO
1 b8
b7
MIGSTOCKe (14.6b)
1 b8
r
The left-hand side is written as (1 b8) IRISHMIG. This is the same as IRISHMIG–b8IRISHMIG.
s
This procedure for deriving long-run multipliers was also described in §12.3.3.
and
b̃4
is 1.153
100
As in the previous case we are determining the effect of a 10 per cent rather
than a 1 per cent change, so must multiply this by 10. The result is, there-
fore, 11.5, or roughly the same magnitude as the short-run effect, as noted
by Hatton and Williamson (p. 584).
Hatton and Williamson then turn to the effect of a sustained rise of 10
per cent in IRWRATIO, the foreign-to-domestic wage ratio, and find that
this ‘would lead ultimately to an increase in the emigration rate of 2.35 per
1,000’ (p. 584). Since the form of this relationship is exactly the same as the
previous one, the calculation follows the same lines. Replacing b4 by b6
we get
13.16
b̃6 23.5
1 0.44
and so
b̃6
is 0.235
100
t
The t 1 time subscripts on both variables in table 14.1 are irrelevant when considering the
long-term model.
§.
cent change in this variable follows the same lines as those for EMPFOR
and IRWRATIO, and is about –3 per 1,000 (p. 585).
For the final variable, the migrant stock, Hatton and Williamson calcu-
late that the long-run impact was that ‘for every 1,000 previous migrants
an additional 41 were attracted overseas each year’ (p. 585). In this case the
MIGSTOCK variable was not entered in the model in logarithmic form so
we are dealing with a simple linear relationship (as in the first row of table
12.1). A change of 1 unit in MIGSTOCK will cause a change of b units in
IRISHMIG (both measured per 1,000 population) and the long-run
impact is derived as
22.87
40.8
1 0.44
Hatton and Williamson conclude from this part of their analysis that the
principal cause of the long-run downward trend in Irish emigration was the
increase in Irish real wages relative to those available to potential migrants
in the United States and other receiving countries. They measure the actual
fall in IRWRATIO between 1852–6 and 1909–13 at about 43 per cent, which
would have ‘lowered the long-run emigration rate by as much as 10 per
1,000, accounting for much of the secular decline in the emigration rate that
we observe in the data’ (p. 586).u They further conclude, on the basis of the
value of R2 in (14.5), that over 80 per cent of the variation in the short-run
fluctuations in emigration is explained by the combination of variables
(other than EMPDOM and IRWRATIO) specified in their model.
u
Their wage data cover only the period from 1876 and they estimate the decline from 1852–6 by
extrapolation of the later rate. The long-run effect on IRISHMIG of a change in IRWRATIO of
43 per cent would be 430.023510.1 per 1,000.
v
The critical distinction between controlling for other variables and simply ignoring them was
discussed in §8.2.3.
§.
Table 14.2 Determinants of Irish county emigration rates, panel data, 1881–1911
(1) (2)
Coefficient t-statistics
R2 0.70
Residual sum of squares 1420.9
HETERO 0.73
model uses data for 32 counties for each of four census years, and is thus a
pooled data set of 128 observations. They initially included year dummies
to allow for the possibility that the coefficients would not be stable across
all four dates, but this was found not to be the case and the year dummies
were omitted.11
The results of this pooled model are given in table 14.2, again focusing
exclusively on total emigration. R2 is 0.70 and the model thus explains 70
per cent of the variation in CNTYMIG, a good performance by the stan-
dard usually attained in cross-sectional regressions.
HETERO reports a test statistic for heteroscedasticity. The test employed
by Hatton and Williamson is a variant of the Lagrange Multiplier tests (such
as the Breusch–Pagan and White tests) introduced in §11.3.2. The test statis-
tic is derived by running the square of the residuals from the fitted regression
equation against a constant and the fitted values of the dependent variable. If
there is heteroscedasticity, this regression should show a systematic relation-
ship between the size of the residual and the size of the fitted value.
The test statistic of the null hypothesis of homoscedasticity is nR2,
where n is the number of observations and R2 is the coefficient of
and Williamson report that these year dummies were statistically insigni-
ficant. They therefore employ the unrestricted pooled regression, with a
single intercept term.
The random effects model rejects the notion that shifts in the regres-
sion line arising from unknown influences can be captured by dummy vari-
ables and posits instead that these unknown factors should be incorporated
in the error term. Thus the regression model takes the form:
t-ratio of 2.12) it can be said ‘that a 10 per cent rise in the relative wage
would raise the emigration rate by 0.64 per 1,000’ (p. 592.) As Hatton and
Williamson note, this estimate is noticeably smaller than the correspond-
ing short-run coefficient estimated on time series, and much lower than
the long-run time-series coefficient. The coefficient on the former is given
in table 14.1 as 13.16 so the effect of a 10 per cent rise would be 13.16
100by 10 1.32; the latter was found in the preceding section to be 2.35.
Since the model specifies a simple linear relationship between the
dependent variable and all the remaining explanatory variables, the inter-
pretation of the regression coefficients on each of these variables is
straightforward. A change in an explanatory variable of 1 unit will cause a
change in CNTYMIG of b units; that is of b per 1,000 of the population,
because this is the unit in which the dependent variable is measured. Since
all except one of the explanatory variables are measured as proportions (or
in the case of RELIEF as a percentage), we can thus say that a change of 1
per cent (equivalent to a change in a proportion of 0.01) will have an effect
on CNTYMIG of b units.
The exception is FAMSIZE, where the unit is average number of persons
per family. This variable has a large and highly significant coefficient which
‘implies that a one-person reduction in average family size would lower the
emigration rate by 7.93 per 1,000. This would support those who have
argued that demographic forces continued to influence emigration in the
late nineteenth century. High birth rates led to large families which precip-
itated a large flow of emigration’ (p. 593).
If we look next at the other economic factors, we can see that the
coefficients on AGRIC, and on the interaction variable created by the
product of LANDHLDG and AGRIC, are both large and statistically
significant. For the former alone a 1 per cent rise in the proportion of the
labour force in agriculture would raise the emigration rate by 16.4 per
1,000, for the latter a rise of 1 per cent would lower the rate by as much as 42
per 1,000. The interaction variable can be interpreted as indicating that the
higher the proportion of smallholdings, the smaller was emigration, con-
trolling for the share of the labour force in agriculture. As Hatton and
Williamson conclude: ‘these results strongly suggest that lack of opportu-
nities in agriculture (including opportunities to obtain or inherit a small
farm) was an important cause of emigration’ (p. 592).12
Turning now to the remaining variables we find from the coefficient on
AGE in table 14.2 that a 1 per cent rise in the proportion of the population
in the prime emigration age group would raise the rate by 13.6 per 1,000,
but that this coefficient is not statistically significant. However, Hatton and
Williamson report this result in a somewhat different form from usual.
§.
What they say is: ‘the point estimate for the proportion of the population
between 15 and 34 indicates, plausibly, that emigration would be 13.6 per
1,000 higher for this group than for the population as a whole’ (p. 592, italics
ours). How do they get to this?
Note that the total migration rate may be expressed as the weighted
average of two migration rates – that for 15–34-year-olds and that for the
rest of the population. Let p be the proportion of the population aged
15–34; since the proportion must sum to one, the rest of the population is
1 p. If the migration rates of the two groups are M1 and M2, we can
express total migration (MIG) as:
MIGM1(p)M2(1 p) (14.9a)
This may be re-written as
MIGM1(p)M2 M2(p)M2 (M1 M2)p (14.9b)
If we were to run a regression with MIG as the dependent variable and p
as the explanatory variable, we would have
MIGab(p)e (14.10)
By comparing (14.9b) and (14.10) it is easily seen that b, the coefficient
on p in the regression, is an estimate of (M1 M2), i.e. of the difference
between the migration rate of those aged 15–34 and that of the remainder
of the population.w Hence we could report the regression coefficient, b, as
the amount by which the migration rate for 15–34-year-olds exceeds the
rate for the rest of the population: 13.6 per 1,000. (This is not quite what
Hatton and Williamson suggested, which was the rate by which migration
for 15–34 year-olds exceeded the overall migration rate.) Alternatively, the
coefficient could be reported in the usual way by noting that an increase in
the proportion of the population aged 15–34 by 1 unit (for example, from
0.33 to 0.34) would raise the migration rate by 13.6 per cent (for example,
from 11 per 1,000 to 12.5 per 1,000).
It follows from the principle captured by (14.9) and (14.10) that the
coefficients on any other explanatory variable that relates to a proportion of
the relevant population (for example, share of the labour force in manufac-
turing, share of the housing stock which is owner-occupied) can also be
interpreted in this way. Thus, the coefficient on URBAN in table 14.2 indi-
cates that the migration rate of urban inhabitants was 3.02 per 1,000 higher
than that for non-urbanites; while the migration rate for Catholics was on
average 4.97 per 1,000 lower than for non-Catholics. All such coefficients
are, of course, estimated holding other characteristics constant.
w
Also note that the intercept term is an estimate of M2.
URBAN has the expected positive sign but the effect of urbanization is
only 3.0 per 1,000 and the coefficient is again not significant. Of the two
poverty variables, RELIEF is highly significant, but the effect on emigration
is relatively small: a rise of 1 per cent in the proportion of the population
receiving poor relief would raise the emigration rate by 1.96 per 1,000.
HOUSING has a smaller impact: a 1 per cent increase in the proportion of
families living in the worst types of housing would raise the emigration rate
by 0.08 per 1,000.x
Finally we see that the sign of the coefficient on CATHOLIC is negative,
and that on ILLITRTE is positive. In both cases this is the opposite of what
was expected, though the coefficient is insignificant in the latter case (p. 594).
Mean value
Change Regression Effect on
1881 1911 1881–1911 coefficient emigration rate
Notes:
(1) and (2) from the original Hatton and Williamson data set, not the version rounded to 2
decimal places in ‘Emigration’, p. 598.
(3)(2) – (1)
(4) from table 14.2, column (1)
(5)(3)(4). These results differ very slightly from those reported by Hatton and
Williamson, p. 595 because of rounding.
Table 14.4 Restricted model of Irish county emigration rates, panel data, 1881–1911
(1) (2)
Coefficient t-statistics
R2 0.57
Residual sum of squares 2145.9
HETERO 0.66
these aspects are directly represented and so the effect of relative wages is
diminished.
To test this assertion, Hatton and Williamson run a revised version of
their panel model. In this they omit those aspects of living standards other
than the relative wage, and also the minor variables that were not statisti-
cally significant or were found to have little or no effect. This alternative
model thus contains only two of the original explanatory variables: CYW-
RATIO and FAMSIZE.
We report their results for total emigration in table 14.4. They show how
with this very restricted specification the relative wage term captures some
of the effects previously attributed to the other measures of living stan-
dards. The impact of a 10 per cent increase almost doubles from 0.64 to
1.19 per 1,000, about the same as the short-run coefficient in the time-
series equation of 1.32, and about half the long-run coefficient of 2.35.
They conclude from this that the results of the time-series and cross-
section models ‘can be largely reconciled’ (p. 596).
Notes
1
At the time this paper was published it was not yet standard practice to give actual
prob-values. The significance of regression coefficients was indicated by reporting
either the standard errors or the t-statistics (i.e. the ratios of the regression
coefficients to their standard errors) or, occasionally, both. Note also that the t-ratios
in this paper are expressed with the same sign as the coefficient; in Hatton and
Williamson’s paper discussed below, the t-ratios are expressed as absolute values.
While it is more accurate to report the sign, it is no longer standard practice, largely
because the sign is irrelevant to the interpretation of statistical significance.
2
We regret that we have not been able to reproduce Benjamin and Kochin’s results
precisely. When (14.2) is run on the data on the web site (with the two explanatory
variables taken to 5 decimal places) the result is
–
R2 0.86, R2 0.84, D–W2.22, SEE1.80
Note that the results reported in Ormerod and Worswick are slightly different again.
None of the discrepancies is large enough to affect the interpretation of the model,
and in what follows we report the original results from Benjamin and Kochin.
3
Benjamin and Kochin also reject the possibility of simultaneity (see §11.4.3),
whereby unemployment influences the replacement rate at the same time as the
replacement rate causes unemployment, though they do so without the benefit of a
formal test, such as the two-stage least squares model introduced in §15.1.1 and
panel 15.3.
4
Ormerod and Worswick recognize that Benjamin and Kochin had experimented
with the addition of a time period for the full period 1920–38, but had found that the
coefficient on the time trend was itself insignificant and that there was little impact
on the estimated value of the other coefficients. However, they say that this conclu-
sion is correct only for the full period, not for the shorter subperiods.
5
Strictly, tests for parameter stability are not the same as tests for outliers, although
Benjamin and Kochin’s variant of the Chow test does go some way to providing a
formal statistical test of the status of 1920.
6
The test statistic in this case is whether the reported unemployment rate falls within
2.131 standard errors of predicted unemployment, where 2.131 is the t-statistic at
the 5 per cent level with 15 df.
7
Timothy J. Hatton, ‘A quarterly model of the labour market in interwar Britain’,
Oxford Bulletin of Economics and Statistics, 50, February 1988, pp. 1–25; Nicholas H.
Dimsdale, S. Nickell, and N. J. Horsewood, ‘Real wages and unemployment in
Britain during the 1930s’, Economic Journal, 99, June 1989, pp. 271–92.
8
For a separate paper showing how this model is formally derived from appropriate
economic principles, see Timothy J. Hatton, ‘A model of UK emigration,
1870–1913’, Review of Economics and Statistics, 77, 1995, pp. 407–15.
9
Note that Hatton and Williamson’s procedure for calculating the impact of a 10 per
cent change in each explanatory variable is an extrapolation from the impact of an
infinitesimal change. An alternative method for calculating the effect of a given per
cent change in EMPFOR would be to multiply the coefficient b4 by the change in the
value of log EMPFOR. Thus, a 10 per cent increase in EMPFOR (e.g. from 90 to 99
per cent) would generate a 10.38 per cent increase in IRISHMIG (log EMPFOR
0.09531 108.9510.384).
10
In contrast, Hatton and Williamson enter the home and foreign employment rate as
separate variables in the main regression. They do this because they believe that
potential migrants interpret information about the labour market in Ireland and
abroad in different ways, so that it would be inappropriate to combine the two ele-
ments into one ratio (see p. 582).
11
To include year dummies one of the four census years would be chosen as the control
year, say 1881, and three separate year dummies would be created, which we can call
1891, 1901, and 1911. The year dummy 1891 would take a value of 1 for all observa-
tions relating to 1891, and 0 otherwise; 1901 would be 1 for all observations relating
to 1901, 0 otherwise; 1911 would be 1 for all observations relating to 1911, 0 other-
wise.
12
The interaction variable would normally be constructed as the product of the two
component variables. Unfortunately, at a late stage of their work, Hatton and
Williamson changed the value of their AGRIC variable without changing the inter-
action term. The unchanged values are included in the data set on the web page to
enable readers to reproduce the published regression results. As can be seen by con-
structing the correct interaction variable, this discrepancy makes little difference to
the results.
b
The sample parishes were selected on the basis of the completeness of their returns to the
Commission, but – for the reasons mentioned in n. 2 of chapter 5 – Boyer believes that there is no
obvious bias as a result of this (pp. 129, 149).
§.
The first problem that Boyer faced in specifying his model is created by
the interdependence of several of the key hypotheses advanced by contem-
poraries and later historians (p. 127). It is widely held that both wage rates
for agricultural labourers (INCOME) and unemployment among labour-
ers (UNEMP) were determinants of RELIEF, but at the same time it is also
claimed that RELIEF led to lower INCOME and higher UNEMP.
Moreover, Boyer’s approach assumes that labour-hiring farmers were able
to choose what they considered would be for them the most profitable
combination of the three variables: the wages they paid, the employment
they offered during non-peak seasons, and the level of benefits paid
through the Poor Law to unemployed labourers.
These three variables, INCOME, UNEMP, and RELIEF are referred to as
endogenous, meaning that they are determined within the system of rela-
tionships Boyer is modelling. By contrast, the hypothesized explanatory
variables are referred to as exogenous or pre-determined. Exogenous vari-
ables are determined outside the model in question, and cannot be affected
by any changes in the other variables within the system of relationships.
Given this mutual interaction between the three endogenous variables
there is clearly a problem of simultaneity, and thus a violation of one of the
assumptions of the classical linear regression (CLR) model (see §11.4.3).
In these circumstances it is not appropriate to use a single-equation model
to explain cross-parish variations in relief expenditures, and Boyer adopts
two alternative procedures; one known as a reduced-form model, the
second as a simultaneous-equations or a structural model (p. 133). A full
explanation of these two types of model is not possible at the level of expo-
sition appropriate to this book, but a very elementary introduction to what
is involved is given in panels 15.1 and 15.2 for those who wish to know a
little more. The panels should be read after completing this section.
Ordinary least squares (OLS) can be used to estimate the reduced-form
equations, but because of the effects of the simultaneity in the relationships
this procedure should not be employed to estimate a simultaneous-equa-
tions model. Instead Boyer adopts the alternative two-stage least squares
(2SLS) procedure introduced in §11.4.3. Panel 15.3 gives a brief explana-
tion of the procedure. Once the model has been estimated by 2SLS, the
interpretation of the regression results is exactly the same as for models
estimated by OLS.
and
INCOMEab1COTTINDb2ALLOTMNT
b3LONDON b4CHILDALLb5SUBSIDY
b6LABRATEb7ROUNDSMNb8DENSITY
b9RELIEFe (15.3)
Two features of this model should be noted. First, when one of the endog-
enous variables is the dependent variable, the other is not omitted (as was the
case in the reduced-form model) but appears on the right-hand side as an
explanatory variable. We thus have INCOME in (15.2) and RELIEF in (15.3).
Because INCOME is included on the right-hand side of (15.2) it is not
necessary to include also those explanatory variables that help to explain
INCOME but do not directly affect RELIEF. This explains why the four
variables in (15.1) that represent specific forms of outdoor relief (CHIL-
DALL, SUBSIDY, LABRATE, and ROUNDSMN) are omitted from (15.2).
They are expected to have a negative effect on INCOME, but are not a
direct determinant of the overall level of RELIEF.
Secondly, unlike the reduced-form model, the equations in the simulta-
neous-equations model for RELIEF and INCOME ((15.2) and (15.3)) do
not share the same explanatory variables. FARMERS is included in (15.2)
to test whether parishes in which labour-hiring farmers had greater politi-
cal power were able to pass on to other ratepayers more of the cost of main-
taining their workers, thus leading to higher RELIEF. However, this power
is not regarded as a determinant of INCOME, and so this variable does not
appear in (15.3). GRAIN is designed to be a proxy for the extent of season-
ality in the demand for labour and is omitted from (15.2) because it should
affect RELIEF only through its effect on UNEMP, which is already entered
in the equation. And as noted in the previous paragraph, the four variables
representing forms of outdoor relief are included as determinants of
INCOME but not of RELIEF.
We can now turn to the discussion of the results of the regression
procedures.
(a) Unlike all the other models we have considered hitherto, this model
consists of a set of two equations that must be considered together and
contains two dependent variables.
(b) The relationships are interdependent, as indicated by the model:
changes in INCOME will lead to changes in RELIEF, and at the same
time changes in RELIEF will cause changes in INCOME.
(c) Because this interdependence or simultaneity is present in the rela-
tionships, one of the assumptions of classical linear regression (CLR)
is violated (see §11.4.3). For this reason the simultaneous-equations
model cannot be reliably estimated by the ordinary least squares
(OLS) procedure. Boyer adopts an alternative technique known as
two-stage least squares (2SLS), explained in panel 15.3.
§.
Equation 15.1 and the two corresponding equations for INCOME and
UNEMP are of this type though they were not derived in precisely this way
from the corresponding structural model. All these reduced-form equa-
tions can be estimated in the standard way by ordinary least squares (OLS).
The change from the simultaneous-equations to the reduced-form
model not only changes the right-hand side to exclude the endogenous var-
iables, it also changes the value of the parameters, i.e. of the constant and the
regression coefficients. This happens because in the reduced-form model
the coefficients measure both
(a) the direct effect of the respective exogenous variables on the endoge-
nous variables, and
(b) their indirect effects through the changes in the endogenous variables
which react simultaneously on each other.
*
This can be thought of as a special case of the instrumental variable technique (see §11.4.1)
in the sense that all the exogenous variables are taken together to create an instrumental
variable.
§.
( ), or uncertain (?); the actual results are either positive () or negative
( ) or not statistically significant (0).
We reproduce in table 15.1 the results Boyer reports for both models for
RELIEF and INCOME. R2 is 0.30 for the reduced-form equation for
RELIEF and 0.36 for INCOME. This is quite satisfactory for cross-sectional
models, which typically yield lower coefficients of multiple determination
than time-series models. In the latter, aspects of the change through time
are typically common to many variables, creating a broadly similar pattern
of variation for dependent and explanatory variables. By contrast, with
cross-sectional models, there are likely to be many other factors at work in
the individual cases (in this instance parishes) that will influence the
dependent variable (see §3.2.3).3
Boyer reports both the t-ratios and the prob-values.c The two are, of
course, different ways of assessing the same issue. For example,
ALLOTMNT (in the reduced-form equation for RELIEF) has a very low
t-statistic (0.13) and a correspondingly high prob-value 0.899; in other
words there is a probability of 89.9 per cent of getting a value greater than t
if the null hypothesis of no association between RELIEF and ALLOTMNT
is correct, and the null hypothesis therefore cannot be rejected.
By contrast the t-ratio for LONDON is very high (3.85) and there is a
correspondingly low prob-value of 0.0001; in other words there is only a
0.01 per cent probability of getting a value greater than t if there is no asso-
ciation, and the null hypothesis can be rejected.
More generally, as would be expected from the discussion of this matter
in §6.3.5, we see that coefficients with a t-ratio of 2 or higher have a corre-
sponding prob-value of 0.05 or lower and are thus significant at the 5 per
cent level or better. Those (such as COTTIND) with a t-statistic a little
under 2 are significant at the 10 per cent level though not at the 5 per cent
level. As a more precise indicator, the prob-value allows more discretion in
the assessment of regression results.
We turn now to the values obtained for the regression coefficients. All
the relationships are linear so the results should all take the form ‘a small
absolute change in X will cause a small absolute change in Y of b units’. It is
necessary to bear in mind, however, that many of the explanatory variables
are dummy variables, and that some of the numerical variables are meas-
ured in proportions or percentages.
The first quantitative result given by Boyer is that per capita relief expen-
ditures were between 1.7 shillings and 3.4 shillings lower in parishes with
c
The two parallel lines either side of t in table 15.1 are known as a modulus. This simply indicates
that the author is stating the prob-value without regard to the sign of t.
Table 15.1 The agricultural labour market and the Old Poor Law, simultaneous-
equations and reduced-form models, cross-section data, c. 1831
R2 0.363
cottage industry than in those without (p. 138). This follows directly from
the coefficients on the dummy variable for COTTIND. The range of values
is given by the point estimates for COTTIND in the reduced-form and
simultaneous-equation models, and the sign indicates that the effect on
RELIEF was negative. As with all the other dummy variables in the model
there are only two categories, and so there is a straight comparison between
those parishes that have cottage industry and the control category that do
not. The subsequent statements about the effects of COTTIND and
ALLOTMNT on INCOME (p. 140), and of CHILDALL on INCOME
(p. 142) are obtained in precisely the same way.
Boyer qualifies the results based on dummy variables, particularly for
ALLOTMNT. These are included in the model to test the hypothesis that a
further cause of the rise in per capita relief expenditure was the need to
compensate labourers for the loss of land caused by enclosures and other
forms of engrossment. However, because ‘dummy variables measure the
occurrence of a phenomenon rather than its magnitude, one cannot always
make meaningful time-series inferences from their cross-sectional
coefficients’ (p. 127). In this case, the typical amount of land lost by labour-
ers was much larger than the size of the average allotment, and therefore
‘the coefficient from the cross-sectional analysis understates the long-term
effect of labourers’ loss of land’ (p. 143).
The next set of quantitative estimates is a little more complicated. Boyer
observes that distance from LONDON (a proxy for the cost of migration)
had a negative effect on both RELIEF and INCOME. A 10 per cent increase
in LONDON resulted in a reduction in RELIEF of between 1.5 and 2.9 per
cent, and in INCOME of between 1.3 and 1.4 per cent (p. 141). How does
he get these results?
We know directly from the regression coefficients that a change in
LONDON of 1 mile will lead to an absolute fall in RELIEF of between 0.04
and 0.08 shillings and in INCOME of between £0.06 and £0.07.4 The values
come from the two models reported in table 15.1 and all the coefficients are
negative. However, Boyer has chosen to convert these absolute changes to
relative ones, and he does this in relation to the mean values of the three
variables.
The mean distance of all parishes from London is 65.1 miles, the mean
of RELIEF is 18.0 shillings, and the mean of INCOME is £29.6.d A 10 per
cent change in LONDON would thus be 6.51 miles, and the effect of this
d
The mean values of RELIEF and INCOME (and of most of the other variables) are given by
Boyer on p. 149. The mean distance from London is not given but can easily be calculated from
the data set.
Y
0.34
X
However, as explained in §12.3.2 this is not the same as the elasticity, which
is defined as the ratio of the proportionate changes in the two variables:
Y X
Y X
If we re-arrange the terms we see that this definition can also be written as
Y X
X Y
X
Y
This can be done in relation to the mean values of the variables, respec-
tively, 7.4 per cent for UNEMP and 18.0 shillings for RELIEF. We thus have
7.4
0.34 0.14
18.0
§.
Since this is not a constant elasticity (see again §12.3.2) its value would
be different at other levels of the variables, but it is usual to evaluate the
elasticity ‘at the mean’.
e
The log-log transformation of the OLS regression is designed to assist comparison of the
coefficients with the non-linear model of fertility that is introduced later in this section. The
coefficients are reported in columns (1) and (4) of table 15.2.
§.
f
The feature that makes the model non-linear occurs in the error term in Boyer’s equation (2), p.
162.
R2 0.124
per cent resulting in a rise in BRTHRATE of 4.4 per cent. Lack of accom-
modation and high population density both had a negative effect. The
coefficient on HOUSING of 0.28 indicates that an increase of 10 per cent
in the ratio of families to the number of houses would account for a fall in
BRTHRATE of 2.8 per cent; and a 10 per cent increase in DENSITY would
result in a reduction of 1.0 per cent.
These cross-section results show that child allowances, INCOME,
HOUSING, and DENSITY all had a statistically significant impact on the
birth rate, but – as always – it is also necessary to consider whether these
factors were historically important. In the present context that can be done
by ascertaining what their effect was in relation to the rise in the crude birth
rate during the late eighteenth and early nineteenth century (pp. 167–71).
For this purpose Boyer assumes that there were no child allowances at
the beginning of this period. To obtain an estimate of their effect at the end
of the period he constructs an average across the different regions in
England in 1824. The parishes in each region were subdivided into four
categories: those not giving child allowances, and those giving allowances
beginning at three, four, or five or more children. The impact of child
allowances for each category is assumed to be the same in each region as in
the southern parishes covered by the regression (zero for those not paying
allowances, and 0.25, 0.17, and 0.17, for the other three categories). The
regions were then weighted according to their population in 1821. The
final result is a weighted average of 0.142. In other words, for England as a
whole, the payment of child allowances caused birth rates to be 14.2 per
cent higher in 1824, other things being equal, than they would have been in
the absence of this form of relief (p. 168).
For the other explanatory variables Boyer obtains estimates of the per-
centage change over the four decades from 1781 to 1821, and multiplies
these by the elasticities from the regression results for (15.6). The results
are summarized in table 15.3. The table indicates that the five variables
would have caused birth rate to rise by between 5.0 and 7.8 per cent. If child
allowances had not been adopted the birth rate would have declined by
between 6.4 and 9.2 per cent. According to Wrigley and Schofield the actual
increase over this period was 14.4 per cent.g
As Boyer concludes, ‘the early-nineteenth-century increase in birth
rates cannot be understood without taking child allowance policies into
account’ (p. 171). Similarly, the abolition of child allowances following the
introduction of the New Poor Law in 1834 was a major reason why birth
g
E. A. Wrigley and R. S. Schofield, The Population History of England, 1541–1871, Edward Arnold,
1981, p. 529.
Table 15.3 Impact of child allowances and other factors on the increase in the English
birth rate, 1781–1821
Notes:
(1) From Boyer, Poor Law, p. 170;
(2) From table 15.2, column (4);
(3)(1) multiplied by (2).
rates remained broadly stable between 1831 and 1851, despite a more rapid
increase in workers’ real incomes.
40
20
0
5–6 7–8 9–10 11–12 13–14 15–16 17–18 19–20 21–22
Age in 1850
in the two years to determine which of the children who were listed in the
census in 1850 were not present in the household 10 years later. This
allowed Steckel to determine the probability that a child of a given age
would leave the family home within the next 10 years.
The complexity of American migration patterns made it impossible to
match all households from 1860 to 1850. Therefore, Steckel’s data set is not
a random sample of all US families, but rather focuses on households that
were already established in 1850 and for which the household head sur-
vived until 1860. It excludes all households established between 1850 and
1860, as well as those in which all the children left before 1860.
Nonetheless, consideration of other evidence leads Steckel to conclude that
his results are representative of the circumstances governing the decision to
leave the family home in this period.
The calculated departure rates are depicted in figure 15.1.8 The graph
shows the proportion of children of a given age in 1850 who had left the
household by 1860. The horizontal axis is the age of the children in 1850; the
vertical axis shows the departure rate over the next 10 years. It is important
to emphasize that Steckel’s data do not mean that 15.7 per cent of 5–6-year-
old boys were leaving the family home in the 1850s, but rather that 15.7 per
cent of those aged 5–6 in 1850 left home sometime over the next 10 years. The
graph shows the departure rates of children aged from 5 to 22 in 1850.9
The graph shows a broadly sigmoid (or logistic) pattern of departures
for both sexes. The leaving rate is low for 5–6-year-olds, rises steeply
through the early teenage years, before reaching a high plateau for children
over 17. It is evident from the graph that younger girls tended to leave the
family home faster than boys, and also that women who remained in the
household after 18 were less likely to leave than men of the same age.
Similarly, there are three dummy variables for father’s birthplace, for
which there are four categories (the three listed above and the omitted cat-
egory, born in the United States). A child with a US-born father would have
0 values for each ethnic dummy; a child with an English-born father would
record 1 for BORNENG and 0 for the others; and similarly for the two
remaining categories. There are four regional variables and three dummies
(with Northeast being omitted); and there are three urban variables and
two dummies (with rural as the omitted category). In each case, the
omitted variables are in the constant term, and in each case, the value of the
coefficient on the dummy variable is relative to the value of LEAVE for the rel-
evant omitted variable.
Steckel divided his sample by sex into groups aged 5–11 and 12–18 in
1850.10 There are over 1,000 observations for the younger age categories
and 579 observations for the older. The sample was divided evenly between
girls and boys. The results for each of these groups are shown in table 15.4.
h
The mean value of LEAVE corresponds to the mean value of P(Yi 1) in the terminology
employed in chapter 13, and the mean value of stayers corresponds to P(Yi 0).
Table 15.4 The age at leaving home in the United States, 1850–1860, explaining the
probability of departure, boys and girls aged 5–11 and 12–18
Source: Richard H. Steckel, ‘The age at leaving home in the United States, 1850–1860’, Social
Science History, 20, 1996, pp. 521–2.
i
Steckel’s estimate of the elasticity is 1.582, rather than the 1.589 cited here. The difference is very
slight and is almost certainly the result of some rounding on either his part or ours.
§. , –
shows, most of the elasticities associated with a unit change in the explana-
tory variables are small. As Steckel notes (p. 520) the t-statistics and other
tests indicate that socio-economic variables were more powerful influences
on the younger children than on those in the older age group.
The most important single influence on the timing and frequency of
leaving home was clearly the age of the child. The large elasticity of AGE-
CHILD suggests that a difference in the child’s age of 1 year in 1850
increased the probability of leaving the home by almost 1.6 per cent for
boys, and by as much as 2.4 per cent for young girls.
Yet, after holding this influence constant, some of the hypotheses set out
in §15.2.1 received consistent support from the model. Thus, the
coefficient on the number of young children in the household (YNGER-
FAM) was large and significant for all cases except the older boys, suggest-
ing that younger children tended to push older ones out of crowded homes.
The age of the mother (AGEM) was also consistently significant, and
inversely related to the decision to leave home: an additional year of age of
the mother reduced the probability of leaving by about 1 per cent for both
age-groups of boys and for young girls. If this is a proxy for birth order, the
result is consistent with evidence from other societies that later-born chil-
dren tended to remain with the family longer.
Older children might be forced out by younger siblings, but the oldest
tended to stay behind, perhaps in their capacity as heirs apparent to the
family estate. Similarly, father’s occupation clearly had a strong influence
on departure from the home, especially for younger children. The
coefficients for all four occupations are positive, indicating that farm chil-
dren (the control category) were generally least likely to leave home. For
young boys the chances of leaving were markedly higher for the sons of
blue collar workers, and lowest for those in unskilled and ‘other’ occupa-
tions; while the young daughters of unskilled workers were among the least
likely to stay. For older children, the t-statistics on the occupation dummies
for the father’s occupation are uniformly low, indicating that this was not a
significant influence on their decision to leave home.
Households resident on the American frontier were more likely to see
their children leave, reflecting perhaps a combination of greater opportu-
nities and a greater awareness of the advantages of going further west.
Some part of the frontier effect can be attributed to other features of that
area. As Steckel observes (p. 523) the frontier was less urbanized, and fami-
lies living there had younger women as mothers, and had more children
under 10 per family, than those in the Northeast. When these characteris-
tics are controlled for, the expected chances of departure were higher for
young boys on the frontier than in the Northeast (the control category) or
the South, and were highest in the north-central states. For young girls the
rate of departure was highest on the frontier, while for older girls it was
higher there than in the northeast and north-central states, but even higher
in the South.
Some of the other hypotheses received at best weak or inconsistent
support from the logit results. Thus, living in an urban community clearly
reduced the rate of leaving home relative to being in a rural setting for all
categories, although the elasticities were small and only one coefficient was
statistically significant. Similarly, the impact of the mother’s absence from
the household had a measurable effect only on the departure rate of young
boys. The wealth of the household, as proxied by the value of its real estate,
had little systematic impact on the decision to leave home; parental literacy
and ethnicity were likewise unimportant. Previous relocation decisions
also counted for little.
With so many variables being insignificant (in the regression on older
boys, only two variables plus the constant are significant at the 10 per cent
critical value), it might be asked whether the model as a whole has statisti-
cal power. The value of –2 log given at the base of table 15.4 tests the prop-
osition, by providing a test of whether the logit regression is a better
predictor of the odds ratio than a model in which the proportion of leavers
is constant across all households (see §13.2.7). It is thus a joint test of the
statistical significance of all the explanatory variables. The null hypothesis
is that the explanatory variables have no effect on the odds ratio.
This statistic is distributed as a chi-square distribution with (k 1)
degrees of freedom, where k is the total number of regressors, including the
constant. Steckel observes that the df for this test is 21.
For the sample of boys, 12–18, the baseline log-likelihood (log L0) is
374.468; the log-likelihood of the logit model is (log L1) 333.442. The
test statistic is thus 2 ( 374.468333.442)82.05. The 1 per cent criti-
cal value of chi-square with 21 df is 38.93, indicating that we may safely
reject the null hypothesis that the logit specification produced no added
value. It is easily seen from table 15.4 that the logit specifications for the
other samples also pass muster.
In chapter 13, we discussed the relative merits of the logit and probit
specifications of models with dichotomous dependent variables. As a
comparative exercise, we ran the probit regression using the Steckel data
set for boys aged 12–18. The two models produce almost identical results,
allowing for the systematic difference in the size of the coefficients
explained in §13.4.1. The coefficient values in the logit formulation are
about 1.81 times as large as in the probit regression (the ratio is not precise
§..
because of the slight difference in the shape of the two distributions and
the non-linearity of the specifications); the t-statistics are almost identi-
cal. The value of the 2 log test is very similar in both specifications, as
are the log-likelihood value and the two goodness-of-fit tests introduced
in §13.2.7: the pseudo R2 and McFadden’s R2. All in all, it appears that
there is little to distinguish between the probit and logit specifications of
this particular model.
a Substitute (15.8) and (15.9) in (15.10) and collect all the terms in
NNP on the left-hand side
NNP [b1NNP][b2NNPb3NNP t 1]
GOV
NNP b1NNP b2NNP b3NNPt 1 GOV
(1 b1 b2)NNP b3NNPt 1 GOV
b3
NNP NNPt 1
1 b1 b2
1
GOV (15.11)
1 b1 b2
CON b1 冤 b3
1 b1 b2
NNPt 1
1
1 b1 b2
GOV 冥
b1b3 b1
NNPt 1 GOV (15.12)
1 b1 b2 1 b1 b2
INV b2 冤 b3
1 b1 b2
NNPt 1
1
1 b1 b2 冥
GOV b3NNPt 1
b2b3 b3 (1 b1 b2 ) b2
NNPt 1 GOV
1 b1 b2 1 b1 b2
b (1 b1 ) b2
3 NNPt 1 GOV (15.13)
1 b1 b2 1 b1 b2
We now have three reduced-form equations, (15.11)–(15.13), each of
which has only the exogenous variables GOV and NNP t 1 as explanatory
variables. The equations all have very complicated regression coefficients.
Of course if we estimate the equations we will not see these complicated
coefficients, with separate results for b1, b2, and b3. What we will get, for
example if we estimate (15.13), is simply a regression coefficient for each
explanatory variable; we will call them respectively c1 and c2:
INV c1NNP t 1 c2GOV (15.13a)
where c1 is measuring both the direct effect of NNP t 1 on INV (equivalent
to b3 in (15.11) in the full simultaneous-equations model) and also the
combination of all the indirect effects involving b1 and b2. There are corre-
sponding equations for (15.11) and (15.12):
§..
b2
c2
1 b1 b2
1
c4
1 b1 b2
c2
b2
c4
c6
b1
c4
b3
c3
1 b1 b2
and since we now know c3, b1, and b2, it is possible to work out what b3 must
be.
Notes
1
Mark Blaug, ‘The myth of the old poor law and the making of the new’, Journal of
Economic History, 23, 1963, pp. 151–84; Anne Digby, Pauper Palaces, Routledge &
Kegan Paul, 1978.
2
It was necessary for Boyer to eliminate UNEMP because per capita expenditure on
relief is equal to the product of (a) the rate at which relief is paid, and (b) the fre-
quency with which it is paid, and the latter factor is obviously correlated with the
unemployment rate. This correlation makes it impossible to model the independent
effect of the generosity of relief payments on unemployment. See further, Boyer,
Poor Law, p. 133.
3
Although Boyer does not report the R2 from the 2SLS regressions, they are broadly
similar to the OLS estimates (0.325 for RELIEF, 0.362 for INCOME).
4
These are the values as reported after rounding to 2 decimal places. To reproduce the
results as published by Boyer we need to work to 3 decimal places for the coefficient
in the simultaneous-equation model for INCOME, i.e. with a coefficient of £0.065
rather than £0.07, obtained by replicating the regression. The other three regression
coefficients are not sensitive to the additional decimal place.
5
J. P. Huzel, ‘Malthus, the Poor Law, and population in early nineteenth century
England’, Economic History Review, 22, 1969, pp. 430–52; and Huzel, ‘The demo-
graphic impact of the old poor law: more reflexions on Malthus’, Economic History
Review, 33, 1980, pp. 367–81.
6
In the first model the coefficient on INFTMORT is negative though not statistically
significant, while in the second it is positive and highly significant. Boyer regards this
as confirmation of the existence of a spurious negative relationship between
BRTHRATE and INFTMORT in (15.6).
7
In order to facilitate matching, Steckel sampled only households with at least one
native-born child over 10 years old in 1860. Steckel’s technique for matching was to
look for households by the name of the household head in the two years. In cases
where families stayed in the same state, this would be relatively simple, since there are
indexes of the names of household heads for each state in 1850 and 1860. For those
households that could not be located in the same state in 1850, Steckel was able to use
the census manuscript information on the birthplace of all family members. The
state in which any child aged 10 or above was born would be a good clue to the house-
hold’s place of residence in 1850; Steckel then consulted the name index of house-
hold heads for that state in 1850 to effect a match.
8
Steckel had to make one final correction in his calculation of average departure rates.
The absence of a child in the household in 1860 could be caused by mortality as well
as migration. Steckel used information on age-specific mortality rates to estimate the
number of absences due to death at each age. The expected death toll was then sub-
tracted from the overall number of absences by age to generate departure rates. The
figures shown in figure 15.1 are for departures without correction for mortality.
9
Steckel excluded children under 5 in 1850, since their departure was probably domi-
nated by premature mortality; he also excluded children over 22, because there were
too few of them to generate statistically reliable data. Married children were also
excluded from the sample.
10
The separation by gender follows from the hypotheses in §15.2.1. The separation
into age groups is designed to minimize the inconvenience of being able to separate
out mortality from voluntarily leaving home in the dependent variable. Age-specific
mortality rates are stable between the ages of 12 and 18, suggesting that the bias will
be small. For younger children, however, the problem may be more severe. If we treat
this as a stochastic error in the measurement of the dependent variable, this will have
no effect on the coefficients in table 15.4 (see §11.4.1). The implicit assumption in
Steckel’s model is that child mortality is not systematically related to the explanatory
variables such that it disguises the true relationship between voluntary departure
and independent variables. Steckel receives support for this assumption from the rel-
ative stability of the coefficients and elasticities in the two age-groups.
The following sections provide a brief description of the four data sets that
are referred to throughout the book and form the basis for the case studies
discussed in chapters 14 and 15. The original works should be consulted for
furtherdiscussionof thehistoricalaspectsandformoredetailedinformation
about the sources. The data sets are not reproduced in this book but can be
easily accessed without charge via a special page on the Cambridge University
Press web site: <http://uk.cambridge.org/resources/0521806631/>
Some of the series are entered in the regression models in the form of
logarithms or as first differences, but all series are given in the data sets in
their original form, and if any manipulation or transformation is required
this must be done by users of the data set.
§.
a
George R. Boyer, An Economic History of the English Poor Law, Cambridge University Press, 1990.
Table A.1 Data set for investigation of relief payments in England in 1831
(b) Series for which the answer for each parish is either ‘yes’ (recorded as 1) or
‘no’(recorded as 0)
COTTIND Does cottage industry exist in the parish?
ALLOTMNT Do labourers have allotments of farm land?
WORKHSE Is there a workhouse in the parish?
CHILDALL Does the parish pay child allowances?
SUBSIDY Does the parish subsidize the wage rates of privately employed
labourers?
LABRATE Does the parish use a labour rate?*
ROUNDSMN Does the parish use the roundsman system?**
Notes:
** Under the labour rate unemployed labourers were apportioned among all occupiers of
property according to ‘the extent of occupation, acreage rent or number of horses
employed’.
** Under this system unemployed labourers were sent round the ratepayers in the parish
with the expectation that they would be offered work.
§.
Table A.2 Data set for investigation of the birth rate in southern England,
c. 1826–1830*
(b) Series for which the answer for each parish is either ‘yes’ (recorded as 1) or
‘no’(recorded as 0)
CHILDAL3 Does the parish begin payment of child allowances at three
children?
CHILDAL4 Does the parish begin payment of child allowances at four
children?
CHILDAL5 Does the parish begin payment of child allowances at five or more
children?
Notes:
* In addition the following four series listed in table A.1 are also used in the study of birth
rates: INCOME, DENSITY, ALLOTMNT, and COTTIND.
population this was more than double the rate of emigration from any
other European country. The overwhelming majority went to the United
States, but smaller numbers moved to Great Britain and a few also went to
Canada, Australia, and New Zealand. In an article published in 1993
Timothy Hatton and Jeffrey Williamson explored several features of this
emigration.b They investigated the trend over time, the annual fluctua-
tions, and the possible characteristics within each county of Ireland that
influenced the outflow, for example, religion, age structure, dependence
on agriculture, and living standards.3
The authors compiled two basic data sets for this purpose. One consists
of annual series for 1877–1913, the period for which the emigration and
other data are thought to be reasonably reliable. The five basic series are
listed in the top panel of table A.3; for some purposes these series are also
subdivided by the country which received the emigrants or by gender; or
they are used with various modifications, for example by taking the change
in a series rather than its actual level, or by taking logs.
b
Timothy J. Hatton and Jeffrey G. Williamson, ‘After the famine: emigration from Ireland,
1850–1913’, Journal of Economic History, 53, 1993, pp. 576–600.
Table A.3 Data set for investigation of emigration from Ireland, 1877–1913
(b) Data for each county for each of the four census dates, 1881–1911
CNTYMIG The proportion per 1,000 of the county population that
emigrated
AGE The proportion of the county population aged 15–34
URBAN The proportion of the county population living in towns of 2,000
or more
AGRIC The proportion of the county male labour force in agriculture
LANDHLDG The proportion of the county agricultural holdings less than 5
acres
CYWRATIO The ratio of the foreign real wage (defined and weighted as above)
to the domestic real wage in agriculture in the county
§.
Notes:
* Emigration from Ireland to Britain is believed to be seriously under-stated, so Hatton and
Williamson construct an adjusted total using double the recorded flow to Britain.
The second set provides data for each of the 32 counties at four census
dates (1881, 1891, 1901, and 1911) for 11 items, listed in the lower panel of
the table.4
that the persistently high level of unemployment ‘was due in large part . . .
to high unemployment benefits relative to wages’. Apart from the severe
crises of 1921 and 1930–2, the high unemployment of other inter-war
years ‘was the consequence almost solely of the dole. The army of the
unemployed standing watch in Britain at the publication of the General
Theory was largely a volunteer army’.d
The insurance scheme that they highlighted had first been introduced in
Britain in 1911, but only for a very small proportion of the workforce. An
Act of 1920 extended coverage to almost all manual workers over the age of
16 (the two main exceptions were agricultural workers and domestic ser-
vants – both relatively stable sectors with low unemployment) and pro-
vided much more generous levels of benefit. There were numerous
subsequent changes, but Benjamin and Kochin argue that the level of the
benefits – together with various institutional features of the system – made
the inter-war scheme ‘more generous relative to wages than ever before or
since’. In their view it was primarily this generosity that explains why
unemployment was so much higher than in earlier or later periods.
Their paper provoked a flurry of historical and statistical criticisms, but
most historians would now recognize that the benefit system did make
some contribution to the severity of inter-war unemployment, though not
by as much as Benjamin and Kochin had originally claimed.5
Their data set consisted of four annual primary series for the years
1920–38: UNEMP, WAGES, BENEFITS, and NNP.6 They are defined in table
A.4; all are taken from published sources.7 From these they constructed two
further series. BWRATIO is the replacement rate, and is a measure of the gen-
erosity of BENEFITS relative to WAGES (i.e. of the extent to which they are a
replacement for wages). DEMAND is a measure of the effects of changes in
aggregate demand calculated as the ratio of NNP to NNP*, its trend value.e
d
The phrases quoted are from Benjamin and Kochin, p. 474; the reference to the General Theory is
to the book which the famous economist, John Maynard Keynes, published in 1936 to explain
his theory of the nature and causes of unemployment.
e
The ratio of output to its trend is included in the regression model as the difference between the
log of actual output (log NNP) and the log of its trend value (log NNP*). The use of logs in this
way as a means of measuring proportions was explained in §1.6.3. The log-linear trend value for
NNP* was estimated by using the procedure explained in §12.4.
§. ’
torical answers to these questions Richard Steckel extracted data from the
manuscript schedules for the censuses of the United States for 1850 and
1860.f His analysis dealt with two issues: the average age at which the chil-
dren left home and the factors that influenced them to make this decision,
but our concern here is only with the data and procedures relating to the
latter issue.
To obtain his data set Steckel initially drew from the manuscript sched-
ules for the 1860 Census a random sample of households with at least one
native-born child aged 10 or above still living at home. Those same house-
holds were then identified in the 1850 Census, using the state in which the
child was born as a pointer to locate where the family had been enumerated
in 1850. This information was obtained for 1600 male-headed households,
and these yielded a national sample of unmarried children aged 5–29 who
were recorded as living at home in 1850.
It was thus possible to determine from these matched samples of house-
holds which of the children living at home in 1850 were still there a decade
later. To investigate what might have influenced the decisions of those who
had departed, Steckel assembled information on a range of socio-
economic and demographic characteristics of the father, the mother, the
f
Richard H. Steckel, ‘The age at leaving home in the United States, 1850–1860’, Social Science
History, 20, 1996, pp. 507–32. The manuscript schedules are the original household returns from
which the published census reports were compiled.
Table A.5 Data set for investigation of decisions to leave home in the United States,
1850–1860
Note:
The terms dummy variable and control category are explained in §10.1.1.
child, and the household. All this information relates to the position in
1850 and was extracted from the Census schedules. The full list of charac-
teristics is set out in table A.5.
The results are reported for 1,714 boys and 1,617 girls in two age groups,
5–11, and 12–18. In order to keep the working data set for this textbook to a
manageable size we have reproduced only the information relating to 579
boys aged 12–18, but the identical procedures are used for the younger boys
and for the girls, and the results for the whole sample are discussed in
chapter 15.
It should be noted that the children are subdivided according to their
age in 1850, so that there is some overlap between the two groups in respect
of the age at which they eventually left home. The children aged 5–11 could
have departed at any age from 6 (a 5-year-old who left in 1851) to 21 (an
11-year-old who left shortly before the census was taken in 1860). The cor-
responding range of age at departure for those in the group aged 12–18
would be 13–28. Steckel estimates that the mean age at which children left
home for the sample as a whole was 26.5 years for males and 22 years for
females.g
Notes
1
For other quantitative studies of English poor relief see Mark Blaug, ‘The myth of the
old poor law and the making of the new’, Journal of Economic History, 23, 1963, pp.
151–84; J. P. Huzel, ‘Malthus, the poor law, and population in early nineteenth
century England’, Economic History Review, 22, 1969, pp. 430–52 and ‘The demo-
graphic impact of the poor law: more reflexions on Malthus’, Economic History
Review, 33, 1980, pp. 364–81; and G. S. L. Tucker, ‘The old poor law revisited’,
Explorations in Economic History, 12, 1975, pp. 233–52.
2
The detailed replies from parishes were printed in four large volumes as Appendix B
of the H. M. Commissioners for Inquiry into the Administration and Practical
Operation of the Poor Laws, Report, Appendix B1, Answers to Rural Questions (H.C.
44), Parliamentary Papers, 1834, XXX–XXXIII; for the 1831 Census of Population of
England and Wales see Parliamentary Papers, 1833 (H.C. 149), XXXVI–XXXVIII.
The data on the number of births and infant deaths listed in table A.2 were extracted
by Boyer from unpublished parish returns for the 1831 Census located in the Public
Record Office (PRO, HO 71).
3
For good general discussions of the issues posed in analysis of emigration see J. D.
Gould, ‘European inter-continental emigration, 1815–1914: patterns and causes’,
Journal of European Economic History, 8, 1979, pp. 593–679; Gould, ‘European inter-
continental emigration: the role of diffusion and feedback’, Journal of European
Economic History, 8, 1980, pp. 267–315; and Dudley Baines, ‘European emigration,
1815–1930: looking at the emigration decision again’, Economic History Review, 47,
g
Steckel, ‘Leaving home’, p. 517.
1994, pp. 525–44. For some more recent quantitative studies see Timothy J. Hatton
and Jeffrey G. Williamson (eds.), Migration and the International Labour Market,
1850–1939, Routledge, 1994.
4
The main source for the annual migration series was Imre Ferenczi and Walter F.
Willcox, International Migrations: vol I, Statistics, NBER, 1929. The annual home
and foreign real wage series were compiled by Jeffrey G. Williamson, ‘The evolution
of global labor markets in the first and second world since 1830: background and evi-
dence’, NBER Working Papers on Historical Factors in Long-Run Growth, 36, NBER,
1992. The unemployment series are from J. R. Vernon, ‘Unemployment rates in
post-bellum America, 1869–1899’, University of Florida, 1991, manuscript; Charles
H. Feinstein, National Income, Expenditure and Output of the United Kingdom,
1855–1965, Cambridge University Press, 1972; Wray Vamplew (ed.), Australian
Historical Statistics, Fairfax, Syme & Weldon, 1987, p. 153; and a series specially con-
structed by Hatton and Williamson for Canada. The county data were taken mainly
from various volumes of the decennial Census of Ireland, Parliamentary Papers, or
the Emigration Statistics of Ireland, Parliamentary Papers, for the four dates.
5
The main initial contributions to the debate were Michael Collins, ‘Unemployment
in interwar Britain: still searching for an explanation’, Journal of Political Economy,
90, 1982, pp. 369–79; David Metcalfe, Stephen J. Nickell and Nicos Floros, ‘Still
searching for an explanation of unemployment in interwar Britain’, Journal of
Political Economy, 90, 1982, pp. 386–99; P. A. Ormerod and G.D.N. Worswick,
‘Unemployment in inter-war Britain’, Journal of Political Economy, 90, 1982, pp.
400–9; and Daniel K. Benjamin and Levis A. Kochin, ‘Unemployment and unem-
ployment benefits in twentieth-century Britain: a reply to our critics’, Journal of
Political Economy, 90, 1982, pp. 410–436. For later quantitative work on this topic see
also Timothy J. Hatton, ‘Unemployment benefits and the macroeconomics of the
interwar labour market’, Oxford Economic Papers, 35 (Supplement), 1983, pp.
486–505; and Barry Eichengreen, ‘Unemployment in interwar Britain: dole or dol-
drums?’, in N. F. R. Crafts, N. H. D. Dimsdale and S. Engerman (eds.), Quantitative
Economic History, Oxford University Press, 1991, pp. 1–27.
6
Benjamin and Kochin also tested their hypothesis with additional series for juvenile
unemployment (for 1924–35) and for male and female unemployment (for
1923–37), but these series are not reproduced in the data set.
7
Wages are from Agatha L. Chapman, Wages and Salaries in the United Kingdom,
1920–1938, Cambridge University Press, 1953; benefits from Eveline Burns, British
Unemployment Programs, 1920–1938, Social Science Research Council, 1941; unem-
ployment from official Ministry of Labour statistics reproduced in Board of Trade,
Statistical Abstract of the United Kingdom, HMSO, 1936 and 1939, and net national
product from Feinstein, National Income.
Index numbers
In §1.7 we noted that many of the series used in quantitative analysis of his-
torical trends and fluctuations take the form of index numbers of the
changes in either prices or quantities, and gave a few examples of controver-
sies in which they have figured prominently. The aim of the present appen-
dix is to outline the main principles involved in the construction of index
numbers. Slightly more formal presentation of the principal concepts and
definitions is also given in panels B.1 and B.2 for those comfortable with the
algebra.
It may help to clarify the nature of a proper index number if we refer first
to price and quantity relatives. These are not index numbers, although
they look exactly like them because they also have one item in the series
shown as 100. However these relatives are merely series that have been con-
verted from an absolute to a relative basis. All that this involves is choosing
some particular year for which the value of the series is taken as 100, and
then expressing all the other items in the series as a ratio to the value in the
chosen year. The advantage of doing this is that it makes it easier to see the
relationship between values; fewer digits are required and the comparative
dimension is more readily grasped.
To illustrate this simple procedure, data on the quantity of bituminous
coal produced in the United States each year from 1900 to 1913 are given in
column (1) of table B.1. The corresponding quantity relative is calculated
in column (2) with 1900 as 100 and in column (3) with 1913 as 100. Both
relatives are rounded to one decimal point. In the same way a price relative
could be constructed from data on (say) the annual average price of a
specific grade of raw cotton at New Orleans in cents per lb.
The important point to note about such relatives is that the relationship
between any pair of years is completely unaffected by either the switch to
Table B.1 Bituminous coal output in the United States, 1900–1913, original data
and two quantity relatives
Source: US Department of Commerce, Historical Statistics of the United States, series G13,
Washington, DC, 1949.
the relative form or the choice of year to take as 100. For example, the
output of coal in 1910 is always 60.3 per cent higher than the output in
1902, whichever column is used for the computation. Relatives are useful,
but they do not provide any additional information that is not already con-
tained in the original series. The distinguishing feature of a proper index
number is that it does exactly that.
ucts. In principle, a separate series should be included for every good pro-
duced in the United Kingdom in these years. However, that is not normally
practicable, and instead a representative sample of products would be
selected.
How then should the output data for these products be combined? It
would clearly not be sensible to simply add together a mishmash of figures
for tons of coal, yards of cotton cloth, gallons of beer, and so on. It is neces-
sary to find some way of combining the data that circumvents the difficulty
created by the heterogeneity of the units in which the quantities are meas-
ured. Furthermore, there were enormous differences in the rates at which
individual industries increased production over this period. To take an
extreme example, the increase in the output of cotton yarn was a staggering
6,500 per cent, whereas that for beer was only a paltry 43 per cent. It is
therefore essential to combine the series in such a way that the overall index
reflects the relative importance of the various industries, and this requires
some definition of ‘relative importance’ and some system of weighting the
components.
There are a number of ways in which these requirements can be
satisfied, but in practice only two main types of index number are widely
used, and we will concentrate on these. One type uses the prices of a fixed
year to convert the component quantities to comparable values and simul-
taneously to weight them. This is known as a Laspeyres quantity index.
The year used for the weights is referred to as the base year.
The other type uses a changing set of prices to value the quantities and is
known as a Paasche quantity index. In both cases a weighted arithmetic
mean (see §2.2.2) is used to combine the different quantities. The year
shown as 100 for the index is called the reference year and, as we shall see,
this may not be the same as the base year.
Each of these quantity indices has its corresponding price index. In the
Laspeyres price index the quantities of a fixed base year are used to weight
the component prices. In the Paasche price index a changing set of quan-
tities is used to weight the prices. The price and quantity indices are thus
perfectly symmetrical.
In order to illustrate some of the key features of these two types of
index number we have invented a very simple data set in which there are
only two commodities, cotton cloth and beer. Table B.2 sets out all the rel-
evant information on the prices and quantities of these two goods for
the three years, 1900, 1901, and 1902. The quantities refer identically to
the amounts produced in each year and the amounts consumed, and the
product of the prices and quantities in columns (3) and (6) can be inter-
preted as the value at current prices of the expenditure on the individual
Table B.2 Illustrative data for construction of price and quantity index numbers
1900 1 10 10 2 15 30 40
1901 5 11 55 3 18 54 109
1902 10 12 120 4 20 80 200
Notes:
(3)(1)(2);
(6)(4)(5);
(7)(3)(6).
Table B.3 Price and quantity indices for the data in table B.2 (1900 100)
Price indices
1900 100.0 100.00 100.00 100.00 100.00
1901 117.5 113.75 114.74 116.11 116.12
1902 130.0 125.00 125.00 127.48 127.73
Quantity indices
1900 100.0 100.00 100.00 100.00 100.00
1901 237.5 230.77 231.92 234.69 236.25
1902 400.0 384.62 384.62 392.23 394.06
(10 1) (15 2) 10 30 40
1900 100 100 100 100.0
(10 1) (15 2) 10 30 40
(11 1) (18 2) 11 36 47
1901 100 100 100 117.5
(10 1) (15 2) 40 40
(12 1) (20 2) 12 40 52
1902 100 100 100 130.0
(10 1) (15 2) 40 40
In each case, the first term in the brackets is the price of the commodity; the
second is the base-year quantity. It will be seen that this quantity (1 unit for
cloth, 2 units for beer) does not change.
This is the index given in column (1) of table B.3. It has a very straight-
forward interpretation: the cost of purchasing the 1900 basket of goods
would be 17.5 per cent more at 1901 prices than it was at the actual prices
prevailing in 1900, and it would be 30 per cent more at 1902 prices.
This aggregate method reveals most clearly the fundamental nature of an
index number and should be carefully studied. However, it involves working
with all the detail of the original prices and quantities, and in a real-life cal-
culation there would of course be many more than two commodities.
It is usually simpler, therefore, to use a second method, in which the
prices are first converted to price relatives with the fixed base year (in this
case 1900) as 100. These relatives are then multiplied by an appropriate
measure of the relative importance of the item to obtain a weighted average
of the price relatives. For the present example the appropriate weights are
the share of each product in total expenditure in the base year, as given by
the data for 1900 in columns (3), (6), and (7) of table B.2.
In our example, the successive relatives are 100, 110, and 120 for cloth;
100, 120, and 133.33 for beer. Because the weights do not change in the
Laspeyres index it is convenient to express them as a proportion, and then
use these fixed proportions in the calculations for each year.a The 1900 pro-
portion for cloth is thus 10/400.25, and for beer it is 30/400.75. The
required index with 1900100 is then formed and the results obtained are
identical to those given by the aggregate method:
The aggregate method and the weighted average of the price relatives
appear to follow different routes but the underlying arithmetic operations
are actually identical, and the two methods will always give the same result
provided that the year taken as 100 for the relatives is the same as the base
year. If it is not, the outcome is not an alternative measure of the change in
prices, it is simply wrong. The way in which failure to observe this crucial
condition will generate a meaningless result is demonstrated in the alge-
braic presentation in panel B.1.
With realistic data sets, the second method – using relatives – is gener-
ally easier to use, and is the method most commonly adopted for the con-
struction of index numbers. Because the two methods give identical
results, users of index numbers frequently refer to the shares in expendi-
ture as though these were the true weights, but this is not correct. As is
demonstrated in panel B.1, the true weights of the Laspeyres price index,
regardless of whether it is compiled by the first or the second method, are
the actual quantities in the base year.
This distinction matters for historians and other users of index
numbers, because changes in expenditure shares reflect changes in both
prices and quantities, whereas it is only the changes in quantities that are rel-
a
Expressing the weights as proportions (rather than percentages) has the further advantage that
the sum of the weights is 1, and it is thus unnecessary to divide the sum of the weighted relatives
by the sum of the weights; compare (2.1a) in §2.2.2.
§.
evant when analysing what effect different sets of weights might have on
the movement in a price index.b
The appropriate quantities to use for the weights are determined by the
nature of the price index. For example, for an index of farm prices it would
be the quantities of the different farm products sold in the base year; for an
index of the cost of living for pensioners it would be the quantities pur-
chased by pensioners in the base year; for an index of share prices the
number of shares issued by the company in the base year; and so on.
For index numbers, unlike relatives, the chosen base year is potentially a
crucial determinant of the measurement. It is thus important that the year
selected as the base for a price index should not be one in which the quan-
tities were seriously distorted by wars, strikes, or other abnormal events.
Indeed, it is quite common to base an index on an average of several years
rather than a single year, in the expectation that this will minimize any dis-
tortions in the relative quantities.
However, even when the selection is confined to broadly ‘normal’ years,
the results obtained with different base years may differ markedly because
the relative quantities used as weights will change in response to the under-
lying economic processes.
b
See, for example, the discussion of alternative weights in Charles H. Feinstein, ‘Pessimism per-
petuated, real wages and the standard of living in Britain during and after the industrial revolu-
tion’, Journal of Economic History, 58, 1998, pp. 640–1.
(1) (2)
Laspeyres Laspeyres
Year price index quantity index
P0Q 0 P0Q 0
0 100 100
P0Q 0 P0Q 0
P1Q 0 P0Q 1
1 100 100
P0Q 0 P0Q 0
P2Q 0 P0Q 2
2 100 100
P0Q 0 P0Q 0
By running one’s eye down the left-hand side of the successive formu-
lae in column (1) it can be seen that for each year the prices of that year are
being compared with prices in the base year, always using the quantities of
the base year as weights. Similarly, on the right-hand side of the quantity
index in column (2), the quantities for each year are being compared with
quantities in the base year, using the prices of the base year as fixed
weights.
The second method of constructing the same price index involves
weighting the price relatives for each commodity by the base year share of
§.
that commodity in total expenditure. For example, for the price indices in
years 1 and 2 (year 0100) this would be
冢PP 冣P Q 100
1
0
0 0
and
冢PP 冣P Q 100
2
0
0 0
P0Q0 P0Q0
Since the P0 in the denominator of the price relative for each separate
commodity would cancel out with the P0 in the expenditure share for that
commodity, this will give a formula identical to the one in column (2)
above. Similarly, in the corresponding formula for a quantity relative, the
Q0 in the denominator of the relative would cancel out with the Q0 in the
expenditure share, leaving exactly the same formula as the one in column
(2) above.
The formula for the use of relatives demonstrates two important points
about index numbers. First, the fact that the formula for the second method
reduces in this way to the one used for the first method shows that the
weights for the index are actually the base year quantities, and are not the
base year expenditure shares. Secondly, it is easy to see that if the year taken
as 100 for the relatives is not the same as the base year, then the terms would
not cancel out in the formula above, and the result would be a meaningless
expression.
For example, if the index is calculated using year 0 expenditure shares
while the relatives have year 2 as 100, the result for year 1 would be
冢PP 冣P Q 100
1
2
0 0
P0Q0
P1P0Q0
P2
100
P0Q0
1902. We thus find that with 1902 weights the change in prices between
1900 and 1902 is not a rise of 30 per cent, but of only 25 per cent.
The fact that the increase in prices in our simple example is greater with
1900 weights than with 1902 weights is not accidental. The data used for our
simple two-good economy exhibit one of the fundamental characteristics
of a market economy: an inverse relationship between relative movements
in prices and quantities. Cotton cloth shows a smaller increase in price than
beer, and a larger increase in output. Inverse relationships of this type are an
essential feature of index numbers, and one of which historians should
always be acutely aware.
They are driven by powerful forces on both the supply and the demand
sides of the economy. On the demand side, consumers generally tend to
switch their consumption in favour of those goods that are becoming rela-
tively cheaper. On the supply side, it is usually the case that the greater the
output, the lower the cost, and so products with the largest expansion in
output tend to have the smallest increase in price (the process known to
economists as economies of scale). However the process is initiated, these
two forces interact cumulatively: greater demand stimulates increased
output, and higher output reduces prices and increases demand.
For any economy that experiences this process, a price index con-
structed with a base year early in the period under review will always show a
greater increase in prices than one constructed with a late base year. This
occurs because the early-year index gives a relatively larger weight to those
products that show a relatively large increase in prices.
By contrast, by the end of the period, the production of those relatively
more expensive goods will have declined in relative importance, and the
quantity weights attached to them will be correspondingly lower than they
were at the beginning of the period. In our simple example 2 units of beer
were consumed for every 1 unit of cloth in 1900, but by 1902 the ratio was
only 0.4 units of beer for every unit of cloth.
Exactly the same tendency applies to quantity indices. A quantity index
constructed with a base year early in the period will always show a greater
increase in quantities than one constructed with a base year late in the
period. The more rapid the rate of structural change in an economy, the
more important this phenomenon will be.
One of the most striking historical examples of this effect occurred in
the period of exceptionally rapid economic change in the USSR between
1928 and 1937. At the beginning of this period the economy had relatively
few machines and other capital goods, and consequently their price was
relatively very high. By the end of the period of Stalin’s programme of
forced industrialization and collectivization, it was food and consumer
goods that were scarce and expensive relative to machinery.2 As a result of
§.
this structural shift there were marked discrepancies in the results accord-
ing to whether index numbers of output used early or late years for their
weights, and similarly for prices.
For example, a Laspeyres index of industrial production with early-year
weights gave a relatively high weight to the initially relatively expensive –
but fast-growing – machinery, and a low weight to the initially relatively
cheap – but slow-growing – consumer goods. Conversely, late-year weights
were relatively low for the machinery and relatively high for the consumer
goods. One of the most reliable estimates of the GNP of the USSR for 1928
and 1937 showed an increase of 175 per cent when measured at the prices
of 1928, compared to only 62 per cent when 1937 was taken as the base
year.3
Given the possibility of more than one answer, what should the histo-
rian do in practice? In statistical terms, both early-year and late-year
answers are equally valid and there is no justification for choosing one
rather than the other.4 Furthermore, the fact of a marked divergence
between the early-year and late-year measures is a valuable signal to the
historian that there has been significant structural change. Notwith-
standing this, it may be desirable for some purposes to have a single
measure of the change in prices or quantities, and there are a number of
possible ways in which this can be accomplished.
and the series can then be reassembled as an index by starting at 100 and
multiplying by the successive exponentials of the weighted averages.
The procedure may seem somewhat complicated, and we can illustrate
it with a Divisia price index based on our simple example. From 1900 to
1901 the growth in prices is (log 11 log 10)(2.3979 2.3026)0.0953
for cloth, and (log 18 log 15)(2.8904 2.7081)0.1823 for beer.
The arithmetic mean of the weights for the beginning and end of this
year are 0.5(10/4055/109)0.3773 for cloth and 0.5(30/40
54/109)0.6227 for beer. So the weighted average growth for this year is
(0.37730.0953)(0.62270.1823)0.1495. The exponential of this is
1.16125, or a rate of growth of 16.125 per cent.
The corresponding calculation for the growth rate from 1901 to 1902
would be (0.55230.0870)(0.44770.1054)0.0952 and the expo-
nential of this is 1.0999, or a growth rate of 9.99 per cent.
Taking 1900 as 100, the Divisia price index would thus show an increase
of 16.125 per cent on 100 to equal 116.12 for 1901; and a further increase of
9.99 per cent on this, to reach 127.73 for 1902. This gives an index (see
column (5) of table B.3) that is fractionally higher in each year (relative to
1900) than the corresponding Fisher Ideal index. As can be seen from
columns (1) and (2) of table B.3, it also falls between the indices with early-
year and late-year weights.c For the corresponding quantity indices, given
in the lower panel of table B.3, the differences are marginally larger
(because the growth of the quantities is more rapid than the growth of the
prices) but the overall pattern relative to the other indices is the same.
The Divisia indices should also provide a perfect decomposition of the
change in value into changes in price and in quantity, but because growth is
calculated over discrete intervals (rather than continuously), the decom-
position is not exact.
Chained indices
The second way of dealing with the index number problem is to construct a
chained index which links (or splices) a succession of shorter price indices
in order to cover a longer span of time than would be appropriate with a
single base year. These successive indices form a single continuous series,
with a single reference year but a number of underlying base years. The
actual procedure for doing this is usually no more than an arithmetic
adjustment based on one or more overlapping years. In this way the new
c
For the use of Fisher Ideal and Divisia indices with historical data, see N. F. R. Crafts, British
Economic Growth during the Industrial Revolution, Cambridge University Press, 1985, p. 26.
§.
Table B.4 Chained index numbers with changes in the reference year (imaginary
data)
subindex is brought up to the same level as the old subindex over those
years, and the subsequent years are scaled proportionately, as illustrated in
column (3) of table B.4.
There are two main reasons for adopting this procedure. The first is that
over time it usually becomes impossible to continue the components of a
single index. Some goods either cease to be produced (horse carriages) or
their quality changes so radically (computers) that those in the market at
the end of the period are no longer comparable with those produced at the
beginning. It is then simply not possible to find market prices in the year
2000 which could be used to value directly either a carriage of the type con-
structed in 1800, or a computer of the type produced in 1950.
The second is that even when this extreme situation does not arise, a
chained index provides a means of circumventing the index number
problem by constantly revising the base year so that the quantities used are
always up to date and thus fully relevant as weights for the constituent
prices.d Each subindex is thus appropriate to its specific time period, and
the gap between the early-year and late-year weights is too short to matter.
For contemporary index numbers generated by official agencies this is now
standard practice. Researchers working with historical series may not be
d
The Divisia index can be interpreted as a chained index which is rebased in every time period; in
our example every year.
able to do this if the data required are not available, but the chaining proce-
dure should be considered whenever possible.e
It is worth repeating that the year shown as 100 in a chained index will
not be the base year except for – at most – one of the subperiods. All other
subperiods in the index will have their own base year. Since, as we have
already demonstrated, the actual base year can have a substantial impact on
any measurement derived from index numbers, it is highly desirable to
identify the true base year in any long-run historical series.
The way in which chaining can conceal the true base year is illustrated
by the example given in table B.4. One imaginary index is given in column
(1) for 1900–2 with 1900 as the base year, and a second in column (2) for
1902–5 with 1902 as base year. The two indices could then be spliced in
1902 as shown in column (3). The resulting spliced index is then converted
in column (4) to an index with 1905 as the reference year.
This final procedure has not altered the pattern of price change in any
way, but it has effectively suppressed the actual base year. The good histo-
rian should always take pains to uncover the truth, so that her evaluation of
the change in the indices is not distorted by misunderstanding of the true
base year used to weight the index.
(10 5) (15 3) 50 45 95
1900 100 100 100 87.15
(11 5) (18 3) 55 54 109
e
For an example of such linking for a nineteenth-century series see Charles H. Feinstein,
‘Pessimism perpetuated, real wages and the standard of living in Britain during and after the
Industrial Revolution’, Journal of Economic History, 58, 1998, p. 634; and for linking of constitu-
ent series over a much longer period, Henry Phelps Brown and Sheila V. Hopkins, A Perspective
of Wages and Prices, Methuen, 1981, p. 40.
§.
and then the comparison of 1900 and 1902 with 1902 weights
(1) (2)
Paasche Paasche
Year price index quantity index
P0Q 0 P0Q 0
0 100 100
P0Q 0 P0Q 0
P1Q 1 P1Q 1
1 100 100
P0Q 1 P1Q 0
P2Q 2 P2Q 2
2 100 100
P0Q 2 P2Q 0
Running one’s eye down the successive Paasche formula shows how for
this index the weights change every year. Thus for each year in the price
index the prices of that year are compared with the reference year prices,
using the quantities of the given year as weights: Q1 for year 1, Q2 for year 2,
and so on.
An important feature of the Paasche indices is that, strictly speaking,
each given year should only be compared directly with the reference year. If,
instead, a Paasche index is used to measure the change in prices between
years 1 and 2 what we actually have is
冤 1/
冢PP 冣P Q 冥 100
P1Q1
0
1
1 1
As long as the prices are expressed relative to the given year, the P1s in
the denominator of the relative and the numerator of the expenditure share
will cancel out to give
1
1
and when this is inverted the formula is identical to the one for year 1 in
column (1) above.
Students (and others) often omit these two critical elements in the pro-
cedure for a Paasche index based on relatives. If the reciprocal is omitted,
the index measures the prices of year 0 relative to those of year 1, instead of
the reverse. If the price relative is not recalculated every year to have the
given year as 100 the result is quite meaningless. For example, if the price
relative for year 1 had been taken with year 0 as 100 the calculation would be
1/
冢PP 冣P Q
1
0
1 1
P1Q1
P1Q1
P12Q1
P0
a complex expression for which there is no intelligible interpretation.
It is because of possible pitfalls of this type that it is so important to
think of index numbers in terms of a notation such as the one adopted in
this panel. It is always possible to make an arithmetic calculation which
looks correct on the surface, but it is only by scrutiny of its underlying struc-
ture that one can decide whether or not the index is sensible.
relatives and then multiplying by 100, and this is the form in which a
Paasche index is often calculated. (It is presented in this form in panel B.2.)
We can go directly to this version of the procedure to calculate the index
for 1902 with 1900 as the reference year. When 1902 is the given year, the
respective shares in expenditure are 120/2000.6 and 80/2000.4, and
the price relatives (now with 1902 as 100) are 10/1210083.33 for cloth,
and 15/2010075 for beer. The full calculation is thus
On the basis of the first of these decompositions one could say, there-
fore, that the increase in the value of total expenditure from 100 to 500 con-
sisted of an increase in prices of 25 per cent and an increase in quantities (a
rise ‘in real terms’) of 300 per cent. The fact that the other two methods
would give a different partitioning between price and quantity is simply
another manifestation of the index number problem.
Finally, it should be noted that in the perspective of a comparison of
1900 and 1902, the calculation of the Paasche price index for 1902 (with
1902 quantities as weights) is identically equal at 125 to the result that
would be obtained using a Laspeyres price index with 1902 as the base year,
because the latter also has 1902 quantities as weights. This can be seen in
the figures for 1902 in the upper panel of table B.3 (and for the correspond-
ing quantity indices in the lower panel, for which both are equal to 384.62).
For this reason the index number problem discussed in §B.4 is some-
times presented as the difference between a Laspeyres and a Paasche
measure, but this arbitrarily assumes that the base of the former is an early
year and the base of the latter a late year, neither of which is necessarily
correct.
Notes
1
The product of the prices and quantities might also be interpreted as the value of the
output of beer and cloth produced. However, the measure of output relevant for the
construction of a quantity index of output, such as an index of industrial produc-
tion, would be the value added, i.e. for each constituent industry, the difference
between the value at which its products are sold and the value of the raw materials,
fuel, and other inputs which it purchased. The corresponding ‘value added prices’
are thus not the same as the prices at which the goods were purchased by consumers.
The prices in table B.2 are assumed to be consumer prices and so we refer to the series
in columns (3), (6), and (7) as the value of expenditure. It is measured at ‘current
prices’, i.e. at the prices prevailing in each year.
2
For a very clear and full discussion of these issues see Abram Bergson, The Real
National Income of Soviet Russia since 1928, Harvard University Press, 1961, pp.
25–41; and Janet G. Chapman, Trends in Consumption in the Soviet Union, Rand
Corporation, 1964, pp. 27–44.
3
Bergson, Real National Income, p. 217. For illustrations of a substantial index
number problem in relation to measures of industrial production in the United
Kingdom between 1924 and 1948 and in the United States between 1909 and 1937,
see W. E. G. Salter, Productivity and Technical Change, Cambridge University Press,
1960, pp. 151–2, 170.
4
There are, however, substantial differences in the underlying economic interpretation
of the alternative index number formulae. For an introduction to these principles see
Robin Marris, Economic Arithmetic, Macmillan, 1958, pp. 227–83 or Dan Usher, The
Measurement of Economic Growth, Basil Blackwell, 1980, pp. 12–64. The economic
principles are discussed briefly in an historical context in N. F. R. Crafts, British
Economic Growth during the Industrial Revolution, Cambridge University Press,
1985, pp. 25–8.
5
The form of this index used for empirical work is an approximation to a theoretical
version which cannot be applied in practice because it treats time as a continuous
variable, whereas empirical work can be undertaken only with discrete periods of
time such as a month or a year.
Figure 8.2 Y
Ballantine for a
regression with two
explanatory
variables
X1 X2
Figure 8.3 Y
Ballantine for a
regression with two
explanatory
variables
X1 X2
Y
(a) With two independent explanatory variables
X1 X2
X1
X2
X1
X2
Bergson, Abram, The Real National Income of Soviet Russia since 1928,
Harvard University Press, 1961
Blalock, H. M., Social Statistics, 2nd edn., McGraw-Hill, 1979
Blaug, Mark, ‘The myth of the old poor law and the making of the new’,
Journal of Economic History, 23, 1963, pp. 151–84
Bloom, Howard S. and H. Douglas Price, ‘Voter response to short-run eco-
nomic conditions: the asymmetric effect of prosperity and recession’,
American Political Science Review, 69, 1975, pp. 1240–54
Bogue, Allan C., ‘Some dimensions of power in the thirty-seventh senate’, in
William O. Aydelotte, Allan C. Bogue and Robert W. Fogel (eds.), The
Dimensions of Quantitative Research in History, Oxford University Press,
1972
Bowden, Sue and Avner Offer, ‘Household appliances and the use of time:
the United States and Britain since the 1920s’, Economic History Review,
47, 1994, pp. 725–48
Bowden, Sue and Paul Turner, ‘The demand for consumer durables in the
United Kingdom in the interwar period’, Journal of Economic History, 53,
1993, pp. 244–58
Boyer, George, R., An Economic History of the English Poor Law, Cambridge
University Press, 1990
‘The influence of London on labor markets in southern England,
1830–1914’, Social Science History, 22, 1998, pp. 257–85
Broadberry, Stephen N. and N. F. R. Crafts, ‘Britain’s productivity gap in the
1930s: some neglected factors’, Journal of Economic History, 52, 1992, pp.
531–58
Burns, Arthur F. and W. C. Mitchell, Measuring Business Cycles, NBER,
1947
Burns, Eveline, British Unemployment Programs, 1920–1938, Social Science
Research Council, 1941
Callahan, Colleen M., Judith A. McDonald and Anthony Patrick O’Brien,
‘Who voted for Smoot–Hawley?’, Journal of Economic History, 54, 1994,
pp. 683–90
Cantor, D. and K. C. Land, ‘Unemployment and crime rates in the post-
World War II United States: a theoretical and empirical analysis’,
American Sociological Review, 44, 1979, pp. 588–608
Caradog Jones, D. (ed.), The Social Survey of Merseyside, 3, Liverpool
University Press, 1934
Census of Population of England and Wales, 1831, Parliamentary Papers,
1833 (H.C. 149), XXXVI–XXXVIII
Chapman, Agatha L., Wages and Salaries in the United Kingdom, 1920–1938,
Cambridge University Press, 1953
‘The demographic impact of the old poor law: more reflexions on Malthus’,
Economic History Review, 33, 1980, pp. 367–81
Isaac, Larry W. and Larry J. Griffin, ‘Ahistoricism in time-series analysis of
historical process: critique, redirection and illustrations from US labor
history’, American Sociological Review, 54, 1989, pp. 873–90
Jackson, Robert V., ‘Rates of industrial growth during the industrial revolu-
tion’, Economic History Review, 45, 1992, pp. 1–23
James, John A., ‘Structural change in American manufacturing, 1850–1890’,
Journal of Economic History, 43, 1983, pp. 433–59
Jarausch, Konrad H. and Gerhard Arminger, ‘The German teaching profes-
sion and Nazi party membership: a demographic logit model’, Journal of
Interdisciplinary History, 20 (Autumn 1989), pp. 197–225
John, A. H., ‘The course of agricultural change, 1660–1760’, in L. Pressnell
(ed.), Studies in the Industrial Revolution, University of London, 1960, p.
136
Kaelble, Hartmut and Mark Thomas, ‘Introduction’, in Y. S. Brenner, H.
Kaelble and M. Thomas (eds.), Income Distribution in Historical
Perspective, Cambridge University Press, 1991, pp. 1–56
Kearl, J. R., Clayne L. Pope, and Larry T. Wimmer, ‘Household wealth in a
settlement economy: Utah, 1850–1870’, Journal of Economic History, 40,
1980, pp. 477–96
Kennedy, Peter E., ‘The “Ballentine”: a graphical aid for econometrics’,
Australian Economic Papers, 20, 1981, pp. 414–16
A Guide to Econometrics, 3rd edn., Blackwell, 1992
Knight, I. and J. Eldridge, The Heights and Weights of Adults in Great Britain,
HMSO, 1984
Kramer, Gerald H., ‘Short-term fluctuations in US voting behavior,
1896–1964’, American Political Science Review, 65, 1971, pp. 131–43
Kussmaul, Ann, Servants in Husbandry in Early Modern England, Cambridge
University Press, 1981
A General View of the Rural Economy of England 1538–1840, Cambridge
University Press, 1990
Kuznets, Simon, Secular Movements in Production and Prices, Houghton
Mifflin, 1930
Layard, Richard and Stephen Nickell, ‘The labour market’, in Rudiger
Dornbusch and Richard Layard, The Performance of the British Economy,
Oxford University Press, 1987, pp.131–79
Lee, Chulhee, ‘Socio-economic background, disease and mortality among
Union army recruits: implications for economic and demographic
history’, Explorations in Economic History, 34, 1997, pp. 27–55
Lewis, W. Arthur, Growth and Fluctuations, 1870–1913, Allen & Unwin, 1978
maturity for height, weight, height velocity and weight velocity: British
children, 1965 (Part I)’, Archives of Disease in Childhood, 41, 1966, pp.
454–71
Thomas, Brinley, Migration and Economic Growth, Cambridge University
Press, 1954
Thomas, Dorothy Swaine, Social Aspects of the Business Cycle, Routledge,
1925
Thomas, T., ‘Aggregate demand in the United Kingdom 1918–1945’, in
Robert Floud and Donald McCloskey (eds.), The Economic History of
Britain since 1700, II, 1860 to the 1970s, 1st edn., Cambridge University
Press, 1981, pp. 332–46
Tobin, James, ‘Estimation of relationships for limited dependent variables’,
Econometrica, 1958, 26, pp. 24–36
Tucker, G. S. L., ‘The old poor law revisited’, Explorations in Economic
History, 12, 1975, pp. 233–52
Tufte, Edward R., Political Control of the Economy, Princeton University
Press, 1978
US Department of Commerce, Historical Statistics of the United States, Series
G13, Washington, DC, 1949
Usher, Dan, The Measurement of Economic Growth, Basil Blackwell, 1980
Vamplew, Wray (ed.), Australian Historical Statistics, Fairfax, Syme &
Weldon, 1987
Vernon, J. R., ‘Unemployment rates in post-bellum America, 1869–1899’,
University of Florida, 1991, manuscript
von Tunzelmann, G. N., Steam Power and British Industrialization, Oxford
University Press, 1978
Voth, Hans-Joachim, Time and Work in England, 1750–1830, Oxford
University Press, 2001
Williamson, Jeffrey G., American Growth and the Balance of Payments
1820–1913, A Study of the Long Swing, University of North Carolina Press,
1964
‘The evolution of global labor markets in the first and second world since
1830: background and evidence’, NBER Working Papers on Historical
Factors in Long-Run Growth, 36, NBER, 1992
Winter, Jay, ‘Unemployment, nutrition and infant mortality in Britain,
1920–50’, in Jay Winter (ed.), The Working Class in Modern British
History, Cambridge University Press, 1983, pp. 232–56
Wonnacott, T. H. and R. J. Wonnacott, Introductory Statistics, 5th edn., John
Wiley, 1990
Woods, Robert, The Demography of Victorian England and Wales, Cambridge
University Press, 2000
Subject index
(Page references for boxes with definitions of the main terms and concepts are shown in bold
type.)
unemployment data set, see Benjamin–Kochin Wald–Wolfowitz runs test, 188–93, 191, 210,
data set 214
unit elasticity, standardized test statistic for, 191
in logit models, 406–9, 409, 410, 411, 412, White test for heteroscedasticity, 311, 453
430(n.10), 485–9 Wilcoxon rank sum test, 194, 196, 197
in probit models 421–2
Z-distributions, 61–5, 131–4, 165
variables Z-tests, 164–5, 185–6
categorical, 9, 385, 422 Zcalc, 166(n.), 191, 193, 195–7, 221(n.2)
continuous, 10, 56, zero-order correlation, 249, 250, 251, 253
Name Index