Ecmet
Ecmet
Ecmet
Hans G. Ehrbar
Chapter 23. The Mean Squared Error as an Initial Criterion of Precision 629
23.1. Comparison of Two Vector Estimators 630
Chapter 28. Updating of Estimates When More Observations become Available 731
Chapter 35. Least Squares as the Normal Maximum Likelihood Estimate 855
Chapter 64. Pooling of Cross Section and Time Series Data 1353
64.1. OLS Model 1354
64.2. The Between-Estimator 1356
64.3. Dummy Variable Model (Fixed Effects) 1357
xx CONTENTS
Preface
These are class notes from several different graduate econometrics and statistics
classes. In the Spring 2000 they were used for Statistics 6869, syllabus on p. ??, and
in the Fall 2000 for Economics 7800, syllabus on p. ??. The notes give a careful and
complete mathematical treatment intended to be accessible also to a reader inexpe-
rienced in math. There are 618 exercise questions, almost all with answers. The
R-package ecmet has many of the datasets and R-functions needed in the examples.
P. 547 gives instructions how to download it.
Here are some features by which these notes may differ from other teaching
material available:
xxiii
xxiv 1. PREFACE
totality, transfactual efficacy, etc., can and should be used. These comments are still
at an experimental state, and are the students are not required to know them for the
exams. In the on-line version of the notes they are printed in a different color.
After some more cleaning out of the code, I am planning to make the AMS-LATEX
source files for these notes publicly available under the GNU public license, and up-
load them to the TEX-archive network CTAN. Since I am using Debian GNU/Linux,
the materials will also be available as a deb archive.
The most up-to-date version will always be posted at the web site of the Econom-
ics Department of the University of Utah www.econ.utah.edu/ehrbar/ecmet.pdf.
You can contact me by email at ehrbar@econ.utah.edu
Hans Ehrbar
CHAPTER 2
Probability Fields
• Markets: the total personal income in New York State in a given month.
• Meteorology: the rainfall in a given month.
• Uncertainty: the exact date of Noah’s birth.
• Indeterminacy: The closing of the Dow Jones industrial average or the
temperature in New York City at 4 pm. on February 28, 2014.
• Chaotic determinacy: the relative frequency of the digit 3 in the decimal
representation of π.
• Quantum mechanics: the proportion of photons absorbed by a polarization
filter
• Statistical mechanics: the velocity distribution of molecules in a gas at a
given pressure and temperature.
Problem 1. (This question will not be asked on any exams) Rényi says: “Ob-
serving how long one has to wait for the departure of an airplane is an experiment.”
Comment.
2.1. THE CONCEPT OF PROBABILITY 3
Answer. Rény commits the epistemic fallacy in order to justify his use of the word “exper-
iment.” Not the observation of the departure but the departure itself is the event which can be
theorized probabilistically, and the word “experiment” is not appropriate here.
What does the fact that probability theory is appropriate in the above situations
tell us about the world? Let us go through our list one by one:
• Games of chance: Games of chance are based on the sensitivity on initial
conditions: you tell someone to roll a pair of dice or shuffle a deck of cards,
and despite the fact that this person is doing exactly what he or she is asked
to do and produces an outcome which lies within a well-defined universe
known beforehand (a number between 1 and 6, or a permutation of the
deck of cards), the question which number or which permutation is beyond
their control. The precise location and speed of the die or the precise order
of the cards varies, and these small variations in initial conditions give rise,
by the “butterfly effect” of chaos theory, to unpredictable final outcomes.
A critical realist recognizes here the openness and stratification of the
world: If many different influences come together, each of which is gov-
erned by laws, then their sum total is not determinate, as a naive hyper-
determinist would think, but indeterminate. This is not only a condition
for the possibility of science (in a hyper-deterministic world, one could not
know anything before one knew everything, and science would also not be
4 2. PROBABILITY FIELDS
necessary because one could not do anything), but also for practical human
activity: the macro outcomes of human practice are largely independent of
micro detail (the postcard arrives whether the address is written in cursive
or in printed letters, etc.). Games of chance are situations which delib-
erately project this micro indeterminacy into the macro world: the micro
influences cancel each other out without one enduring influence taking over
(as would be the case if the die were not perfectly symmetric and balanced)
or deliberate human corrective activity stepping into the void (as a card
trickster might do if the cards being shuffled somehow were distinguishable
from the backside).
The experiment in which one draws balls from urns shows clearly an-
other aspect of this paradigm: the set of different possible outcomes is
fixed beforehand, and the probability enters in the choice of one of these
predetermined outcomes. This is not the only way probability can arise;
it is an extensionalist example, in which the connection between success
and failure is external. The world is not a collection of externally related
outcomes collected in an urn. Success and failure are not determined by a
choice between different spacially separated and individually inert balls (or
playing cards or faces on a die), but it is the outcome of development and
struggle that is internal to the individual unit.
2.1. THE CONCEPT OF PROBABILITY 5
not completely capricious either, since both are mice. It can be predicted
probabilistically. Those mechanisms which make them mice react to the
smoke. The probabilistic regularity comes from the transfactual efficacy of
the mouse organisms.
• Meteorology: the rainfall in a given month. It is very fortunate for the
development of life on our planet that we have the chaotic alternation be-
tween cloud cover and clear sky, instead of a continuous cloud cover as in
Venus or a continuous clear sky. Butterfly effect all over again, but it is
possible to make probabilistic predictions since the fundamentals remain
stable: the transfactual efficacy of the energy received from the sun and
radiated back out into space.
• Markets: the total personal income in New York State in a given month.
Market economies are a very much like the weather; planned economies
would be more like production or life.
• Uncertainty: the exact date of Noah’s birth. This is epistemic uncertainty:
assuming that Noah was a real person, the date exists and we know a time
range in which it must have been, but we do not know the details. Proba-
bilistic methods can be used to represent this kind of uncertain knowledge,
but other methods to represent this knowledge may be more appropriate.
2.1. THE CONCEPT OF PROBABILITY 7
making the changes in the same way as in encrypting using the key string which is
known to the receiver.
Problem 4. Why is it important in the above encryption scheme that the key
string is purely random and does not have any regularities?
Problem 5. [Knu81, pp. 7, 452] Suppose you wish to obtain a decimal digit at
random, not using a computer. Which of the following methods would be suitable?
• a. Open a telephone directory to a random place (i.e., stick your finger in it
somewhere) and use the unit digit of the first number found on the selected page.
Answer. This will often fail, since users select “round” numbers if possible. In some areas,
telephone numbers are perhaps assigned randomly. But it is a mistake in any case to try to get
several successive random numbers from the same page, since many telephone numbers are listed
several times in a sequence.
• c. Roll a die which is in the shape of a regular icosahedron, whose twenty faces
have been labeled with the digits 0, 0, 1, 1,. . . , 9, 9. Use the digit which appears on
2.1. THE CONCEPT OF PROBABILITY 11
top, when the die comes to rest. (A felt table with a hard surface is recommended for
rolling dice.)
Answer. The markings on the face will slightly bias the die, but for practical purposes this
method is quite satisfactory. See Math. Comp. 15 (1961), 94–95, for further discussion of these
dice.
• f. Ask a friend to think of a random digit, and use the digit he names.
Answer. No, people usually think of certain digits (like 7) with higher probability.
• g. Assume 10 horses are entered in a race and you know nothing whatever about
their qualifications. Assign to these horses the digits 0 to 9, in arbitrary fashion, and
after the race use the winner’s digit.
Answer. Okay; your assignment of numbers to the horses had probability 1/10 of assigning a
given digit to a winning horse.
etc. But we cannot distinguish between the first die getting a one and the second a
two, and vice versa. I.e., if we define the sample set to be U = {1, . . . , 6}×{1, . . . , 6},
i.e., the set of all pairs of numbers between 1 and 6, then certain subsets are not
observable. {(1, 5)} is not observable (unless the dice are marked or have different
colors etc.), only {(1, 5), (5, 1)} is observable.
If the experiment is measuring the height of a person in meters, and we make
the idealized assumption that the measuring instrument is infinitely accurate, then
all possible outcomes are numbers between 0 and 3, say. Sets of outcomes one is
usually interested in are whether the height falls within a given interval; therefore
all intervals within the given range represent observable events.
If the sample space is finite or countably infinite, very often all subsets are
observable events. If the sample set contains an uncountable continuum, it is not
desirable to consider all subsets as observable events. Mathematically one can define
quite crazy subsets which have no practical significance and which cannot be mean-
ingfully given probabilities. For the purposes of Econ 7800, it is enough to say that
all the subsets which we may reasonably define are candidates for observable events.
The “set of all possible outcomes” is well defined in the case of rolling a die
and other games; but in social sciences, situations arise in which the outcome is
open and the range of possible outcomes cannot be known beforehand. If one uses
a probability theory based on the concept of a “set of possible outcomes” in such
14 2. PROBABILITY FIELDS
proof derived from the definitions of A ∪ B etc. given above, should remember that a
proof of the set-theoretical identity A = B usually has the form: first you show that
ω ∈ A implies ω ∈ B, and then you show the converse.
• a. Prove that A ∪ B = B ⇐⇒ A ∩ B = A.
Answer. If one draws the Venn diagrams, one can see that either side is true if and only
if A ⊂ B. If one wants a more precise proof, the following proof by contradiction seems most
illuminating: Assume the lefthand side does not hold, i.e., there exists a ω ∈ A but ω ∈ / B. Then
ω∈ / A ∩ B, i.e., A ∩ B 6= A. Now assume the righthand side does not hold, i.e., there is a ω ∈ A
with ω ∈/ B. This ω lies in A ∪ B but not in B, i.e., the lefthand side does not hold either.
• b. Prove that A ∪ (B ∩ C) = (A ∪ B) ∩ (A ∪ C)
Answer. If ω ∈ A then it is clearly always in the righthand side and in the lefthand side. If
there is therefore any difference between the righthand and the lefthand side, it must be for the
ω∈/ A: If ω ∈ / A and it is still in the lefthand side then it must be in B ∩ C, therefore it is also in
the righthand side. If ω ∈ / A and it is in the righthand side, then it must be both in B and in C,
therefore it is in the lefthand side.
Answer. Proof: If ω in lefthand side, then it is in A and in at least one of the Bi , say it is
in Bk . Therefore it is in A ∩ Bk , and therefore it is in the righthand side. Now assume, conversely,
that ω is in the righthand side; thenSit is at least in one of the A ∩ Bi , say it is in A ∩ Bk . Hence it
is in A and in Bk , i.e., in A and in Bi , i.e., it is in the lefthand side.
Answer. There is a proof in [HT83, p. 12]. Draw A and B inside a box which represents U ,
and shade A0 from the left (blue) and B 0 from the right (yellow), so that A0 ∩ B 0 is cross shaded
(green); then one can see these laws.
Answer.
∞ ∞
1 1
[ \
(2.2.11) ,2 = (0, 2) 0, =∅
n n
n=1 n=1
∞ ∞ h
[h1 i \ 1
i
(2.2.12) , 2 = (0, 2] 0, 1 + = [0, 1]
n n
n=1 n=1
S∞ 1 1
Explanation of n=1 n
, 2 : for every α with 0 < α ≤ 2 there is a n with n
≤ α, but 0 itself is in
none of the intervals.
The set operations become logical operations if applied to events. Every experi-
ment returns an element ω∈U as outcome. Here ω is rendered green in the electronic
version of these notes (and in an upright font in the version for black-and-white
printouts), because ω does not denote a specific element of U , but it depends on
2.2. EVENTS AS SETS 19
chance which element is picked. I.e., the green color (or the unusual font) indicate
that ω is “alive.” We will also render the events themselves (as opposed to their
set-theoretical counterparts) in green (or in an upright font).
The set F of all observable events must be a σ-algebra, i.e., it must satisfy:
∅∈F
A ∈ F ⇒ A0 ∈ F
[
A1 , A2 , . . . ∈ F ⇒ A1 ∪ A2 ∪ · · · ∈ F which can also be written as Ai ∈ F
i=1,2,...
\
A1 , A2 , . . . ∈ F ⇒ A1 ∩ A2 ∩ · · · ∈ F which can also be written as Ai ∈ F.
i=1,2,...
Here an infinite sum is mathematically defined as the limit of partial sums. These
axioms make probability what mathematicians call a measure, like area or weight.
In a Venn diagram, one might therefore interpret the probability of the events as the
area of the bubble representing the event.
Answer. Follows from the fact that A and A0 are disjoint and their union U has probability
1.
Problem 10. 2 points Prove that Pr[A ∪ B] = Pr[A] + Pr[B] − Pr[A ∩ B].
Answer. For Econ 7800 it is sufficient to argue it out intuitively: if one adds Pr[A] + Pr[B]
then one counts Pr[A ∩ B] twice and therefore has to subtract it again.
The brute force mathematical proof guided by this intuition is somewhat verbose: Define
D = A ∩ B 0 , E = A ∩ B, and F = A0 ∩ B. D, E, and F satisfy
(2.3.4) D ∪ E = (A ∩ B 0 ) ∪ (A ∩ B) = A ∩ (B 0 ∪ B) = A ∩ U = A,
(2.3.5) E ∪ F = B,
(2.3.6) D ∪ E ∪ F = A ∪ B.
22 2. PROBABILITY FIELDS
You may need some of the properties of unions and intersections in Problem 6. Next step is to
prove that D, E, and F are mutually exclusive. Therefore it is easy to take probabilities
Problem 11. 1 point Show that for arbitrary events A and B, Pr[A ∪ B] ≤
Pr[A] + Pr[B].
Answer. From Problem 10 we know that Pr[A ∪ B] = Pr[A] + Pr[B] − Pr[A ∩ B], and from
axiom (2.3.2) follows Pr[A ∩ B] ≥ 0.
Problem 12. 2 points (Bonferroni inequality) Let A and B be two events. Writ-
ing Pr[A] = 1 − α and Pr[B] = 1 − β, show that Pr[A ∩ B] ≥ 1 − (α + β). You are
2.3. THE AXIOMS OF PROBABILITY 23
allowed to use that Pr[A ∪ B] = Pr[A] + Pr[B] − Pr[A ∩ B] (Problem 10), and that
all probabilities are ≤ 1.
Answer.
(2.3.11) Pr[A ∪ B] = Pr[A] + Pr[B] − Pr[A ∩ B] ≤ 1
(2.3.12) Pr[A] + Pr[B] ≤ 1 + Pr[A ∩ B]
(2.3.13) Pr[A] + Pr[B] − 1 ≤ Pr[A ∩ B]
(2.3.14) 1 − α + 1 − β − 1 = 1 − α − β ≤ Pr[A ∩ B]
Problem 13. (Not eligible S for in-class exams) Given a rising sequence of events
∞
B 1 ⊂ B 2 ⊂ B 3 · · · , define B = i=1 B i . Show that Pr[B] = limi→∞ Pr[B i ].
0 0
Answer.Sn Define C 1 = BS 1 , C 2 = B 2 ∩ B 1 , C 3 = B 3 ∩ B 2 , etc. Then C i ∩ C j = ∅ for i 6= j,
∞
and B n = i=1 C i and B = i=1 C i . In other words, now we have represented every B n and B
as
P∞ a union of disjoint sets, and can therefore apply the third probability axiomP (2.3.3): Pr[B] =
n
i=1
Pr[C i ]. The infinite sum is merely a short way of writing Pr[B] = limn→∞ i=1
Pr[C i ], i.e.,
P n
the infinite sum is the limit of the finite sums. But since these finite sums are exactly i=1
Pr[C i ] =
Sn
Pr[ i=1 C i ] = Pr[B n ], the assertion follows. This proof, as it stands, is for our purposes entirely
acceptable. One can makeS some steps in this proof still more stringent. ForS∞instance, one might use
n
induction to prove B n = i=1 C i . And how does one show that B = i=1 C i ? Well, one knows
S∞ S∞
that C i ⊂ B i , therefore i=1 C i ⊂ i=1 B i = B. Now take an ω ∈ B. Then it lies in at least one
24 2. PROBABILITY FIELDS
of the B i , but it can be in many of them. Let k be the smallest k for which ω ∈ B k . If k = 1, then
ω ∈ C 1 = B 1 as well. Otherwise, ω ∈ /SB k−1 , and therefore ω ∈ C k . I.e., any element in B lies in
∞
at least one of the C k , therefore B ⊂ i=1 C i .
Problem 14. (Not eligible for in-class exams) From problem 13 T derive also
the following: if A1 ⊃ A2 ⊃ A3 · · · is a declining sequence, and A = i Ai , then
Pr[A] = lim Pr[Ai ].
Answer. If the Ai are declining, then their complementsS B i = A0i are rising: B 1 ⊂ B 2 ⊂
B 3 · · · are rising; therefore I know the probability of B = B i . Since by de Morgan’s laws, B = A0 ,
this gives me also the probability of A.
The results regarding the probabilities of rising or declining sequences are equiv-
alent to the third probability axiom. This third axiom can therefore be considered a
continuity condition for probabilities.
If U is finite or countably infinite, then the probability measure is uniquely
determined if one knows the probability of every one-element set. We will call
Pr[{ω}] = p(ω) the probability mass function. Other terms used for it in the lit-
erature are probability function, or even probability density function (although it
is not a density, more about this below). If U has more than countably infinite
elements, the probabilities of one-element sets may not give enough information to
define the whole probability measure.
2.3. THE AXIOMS OF PROBABILITY 25
Mathematical Note: Not all infinite sets are countable. Here is a proof, by
contradiction, that the real numbers between 0 and 1 are not countable: assume
there is an enumeration, i.e., a sequence a1 , a2 , . . . which contains them all. Write
them underneath each other in their (possibly infinite) decimal representation, where
0.di1 di2 di3 . . . is the decimal representation of ai . Then any real number whose
decimal representation is such that the first digit is not equal to d11 , the second digit
is not equal d22 , the third not equal d33 , etc., is a real number which is not contained
in this enumeration. That means, an enumeration which contains all real numbers
cannot exist.
On the real numbers between 0 and 1, the length measure (which assigns to each
interval its length, and to sets composed of several invervals the sums of the lengths,
etc.) is a probability measure. In this probability field, every one-element subset of
the sample set has zero probability.
This shows that events other than ∅ may have zero probability. In other words,
if an event has probability 0, this does not mean it is logically impossible. It may
well happen, but it happens so infrequently that in repeated experiments the average
number of occurrences converges toward zero.
26 2. PROBABILITY FIELDS
• b. 1 point Assume that p + q + r > 1. Name three lotteries which Arnold would
be willing to buy, the net effect of which would be that he loses with certainty.
Answer. Among those six we have to pick subsets that make him a sure loser. If p + q + r > 1,
then we sell him a bet on A, one on B, and one on C. The payoff is always 1, and the cost is
p + q + r > 1.
• c. 1 point Now assume that p + q + r < 1. Name three lotteries which Arnold
would be willing to buy, the net effect of which would be that he loses with certainty.
Answer. If p + q + r < 1, then we sell him a bet on A0 , one on B 0 , and one on C 0 . The payoff
is 2, and the cost is 1 − p + 1 − q + 1 − r > 2.
disjunction of events, then the sum of their probabilities is 1, is additive, i.e., Pr[A ∪
B] = Pr[A] + Pr[B].
Answer. Since r is his subjective probability of C, 1 − r must be his subjective probability of
C 0 = A ∪ B. Since p + q + r = 1, it follows 1 − r = p + q.
This last problem indicates that the finite additivity axiom follows from the
requirement that the bets be consistent or, as subjectivists say, “coherent” with
each other. However, it is not possible to derive the additivity for countably infinite
sequences of events from such an argument.
11 12 13 14
21 22 23 5
Answer. 31 32 , i.e., 10 out of 36 possibilities, gives the probability 18
.
41
• b. 1 point What is the probability that both of the numbers shown are five or
less?
11 12 13 14 15
21 22 23 24 25 25
Answer. 31 32 33 34 35 , i.e., .
41 42 43 44 45 36
51 52 53 54 55
• c. 2 points What is the probability that the maximum of the two numbers shown
is five? (As a clarification: if the first die shows 4 and the second shows 3 then the
maximum of the numbers shown is 4.)
15
25 1
Answer. 35 , i.e., .
45 4
51 52 53 54 55
In this and in similar questions to follow, the answer should be given as a fully
shortened fraction.
The multiplication principle is a basic aid in counting: If the first operation can
be done n1 ways, and the second operation n2 ways, then the total can be done n1 n2
ways.
Definition: A permutation of a set is its arrangement in a certain order. It was
mentioned earlier that for a set it does not matter in which order the elements are
30 2. PROBABILITY FIELDS
written down; the number of permutations is therefore the number of ways a given
set can be written down without repeating its elements. From the multiplication
principle follows: the number of permutations of a set of n elements is n(n − 1)(n −
2) · · · (2)(1) = n! (n factorial). By definition, 0! = 1.
If one does not arrange the whole set, but is interested in the number of k-
tuples made up of distinct elements of the set, then the number of possibilities is
n!
n(n − 1)(n − 2) · · · (n − k + 2)(n − k + 1) = (n−k)! . (Start with n and the number
of factors is k.) (k-tuples are sometimes called ordered k-tuples because the order in
which the elements are written down matters.) [Ame94, p. 8] uses the notation Pkn
for this.
This leads us to the next question: how many k-element subsets does a n-element
set have? We already know how many permutations into k elements it has; but always
k! of these permutations represent the same subset; therefore we have to divide by
k!. The number of k-element subsets of an n-element set is therefore
n! n(n − 1)(n − 2) · · · (n − k + 1) n
(2.5.1) = = ,
k!(n − k)! (1)(2)(3) · · · k k
Problem 17. 5 points Compute the probability of getting two of a kind and three
of a kind (a “full house”) when five dice are rolled. (It is not necessary to express it
as a decimal number; a fraction of integers is just fine. But please explain what you
are doing.)
Answer. See [Ame94, example 2.3.3 on p. 9]. Sample space is all ordered 5-tuples out of 6,
which has 65 elements. Number of full houses can be identified with number of all ordered pairs of
distinct elements out of 6, the first element in the pair denoting the number which appears twice
and the second element that which appears three times, i.e., P26 = 6 · 5. Number of arrangements
of a given full house over the five dice is C25 = 5·4 1·2
(we have to specify the two places taken by the
two-of-a-kind outcomes.) Solution is therefore P26 · C25 /65 = 50/64 = 0.03858. This approach uses
counting.
Alternative approach, using conditional probability: probability of getting 3 of one kind and
then two of a different kind is 1 · 61 · 16 · 56 · 61 = 654 . Then multiply by 52 = 10, since this is the
number of arrangements of the 3 and 2 over the five cards.
Problem 18. What is the probability of drawing the King of Hearts and the
Queen of Hearts if one draws two cards out of a 52 card game? Is it 5212 ? Is it
1
52 2
(52)(51) ? Or is it 1 2 = (52)(51) ?
Answer. Of course the last; it is the probability of drawing one special subset. There are two
ways of drawing this subset: first the King and then the Queen, or first the Queen and then the
King.
32 2. PROBABILITY FIELDS
n
Answer. Because n−k
counts the complements of k-element sets.
n n−1 n−1
(2.6.1) k = k−1 + k .
2.6. RELATIONSHIPS INVOLVING BINOMIAL COEFFICIENTS 33
Why? When the n factors a + b are multiplied out, each of the resulting terms selects
n−k k
fromeach of the n original factors either a or b. The term a b occurs therefore
n n
n−k = k times.
As an application: If you set a = 1, b = 1, you simply get a sum of binomial
coefficients, i.e., you get the number of subsets in a set with n elements: it is 2n
(always count the empty set as one of the subsets). The number of all subsets is
easily counted directly. You go through the set element by element and about every
element you ask: is it in the subset or not? I.e., for every element you have two
34 2. PROBABILITY FIELDS
Pr[B ∩ A]
(2.7.1) Pr[B|A] =
Pr[A]
How can we motivate (2.7.1)? If we know that A has occurred, then of course the only
way that B occurs is when B ∩ A occurs. But we want to multiply all probabilities
of subsets of A with an appropriate proportionality factor so that the probability of
the event A itself becomes = 1.
Problem 20. 3 points Let A be an event with nonzero probability. Show that
the probability conditionally on A, i.e., the mapping B 7→ Pr[B|A], satisfies all the
2.7. CONDITIONAL PROBABILITY 35
Answer. Pr[U |A] = Pr[U ∩A]/ Pr[A] = 1. Pr[B|A] = Pr[B∩A]/ Pr[A] ≥ 0 because Pr[B∩A] ≥
0 and Pr[A] > 0. Finally,
(2.7.5)
∞ S∞ S∞ ∞ ∞
[ Pr[( i=1 B i ) ∩ A] Pr[ i=1 (B i ∩ A)] 1 X X
Pr[ B i |A] = = = Pr[B i ∩ A] = Pr[B i |A]
Pr[A] Pr[A] Pr[A]
i=1 i=1 i=1
First equal sign is definition of conditional probability, second is distributivity of unions and inter-
sections (Problem 6 d), third because the B i are disjoint and therefore the B i ∩ A are even more
disjoint: B i ∩ A ∩ B j ∩ A = B i ∩ B j ∩ A = ∅ ∩ A = ∅ for all i, j with i 6= j, and the last equal sign
again by the definition of conditional probability.
Problem 21. You draw two balls without replacement from an urn which has 7
white and 14 black balls.
If both balls are white, you roll a die, and your payoff is the number which the
die shows in dollars.
36 2. PROBABILITY FIELDS
If one ball is black and one is white, you flip a coin until you get your first head,
and your payoff will be the number of flips it takes you to get a head, in dollars again.
If both balls are black, you draw from a deck of 52 cards, and you get the number
shown on the card in dollars. (Ace counts as one, J, Q, and K as 11, 12, 13, i.e.,
basically the deck contains every number between 1 and 13 four times.)
Show that the probability that you receive exactly two dollars in this game is 1/6.
Answer. You know a complete disjunction of events: U = {ww}∪{bb}∪{wb}, with Pr[{ww}] =
7 6 1
21 20
= 10 ; Pr[{bb}] = 14 13
21 20
= 13
30
7 14
; Pr[{bw}] = 21 20
+ 14 7
21 20
7
= 15 . Furthermore you know the con-
ditional probabilities of getting 2 dollars conditonally on each of these events: Pr[{2}|{ww}] = 16 ;
1
Pr[{2}|{bb}] = 13 ; Pr[{2}|{wb}] = 41 . Now Pr[{2} ∩ {ww}] = Pr[{2}|{ww}] Pr[{ww}] etc., therefore
(2.7.6) Pr[{2}] = Pr[{2} ∩ {ww}] + Pr[{2} ∩ {bw}] + Pr[{2} ∩ {bb}]
1 7 6 1 7 14 14 7 1 14 13
(2.7.7) = + + +
6 21 20 4 21 20 21 20 13 21 20
1 1 1 7 1 13 1
(2.7.8) = + + =
6 10 4 15 13 30 6
Problem 22. 2 points A and B are arbitrary events. Prove that the probability
of B can be written as:
(2.7.9) Pr[B] = Pr[B|A] Pr[A] + Pr[B|A0 ] Pr[A0 ]
2.7. CONDITIONAL PROBABILITY 37
Problem 23. 2 points Prove the following lemma: If Pr[B|A1 ] = Pr[B|A2 ] (call
it c) and A1 ∩ A2 = ∅ (i.e., A1 and A2 are disjoint), then also Pr[B|A1 ∪ A2 ] = c.
Answer.
Pr[B ∩ (A1 ∪ A2 )] Pr[(B ∩ A1 ) ∪ (B ∩ A2 )]
Pr[B|A1 ∪ A2 ] = =
Pr[A1 ∪ A2 ] Pr[A1 ∪ A2 ]
Pr[B ∩ A1 ] + Pr[B ∩ A2 ] c Pr[A1 ] + c Pr[A2 ]
(2.7.10) = = = c.
Pr[A1 ] + Pr[A2 ] Pr[A1 ] + Pr[A2 ]
• b. 4 points What is the probability that all red balls are together? What is the
probability that all white balls are together?
Answer. All red balls together is the same as 3 reds first, multiplied by 6, because you may
have between 0 and 5 white balls before the first red. 38 72 61 · 6 = 28
3
. For the white balls you get
5 4 3 2 1 1
8 7 6 5 4
· 4 = 14
.
BTW, 3 reds first is same probability as 3 reds last, ie., the 5 whites first: 85 47 36 25 14 = 83 27 16 .
Problem 26. The first three questions here are discussed in [Lar82, example
2.6.3 on p. 62]: There is an urn with 4 white and 8 black balls. You take two balls
out without replacement.
2.7. CONDITIONAL PROBABILITY 39
3 4 4 8 1
(2.7.11) = + = .
3+84+8 7+48+4 3
This is the same as the probability that the first ball is white. The probabilities are not dependent
on the order in which one takes the balls out. This property is called “exchangeability.” One can
see it also in this way: Assume you number the balls at random, from 1 to 12. Then the probability
for a white ball to have the number 2 assigned to it is obviously 31 .
• e. 1 point What is the probability that both of them have the same color?
14 1 17 68
Answer. The sum of the two above, 33
+ 11
= 33
(or 132
).
40 2. PROBABILITY FIELDS
• g. 1 point Compute the probability that at least two of the three are black.
42 672 28 (8)(7)(6) 336 14
Answer. It is 55
. For exactly two: 1320
= 55
. For three it is (12)(11)(10)
= 1320
= .
1008 42 8
55
Together 1320
= 55
. One can also get is as: it is the complement of the last, or as 3
+
8
4
12
2 1 3
.
• h. 1 point Compute the probability that two of the three are of the same and
the third of a different color.
960 40 8 4
8 4 8 12
Answer. It is 1320
= 55
= 11
, or 1 2
+ 2 1 3
.
• i. 1 point Compute the probability that at least two of the three are of the same
color.
Answer. This probability is 1. You have 5 black socks and 5 white socks in your drawer.
There is a fire at night and you must get out of your apartment in two minutes. There is no light.
2.7. CONDITIONAL PROBABILITY 41
You fumble in the dark for the drawer. How many socks do you have to take out so that you will
have at least 2 of the same color? The answer is 3 socks.
Problem 27. If a poker hand of five cards is drawn from a deck, what is the prob-
ability that it will contain three aces? (How can the concept of conditional probability
help in answering this question?)
Answer. [Ame94, example 2.3.3 on p. 9] and [Ame94, example 2.5.1 on p. 13] give two
alternative ways to do it. The second answer uses conditional probability: Probability to draw
5·4·3
4 3 2 48 47
three aces in a row first and then 2 nonaces is 52 51 50 49 48
Then multiply this by 53 = 1·2·3 = 10
This gives 0.0017, i.e., 0.17%.
Problem 28. 2 points A friend tosses two coins. You ask: “did one of them
land heads?” Your friend answers, “yes.” What’s the probability that the other also
landed heads?
1 3 1
Answer. U = {HH, HT, T H, T T }; Probability is /
4 4
= 3
.
Problem 29. (Not eligible for in-class exams) [Ame94, p. 5] What is the prob-
ability that a person will win a game in tennis if the probability of his or her winning
a point is p?
42 2. PROBABILITY FIELDS
Answer.
20p(1 − p)3
(2.7.12) p4 1 + 4(1 − p) + 10(1 − p)2 +
1 − 2p(1 − p)
How to derive this: {ssss} has probability p4 ; {sssf s}, {ssf ss}, {sf sss}, and {f ssss} have
prob-
ability 4p4 (1 − p); {sssf f s} etc. (2 f and 3 s in the first 5, and then an s, together 52 = 10
possibilities) have probability 10p4 (1 − p)2 . Now {sssf f f } and 63 = 20 other possibilities give
deuce at least once in the game, i.e., the probability of deuce is 20p3 (1 − p)3 . Now Pr[win|deuce] =
p2 + 2p(1 − p)Pr[win|deuce], because you win either if you score twice in a row (p2 ) or if you get
deuce again (probablity 2p(1−p)) and then win. Solve this to get Pr[win|deuce] = p2 / 1−2p(1−p)
and then multiply this conditional probability with the probability of getting deuce at least once:
Pr[win after at least one deuce] = 20p3 (1 − p)3 p2 / 1 − 2p(1 − p) . This gives the last term in
(2.7.12).
Problem 30. (Not eligible for in-class exams) Andy, Bob, and Chris play the
following game: each of them draws a card without replacement from a deck of 52
cards. The one who has the highest card wins. If there is a tie (like: two kings and
no aces), then that person wins among those who drew this highest card whose name
comes first in the alphabet. What is the probability for Andy to be the winner? For
Bob? For Chris? Does this probability depend on the order in which they draw their
cards out of the stack?
Answer. Let A be the event that Andy wins, B that Bob, and C that Chris wins.
2.7. CONDITIONAL PROBABILITY 43
One way to approach this problem is to ask: what are the chances for Andy to win when he
draws a king?, etc., i.e., compute it for all 13 different cards. Then: what are the chances for Bob
to win when he draws a king, and also his chances for the other cards, and then for Chris.
It is computationally easier to make the following partitioning of all outcomes: Either all three
cards drawn are different (call this event D), or all three cards are equal (event E), or two of the
three cards are equal (T ). This third case will have to be split into T = H ∪ L, according to whether
the card that is different is higher or lower.
If all three cards are different, then Andy, Bob, and Chris have equal chances of winning; if all
three cards are equal, then Andy wins. What about the case that two cards are the same and the
third is different? There are two possibilities. If the card that is different is higher than the two
that are the same, then the chances of winning are evenly distributed; but if the two equal cards
are higher, then Andy has a 23 chance of winning (when the distribution of the cards Y (lower)
and Z (higher) among ABC is is ZZY and ZY Z), and Bob has a 31 chance of winning (when
the distribution is Y ZZ). What we just did was computing the conditional probabilities Pr[A|D],
Pr[A|E], etc.
Now we need the probabilities of D, E, and T . What is the probability that all three cards
3
drawn are the same? The probability that the second card is the same as the first is 51 ; and the
2 (3)(2) 6
probability that the third is the same too is 50
; therefore the total probability is (51)(50) = 2550 .
48 44 2112
The probability that all three are unequal is 51 50
= 2550
. The probability that two are equal and
3 48 432
the third is different is 3 51 50
= 2550 . Now in half of these cases, the card that is different is higher,
and in half of the cases it is lower.
44 2. PROBABILITY FIELDS
I.e., the probability that A wins is 926/2550 = 463/1275 = .363, the probability that B wins is
848/2550 = 424/1275 = .3325, and the probability that C wins is 776/2550 = 338/1275 = .304.
Here we are using Pr[A] = Pr[A|E] Pr[E] + Pr[A|H] Pr[H] + Pr[A|L] Pr[L] + Pr[A|D] Pr[D].
Problem 31. 4 points You are the contestant in a game show. There are three
closed doors at the back of the stage. Behind one of the doors is a sports car, behind
the other two doors are goats. The game master knows which door has the sports car
behind it, but you don’t. You have to choose one of the doors; if it is the door with
the sports car, the car is yours.
After you make your choice, say door A, the game master says: “I want to show
you something.” He opens one of the two other doors, let us assume it is door B,
and it has a goat behind it. Then the game master asks: “Do you still insist on door
A, or do you want to reconsider your choice?”
2.8. RATIO OF PROBABILITIES AS STRENGTH OF EVIDENCE 45
Can you improve your odds of winning by abandoning your previous choice and
instead selecting the door which the game master did not open? If so, by how much?
Answer. If you switch, you will lose the car if you had initially picked the right door, but you
will get the car if you were wrong before! Therefore you improve your chances of winning from 1/3
to 2/3. This is simulated on the web, see www.stat.sc.edu/∼west/javahtml/LetsMakeaDeal.html.
It is counterintuitive. You may think that one of the two other doors always has a goat behind
it, whatever your choice, therefore there is no reason to switch. But the game master not only shows
you that there is another door with a goat, he also shows you one of the other doors with a goat
behind it, i.e., he restricts your choice if you switch. This is valuable information. It is as if you
could bet on both other doors simultaneously, i.e., you get the car if it is behind one of the doors B
or C. I.e., if the quiz master had said: I give you the opportunity to switch to the following: you
get the car if it is behind B or C. Do you want to switch? The only doubt the contestant may have
about this is: had I not picked a door with the car behind it then I would not have been offered
this opportunity to switch.
well, etc. It can be shown that this strategy will not help: if his rival’s hypothesis is
true, then the probability that he will ever be able to publish results which seem to
show that his own hypothesis is true is still ≤ 1/k. I.e., the sequence of independent
observations ωi(2) , ωi(2) , . . . is such that
hYn Yn i 1
(2.8.1) Pr2 Pr1 [{ωi(j) }] ≥ k Pr2 [{ωi(1) }] for some n = 1, 2, . . . ≤
j=1 j=1
k
Bayes’s theorem tells us therefore: if we know that the effect happened, how sure
can we be that the cause happened? Clearly, Bayes’s theorem has relevance for
statistical inference.
Let’s stay with the example with learning for the exam; assume Pr[A] = 60%,
Pr[B|A] = .8, and Pr[B|A0 ] = .5. Then the probability that a student who passed
(.8)(.6) .48
the exam has learned for it is (.8)(.6)+(.5)(.4) = .68 = .706. Look at these numbers:
The numerator is the average percentage of students who learned and passed, and
the denominator average percentage of students who passed.
Problem 34. AIDS diagnostic tests are usually over 99.9% accurate on those
who do not have AIDS (i.e., only 0.1% false positives) and 100% accurate on those
who have AIDS (i.e., no false negatives at all). (A test is called positive if it indicates
that the subject has AIDS.)
• a. 3 points Assuming that 0.5% of the population actually have AIDS, compute
the probability that a particular individual has AIDS, given that he or she has tested
positive.
50 2. PROBABILITY FIELDS
Answer. A is the event that he or she has AIDS, and T the event that the test is positive.
Pr[T |A] Pr[A] 1 · 0.005
Pr[A|T ] = = =
Pr[T |A] Pr[A] + Pr[T |A0 ] Pr[A0 ] 1 · 0.005 + 0.001 · 0.995
100 · 0.5 1000 · 5 5000 1000
= = = = = 0.834028
100 · 0.5 + 0.1 · 99.5 1000 · 5 + 1 · 995 5995 1199
Even after testing positive there is still a 16.6% chance that this person does not have AIDS.
• b. 1 point If one is young, healthy and not in one of the risk groups, then the
chances of having AIDS are not 0.5% but 0.1% (this is the proportion of the applicants
to the military who have AIDS). Re-compute the probability with this alternative
number.
Answer.
1 · 0.001 100 · 0.1 1000 · 1 1000 1000
= = = = = 0.50025.
1 · 0.001 + 0.001 · 0.999 100 · 0.1 + 0.1 · 99.9 1000 · 1 + 1 · 999 1000 + 999 1999
Pr[B] = Pr[B ∩ A]/ Pr[A]. Therefore we will adopt as definition of independence the
so-called multiplication rule:
Definition: B and A are independent, notation B⊥A, if Pr[B ∩A] = Pr[B] Pr[A].
This is a symmetric condition, i.e., if B is independent of A, then A is also
independent of B. This symmetry is not immediately obvious given the above defi-
nition of independence, and it also has the following nontrivial practical implication
(this example from [Daw79a, pp. 2/3]): A is the event that one is exposed to some
possibly carcinogenic agent, and B the event that one develops a certain kind of
cancer. In order to test whether B⊥A, i.e., whether the exposure to the agent does
not increase the incidence of cancer, one often collects two groups of subjects, one
group which has cancer and one control group which does not, and checks whether
the exposure in these two groups to the carcinogenic agent is the same. I.e., the
experiment checks whether A⊥B, although the purpose of the experiment was to
determine whether B⊥A.
Problem 35. 3 points Given that Pr[B ∩ A] = Pr[B] · Pr[A] (i.e., B is inde-
pendent of A), show that Pr[B ∩ A0 ] = Pr[B] · Pr[A0 ] (i.e., B is also independent of
A0 ).
Answer. If one uses our heuristic definition of independence, i.e., B is independent of event
A if Pr[B|A] = Pr[B|A0 ], then it is immediately obvious since definition is symmetric in A and
A0 . However if we use the multiplication rule as the definition of independence, as the text of
52 2. PROBABILITY FIELDS
this Problem suggests, we have to do a little more work: Since B is the disjoint union of (B ∩ A)
and (B ∩ A0 ), it follows Pr[B] = Pr[B ∩ A] + Pr[B ∩ A0 ] or Pr[B ∩ A0 ] = Pr[B] − Pr[B ∩ A] =
Pr[B] − Pr[B] Pr[A] = Pr[B](1 − Pr[A]) = Pr[B] Pr[A0 ].
1
Problem 36. 2 points A and B are two independent events with Pr[A] = 3 and
Pr[B] = 41 . Compute Pr[A ∪ B].
1
Answer. Pr[A ∪ B] = Pr[A] + Pr[B] − Pr[A ∩ B] = Pr[A] + Pr[B] − Pr[A] Pr[B] = 3
+ 14 − 12
1
=
1
2
.
Problem 37. 3 points You have an urn with five white and five red balls. You
take two balls out without replacement. A is the event that the first ball is white,
and B that the second ball is white. a. What is the probability that the first ball
is white? b. What is the probability that the second ball is white? c. What is the
probability that both have the same color? d. Are these two events independent, i.e.,
is Pr[B|A] = Pr[A]? e. Are these two events disjoint, i.e., is A ∩ B = ∅?
Answer. Clearly, Pr[A] = 1/2. Pr[B] = Pr[B|A] Pr[A] + Pr[B|A0 ] Pr[A0 ] = (4/9)(1/2) +
5 4
(5/9)(1/2) = 1/2. The events are not independent: Pr[B|A] = 4/9 6= Pr[B], or Pr[A ∩ B] = 10 9
=
2/9 6= 1/4. They would be independent if the first ball had been replaced. The events are also not
disjoint: it is possible that both balls are white.
2.10. INDEPENDENCE OF EVENTS 53
2.10.2. Independence of More than Two Events. If there are more than
two events, we must require that all possible intersections of these events, not only
the pairwise intersections, follow the above multiplication rule. For instance,
Pr[A ∩ B] = Pr[A] Pr[B];
Pr[A ∩ C] = Pr[A] Pr[C];
(2.10.1) A, B, C mutually independent ⇐⇒
Pr[B ∩ C] = Pr[B] Pr[C];
Pr[A ∩ B ∩ C] = Pr[A] Pr[B] Pr[C].
This last condition is not implied by the other three. Here is an example. Draw a ball
at random from an urn containing four balls numbered 1, 2, 3, 4. Define A = {1, 4},
B = {2, 4}, and C = {3, 4}. These events are pairwise independent but not mutually
independent.
Problem 38. 2 points Flip a coin two times independently and define the fol-
lowing three events:
A = Head in first flip
(2.10.2) B = Head in second flip
C = Same face in both flips.
Are these three events pairwise independent? Are they mutually independent?
54 2. PROBABILITY FIELDS
HH HT 1
Answer. U = TH TT . A = {HH, HT }, B = {HH, T H}, C = {HH, T T }. Pr[A] = 2
,
1 1 1
Pr[B] = 2
, Pr[C] = 2
. They are pairwise independent, but Pr[A ∩ B ∩ C] = Pr[{HH}] = 4
6=
Pr[A] Pr[B] Pr[C], therefore the events cannot be mutually independent.
formulas,
Pr[A ∩ C] Pr[B ∩ C] Pr[A ∩ B ∩ C]
(2.10.3) = .
Pr[C] Pr[C] Pr[C]
Problem 40. 5 points Show that A⊥B|C is equivalent to Pr[A|B∩C] = Pr[A|C].
In other words: independence of A and B conditionally on C means: once we know
that C occurred, the additional knowledge whether B occurred or not will not help us
to sharpen our knowledge about A.
Literature about conditional independence (of random variables, not of events)
includes [Daw79a], [Daw79b], [Daw80].
'$
'$
'$
R S T
V
U W
&% &%
X
&%
One sees, this is very cumbersome, and usually unnecessarily so. If we toss a coin
5 times, the only thing we usually want to know is how many successes there were.
As long as the experiments are independent, the question how the successes were
distributed over the n different trials is far less important. This brings us to the
definition of a random variable, and to the concept of a sufficient statistic.
c
p p
p
c a
p b p p
or 2a2 + ac − √
c2 = (2a − √c)(a + c) = 0. The positive solution is therefore c = 2a. This gives
a + c = 3a = b 3, or b = a 3.
And the function quadplot, also written by Jim Ramsey, does quadrilinear plots,
meaning that proportions for four categories are plotted within a regular tetrahe-
dron. Quadplot displays the probability tetrahedron and its points using XGobi.
Each vertex of the triangle or tetrahedron corresponds to the degenerate probabil-
ity distribution in which one of the events has probability 1 and the others have
probability 0. The labels of these vertices indicate which event has probability 1.
2.11. HOW TO PLOT FREQUENCY VECTORS AND PROBABILITY VECTORS 59
The script kai is an example visualizing data from [Mor65]; it can be run using
the command ecmet.script(kai).
Example: Statistical linguistics.
In the study of ancient literature, the authorship of texts is a perplexing problem.
When books were written and reproduced by hand, the rights of authorship were
limited and what would now be considered forgery was common. The names of
reputable authors were borrowed in order to sell books, get attention for books, or the
writings of disciples and collaborators were published under the name of the master,
or anonymous old manuscripts were optimistically attributed to famous authors. In
the absence of conclusive evidence of authorship, the attribution of ancient texts
must be based on the texts themselves, for instance, by statistical analysis of literary
style. Here it is necessary to find stylistic criteria which vary from author to author,
but are independent of the subject matter of the text. An early suggestion was to use
the probability distribution of word length, but this was never acted upon, because
it is too dependent on the subject matter. Sentence-length distributions, on the
other hand, have proved highly reliable. [Mor65, p. 184] says that sentence-length
is “periodic rather than random,” therefore the sample should have at least about
100 sentences. “Sentence-length distributions are not suited to dialogue, they cannot
be used on commentaries written on one author by another, nor are they reliable on
such texts as the fragmentary books of the historian Diodorus Siculus.”
60 2. PROBABILITY FIELDS
Answer. In a text, passages with long sentences alternate with passages with shorter sen-
tences. This is why one needs at least 100 sentences to get a representative distribution of sen-
tences, and this is why fragments and drafts and commentaries on others’ writings do not exhibit
an average sentence length distribution: they do not have the melody of the finished text.
Besides the length of sentences, also the number of common words which express
a general relation (“and”, “in”, “but”, “I”, “to be”) is random with the same distri-
bution at least among the same genre. By contrast, the occurrence of the definite
article “the” cannot be modeled by simple probabilistic laws because the number of
nouns with definite article depends on the subject matter.
Table 1 has data about the epistles of St. Paul. Abbreviations: Rom Romans; Co1
1st Corinthians; Co2 2nd Corinthians; Gal Galatians; Phi Philippians; Col Colos-
sians; Th1 1st Thessalonians; Ti1 1st Timothy; Ti2 2nd Timothy; Heb Hebrews. 2nd
Thessalonians, Titus, and Philemon were excluded because they were too short to
give reliable samples. From an analysis of these and other data [Mor65, p. 224] the
first 4 epistles (Romans, 1st Corinthians, 2nd Corinthians, and Galatians) form a
consistent group, and all the other epistles lie more than 2 standard deviations from
the mean of this group (using χ2 statistics). If Paul is defined as being the author of
2.11. HOW TO PLOT FREQUENCY VECTORS AND PROBABILITY VECTORS 61
Galatians, then he also wrote Romans and 1st and 2nd Corinthians. The remaining
epistles come from at least six hands.
Rom Co1 Co2 Gal Phi Col Th1 Ti1 Ti2 Heb
no kai 386 424 192 128 42 23 34 49 45 155
one 141 152 86 48 29 32 23 38 28 94
two 34 35 28 5 19 17 8 9 11 37
3 or more 17 16 13 6 12 9 16 10 4 24
Problem 43. Enter the data from Table 1 into xgobi and brush the four epistles
which are, according to Morton, written by Paul himself. 3 of those points are almost
on top of each other, and one is a little apart. Which one is this?
Answer. In R, issue the commands library(xgobi) then data(PaulKAI) then quadplot(PaulKAI,
normalize = TRUE). If you have xgobi but not R, this dataset is one of the default datasets coming
with xgobi.
CHAPTER 3
Random Variables
3.1. Notation
Throughout these class notes, lower case bold letters will be used for vectors
and upper case bold letters for matrices, and letters that are not bold for scalars.
The (i, j) element of the matrix A is aij , and the ith element of a vector b is bi ;
the arithmetic mean of all elements is b̄. All vectors are column vectors; if a row
vector is needed, it will be written in the form b> . Furthermore, the on-line version
of these notes uses green symbols for random variables, and the corresponding black
symbols for the values taken by these variables. If a black-and-white printout of
the on-line version is made, then the symbols used for random variables and those
used for specific values taken by these random variables can only be distinguished
63
64 3. RANDOM VARIABLES
the turtle has run 10 meters, and when Achilles has run the 10 meters, then the turtle
has run 1 meter, etc. The Greeks were actually arguing whether Achilles would ever
reach the turtle.
This may sound like a joke, but in some respects, modern mathematics never
went beyond the level of the Greek philosophers. If a modern mathematicien sees
something like
n
1 X 1 10
(3.2.1) lim = 0, or lim i
= ,
i→∞ i n→∞ 10 9
i=0
then he will probably say that the lefthand term in each equation never really reaches
the number written on the right, all he will say is that the term on the left comes
arbitrarily close to it.
This is like saying: I know that Achilles will get as close as 1 cm or 1 mm to the
turtle, he will get closer than any distance, however small, to the turtle, instead of
simply saying that Achilles reaches the turtle. Modern mathematical proofs are full
of races between Achilles and the turtle of the kind: give me an ε, and I will prove to
you that the thing will come at least as close as ε to its goal (so-called epsilontism),
but never speaking about the moment when the thing will reach its goal.
Of course, it “works,” but it makes things terribly cumbersome, and it may have
prevented people from seeing connections.
66 3. RANDOM VARIABLES
Maybe a few years from now mathematics will be done right. We should not let
this temporary backwardness of mathematics allow to hold us back in our intuition.
∆y
The equation ∆x = 2x does not hold exactly on a parabola for any pair of given
(static) ∆x and ∆y; but if you take a pair (∆x, ∆y) which is moving towards zero
then this equation holds in the moment when they reach zero, i.e., when they vanish.
Writing dy and dx means therefore: we are looking at magnitudes which are in the
process of vanishing. If one applies a function to a moving quantity one again gets a
moving quantity, and the derivative of this function compares the speed with which
the transformedPn quantity moves with the speed of the original quantity. Likewise,
the equation i=1 21n = 1 holds in the moment when n reaches infinity. From this
point of view, the axiom of σ-additivity in probability theory (in its equivalent form
of rising or declining sequences of events) indicates that the probability of a vanishing
event vanishes.
Whenever we talk about infinitesimals, therefore, we really mean magnitudes
which are moving, and which are in the process of vanishing. dVx,y is therefore not,
as one might think from what will be said below, a static but small volume element
located close to the point (x, y), but it is a volume element which is vanishing into
the point (x, y). The probability density function therefore signifies the speed with
which the probability of a vanishing element vanishes.
68 3. RANDOM VARIABLES
This “inverse image” mapping is well behaved with respect to unions and inter-
sections, etc. In other words, we have identities x−1 (A ∩ B) = x−1 (A) ∩ x−1 (B) and
x−1 (A ∪ B) = x−1 (A) ∪ x−1 (B), etc.
Problem 44. Prove the above two identities.
Answer. These are a very subtle proofs. x−1 (A ∩ B) = {ω ∈ U : x(ω) ∈ A ∩ B} = {ω ∈
U : x(ω) ∈ A and x(ω) ∈ B = {ω ∈ U : x(ω) ∈ A} ∩ {ω ∈ U : x(ω) ∈ B} = x−1 (A) ∩ x−1 (B). The
other identity has a similar proof.
Problem 45. Show, on the other hand, by a counterexample, that the “direct
image” mapping defined by x(E) = {r ∈ R : there exists ω ∈ E with x(ω) = r} no
longer satisfies x(E ∩ F ) = x(E) ∩ x(F ).
By taking inverse images under a random variable x, the probability measure
on F is transplanted into a probability measure on the subsets of R by the simple
prescription Pr[B] = Pr x−1 (B) . Here, B is a subset of R and x−1 (B) one of U , the
Pr on the right side is the given probability measure on U , while the Pr on the left is
the new probability measure on R induced by x. This induced probability measure
is called the probability law or probability distribution of the random variable.
Every random variable induces therefore a probability measure on R, and this
probability measure, not the mapping itself, is the most important ingredient of
a random variable. That is why Amemiya’s first definition of a random variable
70 3. RANDOM VARIABLES
(definition 3.1.1 on p. 18) is: “A random variable is a variable that takes values
acording to a certain distribution.” In other words, it is the outcome of an experiment
whose set of possible outcomes is R.
Equation (3.4.5) is the definition of continuity from the right (because the limit
holds only for ε ≥ 0). Why is a cumulative distribution function continuous from
the right? For every nonnegative sequence ε1 ,T ε2 , . . . ≥ 0 converging to zero which
also satisfies ε1 ≥ ε2 ≥ . . . follows {x ≤ a} = i {x ≤ a + εi }; for these sequences,
therefore, the statement follows from what Problem 14 above said about the proba-
bility of the intersection of a declining set sequence. And a converging sequence of
nonnegative εi which is not declining has a declining subsequence.
A cumulative distribution function need not be continuous from the left. If
limε→0,ε>0 F (x − ε) 6= F (x), then x is a jump point, and the height of the jump is
the probability that x = x.
It is a matter of convention whether we are working with right continuous or
left continuous functions here. If the distribution function were defined as Pr[x < a]
72 3. RANDOM VARIABLES
(some authors do this, compare [Ame94, p. 43]), then it would be continuous from
the left but not from the right.
Answer. (3.4.6) does not hold generally, since its rhs is always = 0; the other two equations
always hold.
Answer. If q ≥ 0 then
√ √
(3.4.9) Fq (q) = Pr[z 2 ≤q] = Pr[− q≤z≤ q]
√ √
(3.4.10) = Pr[z≤ q] − Pr[z < − q]
√ √
(3.4.11) = Pr[z≤ q] − Pr[z> q]
√ √
(3.4.12) = Fz ( q) − (1 − Fz ( q))
√
(3.4.13) = 2Fz ( q) − 1.
Instead of the cumulative distribution function Fy one can also use the quan-
tile function Fy−1 to characterize a probability measure. As the notation suggests,
the quantile function can be considered some kind of “inverse” of the cumulative
distribution function. The quantile function is the function (0, 1) → R defined by
(3.4.14) Fy−1 (p) = inf{u : Fy (u) ≥ p}
or, plugging the definition of Fy into (3.4.14),
(3.4.15) Fy−1 (p) = inf{u : Pr[y≤u] ≥ p}.
The quantile function is only defined on the open unit interval, not on the endpoints
0 and 1, because it would often assume the values −∞ and +∞ on these endpoints,
and the information given by these values is redundant. The quantile function is
continuous from the left, i.e., from the other side than the cumulative distribution
74 3. RANDOM VARIABLES
Problem 49. You throw a pair of dice and your random variable x is the sum
of the points shown.
• a. Draw the cumulative distribution function of x.
Answer. This is Figure 1: the cdf is 0 in (−∞, 2), 1/36 in [2,3), 3/36 in [3,4), 6/36 in [4,5),
10/36 in [5,6), 15/36 in [6,7), 21/36 in [7,8), 26/36 on [8,9), 30/36 in [9,10), 33/36 in [10,11), 35/36
on [11,12), and 1 in [12, +∞).
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
Problem 50. 1 point Give the formula of the cumulative distribution function
of a random variable which is uniformly distributed between 0 and b.
Answer. 0 for x ≤ 0, x/b for 0 ≤ x ≤ b, and 1 for x ≥ b.
that value is twice as high, etc. The empirical cumulative distribution function can
be considered an estimate of the cumulative distribution function of the probability
distribution underlying the sample. [Rei89, p. 12] writes it as a sum of indicator
functions:
1X
(3.4.17) F = 1[xi ,+∞)
n i
Among the other probability measures we are only interested in those which can
be represented by a density function (absolutely continuous). A density function is a
nonnegative integrable function which, integrated over the whole line, gives 1. Given
78 3. RANDOM VARIABLES
Rb
such a density function, called fx (x), the probability Pr[x∈(a, b)] = a fx (x)dx. The
density function is therefore an alternate way to characterize a probability measure.
But not all probability measures have density functions.
Those who are not familiar with integrals should read up on them at this point.
Start with derivatives, then: the indefinite integral of a function is a function whose
derivative is the given function. Then it is an important theorem that the area under
the curve is the difference of the values of the indefinite integral at the end points.
This is called the definite integral. (The area is considered negative when the curve
is below the x-axis.)
The intuition of a density function comes out more clearly in terms of infinitesi-
mals. If fx (x) is the value of the density function at the point x, then the probability
that the outcome of x lies in an interval of infinitesimal length located near the point
x is the length of this interval, multiplied by fx (x). In formulas, for an infinitesimal
dx follows
(3.5.1) Pr x∈[x, x + dx] = fx (x) |dx| .
The name “density function” is therefore appropriate: it indicates how densely the
probability is spread out over the line. It is, so to say, the quotient between the
probability measure induced by the variable, and the length measure on the real
numbers.
3.6. TRANSFORMATION OF A SCALAR DENSITY FUNCTION 79
Absolute values are multiplicative, i.e., |t0 (x)dx| = |t0 (x)| |dx|; divide by |dx| to get
This is the transformation formula how to get the density of x from that of y. This
formula is valid for all x ∈ A; the density of x is 0 for all x ∈
/ A.
|dy|
Heuristically one can get this transformation as follows: write |t0 (x)| = |dx| , then
one gets it from fx (x) |dx| = fy (t(x)) |dy| by just dividing both sides by |dx|.
In other words, this transformation rule consists of 4 steps: (1) Determine A,
the range of the new variable; (2) obtain the transformation t which expresses the
old variable in terms of the new variable, and check that it is one-to-one on A; (3)
plug expression (2) into the old density; (4) multiply this plugged-in density by the
absolute value of the derivative of expression (2). This gives the density inside A; it
is 0 outside A.
An alternative proof is conceptually simpler but cannot be generalized to the
multivariate case: First assume t is monotonically increasing. Then Fx (x) = Pr[x ≤
x] = Pr[t(x) ≤ t(i)] = Fy (t(x)). Now differentiate and use the chain rule. Then
also do the monotonically decresing case. This is how [Ame94, theorem 3.6.1 on
pp. 48] does it. [Ame94, pp. 52/3] has an extension of this formula to many-to-one
functions.
3.6. TRANSFORMATION OF A SCALAR DENSITY FUNCTION 81
Problem 52. 4 points [Lar82, example 3.5.4 on p. 148] Suppose y has density
function
(
1 for 0 < y < 1
(3.6.4) fy (y) =
0 otherwise.
Answer. (1) Since y takes values only between 0 and 1, its logarithm takes values between
−∞ and 0, the negative logarithm therefore takes values between 0 and +∞, i.e., A = {x : 0 < x}.
(2) Express y in terms of x: y = e−x . This is one-to-one on the whole line, therefore also on A.
(3) Plugging y = e−x into the density function gives the number 1, since the density function does
not depend on the precise value of y, as long is we know that 0 < y < 1 (which we do). (4) The
derivative of y = e−x is −e−x . As a last step one has to multiply the number 1 by the absolute
value of the derivative to get the density inside A. Therefore fx (x) = e−x for x > 0 and 0 otherwise.
Problem 53. 6 points [Dhr86, p. 1574] Assume the random variable z has
the exponential distribution with parameter λ, i.e., its density function is fz (z) =
λ exp(−λz) for z > 0 and 0 for z ≤ 0. Define u = − log z. Show that the density
function of u is fu (u) = exp µ − u − exp(µ − u) where µ = log λ. This density will
be used in Problem 151.
82 3. RANDOM VARIABLES
Answer. (1) Since z only has values in (0, ∞), its log is well defined, and A = R. (2) Express
old variable in terms of new: −u = log z therefore z = e−u ; this is one-to-one everywhere. (3)
plugging in (since e−u > 0 for all u, we must plug it into λ exp(−λz)) gives . . . . (4) the derivative of
z = e−u is −e−u , taking absolute values gives the Jacobian factor e−u . Plugging in and multiplying
−u
gives the density of u: fu (u) = λ exp(−λe−u )e−u = λe−u−λe , and using λ exp(−u) = exp(µ − u)
this simplifies to the formula above.
Alternative without transformation rule for densities: Fu (u) = Pr[u≤u] = Pr[− log z≤u] =
R +∞ −u
Pr[log z≥ − u] = Pr[z≥e−u ] = −u λe−λz dz = −e−λz |+∞ e−u
= e−λe , now differentiate.
e
Problem 54. 4 points Assume the random variable z has the exponential dis-
√ its density function is fz (z) = exp(−z) for z ≥ 0 and 0
tribution with λ = 1, i.e.,
for z < 0. Define u = z. Compute the density function of u.
√
Answer. (1) A = {u : u ≥ 0} since always denotes the nonnegative square root; (2) Express
2
old variable in terms of new: z = u , this is one-to-one on A (but not one-to-one on all of R);
(3) then the derivative is 2u, which is nonnegative as well, no absolute values are necessary; (4)
multiplying gives the density of u: fu (u) = 2u exp(−u2 ) if u ≥ 0 and 0 elsewhere.
function of x is
n k
(3.7.1) px (k) = Pr[x=k] = p (1 − p)(n−k) k = 0, 1, 2, . . . , n
k
Proof is simple, every subset of k elements represents one possibility of spreading
out the k successes.
We will call any observed random variable a statistic. And we call a statistic t
sufficient for a parameter θ if and only if for any event A and for any possible value
t of t, the conditional probability Pr[A|t≤t] does not involve θ. This means: after
observing t no additional information can be obtained about θ from the outcome of
the experiment.
Problem 55. Show that x, the number of successes in the Bernoulli trial with
parameters p and n, is a sufficient statistic for the parameter p (the probability of
success), with n, the number of trials, a known fixed number.
Answer. Since the distribution of x is discrete, it is sufficient to show that for any given k,
Pr[A|x=k] does not involve p whatever the event A in the Bernoulli trial. Furthermore, since the
Bernoulli trial with n tries is finite, we only have to show it if A is an elementary event in F , i.e.,
an event consisting of one element. Such an elementary event would be that the outcome of the
trial has a certain given sequence of successes and failures. A general A is the finite disjoint union
of all elementary events contained in it, and if the probability of each of these elementary events
does not depend on p, then their sum does not either.
84 3. RANDOM VARIABLES
Pr[A ∩ {x=k}]
(3.7.2) Pr[A|x=k] = .
Pr[x=k]
If A is an elementary event whose number of sucesses is not k, then A ∩ {x=k} = ∅, therefore its
probability is 0, which does not involve p. If A is an elementary event which has
kk successes, then
A ∩ {x=k} = A, which has probability pk (1 − p)n−k . Since Pr[{x=k}] = n k
p (1 − p) n−k , the
n
terms in formula (3.7.2) that depend on p cancel out, one gets Pr[A|x=k] = 1/ k
. Again there is
no p in that formula.
• a. 3 points You make 4 independent trials. Show that the probability that the
first trial is successful, given that the total number of successes in the 4 trials is 3,
is 3/4.
Answer. Let B = {sf f f, sf f s, sf sf, sf ss, ssf f, ssf s, sssf, ssss} be the event that the first
trial is successful,
and let {x=3} = {f sss, sf ss, ssf s, sssf } be the event that there are 3 successes,
it has 43 = 4 elements. Then
Pr[B ∩ {x=3}]
(3.7.3) Pr[B|x=3] =
Pr[x=3]
3.8. PITFALLS OF DATA REDUCTION: THE ECOLOGICAL FALLACY 85
Now B ∩ {x=3} = {sf ss, ssf s, sssf }, which has 3 elements. Therefore we get
3 · p3 (1 − p) 3
(3.7.4) Pr[B|x=3] = = .
4 · p3 (1 − p) 4
lower suicide rates. Durkheim concluded from this that Protestants are more likely
to commit suicide than Catholics. But this is not a compelling conclusion. It may
have been that Catholics in predominantly Protestant provinces were taking their
own lives. The oversight of this logical possibility is called the “Ecological Fallacy”
[Sel58].
This seems like a far-fetched example, but arguments like this have been used to
discredit data establishing connections between alcoholism and unemployment etc.
as long as the unit of investigation is not the individual but some aggregate.
One study [RZ78] found a positive correlation between driver education and
the incidence of fatal automobile accidents involving teenagers. Closer analysis
showed that the net effect of driver education was to put more teenagers on the
road and therefore to increase rather than decrease the number of fatal crashes in-
volving teenagers.
Problem 57. 4 points Assume your data show that counties with high rates of
unemployment also have high rates of heart attacks. Can one conclude from this that
the unemployed have a higher risk of heart attack? Discuss, besides the “ecological
fallacy,” also other objections which one might make against such a conclusion.
Answer. Ecological fallacy says that such a conclusion is only legitimate if one has individual
data. Perhaps a rise in unemployment is associated with increased pressure and increased workloads
among the employed, therefore it is the employed, not the unemployed, who get the heart attacks.
3.9. INDEPENDENCE OF RANDOM VARIABLES 87
Even if one has individual data one can still raise the following objection: perhaps unemployment
and heart attacks are both consequences of a third variable (both unemployment and heart attacks
depend on age or education, or freezing weather in a farming community causes unemployment for
workers and heart attacks for the elderly).
But it is also possible to commit the opposite error and rely too much on indi-
vidual data and not enough on “neighborhood effects.” In a relationship between
health and income, it is much more detrimental for your health if you are poor in a
poor neighborhood, than if you are poor in a rich neighborhood; and even wealthy
people in a poor neighborhood do not escape some of the health and safety risks
associated with this neighborhood.
Another pitfall of data reduction is Simpson’s paradox. According to table 1,
the new drug was better than the standard drug both in urban and rural areas. But
if you aggregate over urban and rural areas, then it looks like the standard drug was
better than the new drug. This is an artificial example from [Spr98, p. 360].
of the event A and y indicator function of the event B, i.e., x takes the value 1 if A
occurs, and the value 0 otherwise, and similarly with y and B. Show that according
to the above definition of independence, x and y are independent if and only if the
events A and B are independent. (Hint: which are the only two events, other than
the certain event U and the null event ∅, that can be defined in terms of x)?
Answer. Only A and A0 . Therefore we merely need the fact, shown in Problem 35, that if A
and B are independent, then also A and B 0 are independent. By the same argument, also A0 and
B are independent, and A0 and B 0 are independent. This is all one needs, except the observation
that every event is independent of the certain event and the null event.
Answer. Because there is no guarantee that the sample frequencies converge. It is not phys-
ically impossible (although it is highly unlikely) that certain outcome will never be realized.
Note the difference between the sample mean, i.e., the average measured in a
given sample, and the “population mean” or expected value. The former is a random
variable, the latter is a parameter. I.e., the former takes on a different value every
time the experiment is performed, the latter does not.
Note that the expected value of the number of dots on a die is 3.5, which is not
one of the possible outcomes when one rolls a die.
Expected value can be visualized as the center of gravity of the probability mass.
If one of the tails has its weight so far out that there is no finite balancing point then
the expected value is infinite of minus infinite. If both tails have their weights so far
out that neither one has a finite balancing point, then the expected value does not
exist.
3.10. LOCATION AND DISPERSION PARAMETERS 91
It is trivial to show that for a function g(x) (which only needs to be defined for
those values which x can assume with nonzero probability), E[g(x)] = p1 g(x1 ) + · · · +
pn g(xn ).
Example of a countable probability mass distribution which has .Pan infinite ex-
∞ 1
pected value: Pr[x = x] = xa2 for x = 1, 2, . . .. (a is the constant 1 i=1 i2 .) The
P∞ a
expected value of x would be i=1 i , which is infinite. But if the random variable
is bounded, then its expected value exists.
The expected value of a continuous random variable is defined in terms of its
density function:
Z +∞
(3.10.2) E[x] = xfx (x) dx
−∞
It can be shown that for any function g(x) defined for all those x for which fx (x) 6= 0
follows:
Z
(3.10.3) E[g(x)] = g(x)fx (x) dx
fx (x)6=0
Here the integral is taken over all the points which have nonzero density, instead of
the whole line, because we did not require that the function g is defined at the points
where the density is zero.
92 3. RANDOM VARIABLES
Problem 60. Let the random variable x have the Cauchy distribution, i.e., its
density function is
1
(3.10.4) fx (x) =
π(1 + x2 )
Show that x does not have an expected value.
Answer.
Z Z Z
x dx 1 2x dx 1 d(x2 ) 1
(3.10.5) = = = ln(1 + x2 )
π(1 + x2 ) 2π 1 + x2 2π 1 + x2 2π
Rules about how to calculate with expected values (as long as they exist):
(3.10.6) E[c] = c if c is a constant
(3.10.7) E[ch] = c E[h]
(3.10.8) E[h + j] = E[h] + E[j]
Problem 61. 2 points You make two independent trials of a Bernoulli experi-
ment with success probability θ, and you observe t, the number of successes. Compute
the expected value of t3 . (Compare also Problem 197.)
Proof. The Jensen inequality holds with equality if h(x) is a linear func-
tion (with a constant term), i.e., in this case, E[h(x)] = h(E[x]). (2) Therefore
Jensen’s inequality is proved if we can find a linear function h with the two prop-
erties h(E[x]) = g(E[x]), and h(x) ≤ g(x) for all other x—because with such a
h, E[g(x)] ≥ E[h(x)] = h(E[x]). (3) The existence of such a h follows from con-
vexity. Since g is convex, for every point a ∈ B there is a number β so that
94 3. RANDOM VARIABLES
g(x) ≥ g(a) + β(x − a). This β is the slope of g if g is differentiable, and other-
wise it is some number between the left and the right derivative (which both always
exist for a convex function). We need this for a = E[x].
This existence is the deepest part of this proof. We will not prove it here, for a
proof see [Rao73, pp. 57, 58]. One can view it as a special case of the separating
hyperplane theorem.
Problem 62. Use Jensen’s inequality to show that (E[x])2 ≤ E[x2 ]. You are
allowed to use, without proof, the fact that a function is convex on B if the second
derivative exists on B and is nonnegative.
Problem 63. Show that the expected value of the empirical distribution of a
sample is the sample mean.
Other measures of locaction: The median is that number m for which there is
as much probability mass to the left of m as to the right, i.e.,
1 1
(3.10.11) Pr[x≤m] = or, equivalently, Fx (m) = .
2 2
It is much more robust with respect to outliers than the mean. If there is more than
one m satisfying (3.10.11), then some authors choose the smallest (in which case the
median is a special case of the quantile function m = F −1 (1/2)), and others the
average between the biggest and smallest. If there is no m with property (3.10.11),
3.10. LOCATION AND DISPERSION PARAMETERS 95
i.e., if the cumulative distribution function jumps from a value that is less than 12 to
a value that is greater than 21 , then the median is this jump point.
The mode is the point where the probability mass function or the probability
density function is highest.
Problem 64. Here we make the simple step from the definition of the variance
to the usually more convenient formula (3.10.13).
• a. 2 points Derive the formula var[x] = E[x2 ] − (E[x])2 from the definition of a
variance, which is var[x] = E[(x − E[x])2 ]. Hint: it is convenient to define µ = E[x].
Write it down carefully, you will lose points for missing or unbalanced parentheses
or brackets.
Answer. Here it is side by side with and without the notation E[x] = µ:
var[x] = E[(x − E[x])2 ] var[x] = E[(x − µ)2 ]
= E[x2 − 2x(E[x]) + (E[x])2 ] = E[x2 − 2xµ + µ2 ]
(3.10.17) 2 2 2
= E[x ] − 2(E[x]) + (E[x]) = E[x2 ] − 2µ2 + µ2
= E[x2 ] − (E[x])2 . = E[x2 ] − µ2 .
Problem 65. If all y i are independent with same variance σ 2 , then show that ȳ
has variance σ 2 /n.
3.10. LOCATION AND DISPERSION PARAMETERS 97
The standard deviation is the square root of the variance. Often preferred be-
cause has same scale as x. The variance, on the other hand, has the advantage of a
simple addition rule.
Standardization: if the random variable x has expected value µ and standard
deviation σ, then z = x−µ
σ has expected value zero and variance one.
An αth quantile or a 100αth percentile of a random variable x was already
defined previously to be the smallest number x so that Pr[x≤x] ≥ α.
Problem 66. 4 points Consumer M has an expected utility function for money
income u(x) = 12x − x2 . The meaning of an expected utility function is very simple:
if he owns an asset that generates some random income y, then the utility he derives
from this asset is the expected value E[u(y)]. He is contemplating acquiring two
assets. One asset yields an income of 4 dollars with certainty. The other yields an
expected income of 5 dollars with standard deviation 2 dollars. Does he prefer the
certain or the uncertain asset?
98 3. RANDOM VARIABLES
Answer. E[u(y)] = 12 E[y] − E[y 2 ] = 12 E[y] − var[y] − (E[y])2 . Therefore the certain asset
gives him utility 48 − 0 − 16 = 32, and the uncertain one 60 − 4 − 25 = 31. He prefers the certain
asset.
dk
(3.10.19) E[xk ] = m (t) .
x
dtk t=0
3.10. LOCATION AND DISPERSION PARAMETERS 99
Proof:
t2 x2 t3 x3
(3.10.20) etx = 1 + tx + + + ···
2! 3!
t2 t3
(3.10.21) mx (t) = E[etx ] = 1 + t E[x] + E[x2 ] + E[x3 ] + · · ·
2! 3!
2
d t
(3.10.22) mx (t) = E[x] + t E[x2 ] + E[x3 ] + · · ·
dt 2!
d2
(3.10.23) mx (t) = E[x2 ] + t E[x3 ] + · · · etc.
dt2
2. The moment generating function is also good for determining the probability
distribution of linear combinations of independent random variables.
a. it is easy to get the m.g.f. of λx from the one of x:
(3.10.24) mλx (t) = mx (λt)
λtx
because both sides are E[e ].
b. If x, y independent, then
(3.10.25) mx+y (t) = mx (t)my (t).
The proof is simple:
(3.10.26) E[et(x+y) ] = E[etx ety ] = E[etx ] E[ety ] due to independence.
100 3. RANDOM VARIABLES
√
The characteristic function is defined as ψx (t) = E[eitx ], where i = −1. It has
the disadvantage that it involves complex numbers, but it has the advantage that it
always exists, since exp(ix) = cos x + i sin x. Since cos and sin are both bounded,
they always have an expected value.
And, as its name says, the characteristic function characterizes the probability
distribution. Analytically, many of its properties are similar to those of the moment
generating function.
3.11. Entropy
3.11.1. Definition of Information. Entropy is the average information gained
by the performance of the experiment. The actual information yielded by an event
A with probabbility Pr[A] = p 6= 0 is defined as follows:
1
(3.11.1) I[A] = log2
Pr[A]
This is simply a transformation of the probability, and it has the dual interpretation
of either how unexpected the event was, or the informaton yielded by the occurrense
of event A. It is characterized by the following properties [AD75, pp. 3–5]:
• I[A] only depends on the probability of A, in other words, the information
content of a message is independent of how the information is coded.
3.11. ENTROPY 101
This formula uses log2 , logarithm with base 2, which can easily be computed from the
natural logarithms, log2 x = log x/ log 2. The choice of base 2 is convenient because
in this way the most informative Bernoulli experiment, that with success probability
p = 1/2 (coin flip), has entropy 1. This is why one says: “the entropy is measured
in bits.” If one goes over to logarithms of a different base, this simply means that
one measures entropy in different units. In order to indicate this dependence on the
measuring unit, equation (3.11.2) was written as the definition H[F ]
bits instead of H[F]
itself, i.e., this is the number one gets if one measures the entropy in bits. If one uses
natural logarithms, then the entropy is measured in “nats.”
Entropy can be characterized axiomatically by the following axioms [Khi57]:
• The uncertainty associated with a finite complete scheme takes its largest
value if all events are equally likely, i.e., H(p1 , . . . , pn ) ≤ H(1/n, . . . , 1/n).
• The addition of an impossible event to a scheme does not change the amount
of uncertainty.
• Composition Law: If the possible outcomes are arbitrarily combined into
m groups W 1 = X 11 ∪ · · · ∪ X 1k1 , W 2 = X 21 ∪ · · · ∪ X 2k2 , . . . , W m =
X m1 ∪ · · · ∪ X mkm , with corresponding probabilities w1 = p11 + · · · + p1k1 ,
w2 = p21 + · · · + p2k2 , . . . , wm = pm1 + · · · + pmkm , then
3.11. ENTROPY 103
H(p1 , . . . , pn ) = H(w1 , . . . , wn ) +
+ w1 H(p11 /w1 + · · · + p1k1 /w1 ) +
+ w2 H(p21 /w2 + · · · + p2k2 /w2 ) + · · · +
+ wm H(pm1 /wm + · · · + pmkm /wm ).
Since pij /wj = Pr[X ij |Wj ], the composition law means: if you first learn half the
outcome of the experiment, and then the other half, you will in the average get as
much information as if you had been told the total outcome all at once.
The entropy of a random variable x is simply the entropy of the probability
field induced by x on R. It does not depend on the values x takes but only on the
probabilities. For discretely distributed random variables it can be obtained by the
following “eerily self-referential” prescription: plug the random variable into its own
probability mass function and compute the expected value of the negative logarithm
of this, i.e.,
H[x]
(3.11.3) = E[− log2 px (x)]
bits
One interpretation of the entropy is: it is the average number of yes-or-no ques-
tions necessary to describe the outcome of the experiment. For instance, consider an
experiment which has 32 different outcomes occurring with equal probabilities. The
104 3. RANDOM VARIABLES
entropy is
32
H X 1
(3.11.4) = log2 32 = log2 32 = 5 i.e., H = 5 bits
bits i=1
32
which agrees with the number of bits necessary to describe the outcome.
Problem 67. Design a questioning scheme to find out the value of an integer
between 1 and 32, and compute the expected number of questions in your scheme if
all numbers are equally likely.
Answer. In binary digits one needs a number of length 5 to describe a number between 0 and
31, therefore the 5 questions might be: write down the binary expansion of your number minus 1.
Is the first binary digit in this expansion a zero, then: is the second binary digit in this expansion a
zero, etc. Formulated without the use of binary digits these same questions would be: is the number
between 1 and 16?, then: is it between 1 and 8 or 17 and 24?, then, is it between 1 and 4 or 9 and
12 or 17 and 20 or 25 and 28?, etc., the last question being whether it is odd. Of course, you can
formulate those questions conditionally: First: between 1 and 16? if no, then second: between 17
and 24? if yes, then second: between 1 and 8? Etc. Each of these questions gives you exactly the
entropy of 1 bit.
Problem 69. [CT91, example 2.1.2 on pp. 14/15]: The experiment has four
possible outcomes; outcome x=a occurs with probability 1/2, x=b with probability
1/4, x=c with probability 1/8, and x=d with probability 1/8.
• a. 2 points The entropy of this experiment (in bits) is one of the following
three numbers: 11/8, 7/4, 2. Which is it?
• d. 3 points Assume we know about the first outcome that x6=a. What is the
entropy of the remaining experiment (i.e., under the conditional probability)?
• e. 5 points Show in this example that the composition law for entropy holds.
3.11. ENTROPY 107
• a. 3 points How many questions does one need in the average to determine the
outcome of the roll of an unbiased die? In other words, pick a certain questioning
scheme (try to make it efficient) and compute the average number of questions if
this scheme is followed. Note that this average cannot be smaller than the entropy
H /bits, and if one chooses the questions optimally, it is smaller than H /bits + 1.
Answer. First question: is it bigger than 3? Second question: is it even? Third question (if
necessary): is it a multiple of 3? In this scheme, the number of questions for the six faces of the
die are 3, 2, 3, 3, 2, 3, therefore the average is 46 · 3 + 62 · 2 = 2 23 . Also optimal: (1) is it bigger than
2? (2) is it odd? (3) is it bigger than 4? Gives 2, 2, 3, 3, 3, 3. Also optimal: 1st question: is it 1 or
108 3. RANDOM VARIABLES
2? If anser is no, then second question is: is it 3 or 4?; otherwise go directly to the third question:
is it odd or even? The steamroller approach: Is it 1? Is it 2? etc. gives 1, 2, 3, 4, 5, 5 with expected
number 3 13 . Even this is here < 1 + H /bits.
Problem 71.
• a. 1 point Compute the entropy of a roll of two unbiased dice if they are
distinguishable.
Answer. Just twice the entropy from Problem 70.
H 1 1 1 ln 36
(3.11.7) = ln 36 + · · · + ln 36 = = 5.170
bits ln 2 36 36 ln 2
• b. Would you expect the entropy to be greater or less in the more usual case
that the dice are indistinguishable? Check your answer by computing it.
Answer. If the dice are indistinguishable, then one gets less information, therefore the exper-
iment has less entropy. One has six like pairs with probability 1/36 and 6 · 5/2 = 15 unlike pairs
with probability 2/36 = 1/18 each. Therefore the average information gained is
H 1 1 1 1 1 5
(3.11.8) = 6· ln 36 + 15 · ln 18 = ln 36 + ln 18 = 4.337
bits ln 2 36 18 ln 2 6 6
3.11. ENTROPY 109
• c. 3 points Note that the difference between these two entropies is 5/6 = 0.833.
How can this be explained?
Answer. This is the composition law (??) in action. Assume you roll two dice which you first
consider indistinguishable and afterwards someone tells you which is which. How much information
do you gain? Well, if the numbers are the same, then telling you which die is which does not give
you any information, since the outcomes of the experiment are defined as: which number has the
first die, which number has the second die, regardless of where on the table the dice land. But if
the numbers are different, then telling you which is which allows you to discriminate between two
outcomes both of which have conditional probability 1/2 given the outcome you already know; in
this case the information you gain is therefore 1 bit. Since the probability of getting two different
numbers is 5/6, the expected value of the information gained explains the difference in entropy.
All these definitions use the convention 0 log 01 = 0, which can be justified by the
following continuity argument: Define the function, graphed in Figure 3:
(
w log w1 if w > 0
(3.11.9) η(w) =
0 if w = 0.
η is continuous for all w ≥ 0, even at the boundary point w = 0. Differentiation gives
η 0 (w) = −(1 + log w), and η 00 (w) = −w−1 . The function starts out at the origin with
a vertical tangent, and since the second derivative is negative, it is strictly concave
for all w > 0. The definition of strict concavity is η(w) < η(v) + (w − v)η 0 (v) for
w 6= v, i.e., the function lies below all its tangents. Substituting η 0 (v) = −(1 + log v)
110 3. RANDOM VARIABLES
and simplifying gives w − w log w ≤ v − w log v for v, w > 0. One verifies that this
inequality also holds for v, w ≥ 0.
Problem 72. Make a complete proof, discussing all possible cases, that for
v, w ≥ 0 follows
(3.11.10) w − w log w ≤ v − w log v
Answer. We already know it for v, w > 0. Now if v = 0 and w = 0 then the equation reads
0 ≤ 0; if v > 0 and w = 0 the equation reads 0 ≤ v, and if w > 0 and v = 0 then the equation reads
w − w log w ≤ +∞.
his payoff. And the maximum expected value of this payoff is exactly the negative
of the entropy of the experiment.
Proof: Assume the correct value of the probability is p, and the number Clarence
tells Tina is q. For every p, q between 0 and 1 we have to show:
(3.11.11) p log p + (1 − p) log(1 − p) ≥ p log q + (1 − p) log(1 − q).
For this, plug w = p and v = q as well as w = 1 − p and v = 1 − q into equation
(3.11.10) and add.
3.11.4. The Inverse Problem. Now let us go over to the inverse problem:
computing those probability fields which have maximum entropy subject to the in-
formation you have.
If you know that the experiment has n different outcomes, and you do not know
the probabilities of these outcomes, then the maximum entropy approach amounts
to assigning equal probability 1/n to each outcome.
Problem 73. (Not eligible for in-class exams) You are playing a slot machine.
Feeding one dollar to this machine leads to one of four different outcomes: E 1 :
machine returns nothing, i.e., you lose $1. E 2 : machine returns $1, i.e., you lose
nothing and win nothing. E 3 : machine returns $2, i.e., you win $1. E 4 : machine
returns $10, i.e., you win $9. Events E i occurs with probability pi , but these proba-
bilities are unknown. But due to a new “Truth-in-Gambling Act” you find a sticker
112 3. RANDOM VARIABLES
1
w log w
.
6 @
@
1 . .......................................................
................ ...........
........
e
@ .........
....... ........
...... ........
..
...... @ .......
..
.... .......
.......
....
. @ .......
... ......
. ......
... ......
......
..
.
@ ......
... ......
......
..
. @ ......
.... ......
.....
.. . @ .....
....
....
- w
....
....
....
....
....
....
1 ....
....
e 1 ....
....
....
....
....
....
....
....
....
....
....
....
....
....
....
....
....
....
....
....
..
1
Figure 3. η : w 7→ w log w is continuous at 0, and concave everywhere
3.11. ENTROPY 113
on the side of the machine which says that in the long run the machine pays out only
$0.90 for every dollar put in. Show that those values of p1 , p2 , p3 , and p4 which
maximize the entropy (and therefore make the machine most interesting) subject to
the constraint that the expected payoff per dollar put in is $0.90, are p1 = 0.4473,
p2 = 0.3158, p3 = 0.2231, p4 = 0.0138.
Answer. SolutionP is derived in [Rie85, P pp. 68/9 and P74/5], and he refers to [Rie77]. You
have to maximize − pn log pn subject to pn = 1 and cn pn = d. In our case c1 = 0, c2 = 1,
c3 = 2, and c4 = 10, and d = 0.9, but the treatment below goes through for arbitrary ci as long as
not all of them are equal. This case is discussed in detail in the answer to Problem 74.
• a. Difficult: Does the maximum entropy approach also give us some guidelines
how to select these probabilities if all we know is that the expected value of the payout
rate is smaller than 1?
Answer. As shown in [Rie85, pp. 68/9 and 74/5], one can give the minimum value of the
entropy for all distributions with payoff smaller than 1: H < 1.6590, and one can also give some
bounds for the probabilities: p1 > 0.4272, p2 < 0.3167, p3 < 0.2347, p4 < 0.0214.
• b. What if you also know that the entropy of this experiment is 1.5?
Answer. This was the purpose of the paper [Rie85].
P
Problem 74. (Not eligible for in-class exams) Let p1 , p2 , . . . , pn ( pi = 1) be
the proportions of the population of a city living in n residential colonies. The cost of
114 3. RANDOM VARIABLES
living in colony i, which includes cost of travel from the colony to the central business
district, the cost of the time this travel consumes, the rent or mortgage payments,
and other costs associated with living in colony i, is represented by the monetary
amount ci . Without loss of generality we will assume that the ci are numbered in
such a way that c1 ≤ c2 ≤ · · · ≤ cn . We will also assume that the ci are not all
equal. We assume that the ci are known and that also the average expenditures on
travel etc. in the population is known; its value is d. One approach to modelling the
population distribution is to maximize the entropy subject to the average expenditures,
i.e., to choose p1 , p2 , . . . pn such that H = pi log p1i is maximized subject to the two
P
P P
constraints pi = 1 and pi ci = d. This would give the greatest uncertainty about
where someone lives.
(3.11.15) pi = Pexp(−λci )
exp(−λci )
Now all the pi depend on the same unknown λ, and this λ must be chosen such that the second
constraint holds. This is the Maxwell-Boltzmann distribution if µ = kT where k is the Boltzmann
constant and T the temperature.
(3.11.18)
X X X X X X
2 ai c2j aj − ci a i cj a j = (c2i + c2j − 2ci cj )ai aj = (ci − cj )2 ai aj ≥ 0
i j i j i,j i,j
and as long as the cost numbers are not all equal, the pi are uniquely determined by
the above entropy maximization problem.
Answer. Here is the derivative; it is negative because of the mathematical lemma just shown:
P P P 2
u0 v − uv 0 exp(−λci ) c2i exp(−λci ) − ci exp(−λci )
0
(3.11.20) f (λ) = =− 2 <0
v2
P
exp(−λci )
Since c1 ≤ c2 ≤ · · · ≤ cn , it follows
P P P
c exp(−λci ) c exp(−λci ) c exp(−λci )
(3.11.21) c1 = P1 ≤ Pi ≤ Pn = cn
exp(−λci ) exp(−λci ) exp(−λci )
Now the statement about the limit can be shown if not all cj are equal, say c1 < ck+1 but c1 = ck .
The fraction can be written as
Pn−k Pn−k
kc1 exp(−λc1 ) + ck+i exp(−λck+i ) kc1 + c
i=1 k+i
exp(−λ(ck+i − c1 ))
(3.11.22) P i=1 n−k = P n−k
k exp(−λc1 ) + i=1 exp(−λck+i ) k+ exp(−λ(ck+i − c1 ))
i=1
Although λ depends on d, show that ∂∂dH = λ, i.e., it is the same as if λ did not
depend on d. This is an example of the “envelope theorem,” and it also gives an
interpretation of λ.
Pexp(−λci )
P
Answer. We have to plug the optimal pi = into the formula for H = − pi log pi .
exp(−λci )
P
For this note that − log pi = λci + k(λ) where k(λ) = log( exp(−λcj )) does not depend on i.
pi = λd+k(λ), and ∂∂dH = λ+d ∂λ +k0 (λ) ∂λ
P P P
Therefore H = pi (λci +k(λ)) = λ pi ci +k(λ) ∂d ∂d
.
0
Now we need the derivative of k(λ), and we discover that k (λ) = −f (λ) where f (λ) was defined in
(3.11.19). Therefore ∂∂dH = λ + (d − f (λ)) ∂λ
∂d
= λ.
• e. 5 points Now assume d is not known (but the ci are still known), i.e., we
know that (3.11.12) holds for some λ but we don’t know which. We want to estimate
this λ (and therefore all pi ) by taking a random sample of m people from that met-
ropolitan area and asking them what their regional living expenditures are and where
they live. Assume xi people in this sample live in colony i. One way toPestimate this
xi
λ would be to use the average consumption expenditure of the sample, m ci , as an
estimatePof the missing d in the above procedure, i.e., choose that λ which satisfies
xi
f (λ) = m ci . Another procedure, which seems to make a better use of the infor-
mation given by the sample, would be to compute the maximum likelihood estimator
of λ based on all xi . Show that these two estimation procedures are identical.
3.11. ENTROPY 119
Answer. The xi have the multinomial distribution. Therefore, given that the proportion pi
of the population lives in colony i, and you are talking a random sample of size m from the whole
population, then the probability to get the outcome x1 , . . . , xn is
m!
(3.11.24) L= px1 px2 · · · px
n
n
x1 ! · · · xn ! 1 2
This is what we have to maximize, subject to the condition that the pi are an entropy maximizing
population distribution. Let’s take logs for computational simplicity:
X X
(3.11.25) log L = log m! − log xj ! + xi log pi
j
All we know about the pi is that they must be some entropy maximizing probabilities, but we don’t
know yet which ones, i.e., they dependP on the unknown λ. Therefore we need the formula again
− log pi = λci + k(λ) where k(λ) = log( exp(−λcj )) does not depend on i. This gives
X X X X
(3.11.26) log L = log m!− log xj !− xi (λci +k(λ)) = log m!− log xj !−λ xi ci +k(λ)m
j j
P
(for this last term remember that xi = m. Therefore the derivative is
1 ∂ X xi
(3.11.27) log L = ci − f (λ)
m ∂λ m
I.e., using the obvious estimate for d is the same as maximum likelihood under the assumption of
maximum entropy.
120 3. RANDOM VARIABLES
121
122 4. RANDOM NUMBER GENERATION AND ENCRYPTION
If xn is the current value of the “random seed” then a call to the random number
generator first computes
(4.0.28) xn+1 = (αxn + γ) mod µ
as the seed for the next call, and then returns xn+1 /µ as independent observation of
a pseudo random number which is uniformly distributed in (0, 1).
a mod b is the remainder in the integer division of a by b. For instance 13 mod 10 =
3, 16 mod 8 = 0, etc.
The selection of α, γ, and µ is critical here. We need the following criteria:
• The random generator should have a full period, i.e., it should produce all
numbers 0 < x < µ before repeating. (Once one number is repeated, the
whole cycle is repeated).
• The function should “appear random.”
4. RANDOM NUMBER GENERATION AND ENCRYPTION 123
theory behind the fact that linear congruential random number generators are good
generators.
If γ = 0 then the period is shorter: then the maximum period is µ − 1 because
any sequence which contains 0 has 0 everywhere. But not having to add γ at every
step makes computation easier.
Not all pairs α and µ give good random number generators, and one should only
use random number generators which have been thoroughly tested. There are some
examples of bad random number generators used in certain hardware or software
programs.
Problem 76. The dataset located at www.econ.utah.edu/ehrbar/data/randu.txt
(which is available as dataset randu in the R-base distribution) has 3 columns and
400 rows. Each row is a consecutive triple of numbers generated by the old VAX
FORTRAN function RANDU running under VMS 1.5. This random generator,
which is discussed in [Knu98, pp. 106/7], starts with an odd seed x0 , the n + 1st
seed is xn+1 = (65539xn ) mod 231 , and the data displayed are xn /231 rounded to 6
digits. Load the data into xgobi and use the Rotation view to check whether you
can see something suspicious.
Answer. All data are concentrated in 15 parallel planes. All triplets of observations of randu
fall into these planes; [Knu98, pp. ??] has a mathematical proof. VMS versions 2.0 and higher use
a different random generator.
4.1. ALTERNATIVES TO LINEAR CONGRUENTIAL 125
Efficient algorithms exist but are not in the repertoire of most computers. This
generator is completely free of the lattice structure of multiplicative congruential
generators.
126 4. RANDOM NUMBER GENERATION AND ENCRYPTION
Combine several random generators: If you have two random generators with
modulus m, use
(4.1.4) xm − ym mod µ
Equidistribution: either a Chi-Square test that the outcomes fall into d intervals,
or a Kolmogoroff-Smirnov test.
Serial test: that all integer pairs in the integer-valued outcome are equally likely.
Gap test: for 0 ≤ α < β ≤ 1 a gap of length r is a sequence of r + 1 consecutive
numbers in which the last one is in the interval, and the others are not. Count the
occurrence of such gaps, and make a Chi Squared test with the probabilities of such
occurrences. For instance, if α = 0 and β = 1/2 this computes the lengths of “runs
above the mean.”
Poker test: consider groups of 5 successive integers and classify them into the
7 categories: all different, one pair, two pairs, three of a kind, full house, four of a
kind, five of a kind.
Coupon collectors test: observe the length of sequences required to get a full set
of integers 0, . . . , d − 1.
Permutation test: divide the input sequence of the continuous random variable
into t-element groups and look at all possible relative orderings of these k-tuples.
There are t! different relative orderings, and each ordering has probability 1/t!.
Run test: counts runs up, but don’t use Chi Square test since subsquent runs
are not independent; a long run up is likely to be followed by a short run up.
128 4. RANDOM NUMBER GENERATION AND ENCRYPTION
Maximum-of-t-Test: split the sample into batches of equal length and take the
maximum of each batch. Taking these maxima to the tth power should again give
an equidistributed sample.
Collision tests: 20 consecutive observations are all smaller than 1/2 with prob-
ability 2−20 ; and every other partition defined by combinations of bigger or smaller
than 1/2 has the same probability. If there are only 214 observations, then on the
average each of these partitions is populated only with probability 1/64. We count
the number of “collisions”, i.e., the number of partitions which have more than 1 ob-
servation in them, and compare this with the binomial distribution (the Chi Square
cannot be applied here).
Birthday spacings test: lagged Fibonacci generators consistently fail it.
Serial correlation test: a statistic which looks like a sample correlation coefficient
which can be easily computed with the Fast Fourier transformation.
Tests on subsequences: equally spaced subsequences are usually worse than the
original sequence if it is a linear congruential generator.
the function WH.from.current.seed() gets a number between 0 and 1 from its argument (which
has the same default). Both functions are one-liners:
The original computer program doing this kind of encryption was written by the
programmer Phil Zimmerman. The program is called PGP, “Pretty Good Privacy,”
and manuals are [Zim95] and [Sta95]. More recently, a free version of this program
has been written, called GNU Privacy Guard, textttwww.gnupg.org, which does not
use the patented IDEA algorithm and is under the Gnu Public License.
Here is the mathematics of it. I am following [Sch97, p. 120, 130], a relevant
and readable book. First some number-theoretic preliminaries.
Fermat’s theorem: for a prime p and an integer b not divisible by p, bp−1 mod p =
1.
Euler’s φ function or Euler’s “totient” is the number of positive integers r smaller
than m that are coprime to m, i.e., have no common divisors with m. Example: for
m = 10, the coprime numbers are r = 1, 3, 7, 9, therefore φ(m) = 4.
If m is prime, then φ(m) = m − 1.
If m and n are coprime, then φ(mn) = φ(m)φ(n).
If p and q are to different prime numbers, then φ(pq) = (p − 1)(q − 1).
Euler’s theorem extends Fermat’s theorem: if b is coprime with e, then bφ(e) mod e =
1.
Application to digital encryption (RSA algorithm): r is a large “modulus” (in-
dicating the largest message size which is encrypted in one step) and the plaintext
message is a number M with 1 < M < r which must be coprime with r. (A real
4.4. PUBLIC KEY CRYPTOLOGY 135
life text is first converted into a sequence of positive integers M < r which are then
encrypted individually. Indeed, since the RSA algorithm is rather slow, the message
is encrypted with a temporary ordinary secret key, and only this key is encrypted
with the RSA algorithm and attached to the conventionally encrypted message.) By
applying Euler’s theorem twice one can show that pairs of integers s and t exist such
that encryption consists in raising to the sth power modulo r, and decryption in rais-
ing to the tth power modulo r. I.e., in order to encrypt one computes E = M s mod r,
and one can get M back by computing M = E t mod r.
If s is any number coprime with φ(r), then t = sφ(φ(r))−1 mod φ(r) is the de-
cryption key belonging to s. To prove this, we will first show that E t mod r =
M st mod r = M . Now st = sφ(φ(r)) , and since s is coprime with φ(r), we can apply
Euler’s theorem to get st mod φ(r) = 1, i.e., a k exists with st = 1 + kφ(r). Therefore
E t mod r = M st mod r = (M M kφ(r) ) mod r = M (M φ(r) mod r)k mod r. A second
application of Euler’s theorem says that M φ(r) mod r = 1, therefore M st mod r =
M mod r = M . Finally, since M φ(r) mod r = 1, we get M st mod r = M st mod φ(r) mod r.
If r is a prime and s is coprime with r − 1, then someone who has enough
information to do the encryption, i.e., who knows s and r, can also easily compute
t: t = sφ(r−1)−1 .
But if r is the product of two different big primes, call them p and q, then someone
who knows p and q can compute pairs s, t fairly easily, but it is computationally very
136 4. RANDOM NUMBER GENERATION AND ENCRYPTION
• c. 2 points Therefore the public key is {7, 33} and the private key {t, 33} with
the t just computed. Now take a plaintext consisting of the number 5, use the public
key to encrypt it. What is the encrypted text? Use the private key to decrypt it again.
4.4. PUBLIC KEY CRYPTOLOGY 137
Answer. If the plaintext = 5, then encryption is the computation of 57 mod 33 = 78125 mod 33 =
14. Decryption is the computation of 143 mod 33 = 2744 mod 33 = 5.
• d. 1 point This procedure is only valid if the plaintext is coprime with t. What
should be done about this?
Answer. Nothing. t is huge, and if it is selected in such a way that it does not have many
different prime multipliers, the chance that a text happens to be not coprime with it is minuscule.
• e. 2 points Now take the same plaintext and use the private key to encrypt it.
What is the encrypted text? Then use the public key to decrypt it.
Answer. If the plaintext = 5, then encryption is the computation of 53 mod 33 = 125 mod 33 =
26. Decryption is the computation of 267 mod 33 = 8031810176 mod 33 = 5.
CHAPTER 5
5.1. Binomial
We will begin with mean and variance of the binomial variable, i.e., the number
of successes in n independent repetitions of a Bernoulli trial (3.7.1). The binomial
variable has the two parameters n and p. Let us look first at the case n = 1, in which
the binomial variable is also called indicator variable: If the event A has probability
p, then its complement A0 has the probability q = 1 − p. The indicator variable of
A, which assumes the value 1 if A occurs, and 0 if it doesn’t, has expected value p
and variance pq. For the binomial variable with n observations, which is the sum of
n independent indicator variables, the expected value (mean) is np and the variance
is npq.
139
140 5. SPECIFIC RANDOM VARIABLES
Problem 79. The random variable x assumes the value a with probability p and
the value b with probability q = 1 − p. Show that var[x] = pq(a − b)2 .
Answer. E[x] = pa + qb; var[x] = E[x2 ] − (E[x])2 = pa2 + qb2 − (pa + qb)2 = (p − p2 )a2 −
2pqab + (q − q 2 )b2 = pq(a − b)2 . For this last equality we need p − p2 = p(1 − p) = pq.
The Negative Binomial Variable is, like the binomial variable, derived from the
Bernoulli experiment; but one reverses the question. Instead of asking how many
successes one gets in a given number of trials, one asks, how many trials one must
make to get a given number of successes, say, r successes.
First look at r = 1. Let t denote the number of the trial at which the first success
occurs. Then
(5.1.1) Pr[t=n] = pq n−1 (n = 1, 2, . . .).
This is called the geometric probability.
Is the probability derived in this way σ-additive? The sum of a geometrically
declining sequence is easily computed:
(5.1.2) 1 + q + q 2 + q 3 + · · · = s Now multiply by q:
(5.1.3) q + q 2 + q 3 + · · · = qs Now subtract and write 1 − q = p:
(5.1.4) 1 = ps
5.1. BINOMIAL 141
• b. 2 points Let m and n be two positive integers with m < n. Show that
Pr[t=n|t>m] = Pr[t=n − m].
Pr[t=n] pq n−1
Answer. Pr[t=n|t>m] = Pr[t>m]
= qm
= pq n−m−1 = Pr[t=n − m].
• c. 1 point Why is this property called the memory-less property of the geometric
random variable?
Answer. If you have already waited for m periods without success, the probability that success
will come in the nth period is the same as the probability that it comes in n − m periods if you
start now. Obvious if you remember that geometric random variable is time you have to wait until
1st success in Bernoulli trial.
Now let us look at the negative binomial with arbitrary r. What is the probability
that it takes n trials to get r successes? (That means, with n − 1 trials we did not yet
have r successes.) The probability that the nth trial is a successis p. The probability
that there are r − 1 successes in the first n − 1 trials is n−1
r−1 p
r−1 n−r
q . Multiply
those to get:
n − 1 r n−r
(5.1.13) Pr[t=n] = p q .
r−1
This is the negative binomial, also called the Pascal probability distribution with
parameters r and p.
5.1. BINOMIAL 145
One easily gets the mean and variance, because due to the memory-less property
it is the sum of r independent geometric variables:
r rq
(5.1.14) E[t] = var[t] = 2
p p
Some authors define the negative binomial as the number of failures before the
rth success. Their formulas will look slightly different than ours.
Problem 82. 3 points A fair coin is flipped until heads appear 10 times, and x
is the number of times tails appear before the 10th appearance of heads. Show that
the expected value E[x] = 10.
Answer. Let t be the number of the throw which gives the 10th head. t is a negative binomial
with r = 10 and p = 1/2, therefore E[t] = 20. Since x = t − 10, it follows E[x] = 10.
Problem 83. (Banach’s match-box problem) (Not eligible for in-class exams)
There are two restaurants in town serving hamburgers. In the morning each of them
obtains a shipment of n raw hamburgers. Every time someone in that town wants
to eat a hamburger, he or she selects one of the two restaurants at random. What is
the probability that the (n + k)th customer will have to be turned away because the
restaurant selected has run out of hamburgers?
Answer. For each restaurant it is the negative binomial probability distribution in disguise:
if a restaurant runs out of hamburgers this is like having n successes in n + k tries.
146 5. SPECIFIC RANDOM VARIABLES
But one can also reason it out: Assume one of the restaurantes must turn customers away
after the n + kth customer. Write down all the n + k decisions made: write a 1 if the customer
goes to the first restaurant, and a 2 if he goes to the second. I.e., write down n + k ones and twos.
Under what conditions will such a sequence result in the n + kth move eating the last hamburgerthe
first restaurant? Exactly if it has n ones and k twos, a n + kth move is a one. As in the reasoning
for the negative binomial probability distribution, there are n+k−1 n−1
possibilities, each of which
has probability 2−n−k . Emptying the
1−n−k second restaurant has the same probability. Together the
probability is therefore n+k−1
n−1
2 .
pick y white balls from the set of all white balls (there are wy possibilities to do
pick m − y black balls from the set of all black balls, which can
that), and then you
n−w
be done in m−y different ways. Every union of such a set of white balls with a set
of black balls gives a set
of m elements with exactly y white balls, as desired. There
are therefore wy m−y
n−w
different such sets, and the probability of picking such a set
is
w n−w
y m−y
(5.2.1) Pr[Sample of m elements has exactly y white balls] = n
.
m
Problem 84. You have an urn with w white and n − w black balls in it, and you
take a sample of m balls with replacement, i.e., after pulling each ball out you put it
back in before you pull out the next ball. What is the probability that y of these balls
are white? I.e., we are asking here for the counterpart of formula (5.2.1) if sampling
is done with replacement.
Answer.
y m−y
w n−w m
(5.2.2)
n n y
148 5. SPECIFIC RANDOM VARIABLES
Without proof we will state here that the expected value of y, the number of
white balls in the sample, is E[y] = m w
n , which is the same as if one would select the
balls with replacement.
Also without proof, the variance of y is
w (n − w) (n − m)
(5.2.3) var[y] = m .
n n (n − 1)
This is smaller than the variance if one would choose with replacement, which is
represented by the above formula without the last term n−m n−1 . This last term is
called the finite population correction. More about all this is in [Lar82, p. 176–183].
success λt
n . (This is an approximation since some of these intervals may have more
than one occurrence; but if the intervals become very short the probability of having
two occurrences in the same interval becomes negligible.)
In this discrete approximation, the probability to have k successes in time t is
n λt k λt (n−k)
(5.3.1) Pr[x=k] = 1−
k n n
1 n(n − 1) · · · (n − k + 1) k
λt n λt −k
(5.3.2) = (λt) 1 − 1 −
k! nk n n
k
(λt) −λt
(5.3.3) → e for n → ∞ while k remains constant
k!
(5.3.3) is the limit because the second and the last term in (5.3.2) → 1. The sum
P∞ k
of all probabilities is 1 since k=0 (λt) λt
k! = e . The expected value is (note that we
can have the sum start at k = 1):
∞ ∞
X (λt)k X (λt)k−1
(5.3.4) E[x] = e−λt k = λte−λt = λt.
k! (k − 1)!
k=1 k=1
Answer. That which gives the right expected value, i.e., λ = np.
Problem 87. Two researchers counted cars coming down a road, which obey a
Poisson distribution with unknown parameter λ. In other words, in an interval of
length t one will have k cars with probability
(λt)k −λt
(5.3.8) e .
k!
Their assignment was to count how many cars came in the first half hour, and how
many cars came in the second half hour. However they forgot to keep track of the
time when the first half hour was over, and therefore wound up only with one count,
namely, they knew that 213 cars had come down the road during this hour. They
were afraid they would get fired if they came back with one number only, so they
applied the following remedy: they threw a coin 213 times and counted the number of
heads. This number, they pretended, was the number of cars in the first half hour.
• a. 6 points Did the probability distribution of the number gained in this way
differ from the distribution of actually counting the number of cars in the first half
hour?
Answer. First a few definitions: x is the total number of occurrences in the interval [0, 1]. y
is the number of occurrences in the interval [0, t] (for a fixed t; in the problem it was t = 12 , but we
152 5. SPECIFIC RANDOM VARIABLES
will do it for general t, which will make the notation clearer and more compact. Then we want to
compute Pr[y=m|x=n]. By definition of conditional probability:
How can we compute the probability of the intersection Pr[y=m and x=n]? Use a trick: express
this intersection as the intersection of independent events. For this define z as the number of
events in the interval (t, 1]. Then {y=m and x=n} = {y=m and z=n − m}; therefore Pr[y=m and
x=n] = Pr[y=m] Pr[z=n − m]; use this to get
(5.3.10)
n−m
λm tm −λt λ (1−t)n−m −λ(1−t)
Pr[y=m] Pr[z=n − m] m!
e (n−m)!
e n m
Pr[y=m|x=n] = = λn −λ
= t (1−t)n−m ,
Pr[x=n] n!
e m
k (λt)k (1−λ)k tk
Here we use the fact that Pr[x=k] = tk! e−t , Pr[y=k] = k! e−λt , Pr[z=k] = k!
e−(1−λ)t .
One sees that a. Pr[y=m|x=n] does not depend on λ, and b. it is exactly the probability of having m
successes and n − m failures in a Bernoulli trial with success probability t. Therefore the procedure
with the coins gave the two researchers a result which had the same probability distribution as if
they had counted the number of cars in each half hour separately.
• b. 2 points Explain what it means that the probability distribution of the number
for the first half hour gained by throwing the coins does not differ from the one gained
5.3. THE POISSON DISTRIBUTION 153
by actually counting the cars. Which condition is absolutely necessary for this to
hold?
Answer. The supervisor would never be able to find out through statistical analysis of the
data they delivered, even if they did it repeatedly. All estimation results based on the faked statistic
would be as accurate regarding λ as the true statistics. All this is only true under the assumption
that the cars really obey a Poisson distribution and that the coin is fair.
The fact that the Poisson as well as the binomial distributions are memoryless has nothing to
do with them having a sufficient statistic.
the customer, and if it shows tails, Karl does. Compute the probability that Herbert
has to serve exactly one customer during the hour. Hint:
1 1 1
(5.3.12) e = 1 + 1 + + + + ··· .
2! 3! 4!
• c. For any integer k ≥ 0, compute the probability that Herbert has to serve
exactly k customers during the hour.
Problem 89. 3 points Compute the moment generating function of a Poisson
k
variable observed over a unit time interval, i.e., x satisfies Pr[x=k] = λk! e−λ and
you want E[etx ] for all t.
tx
P∞ tk λk −λ P∞ (λet )k −λ λet −λ λ(et −1)
Answer. E[e ] = k=0
e k!
e = k=0 k!
e =e e =e .
Ft (t) = Pr[t≤t] = 1 − e−λt when t ≥ 0, and Ft (t) = 0 for t < 0. The density function
is therefore ft (t) = λe−λt for t ≥ 0, and 0 otherwise. This is called the exponential
density function (its discrete analog is the geometric random variable). It can also
be called a Gamma variable with parameters r = 1 and λ.
R∞ R∞ ∞ R ∞ u=t v 0 = e−λt
the second do it again: 2te−λt dt = uv 0 dt = uv − u0 vdt, where
0 0 0 0 u0 = 1 v = −(1/λ)e−λt
∞ R∞
Therefore the second term becomes 2(t/λ)e−λt +2 (1/λ)e−λt dt = 2/λ2 .
0 0
Problem 92. 2 points Does the exponential random variable with parameter
λ > 0, whose cumulative distribution function is Ft (t) = 1 − e−λt for t ≥ 0, and
0 otherwise, have a memory-less property? Compare Problem 80. Formulate this
memory-less property and then verify whether it holds or not.
Answer. Here is the formulation: for s<t follows Pr[t>t|t>s] = Pr[t>t − s]. This does indeed
Pr[t>t and t>s] Pr[t>t] e−λt
hold. Proof: lhs = Pr[t>s]
= Pr[t>s]
= e−λs
= e−λ(t−s) .
• b. 2 points What is the probability that an unemployment spell ends after time
t + h, given that it has not yet ended at time t? Show that this is the same as the
5.4. THE EXPONENTIAL DISTRIBUTION 157
Pr[t>t + h] e−λ(t+h)
(5.4.1) Pr[t>t + h|t>t] = = = e−λh
Pr[t>t] e−λt
158 5. SPECIFIC RANDOM VARIABLES
Problem 94. 3 points Compute the density function of t(3) , the time of the third
occurrence of a Poisson variable.
Answer.
(5.5.2) Pr[t(3) >t] = Pr[x=0] + Pr[x=1] + Pr[x=2]
λ2 2 −λt
(5.5.3) Ft(3) (t) = Pr[t(3) ≤t] = 1 − (1 + λt + t )e
2
∂ λ2 2 λ3 2 −λt
(5.5.4) ft(3) (t) = Ft(3) (t) = − −λ(1 + λt + t ) + (λ + λ2 t) e−λt = t e .
∂t 2 2
5.5. THE GAMMA DISTRIBUTION 159
If one asks for the rth occurrence, again all but the last term cancel in the
differentiation, and one gets
λr
(5.5.5) ft(r) (t) = tr−1 e−λt .
(r − 1)!
This density is called the Gamma density with parameters λ and r.
The following definite integral, which is defined for all r > 0 and all λ > 0 is
called the Gamma function:
Z ∞
(5.5.6) Γ(r) = λr tr−1 e−λt dt.
0
Problem 96. 3 points Show by partial integration that the Gamma function
satisfies Γ(r + 1) = rΓ(r).
Answer. Start with
Z ∞
(5.5.8) Γ(r + 1) = λr+1 tr e−λt dt
0
R R
and integrate by parts: u0 vdt = uv − uv 0 dt with u0 = λe−λt and v = λr tr , therefore u = −e−λt
and v 0 = rλr tr−1 :
∞ Z ∞
(5.5.9) Γ(r + 1) = −λr tr e−λt rλr tr−1 e−λt dt = 0 + rΓ(r).
+
0 0
Problem 97. Show that Γ(r) = (r − 1)! for all natural numbers r = 1, 2, . . ..
Answer. Proof by induction. First verify that it holds for r = 1, i.e., that Γ(1) = 1:
Z ∞ ∞
(5.5.10) Γ(1) = λe−λt dt = −e−λt =1
0
0
and then, assuming that Γ(r) = (r − 1)! Problem 96 says that Γ(r + 1) = rΓ(r) = r(r − 1)! = r!.
√
Without proof: Γ( 21 ) = π. This will be shown in Problem 161.
5.5. THE GAMMA DISTRIBUTION 161
Therefore the following defines a density function, called the Gamma density
with parameter r and λ, for all r > 0 and λ > 0:
λr r−1 −λx
(5.5.11) f (x) = x e for x ≥ 0, 0 otherwise.
Γ(r)
The only application we have for it right now is: this is the distribution of the time
one has to wait until the rth occurrence of a Poisson distribution with intensity λ.
Later we will have other applications in which r is not an integer.
Problem 98. 4 points Compute the moment generating function of the Gamma
distribution.
Answer.
Z ∞
λr r−1 −λx
(5.5.12) mx (t) = E[etx ] = etx x e dx
0
Γ(r)
Z ∞
λr (λ − t)r xr−1 −(λ−t)x
(5.5.13) = e dx
(λ − t)r Γ(r)
r 0
λ
(5.5.14) =
λ−t
since the integrand in (5.5.12) is the density function of a Gamma distribution with parameters r
and λ − t.
162 5. SPECIFIC RANDOM VARIABLES
Problem 99. 2 points The density and moment generating functions of a Gamma
variable x with parameters r > 0 and λ > 0 are
λr r−1 −λx
(5.5.15) fx (x) = x e for x ≥ 0, 0 otherwise.
Γ(r)
λ r
(5.5.16) mx (t) =
.
λ−t
Show the following: If x has a Gamma distribution with parameters r and 1, then v =
x/λ has a Gamma distribution with parameters r and λ. You can prove this either
using the transformation theorem for densities, or the moment-generating function.
Answer. Solution using density function: The random variable whose density we know is x;
1
its density is Γ(r) xr−1 e−x . If x = λv, then dx
dv
= λ, and the absolute value is also λ. Therefore the
λr
density of v is Γ(r)
v r−1 e−λv . Solution using the mgf:
1 r
(5.5.17) mx (t) = E[etx ] =
1−t
1 r λ r
(5.5.18) mv (t) E[etv ] = E[e(t/λ)x ] = =
1 − (t/λ) λ−t
but this last expression can be recognized to be the mgf of a Gamma with r and λ.
5.5. THE GAMMA DISTRIBUTION 163
Problem 101. Show that a Gamma variable x with parameters r and λ has
expected value E[x] = r/λ and variance var[x] = r/λ2 .
Answer. Proof with moment generating function:
r r+1
d λ r λ
(5.5.20) = ,
dt λ−t λ λ−t
r r(r+1)
therefore E[x] = λ , and by differentiating twice (apply the same formula again), E[x2 ] = λ2
,
therefore var[x] = λr2 .
R ∞ λr r−1 −λt
Proof using density function: For the expected value one gets E[t] = t· t e dt =
r 1
R∞ r Γ(r+1) r
R0 ∞ 2Γ(r)
λr r−1 −λt
λ Γ(r+1) 0
tr λr+1 e−λt dt = λ · Γ(r+1) = λ . Using the same tricks E[t2 ] = t · Γ(r) t e dt =
0
r(r+1)
R ∞ λ r+2
r+1 −λt r(r+1)
λ2
t e dt = λ2 .
0 Γ(r+2)
Therefore var[t] = E[t2 ] − (E[t])2 = r/λ2 .
164 5. SPECIFIC RANDOM VARIABLES
2 2
• c. 2 points Show that E[x2 ] = a +ab+b
3 .
1 b3 −a3
2
R b x2
Answer. E[x ] = dx = . Now use the identity b3 − a3 = (b − a)(b2 + ab + a2 )
a b−a b−a 3
(check it by multiplying out).
(b−a)2
• d. 2 points Show that var[x] = 12 .
Problem 103. x and y are two independent random variables distributed uni-
formly over the interval [0, 1]. Let u be their minimum u = min(x, y) (i.e., u
takes the value of x when x is smaller, and the value of y when y is smaller), and
v = max(x, y).
• a. 2 points Given two numbers q and r between 0 and 1. Draw the events u≤q
and v≤r into the unit square and compute their probabilities.
For v it is: Pr[v ≤ r] = r 2 ; this is at the same time the cumulative distribution function. Therefore
the density function is fv (v) = 2v for 0 ≤ v ≤ 1 and 0 elsewhere.
Z 1
1
2v 3 2
(5.7.2) E[v] = v2v dv = = .
0
3 3
0
Answer. Pr[2 < x≤5] = Pr[2−3 < x−3 ≤ 5−3] = Pr[ 2−3 2
< x−32
≤ 5−32
] = Pr[− 12 < x−3
2
≤
1 1 1
1] = Φ(1) − Φ(− 2 ) = Φ(1) − (1 − Φ( 2 )) = Φ(1) + Φ( 2 ) − 1 = 0.8413 + 0.6915 − 1 = 0.5328. Some
tables (Greene) give the area between 0 and all positive values; in this case it is 0.3413 + 0.1915.
z2 t2 1
(5.8.3) tz − = − (z − t)2 ;
2 2 2
168 5. SPECIFIC RANDOM VARIABLES
2 t2
Note that the first summand, t2 , no longer depends on z; therefore the factor e 2
can be written in front of the integral:
Z +∞
t2 1 1 2 t2
(5.8.4) mz (t) = e 2 √ e− 2 (z−t) dz = e 2 ,
−∞ 2π
because now the integrand is simply the density function of a N (t, 1).
A general univariate normal x ∼ N (µ, σ 2 ) can be written as x = µ + σz with
z ∼ N (0, 1), therefore
2 2
(5.8.5) mx (t) = E[e(µ+σz)t ] = eµt E[eσzt ] = e(µt+σ t /2)
.
Problem 105. Given two independent normal variables x ∼ N (µx , σx2 ) and
y ∼ N (µy , σy2 ). Using the moment generating function, show that
(5.8.6) αx + βy ∼ N (αµx + βµy , α2 σx2 + β 2 σy2 ).
Answer. Because of independence, the moment generating function of αx + βy is the product
of the m.g.f. of αx and the one of βy:
2 2 2 2 2 2 2 2 2 2 2
t /2 µy βt+σy β t /2
(5.8.7) mαx+βy (t) = eµx αt+σx α e = e(µx α+µy β)t+(σx α +σy β )t /2
,
which is the moment generating function of a N (αµx + βµy , α2 σx2 + β 2 σy2 ).
We will say more about the univariate normal later when we discuss the multi-
variate normal distribution.
5.8. THE NORMAL DISTRIBUTION 169
fz (z)
(5.8.8) E[z|z>z] = , var[z|z>z] = 1 − µ(µ − z), where µ = E[z|z>z].
1 − Fz (z)
This expected value is therefore the ordinate of the density function at point z divided
by the tail area of the tail over which z is known to vary. (This rule is only valid for
the normal density function, not in general!) These kinds of results can be found in
[JK70, pp. 81–83] or in the original paper [Coh50]
• b. 3 points Since it is the 63rd birthday of the owner of the dealership, all
cars in the dealership are sold for the price of $6300. You pick at random one of
the people coming out of the dealership. The probability that this person bought a car
Answer. This is the unconditional probability that the reservation price was higher than
$6300 + $500 = $6800. i.e., Pr[y≥6800. Define z = (y − $6000)/$1000. It is a standard normal, and
y≤$6800 ⇐⇒ z≤.8, Therefore p = 1 − Pr[z≤.8] = .2119.
The important part of this question is: it depends on the outcome of the experi-
ment whether or not someone is included in the sample sample selection bias.
Answer. Here we need the conditional probability:
Pr[y>$6800] 1 − Pr[y≤$6800]
(5.8.9) p = Pr[y>$6800|y>$6300] = = .
Pr[y>$6300] 1 − Pr[y≤$6300]
Again use the standard normal z = (y − $6000)/$1000. As before, y≤$6800 ⇐⇒ z≤.8, and
y≤$6300 ⇐⇒ z≤.3. Therefore
1 − Pr[z≤.8] .2119
(5.8.10) p= = = .5546.
1 − Pr[z≤.3] .3821
It depends on the layout of the normal distribution table how this should be looked up.
• d. 5 points We are still picking out customers that have bought the birthday
specials. Compute the median value m of such a customer’s consumer surplus. It is
defined by
• e. 3 points Is the expected value of the consumer surplus of all customers that
have bought a birthday special larger or smaller than the median? Fill in your answer
Call the cumulative distribution function of a standard normal Fz (z). Then the
cumulative distribution function of the χ2 variable q = z 2 is, according to Problem
√
47, Fq (q) = 2Fz ( q) − 1. To get the density of q take the derivative of Fq (q) with
respect to q. For this we need the chain rule, first taking the derivative with respect
√
to z = q and multiply by dz dq :
d √ d
(5.9.1) fq (q) = 2Fz ( q) − 1 = 2Fz (z) − 1
dq dq
dFz dz 2 2 1
(5.9.2) =2 (z) = √ e−z /2 √
dz dq 2π 2 q
1
(5.9.3) =√ e−q/2 .
2πq
√
Now remember the Gamma function. Since Γ(1/2) = π (Proof in Problem 161),
one can rewrite (5.9.3) as
normal density, but has much thicker tails. The density and characteristic functions
are (I am not asking you to compute the characteristic function)
1
(5.11.1) fx (x) = E[eitx ] = exp(− |t|).
π(1 + x2 )
√
Here i = −1, but you should not be afraid of it, in most respects, i behaves like any
real number. The characteristic function has properties very similar to the moment
generating function, with the added advantage that it always exists. Using the char-
acteristic functions show that if x and y are independent Cauchy distributions, then
(x + y)/2 has the same distribution as x or y.
Answer.
x+y t t t t
h i h i
(5.11.2) E exp it = E exp i x exp i y = exp(− ) exp(− ) = exp(− |t|).
2 2 2 2 2
This is perhaps the simplest case of a transcendental conclusion. But this sim-
plest case also vindicates another one of Bhaskar’s assumptions: these transcendental
conclusions cannot be arrived at in a non-transcendental way, by staying in the sci-
ence itself. It is impossible to decide, using statistical means alone, whether one’s
data come from a distribution which has finite expected values or not. The reason
is that one always has only finite datasets, and the empirical distribution of a finite
sample always has finite expected values, even if the sample comes from a population
which does not have finite expected values.
CHAPTER 6
179
180 6. SUFFICIENT STATISTICS AND THEIR DISTRIBUTIONS
factorized as follows:
pθ (ω) = g t(ω), θ · h(ω) for all ω ∈ U .
If U ⊂ Rn , we can write ω = (y1 , . . . , yn ). If Prθ is not discrete but generated by a
family of probability densities f (y1 , . . . , yn ; θ), then the condition reads
f (y1 , . . . , yn ; θ) = g t(y1 , . . . , yn ), θ · h(y1 , . . . , yn ).
Note what this means: the probability of an elementary event (or of an infinitesimal
interval) is written as the product of two parts: one depends on ω through t, while
the other depends on ω directly. Only that part of the probability that depends on
ω through t is allowed to also depend on θ.
Proof in the discrete case: First let us show the necessity of this factorization.
Assume that t is sufficient, i.e., that Prθ [ω|t=t] does not involve θ. Then one possible
factorization is
(6.1.1) Prθ [ω] = Prθ [t=t(ω)] · Pr[ω|t=t(ω)]
(6.1.2) = g(t(ω), θ) · h(ω).
Now let us prove that the factorization property implies sufficiency. Assume
therefore (6.1) holds. We have to show that for all ω ∈ U and t ∈ R, the conditional
probability Prθ [{ω}|{κ ∈ U : t(κ) = t}], which will in shorthand notation be written
6.1. FACTORIZATION THEOREM FOR SUFFICIENT STATISTICS 181
if t(ω) 6= t, this is zero, i.e., independent of θ. Now look at case t(ω) = t, i.e.,
{ω} ∩ {t=t} = {ω}. Then
g(t, θ)h(ω) h(ω)
(6.1.6) Prθ [ω|t=t] = = , which is independent of θ.
g(t, θ)k(t) k(t)
Problem 108. 6 points Using the factorization theorem for sufficient statistics,
show that in a n times repeated Bernoulli experiment (n is known), the number of
successes is a sufficient statistic for the success probability p.
• a. Here is a formulation of the factorization theorem: Given a family of discrete
probability measures Prθ depending on a parameter θ. The statistic t is sufficient for
182 6. SUFFICIENT STATISTICS AND THEIR DISTRIBUTIONS
parameter θ iff there exists a function of two variables g : R×Θ → R, (t, θ) 7→ g(t; θ),
and a function of one variable h : U → R, ω 7→ h(ω) so that for all ω ∈ U
Prθ [{ω}] = g t(ω), θ · h(ω).
Before you apply this, ask yourself: what is ω?
Answer. This is very simple: the probability of every elementary event depends on this ele-
ment only through the random variable t : U → N , which is the number of successes. Pr[{ω}] =
pt(ω) (1 − p)n−t(ω) . Therefore g(k; p) = pk (1 − p)n−k and h(ω) = 1 does the trick. One can
also
n
say: the probability of one element ω is the probability of t(ω) successes divided by t(ω) . This
gives another easy-to-understand factorization.
p
therefore θ = ln 1−p
. To compute b(θ) you have to express n ln(1 − p) as a function of θ and
p 1 1
then reverse the sign. The following steps are involved: exp θ = 1−p
= 1−p
− 1; 1 + exp θ = 1−p
;
ln(1 + exp θ) = − ln(1 − p); therefore b(θ) = n ln(1 + exp θ).
Problem 110. 2 points Show that the Poisson distribution (5.3.5) with t = 1,
i.e.,
λk −λ
(6.2.6) Pr[x=k] = e for k = 0, 1, . . .
k!
is a member of the exponential family. Compute the canonical parameter θ and the
function b(θ).
Problem 112. Show that the Gamma distribution is a member of the exponential
dispersion family.
Next observation: for the exponential and the exponential dispersion families,
the expected value is the derivative of the function b(θ)
∂b(θ)
(6.2.11) E[y] = .
∂θ
186 6. SUFFICIENT STATISTICS AND THEIR DISTRIBUTIONS
This follows from the basic theory associated with maximum likelihood estimation,
see (13.4.12). E[y] is therefore a function of the “canonical parameter” θ, and in the
generalized linear model the assumption is made that this function has an inverse,
i.e., the canonical parameter can be written θ = g(µ) where g is called the “canonical
link function.”
Problem 113. 2 points In the case of the Binomial distribution (see Problem
109) compute b0 (θ) and verify that it is the same as E[x].
1 p
Answer. b(θ) = n ln(1 + exp θ), therefore b0 (θ) = n 1+exp θ
exp(θ). Now exp(θ) = 1−p
;
0
plugging this in gives b (θ) = np, which is the same as E[x].
Problem 114. 1 point In the case of the Poisson distribution (see Problem 110)
compute b0 (θ) and verify that it is the same as E[x], and compute b00 (θ) and verify that
it is the same as var[x]. You are allowed, without proof, that a Poisson distribution
with parameter λ has expected value λ and variance λ.
Answer. b(θ) = exp θ, therefore b0 (θ) = b00 (θ) = exp(θ) = λ.
From (13.4.20) follows furthermore that the variance is the second derivative of
b, multiplied by a(φ):
∂ 2 b(θ)
(6.2.12) var[y] = a(φ)
∂θ2
6.2. THE EXPONENTIAL FAMILY OF DISTRIBUTIONS 187
Since θ is a function of the mean, this means: the variance of each observation is
the product of two factors, the first factor depends on the mean only, it is called
the “variance function,” and the other factor depends on φ. This is exactly the
specification of the generalized linear model, see Section 69.3.
CHAPTER 7
to.”) One does not need to know the full distribution of y for that, only its expected
value and standard deviation. We will give here a proof only if y has a discrete
distribution, but the inequality is valid in general. Going over to the standardized
variable z = y−µ 1
σ we have to show Pr[|z|≥k] ≤ k2 . Assuming z assumes the values
z1 , z2 ,. . . with probabilities p(z1 ), p(z2 ),. . . , then
X
(7.1.2) Pr[|z|≥k] = p(zi ).
i : |zi |≥k
Now multiply by k 2 :
X
(7.1.3) k 2 Pr[|z|≥k] = k 2 p(zi )
i : |zi |≥k
X
(7.1.4) ≤ zi2 p(zi )
i : |zi |≥k
X
(7.1.5) ≤ zi2 p(zi ) = var[z] = 1.
all i
The Chebyshev inequality is sharp for all k ≥ 1. Proof: the random variable
which takes the value −k with probability 2k12 and the value +k with probability
7.1. CHEBYSHEV INEQUALITY 191
1
2k2 ,and 0 with probability 1 − k12 , has expected value 0 and variance 1 and the
≤-sign in (7.1.1) becomes an equal sign.
Problem 115. [HT83, p. 316] Let y be the number of successes in n trials of a
Bernoulli experiment with success probability p. Show that
y 1
(7.1.6) Pr − p <ε ≥ 1 − .
n 4nε2
Hint: first compute what Chebyshev will tell you about the lefthand side, and then
you will need still another inequality.
Answer. E[y/n] = p and var[y/n] = pq/n (where q = 1 − p). Chebyshev says therefore
q
y pq 1
(7.1.7) Pr − p ≥k ≤ .
k2
n n
p
Setting ε = k pq/n, therefore 1/k2 = pq/nε2 one can rewerite (7.1.7) as
y pq
(7.1.8) Pr − p ≥ε ≤ .
nε2
n
Now note that pq ≤ 1/4 whatever their values are.
inequality says about this probability? Also, Pr[|z|≥2] is approximately 5%, again
look up the precise value. What does Chebyshev say?
Answer. Pr[|z|≥1] = 0.3174, the Chebyshev inequality says that Pr[|z|≥1] ≤ 1. Also,
Pr[|z|≥2] = 0.0456, while Chebyshev says it is ≤ 0.25.
• a. 3 points What happens to this result when the distribution from which the
y i are taken does not have an expected value or a variance?
7.3. CENTRAL LIMIT THEOREM 195
Answer. The result still holds but ȳ and s2 do not converge as the number of observations
increases.
ȳ n −µ
Why do we have the funny expression σ/ √ ? Because this is the standardized
n
version of ȳ n . We know from the law of large numbers that the distribution of
ȳ n becomes more and more concentrated around µ. If we standardize the sample
averages ȳ n , we compensate for this concentration. The central limit theorem tells
us therefore what happens to the shape of the cumulative distribution function of ȳ n .
If we disregard the fact that it becomes more and more concentrated (by multiplying
it by a factor which is chosen such that the variance remains constant), then we see
that its geometric shape comes closer and closer to a normal distribution.
Proof of the Central Limit Theorem: By Problem 120,
n n
ȳ n − µ 1 X yi − µ 1 X yi − µ
(7.3.2) √ =√ =√ zi where z i = .
σ/ n n i=1 σ n i=1 σ
Let m3 , m4 , etc., be the third, fourth, etc., moments of z i ; then the m.g.f. of z i is
t2 m3 t3 m4 t4
(7.3.3) mzi (t) = 1 + + + + ···
2! 3! 4!
Pn √
Therefore the m.g.f. of √1 z i is (multiply and substitute t/ n for t):
n i=1
t2 m3 t3 m4 t 4 n wn n
(7.3.4) 1+ + √ + 2
+ ··· = 1+
2!n 3! n3 4!n n
7.3. CENTRAL LIMIT THEOREM 197
where
t2 m3 t3 m4 t 4
(7.3.5) wn = + √ + + ··· .
2! 3! n 4!n
n
Now use Euler’s limit, this time in the form: if wn → w for n → ∞, then 1+ wnn →
2 t2
ew . Since our wn → t2 , the m.g.f. of the standardized ȳ n converges toward e 2 , which
is that of a standard normal distribution.
The Central Limit theorem is an example of emergence: independently of the
distributions of the individual summands, the distribution of the sum has a very
specific shape, the Gaussian bell curve. The signals turn into white noise. Here
emergence is the emergence of homogenity and indeterminacy. In capitalism, much
more specific outcomes emerge: whether one quits the job or not, whether one sells
the stock or not, whether one gets a divorce or not, the outcome for society is to
perpetuate the system. Not many activities don’t have this outcome.
ȳ n −µ Pn yi −µ
Problem 120. Show in detail that σ/ √ = √1
n n i=1 σ .
√ P √ P
√ Pn P
n 1 n 1 n 1 n n 1 n
Answer. Lhs = σ n i=1
y i −µ = σ n i=1
yi − n i=1
µ = σ n i=
µ = rhs.
198
7. CHEBYSHEV INEQUALITY, WEAK LAW OF LARGE NUMBERS, AND CENTRAL LIMIT THEO
Problem 121. 3 points Explain verbally clearly what the law of large numbers
means, what the Central Limit Theorem means, and what their difference is.
Problem 122. (For this problem, a table is needed.) [Lar82, exercise 5.6.1,
p. 301] If you roll a pair of dice 180 times, what is the approximate probability that
the sum seven appears 25 or more times? Hint: use the Central Limit Theorem (but
don’t worry about the continuity correction, which is beyond the scope of this class).
Answer. Let xi be the random variable that equals one if the i-th roll is a seven, and zero
otherwise. Since 7 can be obtained in six ways (1+6, 2+5, 3+4, 4+3, 5+2, 6+1), the probability
to get a 7 (which is at the same time the expected value of xi ) is 6/36=1/6. Since x2i = xi ,
P180
var[xi ] = E[xi ] − (E[xi ])2 = 16 − 36
1 5
= 36 . Define x = x . We need Pr[x≥25]. Since x
i=1 i
is the sum of many independent identically distributed random variables, the CLT says that x is
asympotically normal. Which normal? That which has the same expected value and variance as
x. E[x] = 180 · (1/6) = 30 and var[x] = 180 · (5/36) = 25. Therefore define y ∼ N (30, 25). The
CLT says that Pr[x≥25] ≈ Pr[y≥25]. Now y≥25 ⇐⇒ y − 30≥ − 5 ⇐⇒ y − 30≤ + 5 ⇐⇒
(y − 30)/5≤1. But z = (y − 30)/5 is a standard Normal, therefore Pr[(y − 30)/5≤1] = Fz (1), i.e.,
the cumulative distribution of the standard Normal evaluated at +1. One can look this up in a
table, the probability asked for is .8413. Larson uses the continuity correction: x is discrete, and
Pr[x≥25] = Pr[x>24]. Therefore Pr[y≥25] and Pr[y>24] are two alternative good approximations;
but the best is Pr[y≥24.5] = .8643. This is the continuity correction.
CHAPTER 8
In this chapter we will look at two random variables x and y defined on the same
sample space U , i.e.,
(8.0.6) x : U 3 ω 7→ x(ω) ∈ R and y : U 3 ω 7→ y(ω) ∈ R.
As we said before, x and y are called independent if all events of the form x ≤ x
are independent of any event of the form y ≤ y. But now let us assume they are
not independent. In this case, we do not have all the information about them if we
merely know the distribution of each.
The following example from [Lar82, example 5.1.7. on p. 233] illustrates the
issues involved. This example involves two random variables that have only two
possible outcomes each. Suppose you are told that a coin is to be flipped two times
199
200 8. VECTOR RANDOM VARIABLES
and that the probability of a head is .5 for each flip. This information is not enough
to determine the probability of the second flip giving a head conditionally on the
first flip giving a head.
For instance, the above two probabilities can be achieved by the following ex-
perimental setup: a person has one fair coin and flips it twice in a row. Then the
two flips are independent.
But the probabilities of 1/2 for heads and 1/2 for tails can also be achieved as
follows: The person has two coins in his or her pocket. One has two heads, and one
has two tails. If at random one of these two coins is picked and flipped twice, then
the second flip has the same outcome as the first flip.
What do we need to get the full picture? We must consider the two variables not
as a totality. In order to do this, we combine x and y into one
separately but jointly,
x
entity, a vector ∈ R2 . Consequently we need to know the probability measure
y
x(ω)
induced by the mapping U 3 ω 7→ ∈ R2 .
y(ω)
It is not sufficient to look at random variables individually; one must look at
them as a totality.
Therefore let us first get an overview over all possible probability measures on the
plane R2 . In strict analogy with the one-dimensional case, these probability measures
8. VECTOR RANDOM VARIABLES 201
derivatives:
∂2
(8.0.10) fx,y (x, y) = Fx,y (x, y).
∂x ∂y
Probabilities can be obtained back from the density function either by the in-
tegral condition, or by the infinitesimal condition. I.e., either one says for a subset
B ⊂ R2 :
Z Z
x
(8.0.11) Pr[ ∈ B] = f (x, y) dx dy,
y B
x
(8.0.12) Pr[ ∈ dVx,y ] = f (x, y) |dV |.
y
The vertical bars here do not mean the absolute value but the volume of the argument
inside.
8.1. EXPECTED VALUE, VARIANCES, COVARIANCES 203
Problem 124. Assume there are two transportation choices available: bus and
car. If you pick at random a neoclassical individual ω and ask which utility this
person derives from using bus or car, the answer will be two numbers that can be
u(ω)
written as a vector (u for bus and v for car).
v(ω)
u
• a. 3 points Assuming has a uniform density in the rectangle with corners
v
66 66 71 71
, , , and , compute the probability that the bus will be preferred.
68 72 68 72
Answer. The probability is 9/40. u and v have a joint density function that is uniform in
the rectangle below and zero outside (u, the preference for buses, is on the horizontal, and v, the
preference for cars, on the vertical axis). The probability is the fraction of this rectangle below the
diagonal.
204 8. VECTOR RANDOM VARIABLES
72
71
70
69
68
66 67 68 69 70 71
• b. 2 points How would you criticize an econometric study which argued along
the above lines?
Answer. The preferences are not for a bus or a car, but for a whole transportation systems.
And these preferences are not formed independently and individualistically, but they depend on
which other infrastructures are in place, whether there is suburban sprawl or concentrated walkable
cities, etc. This is again the error of detotalization (which favors the status quo).
Jointly
distributed random variables should be written as random vectors. In-
y
stead of we will also write x (bold face). Vectors are always considered to be
z
column vectors. The expected value of a random vector is a vector of constants,
8.1. EXPECTED VALUE, VARIANCES, COVARIANCES 205
notation
E[x1 ]
.
(8.1.2) E [x] = ..
E[xn ]
Problem 125. 3 points Using definition (8.1.3) prove the following formula:
Write it down carefully, you will lose points for unbalanced or missing parantheses
and brackets.
206 8. VECTOR RANDOM VARIABLES
Answer. Here it is side by side with and without the notation E[x] = µ and E[y] = ν:
cov[x, y] = E (x − E[x])(y − E[y]) cov[x, y] = E[(x − µ)(y − ν)]
= E xy − x E[y] − E[x]y + E[x] E[y] = E[xy − xν − µy + µν]
(8.1.7)
= E[xy] − E[x] E[y] − E[x] E[y] + E[x] E[y] = E[xy] − µν − µν + µν
= E[xy] − E[x] E[y]. = E[xy] − µν.
Problem 126. 1 point Using (8.1.6) prove the five computation rules with co-
variances (8.1.4) and (8.1.5).
Problem 127. Using the computation rules with covariances, show that
If one deals with random vectors, the expected value becomes a vector, and the
variance becomes a matrix, which is called dispersion matrix or variance-covariance
matrix or simply covariance matrix. We will write it V [x]. Its formal definition is
>
(8.1.9) V [x] = E (x − E [x])(x − E [x]) ,
8.1. EXPECTED VALUE, VARIANCES, COVARIANCES 207
but we can look at it simply as the matrix of all variances and covariances, for
example
x var[x] cov[x, y]
(8.1.10) V [ y ] = cov[y, x] .
var[y]
Hint: You need to multiply matrices, and to use the following computation rules for
covariances:
(8.1.13)
cov[x + y, z] = cov[x, z] + cov[y, z] cov[αx, y] = α cov[x, y] cov[x, x] = var[x].
208 8. VECTOR RANDOM VARIABLES
Answer. V [Ax] =
a b y ay + bz var[ay + bz] cov[ay + bz, cy + dz]
V[ ] = V[ ]=
c d z cy + dz cov[cy + dz, ay + bz] var[cy + dz]
Since the variances are nonnegative, one can see from equation (8.1.11) that
covariance matrices are nonnegative definite (which is in econometrics is often also
called positive semidefinite). By definition, a symmetric matrix Σ is nonnegative def-
inite if for all vectors a follows a>Σ a ≥ 0. It is positive definite if it is nonnegativbe
definite, and a>Σ a = 0 holds only if a = o.
Problem 129. 1 point A symmetric matrix Ω is nonnegative definite if and only
if a>Ω a ≥ 0 for every vector a. Using this criterion, show that if Σ is symmetric and
nonnegative definite, and if R is an arbitrary matrix, then R>Σ R is also nonnegative
definite.
One can also define a covariance matrix between different vectors, C [x, y]; its
i, j element is cov[xi , y j ].
8.1. EXPECTED VALUE, VARIANCES, COVARIANCES 209
Therefore the minimum value is a∗ = cov[y, z]/ var[y], for which the cross product term is −2 times
the first item:
(cov[y, z])2 2(cov[y, z])2
(8.1.18) 0 ≤ var[a∗ y − z] = − + var[z]
var[y] var[y]
(8.1.19) 0 ≤ −(cov[y, z])2 + var[y] var[z].
210 8. VECTOR RANDOM VARIABLES
This proves (8.1.15) for the case var[y] 6= 0. If var[y] = 0, then y is a constant, therefore cov[y, z] = 0
and (8.1.15) holds trivially.
x
By the definition of a product set: ∈ A × B ⇔ x ∈ A and y ∈ B. Split R into
Sy
many small disjoint intervals, R = i dVyi , then
X x
(8.2.3) Pr[x ∈ dVx ] = Pr ∈ dVx × dVyi
y
i
X
(8.2.4) = fx,y (x, yi )|dVx ||dVyi |
i
X
(8.2.5) = |dVx | fx,y (x, yi )|dVyi |.
i
P
Therefore i fx,y (x, y)|dVyi | is the density function we are looking for. Now the
|dVyi | are usually written as dy, and the sum is usually written as an integral (i.e.,
an infinite sum each summand of which is infinitesimal), therefore we get
Z y=+∞
(8.2.6) fx (x) = fx,y (x, y) dy.
y=−∞
In other words, one has to “integrate out” the variable which one is not interested
in.
212 8. VECTOR RANDOM VARIABLES
For a density function there is the problem that Pr[x=x] = 0, i.e., the conditional
probability is strictly speaking not defined. Therefore take an infinitesimal volume
element dVx located at x and condition on x ∈ dVx :
This no longer depends on dVx , only on its location x. The conditional density is
therefore
fx,y (x, y)
(8.3.5) fy|x (y, x) = .
fx (x)
As y varies, the conditional density is proportional to the joint density function, but
for every given value of x the joint density is multiplied by an appropriate factor so
that its integral with respect to y is 1. From (8.3.5) follows also that the joint density
function is the product of the conditional times the marginal density functions.
Problem 131. 2 points The conditional density is the joint divided by the mar-
ginal:
fx,y (x, y)
(8.3.6) fy|x (y, x) = .
fx (x)
Show that this density integrates out to 1.
Answer. The conditional is a density in y with x as parameter. Therefore its integral with
respect to y must be = 1. Indeed,
R +∞
Z +∞ fx,y (x, y) dy
y=−∞ fx (x)
(8.3.7) fy|x=x (y, x) dy = = =1
y=−∞
fx (x) fx (x)
214 8. VECTOR RANDOM VARIABLES
Problem 132. [BD77, example 1.1.4 on p. 7]. x and y are two independent
random variables uniformly distributed over [0, 1]. Define u = min(x, y) and v =
max(x, y).
• a. Draw in the x, y plane the event {max(x, y) ≤ 0.5 and min(x, y) > 0.4} and
compute its probability.
Answer. The event is the square between 0.4 and 0.5, and its probability is 0.01.
• b. Compute the probability of the event {max(x, y) ≤ 0.5 and min(x, y) ≤ 0.4}.
Answer. It is Pr[max(x, y) ≤ 0.5] − Pr[max(x, y) ≤ 0.5 and min(x, y) > 0.4], i.e., the area of
the square from 0 to 0.5 minus the square we just had, i.e., 0.24.
• e. Compute the joint density function of u and v. Note: this joint density is
discontinuous. The values at the breakpoints themselves do not matter, but it is very
important to give the limits within this is a nontrivial function and where it is zero.
Answer. One can see from the way the cumulative distribution function was constructed that
the density function must be
2 if 0 ≤ u ≤ v ≤ 1
(8.3.10) fu,v (u, v) =
0 otherwise
I.e., it is uniform in the above-diagonal part of the square. This is also what one gets from differ-
entiating 2vu − u2 once with respect to u and once with respect to v.
This can be explained as follows: The probability that the first x1 experiments
yield outcome 1, the next x2 outcome 2, etc., is px1 1 px2 2 · · · pxr r . Now every other
sequence of experiments which yields the same number of outcomes of the different
categories is simply a permutation of this. But multiplying this probability by n!
may count certain sequences of outcomes more than once. Therefore we have to
divide by the number of permutations of the whole n element set which yield the
same original sequence. This is x1 ! · · · xr !, because this must be a permutation which
permutes the first x1 elements amongst themselves, etc. Therefore the relevant count
n!
of permutations is x1 !···xr!
.
Problem 133. You have an experiment with r different outcomes, the ith out-
come occurring with probability pi . You make n independent trials, and the ith out-
come occurred xi times. The joint distribution of the x1 , . . . , xr is called a multino-
mial distribution with parameters n and p1 , . . . , pr .
• a. 3 points Prove that their mean vector and covariance matrix are
(8.4.2)
p1 − p21 −p1 p2 · · ·
p1 −p1 pr
x1 x 1 −p2 p1 p2 − p22 · · ·
p2 −p2 pr
µ = E [ ... ] = n . and Ψ = V [ ... ] = n . .. .
.. ..
.. .. . . .
xr xr 2
218 8. VECTOR RANDOM VARIABLES
Hint: use the fact that the multinomial distribution with parameters n and p1 , . . . , pr
is the independent sum of n multinomial distributions with parameters 1 and p1 , . . . , pr .
Answer. In one trial, x2i = xi , from which follows the formula for the variance, and for i 6= j,
xi xj = 0, since only one of them can occur. Therefore cov[xi , xj ] = 0 − E[xi ] E[xj ]. For several
independent trials, just add this.
• b. 1 point How can you show that this covariance matrix is singular?
Answer. Since x1 + · · · + xr = n with zero variance, we should expect
p1 − p21 −p1 p2 ··· −p1 pr 1 0
−p2 p1 p2 − p22 ··· −p2 pr 1 0
(8.4.3) n . = .
... ..
.
..
.
..
.
.. ..
−pr p1 −pr p2 ··· 2
pr − pr 1 0
y, i.e., all events of the form {x(ω) ∈ C} are independent of all events of the form
{y(ω) ∈ D} with arbitrary (measurable) subsets C ⊂ Rm and D ⊂ Rn .
For this it is sufficient that for all x ∈ Rm and y ∈ Rn , the event {x ≤ x}
is independent of the event {y ≤ y}, i.e., that the joint cumulative distribution
function is the product of the marginal ones.
Since the joint cumulative distribution function of independent variables is equal
to the product of the univariate cumulative distribution functions, the same is true
for the joint density function and the joint probability mass function.
Only under this strong definition of independence is it true that any functions
of independent random variables are independent.
Problem 134. 4 points Prove that, if x and y are independent, then E[xy] =
E[x] E[y] and therefore cov[x, y] = 0. (You may assume x and y have density func-
tions). Give a counterexample where the covariance is zero but the variables are
nevertheless dependent.
Answer. Just use that the joint
density
function
is the product of the marginals. It can
also be
done as follows: E[xy] = E E[xy|x] = E x E[y|x] = now independence is needed = E x E[y] =
E[x] E[y]. A counterexample is given in Problem 150.
Problem 135. 3 points Prove the following: If the scalar random variables x
and y are indicator variables (i.e., if each of them can only assume the values 0 and
220 8. VECTOR RANDOM VARIABLES
1), and if cov[x, y] = 0, then x and y are independent. (I.e., in this respect indicator
variables have similar properties as jointly normal random variables.)
Problem 136. If the vector random variables x and y have the property that
xi is independent of every y j for all i and j, does that make x and y independent
random vectors? Interestingly, the answer is no. Give a counterexample that this
fact does not even hold for indicator variables. I.e., construct two random vectors x
and y, consisting of indicator variables, with the property that each component of x
is independent of each component of y, but x and y are not independent as vector
random variables. Hint: Such an example can be constructed in the simplest possible
case that x has two components and y has one component; i.e., you merely have to
find three indicator variables x1 , x2 , and y with the
property
that x1 is independent
x1
of y, and x2 is independent of y, but the vector is not independent of y. For
x2
these three variables, you should use three events which are pairwise independent but
not mutually independent.
8.6. CONDITIONAL EXPECTATION AND VARIANCE 221
Problem 137. 4 points Prove that, if x and y are independent, then var[xy] =
(E[x])2 var[y] + (E[y])2 var[x] + var[x] var[y].
Answer. Start with result and replace all occurrences of var[z] with E[z 2 ]−E[z]2 , then multiply
out: E[x]2 (E[y 2 ] − E[y]2 ) + E[y]2 (E[x2 ] − E[x]2 ) + (E[x2 ] − E[x]2 )(E[y 2 ] − E[y]2 ) = E[x2 ] E[y 2 ] −
E[x]2 E[y]2 = E[(xy)2 ] − E[xy]2 .
Since E[y|x] is a random variable, it is possible to take its expected value. The
law of iterated expectations is extremely important here. It says that you will get
the same result as if you had taken the expected value of y:
(8.6.2) E E[y|x] = E[y].
Proof (for the case that the densities exist):
Z R
y fx,y (x, y) dy
E E[y|x] = E[g(x)] = fx (x) dx
fx (x)
(8.6.3) Z Z
= y fx,y (x, y) dy dx = E[y].
Problem 138. Let x and y be two jointly distributed variables. For every fixed
value x, var[y|x = x] is the variance of y under the conditional distribution, and
var[y|x] is this variance as a random variable, namely, as a function of x.
• a. 1 point Prove that
(8.6.4) var[y|x] = E[y 2 |x] − (E[y|x])2 .
This is a very simple proof. Explain exactly what, if anything, needs to be done to
prove it.
8.6. CONDITIONAL EXPECTATION AND VARIANCE 223
can be used to determine whether the regression function E[y|x] appears to be visu-
ally well-determined or not. Does a small or a big variance ratio indicate a well-
determined regression function?
Answer. For a well-determined regression function the variance ratio should be small. [Coo98,
p. 23] writes: “This ratio is reminiscent of a one-way analysis of variance, with the numerator rep-
resenting the average within group (slice) variance, and the denominator representing the varince
between group (slice) means.”
distribution has the unconditional mean in the center of the U, i.e., here the unconditional mean
does not lie on the curve drawn out by the conditional mean.
• c. 2 points Do you have any ideas how the strange-looking cluster of points in
the figure on page 225 was generated?
226 8. VECTOR RANDOM VARIABLES
`
`
6 `` ` `
` `
` `` `` ``
`` ` ` ` ``` ` `
` ` ` `` `
` ` ` `` ` ` ` ` ` `` ` ` `` ` `
`
` ` ` ` ` ` ```` `` `` ` ` ````` `` ` `` `` ```` `` `` ` `` ` ` ` `
` ` ` `` ` ` ` ` ``` ``` ` ` `` ` ` ` ` `
` `` ` `` ` ` `` ``` ```` ``` `` ``` ` `` `` ` `
` ` `` ` ` ` `` ` ` `
` `` ` ` ` ` `` ` `` ` ` ` `
`
` ` ` ` ``
` `
` `` `
`` ` ` `
`` `
-
`
6
`
-
8.7. EXPECTED VALUES AS PREDICTORS 227
Problem 140. 2 points Given two independent random variables x and y with
density functions fx (x) and gy (y). Write down their joint, marginal, and conditional
densities.
2 2
(8.7.2) E[ y − h(x) ] ≥ E[ y − E[y|x] ].
For this proof and the proofs required in Problems
143 and 144, you may use (1)
the theorem of iterated expectations E E[y|x] = E[y], (2) the additivity E[g(y) +
8.7. EXPECTED VALUES AS PREDICTORS 229
Here the cross product term E[(y − E[y|x])(h(x) − E[y|x])] is zero. In order to see this, first use the
law of iterated expectations
(8.7.4) E[(y − E[y|x])(h(x) − E[y|x])] = E E[(y − E[y|x])(h(x) − E[y|x])|x]
and then look at the inner term, not yet doing the outer expectation:
E[(y − E[y|x])(h(x) − E[y|x])|x] = (h(x) − E[y|x]) =
E[(y − E[y|x])|x] = (h(x) − E[y|x])(E[y|x] − E[y|x]) == (h(x) − E[y|x]) · 0 = 0
Plugging this into (8.7.4) gives E[(y − E[y|x])(h(x) − E[y|x])] = E 0 = 0.
This is one of the few clear cut results in probability theory where a best esti-
mator/predictor exists. In this case, however, all parameters of the distribution are
230 8. VECTOR RANDOM VARIABLES
known, the only uncertainty comes from the fact that some random variables are
unobserved.
Problem 143. Assume the vector x = [x1 , . . . xj ]> and the scalar y are jointly
distributed random variables, and assume conditional means exist. Define ε = y −
E[y|x].
• a. 5 points Demonstrate the following identities:
(8.7.5) E[ε|x] = 0
(8.7.6) E[ε] = 0
(8.7.7) E[xi ε|x] = 0 for all i, 1 ≤ i ≤ j
(8.7.8) E[xi ε] = 0 for all i, 1 ≤ i ≤ j
(8.7.9) cov[xi , ε] = 0 for all i, 1 ≤ i ≤ j.
Interpretation of (8.7.9): ε is the error in the best prediction of y based on x. If this
error were correlated with one of the components xi , then this correlation could be
used to construct a better prediction of y.
Answer. (8.7.5): E[ε|x] = E[y|x]−E E[y|x]|x = 0 since E[y|x] is a function of x and therefore
equal to its own expectation conditionally on x. (This is not the law of iterated expectations but
the law that the expected value of a constant is a constant.)
8.7. EXPECTED VALUES AS PREDICTORS 231
(8.7.6) follows from (8.7.5) (i.e., (8.7.5) is stronger than (8.7.6)): if an expectation is zero con-
ditionally on every possible outcome of x then it is zero altogether. In formulas, E[ε] = E E[ε|x] =
E[0] = 0. It is also easy to show it in one swoop, without using (8.7.5): E[ε] = E[y − E[y|x]] = 0.
Either way you need the law of iterated expectations for this.
(8.7.7): E[xi ε|x] = xi E[ε|x] = 0.
(8.7.8): E[xi ε] = E E[xi ε|x] = E[0] = 0; or in one swoop: E[xi ε] = E xi y − xi E[y|x] =
E xi y −E[xi y|x] = E[xi y]−E[xi y] = 0. The following “proof” is not correct: E[xi ε] = E[xi ] E[ε] =
E[xi ] · 0 = 0. xi and ε are generally not independent, therefore the multiplication rule E[xi ε] =
E[xi ] E[ε] cannot be used. Of course, the following “proof” does not work either: E[xi ε] = xi E[ε] =
xi · 0 = 0. xi is a random variable and E[xi ε] is a constant; therefore E[xi ε] = xi E[ε] cannot hold.
(8.7.9): cov[xi , ε] = E[xi ε] − E[xi ] E[ε] = 0 − E[xi ] · 0 = 0.
• b. 2 points This part can only be done after discussing the multivariate normal
distribution:If x and y are jointly normal, show that x and ε are independent, and
that the variance of ε does not depend on x. (This is why one can consider it an
error term.)
Answer. If x and y are jointly normal, then x and ε are jointly normal as well, and indepen-
dence follows from the fact that their covariance is zero. The variance
is constant
because in the
Normal case, the conditional variance is constant, i.e., E[ε2 ] = E E[ε2 |x] = constant (does not
depend on x).
232 8. VECTOR RANDOM VARIABLES
Problem 144. 5 points Under the permanent income hypothesis, the assumption
is made that consumers’ lifetime utility is highest if the same amount is consumed
every year. The utility-maximizing level of consumption c for a given consumer
depends on the actual state of the economy in each of the n years of the consumer’s
life c = f (y 1 , . . . , y n ). Since c depends on future states of the economy, which are
not known, it is impossible for the consumer to know this optimal c in advance; but
it is assumed that the function f and the joint distribution of y 1 , . . . , y n are known to
him. Therefore in period t, when he only knows the values of y 1 , . . . , y t , but not yet
the future values, the consumer decides to consume the amount ct = E[c|y 1 , . . . , y t ],
which is the best possible prediction of c given the information available to him. Show
that in this situation, ct+1 − ct is uncorrelated with all y 1 , . . . , y t . This implication of
the permanent income hypothesis can be tested empirically, see [Hal78]. Hint: you
are allowed to use without proof the following extension of the theorem of iterated
expectations:
(8.7.10) E E[x|y, z]y = E[x|y].
E[x|y], i.e., (8.7.10) says therefore that I cannot predict how I will change my mind
after better information becomes available.
Answer. In (8.7.10) set x = c = f (y 1 , . . . , y t , y t+1 , . . . , y n ), y = [y 1 , . . . , y t ]> , and z = y t+1
to get
(8.7.11) E E[c|y 1 , . . . , y t+1 ]y 1 , . . . , y t = E[c|y 1 , . . . , y t ].
Writing ct for E[c|y 1 , . . . , y t ], this becomes E[ct+1 |y 1 , . . . , y t ] = ct , i.e., ct is not only the best
predictor of c, but also that of ct+1 . The change in consumption ct+1 − ct is therefore the prediction
error, which is uncorrelated with the conditioning variables, as shown in Problem 143.
Problem 145. 3 points Show that for any two random variables x and y whose
covariance exists, the following equation holds:
(8.7.12) cov[x, y] = cov x, E[y|x]
Note: Since E[y|x] is the best predictor of y based on the observation of x, (8.7.12)
can also be written as
(8.7.13) cov x, (y − E[y|x]) = 0,
i.e., x is uncorrelated with the prediction error of the best prediction of y given x.
(Nothing to prove for this Note.)
234 8. VECTOR RANDOM VARIABLES
Problem 146. Assume x and y have a joint density function fx,y (x, y) which
is symmetric about the x-axis, i.e.,
fx,y (x, y) = fx,y (x, −y).
Also assume that variances and covariances exist. Show that cov[x, y] = 0. Hint:
one way to do it is to look at E[y|x].
Answer. We know that cov[x, y] = cov x, E[y|x] . Furthermore, from symmetry follows
E[y|x] = 0. Therefore cov[x, y] = cov[x, 0] = 0. Here is a detailed proof of E[y|x] = 0: E[y|x=x] =
R∞ f (x,y)
y x,y
fx (x)
dy. Now substitute z = −y, then also dz = −dy, and the boundaries of integration
−∞
are reversed:
Z −∞ Z −∞
fx,y (x, −z) fx,y (x, z)
(8.7.15) E[y|x=x] = z dz = z dz = − E[y|x=x].
∞
fx (x) ∞
fx (x)
One can also prove directly under this presupposition cov[x, y] = cov[x, −y] and therefore it must
be zero.
8.7. EXPECTED VALUES AS PREDICTORS 235
Problem 147. [Wit85, footnote on p. 241] Let p be the logarithm of the price
level, m the logarithm of the money supply, and x a variable representing real influ-
ences on the price level (for instance productivity). We will work in a model of the
economy in which p = m + γx, where γ is a nonrandom parameter, and m and x are
2
independent normal with expected values µm , µx , and variances σm , σx2 . According
to the rational expectations assumption, the economic agents know the probability
distribution of the economy they live in, i.e., they know the expected values and vari-
ances of m and x and the value of γ. But they are unable to observe m and x, they
can only observe p. Then the best predictor of x using p is the conditional expectation
E[x|p].
• a. Assume you are one of these agents and you observe p = p. How great
would you predict x to be, i.e., what is the value of E[x|p = p]?
cov(x,p)
Answer. It is, according to formula (10.3.18), E[x|p = p] = µx + var(p)
(p − E[p]). Now
E[p] = µm + γµx , cov[x, p] = cov[x, m] + γ cov[x, x] = γσx2 , and var(p) = 2 + γ 2 σ 2 . Therefore
σm x
γσx2
(8.7.16) E[x|p = p] = µx + 2 (p − µm − γµx ).
σm + γ 2 σx2
Answer.
γσx2
(8.7.17) ε = x − µx − 2 + γ 2 σ2
(p − µm − γµx ).
σm x
2
• c. In an attempt to fine tune the economy, the central bank increases σm . Does
that increase or decrease var(ε)?
Answer. From (8.7.20) follows that it increases the variance.
variable, whose density we want to compute; (2) express the old variable, the one
whose density/mass function is known, in terms of the new variable,
the one whose
x x
density or mass function is needed. If that of is known, set = t(u, v). Here
y y
q(u, v)
t is a vector-valued function, (i.e., it could be written t(u, v) = , but we will
r(u, v)
use one symbol t for this whole transformation), and you have to check that it is
one-to-one on A, i.e., t(u, v) = t(u1 , v1 ) implies u = u1 and v = v1 for all (u, v) and
u1 , v1 ) in A. (A function for which two different arguments (u, v) and u1 , v1 ) give
the same function value is called many-to-one.)
If the joint probability distribution of x and y is described by a probability mass
function, then the joint probability mass function of u and v can simply be obtained
by substituting t into the joint probability mass function of x and y (and it is zero
for any values which are not in A):
(8.8.1)
u u x
pu,v (u, v) = Pr = = Pr t(u, v) = t(u, v) = Pr = t(u, v) = px,y t(u, v) .
v v y
The second equal sign is where the condition enters that t : R2 → R2 is one-to-one.
238 8. VECTOR RANDOM VARIABLES
If one works with the density function instead of a mass function, one must
perform an additional step besides substituting t. Since t is one-to-one, it follows
u
(8.8.2) { ∈ dVu,v } = {t(u, v) ∈ t(dV )x,y }.
v
Therefore
(8.8.3)
u
fu,v (u, v)|dVu,v | = Pr[ ∈ dVu,v ] = Pr[t(u, v) ∈ t(dV )x,y ] = fx,y (t(u, v))|t(dV )x,y | =
v
|t(dV )x,y |
(8.8.4) = fx,y (t(u, v)) |dVu,v |.
|dVu,v |
|t(dV ) |
The term |dVu,vx,y| is the local magnification factor of the transformation t;
analytically it is the absolute value |J| of the Jacobian determinant
∂x ∂x ∂q
(u, v) ∂q (u, v)
(8.8.5) ∂u ∂v ∂u
J = ∂y ∂y = ∂r
∂v .
∂r
∂u ∂v ∂u (u, v) ∂v (u, v)
Remember, u, v are the new and x, y the old variables. To compute J one
has to express the old in terms of the new variables. If one expresses the new in
8.8. TRANSFORMATION OF VECTOR RANDOM VARIABLES 239
terms of the old, one has to take the inverse of the corresponding determinant! The
transformation rule for density functions can therefore be summarized as:
∂x ∂x
(x, y) = t(u, v) one-to-one ⇒ fu,v (u, v) = fx,y t(u, v) |J| where J = ∂u ∂v
∂y ∂y .
∂u ∂v
Problem 148. Let x and y be two random variables with joint density function
fx,y (x, y).
• a. 3 points Define u = x + y. Derive the joint density function of u and y.
Answer. You have to express the “old” x and y as functions of the “new” u and y:
∂x ∂x
x=u−y x 1 −1 u 1 −1
or = therefore J = ∂u
∂y
∂y
∂y
= = 1.
y=y y 0 1 y ∂u ∂y
0 1
Therefore
(8.8.6) fu,y (u, y) = fx,y (u − y, y).
• b. 1 point Derive from this the following formula computing the density func-
tion fu (u) of the sum u = x + y from the joint density function fx,y (x, y) of x and
y.
Z y=∞
(8.8.7) fu (u) = fx,y (u − y, y)dy.
y=−∞
240 8. VECTOR RANDOM VARIABLES
Answer. Write down the joint density of u and y and then integrate y out, i.e., take its integral
over y from −∞ to +∞:
Z y=∞ Z y=∞
(8.8.8) fu (u) = fu,y (u, y)dy = fx,y (u − y, y)dy.
y=−∞ y=−∞
x
i.e., one integrates over all with x + y = u.
y
To help evaluate this integral, here is the area in u, y-plane (u = x + y on the horizontal and y on
the vertical axis) in which fx,y (u − v, v) has the value 1:
8.8. TRANSFORMATION OF VECTOR RANDOM VARIABLES 241
6 q q
q q -
This is the area between (0,0), (1,1), (2,1), and (1,0).
One can also show it this way: fx,y (x, y) = 1 iff 0 ≤ x ≤ 1 and 0 ≤ y ≤ 1. Now take any fixed
u. It must be between 0 and 2. First assume 0 ≤ u ≤ 1: then fx,y (u − y, y) = 1 iff 0 ≤ u − y ≤ 1
and 0 ≤ y ≤ 1 iff 0 ≤ y ≤ u. Now assume 1 ≤ u ≤ 2: then fx,y (u − y, y) = 1 iff u − 1 ≤ y ≤ 1.
Problem 151. [Ame85, pp. 296–7] Assume three transportation choices are
available: bus, train, and car. If you pick at random a neoclassical individual ω
and ask him or her which utility this person derives from using bus, train, and car,
the answer will be three numbers u1 (ω), u2 (ω), u3 (ω). Here u1 , u2 , and u3 are as-
sumed to be independent random variables with the following cumulative distribution
functions:
(8.8.11) Pr[ui ≤ u] = Fi (u) = exp − exp(µi − u) , i = 1, 2, 3.
I.e., the functional form is the same for all three transportation choices (exp in-
dicates the exponential function); the Fi only differ by the parameters µi . These
probability distributions are called Type I extreme value distributions, or log Weibull
distributions.
Often these kinds of models are set up in such a way that these µi to depend on
the income etc. of the individual, but we assume for this exercise that this distribution
applies to the population as a whole.
8.8. TRANSFORMATION OF VECTOR RANDOM VARIABLES 243
• a. 1 point Show that the Fi are indeed cumulative distribution functions, and
derive the density functions fi (u).
Individual ω likes cars best if and only if his utilities satisfy u3 (ω) ≥ u1 (ω) and
u3 (ω) ≥ u2 (ω). Let I be a function of three arguments such that I(u1 , u2 , u3 ) is the
indicator function of the event that one randomly chooses an individual ω who likes
cars best, i.e.,
(
1 if u1 ≤ u3 and u2 ≤ u3
(8.8.12) I(u1 , u2 , u3 ) =
0 otherwise.
Then Pr[car] = E[I(u1 , u2 , u3 )]. The following steps have the purpose to compute
this probability:
• b. 2 points For any fixed number u, define g(u) = E[I(u1 , u2 , u3 )|u3 = u].
Show that
(8.8.13) g(u) = exp − exp(µ1 − u) − exp(µ2 − u) .
244 8. VECTOR RANDOM VARIABLES
Random Matrices
The step from random vectors to random matrices (and higher order random
arrays) is not as big as the step from individual random variables to random vectors.
We will first give a few quite trivial verifications that the expected value operator
is indeed a linear operator, and them make some not quite as trivial observations
about the expected values and higher moments of quadratic forms.
245
246 9. RANDOM MATRICES
Theorem 9.1.11. V [x] is singular if and only if a vector a exists so that a> x
is almost surely a constant.
Proof: Call V [x] = Σ. Then Σ singular iff an a exists with Σa = o iff an a exists
with a>Σa = var[a> x] = 0 iff an a exists so that a> x is almost surely a constant.
This means, singular random variables have a restricted range, their values are
contained in a linear subspace. This has relevance for estimators involving singular
random variables: two such estimators (i.e., functions of a singular random variable)
should still be considered the same if their values coincide in that subspace in which
the values of the random variable is concentrated—even if elsewhere their values
differ.
Problem 154. [Seb77, exercise 1a–3 on p. 13] Let x = [x1 , . . . , xn ]> be a vector
of random variables, and let y 1 = x1 and y i = xi − xi−1 for i = 2, 3, . . . , n. What
must the dispersion matrix V [x] be so that the y i are uncorrelated with each other
and each have unit variance?
9.2. MEANS AND VARIANCES OF QUADRATIC FORMS 249
Here we used that tr(AB) = tr(BA) and, if c is a scalar, i.e., a 1 × 1 matrix, then
tr(c) = c.
In tile notation (see Appendix B), the proof of theorem 9.2.1 is much more
straightforward and no longer seems to rely on “tricks.” From y ∼ (η, Σ ), i.e., we
9.2. MEANS AND VARIANCES OF QUADRATIC FORMS 251
" y # η
(9.2.7) E = + Σ ; therefore
y η
" y # " y # η
(9.2.8) E A = E A = A + Σ A .
y y η
>
Answer. Write y = y1 y2 ... yn and Σ = diag( σ12 σ22 ... 2 ). Then the
σn
>
1 > 1 >
vector y 1 − ȳ y 2 − ȳ ... y n − ȳ can be written as (I − n
ιι )y. n ιι is idempotent,
1 > 1 >
therefore D = I − n
ιι
is idempotent too. Our estimator is n(n−1)
y Dy, and since the mean
vector η = ιη satisfies Dη = o, theorem 9.2.1 gives
1
(9.2.10) E[y > Dy] = tr[DΣ Σ] −
Σ] = tr[Σ tr[ιι>Σ ]
n
1
(9.2.11) = (σ12 + · · · + σn
2
) − tr[ι>Σ ι]
n
n−1 2 2
(9.2.12) = (σ1 + · · · + σn ).
n
Divide this by n(n − 1) to get (σ12 + · · · + σn
2 )/n2 , which is var[ȳ], as claimed.
For the variances of quadratic forms we need the third and fourth moments of
the underlying random variables.
Problem 156. Let µi = E[(y − E[y])i ] be the ith centered moment of y, and let
√
σ = µ2 be its standard deviation. Then the skewness is defined as γ1 = µ3 /σ 3 , and
kurtosis is γ2 = (µ4 /σ 4 ) − 3. Show that skewness and kurtosis of ay + b are equal to
those of y if a > 0; for a < 0 the skewness changes its sign. Show that skewness γ1
and kurtosis γ2 always satisfy
(9.2.13) γ12 ≤ γ2 + 2.
9.2. MEANS AND VARIANCES OF QUADRATIC FORMS 253
Problem 157. Show that any real numbers γ1 and γ2 satisfying (9.2.13) can be
the skewness and kurtosis of a random variable.
Answer. To show that all combinations satisfying this inequality are possible, define
p
r= γ2 + 3 − 3γ12 /4 a = r + γ1 /2 b = r − γ1 /2
and construct a random variable x which assumes the following three values:
a with probability 1/2ar
(9.2.15) x= 0 with probability 1/(γ2 + 3 − γ12 ),
−b
with probability 1/2br
This variable has expected value zero, variance 1, its third moment is γ1 , and its fourth moment
γ2 + 3.
called the skewness of these variables. Then the following holds for the third mixed
moments:
(
σ 3 γ1 if i = j = k
(9.2.16) E[εi εj εk ] =
0 otherwise
and from (9.2.16) follows that for any n × 1 vector a and symmetric n × n matrices
C whose vector of diagonal elements is c,
(9.2.17) E[(a>ε )(ε
ε> Cε
ε)] = σ 3 γ1 a> c.
One would like to have a matrix notation for (9.2.16) from which (9.2.17) follows by
a trivial operation. This is not easily possible in the usual notation, but it is possible
9.2. MEANS AND VARIANCES OF QUADRATIC FORMS 255
in tile notation:
" ε #
(9.2.20) E = γ1 σ 3 ∆ .
ε ε
Therefore
a
a
" ε #
(9.2.21) E = γ1 σ 3 ∆
ε ε
C
C
256 9. RANDOM MATRICES
4
σ (γ2 + 3) if i = j = k = l
σ 4
if i = j 6= k = l or i = k 6= j = l
(9.2.22) E[εi εj εk εl ] =
or i = l 6= j = k
0 otherwise.
It is not an accident that (9.2.22) is given element by element and not in matrix
notation. It is not possible to do this, not even with the Kronecker product. But it
9.2. MEANS AND VARIANCES OF QUADRATIC FORMS 257
" ε ε #
(9.2.23) E = σ4 + σ4 + σ4 + γ2 σ 4 ∆
ε ε
Problem 158. [Seb77, pp. 14–16 and 52] Show that for any symmetric n × n
matrices A and B, whose vectors of diagonal elements are a and b,
(9.2.24) ε> Aε
E[(ε ε> Bε
ε)(ε ε)] = σ 4 tr A tr B + 2 tr(AB) + γ2 a> b .
258 9. RANDOM MATRICES
Answer. (9.2.24) is an immediate consequence of (9.2.23); this step is now trivial due to
linearity of the expected value:
A
A A A A
ε ε
E = σ4 + σ4 + σ4 + γ2 σ 4 ∆
ε ε
B B B B
B
The first term is tr AB. The second is tr AB > , but since A and B are symmetric, this is equal
to tr AB. The third term is tr A tr B. What is the fourth term? Diagonal arrays exist with any
number of arms, and any connected concatenation of diagonal arrays is again a diagonal array, see
(B.2.1). For instance,
∆
(9.2.25) ∆ = .
∆
9.2. MEANS AND VARIANCES OF QUADRATIC FORMS 259
From this together with (B.1.4) one can see that the fourth term is the scalar product of the diagonal
vectors of A and B.
(9.2.26) ε> Cε
cov[ε ε, ε > Dε
ε] = σ 4 γ2 c> d + 2σ 4 tr(CD).
Problem 160. (Not eligible for in-class exams) Take any symmetric matrix A
and denote the vector of diagonal elements by a. Let x = θ + ε where ε satisfies the
conditions of theorem 9.2.2 and equation (9.2.23). Then
(9.2.27) var[x> Ax] = 4σ 2 θ > A2 θ + 4σ 3 γ1 θ > Aa + σ 4 γ2 a> a + 2 tr(A2 ) .
Answer. Proof: var[x> Ax] = E[(x> Ax)2 ] − (E[x> Ax])2 . Since by assumption V [x] = σ 2 I,
the second term is, by theorem 9.2.1, (σ 2 tr A + θ > Aθ)2 . Now look at first term. Again using the
notation ε = x − θ it follows from (9.2.3) that
We will take expectations of these terms one by one. Use (9.2.24) for first term:
(9.2.30) (ε ε)2 = σ 4 γ2 a> a + (tr A)2 + 2 tr(A2 ) .
ε> Aε
The third term is a constant which remains as it is; for the fourth term use (9.2.17)
(9.2.33) ε θ > Aε
ε > Aε ε b>ε
ε = ε > Aε
(9.2.34) E[ε ε] = σ 3 γ1 a> b = σ 3 γ1 a> Aθ
ε θ > Aε
ε> Aε
If one takes expected values, the fifth term becomes 2σ 2 tr(A) θ > Aθ, and the last term falls away.
Putting the pieces together the statement follows.
CHAPTER 10
see that this joint density integrates to 1, go over to polar coordinates x = r cos φ,
y = r sin φ, i.e., compute the joint distribution of r and φ from that of x and y: the
absolute value of the Jacobian determinant is r, i.e., dx dy = r dr dφ, therefore
Z y=∞ Z x=∞ Z 2π Z ∞
1 − x2 +y2 1 − r2
(10.1.2) e 2 dx dy = e 2 r dr dφ.
y=−∞ x=−∞ 2π φ=0 r=0 2π
1 −t ∞
By substituting t = r2 /2, therefore dt = r dr, the inner integral becomes − 2π e 0 =
1
2π ; therefore the whole integral is 1. Therefore the product of the integrals of the
marginal densities is 1, and since each such marginal integral is positive and they are
equal, each of the marginal integrals is 1 too.
R∞
Problem 161. 6 points The Gamma function can be defined as Γ(r) = 0 xr−1 e−x
√
Show that Γ( 21 ) = π. (Hint: after substituting r = 1/2, apply the variable transfor-
mation x = z 2 /2 for nonnegative x and z only, and then reduce the resulting integral
to the integral over the normal density function.)
dx
√
Answer. Then dx = z dz, √
x
= dz 2. Therefore one can reduce it to the integral over the
normal density:
Z ∞ Z ∞ Z ∞ √
1 √ 2 1 2 2π √
(10.1.3) √ e−x dx = 2 e−z /2
dz = √ e−z /2
dz = √ = π.
0
x 0 2 −∞ 2
10.1. MORE ABOUT THE UNIVARIATE CASE 263
vector
x is bivariate normal.
Take any nonsingular 2 × 2 matrix P and a 2 vector
µ u
µ= , and define = u = P x + µ. We need nonsingularity because otherwise
ν v
the resulting variable would not have a bivariate density; its probability mass would
be concentrated on one straight line in the two-dimensional plane. What is the
joint density function of u? Since P is nonsingular, the transformation is on-to-one,
therefore we can apply the transformation theorem for densities. Let us first write
down the density function of x which we know:
1 1
2 2
(10.3.1) fx,y (x, y) = exp − (x + y ) .
2πσ 2 2σ 2
For the next step, remember that we have to express the old variable in terms
of the new one: x = P −1 (u − µ). The Jacobiandeterminant is therefore
J =
x u − µ
det(P −1 ). Also notice that, after the substitution = P −1 , the expo-
y v−ν
>
x x
nent in the joint density function of x and y is − 2σ1 2 (x2 + y 2 ) = − 2σ1 2 =
y y
>
u−µ > u−µ
− 2σ1 2 P −1 P −1 . Therefore the transformation theorem of density
v−ν v−ν
10.3. BIVARIATE NORMAL 267
functions gives
1 >
1 −1
u−µ −1 > −1 u − µ
(10.3.2) fu,v (u, v) = det(P ) exp − 2 P P .
2πσ 2 2σ v − ν v−ν
This expression can be made nicer. Note that the covariance matrix of the
u >
transformed variables is V [ ] = σ 2 P P > = σ 2 Ψ, say. Since P −1 P −1 P P > = I,
v
−1 > −1
= Ψ and det(P −1 ) = 1/ det(Ψ), therefore
−1
p
it follows P P
1 1 1 u − µ>
−1 u − µ
(10.3.3) fu,v (u, v) = exp − Ψ .
2πσ 2 det(Ψ) 2σ 2 v − ν v−ν
p
This is the general formula for the density function of a bivariate normal with non-
singular covariance matrix σ 2 Ψ and mean vector µ. One can also use the following
notation which is valid for the multivariate Normal variable with n dimensions, with
mean vector µ and nonsingular covariance matrix σ 2 Ψ:
1
(10.3.4) fx (x) = (2πσ 2 )−n/2 (det Ψ)−1/2 exp − 2 (x − µ)> Ψ−1 (x − µ) .
2σ
Problem 163. 1 point Show that the matrix product of (P −1 )> P −1 and P P >
is the identity matrix.
268 10. MULTIVARIATE NORMAL
Problem 164. 3 points All vectors in this question are n × 1 column vectors.
Let y = α+ε ε, where α is a vector of constants and ε is jointly normal with E [ε
ε] = o.
ε] is not given directly, but a n×n nonsingular matrix
Often, the covariance matrix V [ε
T is known which has the property that the covariance matrix of T ε is σ 2 times the
n × n unit matrix, i.e.,
2
(10.3.5) V [T ε ] = σ I n .
Show that in this case the density function of y is
1 >
(10.3.6) fy (y) = (2πσ 2 )−n/2 |det(T )| exp − 2 T (y − α) T (y − α) .
2σ
Hint: define z = T ε , write down the density function of z, and make a transforma-
tion between z and y.
Answer. Since E [z] = o and V [z] = σ 2 I n , its density function is (2πσ 2 )−n/2 exp(−z > z/2σ 2 ).
Now express z, whose density we know, as a function of y, whose density function we want to know.
z = T (y − α) or
(10.3.7) z1 = t11 (y1 − α1 ) + t12 (y2 − α2 ) + · · · + t1n (yn − αn )
..
(10.3.8) .
(10.3.9) zn = tn1 (y1 − α1 ) + tn2 (y1 − α2 ) + · · · + tnn (yn − αn )
therefore the Jacobian determinant is det(T ). This gives the result.
10.3. BIVARIATE NORMAL 269
factor. In other words, by completing the square we wrote the joint density function
in its natural form as the product of a marginal and a conditional density function:
fu,v (u, v) = fu (u) · fv|u (v; u).
From this decomposition one can draw the following conclusions:
σv
(10.3.16) E[v|u = u] = ρ u,
σu
272 10. MULTIVARIATE NORMAL
We did this in such detail because any bivariate normal with zero mean has this
form. A multivariate normal distribution is determined by its means and variances
and covariances (or correlations coefficients). If the means are not zero, then the
densities merely differ from the above by an additive constant in the arguments, i.e.,
10.3. BIVARIATE NORMAL 273
if one needs formulas for nonzero mean, one has to replace u and v in the above
equations by u − µu and v − µv . du and dv remain the same, because the Jacobian
of the translation u 7→ u − µu , v 7→ v − µv is 1. While the univariate normal was
determined by mean and standard deviation, the bivariate normal is determined by
the two means µu and µv , the two standard deviations σu and σv , and the correlation
coefficient ρ.
10.3.2. Level Lines of the Normal Density.
Problem 166. 8 points Define the angle δ = arccos(ρ), i.e, ρ = cos δ. In terms
of δ, the covariance matrix (??) has the form
σu2
σu σv cos δ
(10.3.21) Ψ=
σu σv cos δ σv2
Show that for all φ, the vector
r σu cos φ
(10.3.22) x=
r σv cos(φ + δ)
satisfies x> Ψ−1 x = r2 . The opposite holds too, all vectors x satisfying x> Ψ−1 x =
r2 can be written in the form (10.3.22) for some φ, but I am not asking to prove
this. This formula can be used to draw level lines of the bivariate Normal density
and confidence ellipses, more details in (??).
274 10. MULTIVARIATE NORMAL
Problem 167. The ellipse in Figure 1 contains all the points x, y for which
−1
0.5 −0.25 x−1
(10.3.23) x−1 y−1 ≤6
−0.25 1 y−1
10.3. BIVARIATE NORMAL 275
−2 −1 x=0 1 2 3 4
......................................................
................... ............
......... .........
........ ........
...
. ...... ........
.......
.
.....
. .......
..
.... ......
......
...
. ......
3 ...
...
.
. ......
......
.....
.
... ....
....
... ....
. .. ....
....
... ....
.
... ....
. ....
.... ....
....
... ...
.... ...
...
... ...
... ...
...
... ...
... ...
2 ...
....
...
...
... ...
... ...
... ...
... ...
... ...
... ...
... ...
... ...
... ...
... ...
... ...
... ...
... ...
... ...
... ...
1 ...
...
...
... ...
...
...
... ...
... ...
... ...
... ...
... ...
... ...
... ...
... ...
... ...
... ...
... ...
... ...
... ..
.
... ...
=0 ... ....
... ..
... .
... .
. ...
276 10. MULTIVARIATE NORMAL
The set of all these points forms a band limited by two parallel lines. What is the
probability that [ xy ] falls between these two lines?
• g. 1 point It is our purpose to show that this band is again tangent to the
ellipse. This is easiest if we use matrix notation. Define
x 1 0.5 −0.25 a
(10.3.26) x= µ= Ψ= a=
y 1 −0.25 1 b
Equation (10.3.23) in matrix notation says: the ellipse contains all the points for
which
(10.3.27) (x − µ)> Ψ−1 (x − µ) ≤ 6.
Show that the band defined by inequality (10.3.25) contains all the points for which
2
a> (x − µ)
(10.3.28) ≤ 6.
a> Ψa
• h. 2 points Inequality (10.3.28) can also be written as:
(10.3.29) (x − µ)> a(a> Ψa)−1 a> (x − µ) ≤ 6
or alternatively
a a −1 x − 1
b Ψ−1
(10.3.30) x−1 y−1 a a b ≤ 6.
b b y−1
278 10. MULTIVARIATE NORMAL
• l. 2 points The vertical lines in Figure 1 which are not tangent to the ellipse
delimit a band which, if extended to infinity, has as much probability mass as the
ellipse itself. Compute the x-coordinates of these two lines.
10.3.3. Miscellaneous Exercises.
Problem 168. Figure 2 shows the level line for a bivariate Normal density which
contains 95% of the probability mass.
x
• a. 3 points One of the following matrices is the covariance matrix of . Ψ1 =
y
0.62 −0.56 1.85 1.67 0.62 0.56 1.85 −1.67
, Ψ2 = , Ψ3 = , Ψ4 = ,
−0.56 1.04 1.67 3.12 0.56 1.04 1.67 3.12
3.12 −1.67 1.04 0.56 3.12 1.67 0.62 0.81
Ψ5 = , Ψ6 = , Ψ7 = , Ψ8 = ,
−1.67 1.85
0.56 0.62 1.67 1.85 0.81 1.04
3.12 1.67 0.56 0.62
Ψ9 = , Ψ10 = . Which is it? Remember that for a uni-
2.67 1.85 0.62 −1.04
variate Normal, 95% of the probability mass lie within ±2 standard deviations from
the mean. If you are not sure, cross out as many of these covariance matrices as
possible and write down why you think they should be crossed out.
Answer. Covariance matrix must be symmetric, therefore we can cross out 4 and 9. It must
also be nonnegative definite (i.e., it must have nonnegative elements in the diagonal), therefore
280 10. MULTIVARIATE NORMAL
cross out 10, and a nonnegative determinant, therefore cross out 8. Covariance must be positive, so
cross out 1 and 5. Variance in x-direction is smaller than in y-direction, therefore cross out 6 and
7. Remains 2 and 3.
Of these it is number 3. By comparison with Figure 1 one can say that the vertical band
between 0.4 and 2.6 and the horizontal band between 3 and -1 roughly have the same probability
as the ellipse, namely 95%. Since a univariate Normal has 95% of its probability mass in an
interval centered around the mean which is 4 standard deviations long, standard deviations must
be approximately 0.8 in the horizontal and 1 in the vertical directions.
Ψ1 is negatively correlated; Ψ2 has the right correlation but is scaled too big; Ψ3 this is it; Ψ4
not symmetric; Ψ5 negatively correlated, and x has larger variance than y; Ψ6 x has larger variance
than y; Ψ7 too large, x has larger variance than y; Ψ8 not positive definite; Ψ9 not symmetric;
Ψ10 not positive definite.
The next Problem constructs a counterexample which shows that a bivariate dis-
tribution, which is not bivariate Normal, can nevertheless have two marginal densities
which are univariate Normal.
Problem 169. Let x and y be two independent standard normal random vari-
ables, and let u and v be bivariate normal with mean zero, variances σu2 = σv2 = 1,
and correlation coefficient ρ 6= 0. Let fx,y and fu,v be the corresponding density
10.3. BIVARIATE NORMAL 281
functions, i.e.,
1 a2 + b2 1 b
fx,y (a, b) = exp(− ) fu,v (a, b) = exp(−a2 + b2 − 2ρa )
ρ2 )
p
2π 2 2π 1 − ρ 2 2(1 −
Assume the random variables a and b are defined by the following experiment: You
flip a fair coin; if it shows head, then you observe x and y and give a the value
observed on x, and b the value observed of y. If the coin shows tails, then you
observe u and v and give a the value of u, and b the value of v.
• a. Prove that the joint density of a and b is
1 1
(10.3.33) fa,b (a, b) =
fx,y (a, b) + fu,v (a, b).
2 2
Hint: first show the corresponding equation for the cumulative distribution functions.
Answer. Following this hint:
(10.3.34) Fa,b (a, b) = Pr[a ≤ a and b ≤ b] =
(10.3.35) = Pr[a ≤ a and b ≤ b|head] Pr[head] + Pr[a ≤ a and b ≤ b|tail] Pr[tail]
1 1
(10.3.36) = Fx,y (a, b)
+ Fu,v (a, b) .
2 2
The density function is the function which, if integrated, gives the above cumulative distribution
function.
282 10. MULTIVARIATE NORMAL
Then you can see that the marginal is standard normal. Therefore you get a mixture of two
distributions each of which is standard normal, therefore it is not really a mixture any more.
It is not normal, it is a mixture of normals with different variances. This has mean zero and variance
1
2
(1 + (1 − ρ2 )) = 1 − 12 ρ2 .
• b. 2 points Find the joint probability density function of r and φ. Also indicate
the area in (r, φ) space in which it is nonzero.
1 2 2 2 1 2 2
Answer. fx,y (x, y) = 2πσ 2
e−(x +y )/2σ ; therefore fr,φ (r, φ) = 2πσ 2
re−r /2σ for 0 ≤ r <
∞ and 0 ≤ φ < 2π.
• c. 3 points Find the marginal distributions of r and φ. Hint: for one of the
integrals it is convenient to make the substitution q = r2 /2σ 2 .
284 10. MULTIVARIATE NORMAL
1 2 2 1
Answer. fr (r) = σ2
re−r /2σ for 0 ≤ r < ∞, and fφ (φ) = 2π
for 0 ≤ φ < 2π. For the latter
1
R∞ 2 2 1 1
we need 2πσ 2 re−r /2σ dr = 2π , set q= r 2 /2σ 2 , then dq = σ 2 r dr, and the integral becomes
0
1 ∞ −q
R
2π 0
e dq.
which means y is standard normal as well. In other words, every y i is univariate stan-
dard normal with same variance σ 2 and y i is independent of y j for i 6= j. Therefore
also any subvector of y, such as x, is standard normal. Since z > z−x> x = y > y−x> x
is the sum of the squares of those elements of y which are not in x, it follows that it
is an independent σ 2 χ2p−m .
Problem 171. Show that the moment generating function of a multivariate stan-
dard normal with variance σ 2 is mz (t) = E [exp(t> z)] = exp(σ 2 t> t/2).
r
X
(10.4.5) E[exp(qt)] = E[exp(t λi x2i )]
i=1
(10.4.6) = E[exp(tλ1 x21 )] · · · E[exp(tλr x2r )]
(10.4.7) = (1 − 2λ1 σ 2 t)−1/2 · · · (1 − 2λr σ 2 t)−1/2 .
288 10. MULTIVARIATE NORMAL
and sufficient:
(10.4.10) P 2i = P i for all i
(10.4.11) P iP j = O i 6= j
k
X
(10.4.12) rank(P i ) = p
i=1
I will give a brief overview in tile notation of the higher moments of the mul-
tivariate standard normal z. All odd moments disappear, and the fourth moments
are
" z z #
(10.5.2) E = + +
z z
Compared with (9.2.23), the last term, which depends on the kurtosis, is missing.
What remains is a sum of outer products of unit matrices, with every possibility
appearing exactly once. In the present case, it happens to be possible to write down
the four-way arrays in (10.5.2) in terms of Kronecker products and the commutation
matrix K (n,n) introduced in (B.5.21): It is
> >
(10.5.3) E [(zz ) ⊗ (zz )] = I n2 + K
(n,n)
+ (vec[I n ])(vec[I n ])>
292 10. MULTIVARIATE NORMAL
Π
Π Π Π
" z z #
(10.5.4) E = + +
z z
Π Π Π
Π
The first term is I n2 due to (B.5.26), the second is K (n,n) due to (B.5.35), and the
third is (vec[I n ])(vec[I n ])> because of (B.5.24). Graybill [Gra83, p. 312] considers
it a justification of the interest of the commutation matrix that it appears in the
higher moments of the standard normal. In my view, the commutation matrix is
ubiquitous only because the Kronecker-notation blows up something as trivial as the
crossing of two arms into a mysterious-sounding special matrix.
It is much easier to work with (10.5.2) without the detour over Kronecker prod-
ucts:
10.5. HIGHER MOMENTS OF THE MULTIVARIATE STANDARD NORMAL 293
Problem 173. [Gra83, 10.9.10 (1) on p. 366] Show that for symmetric A and
B E[z > Az z > Bz] = 2 tr(AB) + tr(A) tr(B).
Answer. This is (9.2.24) in the case of zero kurtosis, but here is a direct proof based on
(10.5.2):
A
A A A
z z
E = + +
z z
B B B
B
If one takes the variance-covariance matrix, which should in tile notation always
be written with a C , so that one knows which arms stick out in which direction, then
294 10. MULTIVARIATE NORMAL
= +
The sixth moments of the standard normal, in analogy to the fourth, are the
sum of all the different possible outer products of unit matrices:
10.5. HIGHER MOMENTS OF THE MULTIVARIATE STANDARD NORMAL 295
" z z z #
(10.5.5) E = + +
z z z
+ + + + +
+ + + + +
+ + + + .
Here is the principle how these were written down: Fix one branch, here the South-
west branch. First combine the Southwest branch with the Northwest one, and then
296 10. MULTIVARIATE NORMAL
you have three possibilities to pair up the others as in (10.5.2). Next combine the
Southwest branch with the North branch, and you again have three possibilities for
the others. Etc. This gives 15 possibilities altogether.
This can no longer be written as a Kronecker product, see [Gra83, 10.9.4 (3)
on p. 363]. However (10.5.5) can be applied directly, for instance in order to show
(10.5.6), which is [Gra83, 10.9.12 (1) on p. 368]:
(10.5.6) E[(z > Az)(z > Bz)(z > Cz)] = tr(A) tr(B) tr(C) + 2 tr(A) tr(BC)
+ 2 tr(B) tr(AC) + 2 tr(C) tr(AB) + 8 tr(ABC).
Answer.
z z z
(10.5.7) E B =
z z z
10.5. HIGHER MOMENTS OF THE MULTIVARIATE STANDARD NORMAL 297
A A A
= B + B + B +
C C C
A A A
+ B + B + B +
298 10. MULTIVARIATE NORMAL
A A A
+ B + B + B +
C C C
A A A
+ B + B + B +
10.6. THE GENERAL MULTIVARIATE NORMAL 299
A A A
+ B + B + B .
C C C
These 15 summands are, in order, tr(B) tr(AC), tr(ABC) twice, tr(B) tr(AC), tr(ABC) four
times, tr(A) tr(BC), tr(ABC) twice, tr(A) tr(BC), tr(C) tr(AB) twice, and tr(A) tr(B) tr(C).
In this case, clearly, E [y] = c and V [y] = σ 2 CC > , where σ 2 is the variance of
the standard normal.
300 10. MULTIVARIATE NORMAL
We will say: the vector is multivariate normal, and its elements or subvectors
x
are jointly normal, i.e., x and y are jointly normal if and only if is multivariate
y
normal. This is not transitive. If x and y are jointly normal and y and z are, then
x and z need not be. And even if all three pairs are jointly normal, this does not
mean they are multivariate normal.
For the proof we will use the following theorem: The distribution of a random
variable y is fully characterized by the univariate distributions of all a> y for all
vectors a. A proof can be found in [Rao73, p. 517].
Assume u = Cx + c and v = Dz + d where x and z are standard normal, and u
and v have equal mean and variances, i.e., c = d and CC > = DD > . We will show
that u and v or, equivalently, Cx and Dz are identically distributed, by verifying
that for every vector a, the distribution of a> Cx is the same as the distribution of
a> Dz. There are two cases: either a> C = o> ⇒ a> CC > = o> ⇒ a> DD > =
o> ⇒ a> D = o> , therefore a> Cx and a> Dz have equal distributions degenerate
at zero. Now if a> C 6= o> , then without loss of generality one can restrict oneself
10.6. THE GENERAL MULTIVARIATE NORMAL 301
to the a with a> CC > a = 1, therefore also a> DD > a = 1. By theorem 10.4.2, both
a> Cx and a> Dy are standard normal.
x
Theorem 10.6.3. If is multivariate normal and C [x, y] = O, then x and y
y
are independent.
Proof. Let µ = E [x] and ν = E [y], and A and B two matrices with AA> =
>
V [x] and
BB = V [y], and let u and v independent standard normal variables.
x
Then has the same distribution as
y
A O u µ
(10.6.1) + .
O B v ν
Since u and v are independent, x and y are also independent.
Problem 175. Show that, if y ∼ Nn (θ, σ 2Σ ), then
(10.6.2) ΣD > )
Dy + d ∼ Nk (Dθ + d, σ 2 DΣ
Answer. Follows immediately from our definition of a multivariate normal.
Theorem 10.6.4. Let y ∼ N (θ, σ 2Σ ). Then one can find two matrices B and
D so that z = B(y − θ) is standard normal, and y = Dz + θ.
302 10. MULTIVARIATE NORMAL
Problem 176. Show that a random variable y with expected value θ and non-
singular covariance matrix σ 2Σ is multivariate normal iff its density function is
1
fy (y) = (2πσ 2 )−n/2 (det Σ )−1/2 exp − (y − θ)>Σ −1 (y − θ) .
(10.6.4)
2
Hint: use the matrices B and D from theorem 10.6.4.
10.6. THE GENERAL MULTIVARIATE NORMAL 303
1 >
(10.6.5) fz (z) = (2πσ 2 )−n/2 exp(− z z).
2σ 2
> >
From this we get the √one of y. Since I = B Σ B,√it follows 1 = det(B Σ B) = (det B)2 det Σ ,
> > >
therefore J = det B = ± det Σ , and |J| = | det B| = det Σ . Since z z = (y − θ) B B(y − θ) =
(y − θ)>Σ −1 (y − θ), y has the density function (10.6.4).
Conversely, assume we know that y has the density function (10.6.4). Then let us derive from
this the density function of z = B(y − θ). Since√ Σ is nonsingular, one can
√ solve y = Dz + θ.
Since DD > = Σ , it follows J = det D = ± det Σ , and therefore |J| = det Σ . Furthermore,
(y − θ)>Σ −1 (y − θ) = z > z, i.e., the density of z is that of a standard normal. Since y is a linear
transformation of z, it is multivariate normal.
Problem 177. Show that the moment generating function of a multivariate nor-
mal y ∼ N (θ, σ 2Σ ) is
(10.6.6) my (t) = exp(t> θ + σ 2 t>Σ t/2).
Give a proof which is valid for singular as well as nonsingular Σ . You may use
the formula for the moment generating function of a multivariate Standard normal
for this proof.
304 10. MULTIVARIATE NORMAL
Answer.
(10.6.7)
my (t) = E exp(t> y) = E exp t> (Dz+θ) = exp(t> θ) E exp t> Dz = exp(t> θ) exp σ 2 t> DD >
n−1 2 2
2σ 2(n−1)σ 4
The alternative t2 = n+1
s ; therefore its bias is − n+1 and its variance is (n+1)2
, and the MSE
2σ 4
is n+1
.
Problem 179. TheP n × 1 vector y and distribution y ∼ N (ιθ, σ 2 I). Show that
ȳ is independent of q = (y i − ȳ)2 , and that q ∼ σ 2 χ2n−1 .
P 2
Answer. Set z = y − ιθ. Then q = (z i − z̄) and ȳ = z̄ + θ, and the statement follows from
theorem 10.4.2 with P = √1n ι> .
10.6. THE GENERAL MULTIVARIATE NORMAL 307
−1 ..........................................................
0 1 ........
..........
.............. 2 ......
.....
....
3
. .
..
..
. ...
........ ...
.
..
..
........ ...
..
..
...... ...
..... ...
3 .........
..
......
..
. ...
...
...
3
.
.
..
.....
. ...
...
........ .
......
. ..
...... .
....
...... ...
.
.... ...
.
.... ...
....
.... ..
..
..... ..
.
.
.. ..
.... ...
.... ...
... ..
....
2 .
...
.... .
.... 2
... ...
.... ...
.... ...
.
..... ...
.. ...
...
... ...
... ..
....
. ...
.
.. ...
...
... ...
... ..
.
.... .. ..
. .
... ...
1 ...
...
....
...
... 1
... .....
... ..
...
... ...
... ...
... .....
... ...
... ...
...
... ...
... .
....
... ...
...
... ....
... ....
0 ...
.
. .
....
. 0
... .
....
.. ...
... ....
... ....
..... .
.....
.
.
.. ....
... ....
....
... ....
.... .......
.
.. .....
..... .....
.....
... ......
... ...
......
... ......
......
−1 ...
...
... ......
......
−1
... ....
........
..
CHAPTER 11
Only for the sake of this exercise we will assume that “intelligence” is an innate
property of individuals and can be represented by a real number z. If one picks at
random a student entering the U of U, the intelligence of this student is a random
variable which we assume to be normally distributed with mean µ and standard
deviation σ. Also assume every student has to take two intelligence tests, the first
at the beginning of his or her studies, the other half a year later. The outcomes of
these tests are x and y. x and y measure the intelligence z (which is assumed to be
the same in both tests) plus a random error ε and δ, i.e.,
(11.0.14) x=z+ε
(11.0.15) y =z+δ
309
310 11. THE REGRESSION FALLACY
Here z ∼ N (µ, τ 2 ), ε ∼ N (0, σ 2 ), and δ ∼ N (0, σ 2 ) (i.e., we assume that both errors
have the same variance). The three variables ε, δ, and z are independent of each
other. Therefore x and y are jointly normal. var[x] = τ 2 + σ 2 , var[y] = τ 2 + σ 2 ,
2
cov[x, y] = cov[z + ε, z + δ] = τ 2 + 0 + 0 + 0 = τ 2 . Therefore ρ = τ 2τ+σ2 . The contour
lines of the joint density are ellipses with center (µ, µ) whose main axes are the lines
y = x and y = −x in the x, y-plane.
Now what is the conditional mean? Since var[x] = var[y], (10.3.17) gives the
line E[y|x=x] = µ + ρ(x − µ), i.e., it is a line which goes through the center of the
ellipses but which is flatter than the line x = y representing the real underlying linear
relationship if there are no errors. Geometrically one can get it as the line which
intersects each ellipse exactly where the ellipse is vertical.
Therefore, the parameters of the best prediction of y on the basis of x are not
the parameters of the underlying relationship. Why not? Because not only y but
also x is subject to errors. Assume you pick an individual by random, and it turns
out that his or her first test result is very much higher than the average. Then it is
more likely that this is an individual which was lucky in the first exam, and his or
her true IQ is lower than the one measured, than that the individual is an Einstein
who had a bad day. This is simply because z is normally distributed, i.e., among the
students entering a given University, there are more individuals with lower IQ’s than
Einsteins. In order to make a good prediction of the result of the second test one
11. THE REGRESSION FALLACY 311
must make allowance for the fact that the individual’s IQ is most likely lower than
his first score indicated, therefore one will predict the second score to be lower than
the first score. The converse is true for individuals who scored lower than average,
i.e., in your prediction you will do as if a “regression towards the mean” had taken
place.
The next important point to note here is: the “true regression line,” i.e., the
prediction line, is uniquely determined by the joint distribution of x and y. However
the line representing the underlying relationship can only be determined if one has
information in addition to the joint density, i.e., in addition to the observations.
E.g., assume the two tests have different standard deviations, which may be the case
simply because the second test has more questions and is therefore more accurate.
Then the underlying 45◦ line is no longer one of the main axes of the ellipse! To be
more precise, the underlying line can only be identified if one knows the ratio of the
variances, or if one knows one of the two variances. Without any knowledge of the
variances, the only thing one can say about the underlying line is that it lies between
the line predicting y on the basis of x and the line predicting x on the basis of y.
The name “regression” stems from a confusion between the prediction line and
the real underlying relationship. Francis Galton, the cousin of the famous Darwin,
measured the height of fathers and sons, and concluded from his evidence that the
heights of sons tended to be closer to the average height than the height of the
312 11. THE REGRESSION FALLACY
fathers, a purported law of “regression towards the mean.” Problem 180 illustrates
this:
Problem 180. The evaluation of two intelligence tests, one at the beginning
of the semester, one at the end, gives the following disturbing outcome: While the
underlying intelligence during the first test was z ∼ N (100, 20), it changed between
the first and second test due to the learning experience at the university. If w is the
intelligence of each student at the second test, it is connected to his intelligence z
at the first test by the formula w = 0.5z + 50, i.e., those students with intelligence
below 100 gained, but those students with intelligence above 100 lost. (The errors
of both intelligence tests are normally distributed with expected value zero, and the
variance of the first intelligence test was 5, and that of the second test, which had
more questions, was 4. As usual, the errors are independent of each other and of the
actual intelligence.)
• a. 3 points If x and y are the outcomes of the first and second intelligence
test, compute E[x], E[y], var[x], var[y], and the correlation coefficient ρ = corr[x, y].
Figure 1 shows an equi-density line of their joint distribution; 95% of the probability
mass of the test results are inside this ellipse. Draw the line w = 0.5z + 50 into
Figure 1.
Answer. We know z ∼ N (100, 20); w = 0.5z + 50; x = z + ε; ε ∼ N (0, 4); y = w + δ;
δ ∼ N (0, 5); therefore E[x] = 100; E[y] = 100; var[x] = 20 + 5 = 25; var[y] = 5 + 4 = 9;
11. THE REGRESSION FALLACY 313
The line y = 50 + 0.5x goes through the points (80, 90) and (120, 110).
6
• c. 2 points Another researcher says that w = 10 z + 40, z ∼ N (100, 100
6 ),
50
ε ∼ N (0, 6 ), δ ∼ N (0, 3). Is this compatible with the data?
314 11. THE REGRESSION FALLACY
6
Answer. Yes, it is compatible: E[x] = E[z]+E[ε] = 100; E[y] = E[w]+E[δ] = 10
100+40 = 100;
6 2
100 50
63 100 6
var[x] = 6
+ 6
= 25; var[y] = 10
var[z] + var[δ] = 100 6
+ 3 = 9; cov[x, y] = 10 var[z] =
10.
• d. 4 points A third researcher asserts that the IQ of the students really did not
change. He says w = z, z ∼ N (100, 5), ε ∼ N (0, 20), δ ∼ N (0, 4). Is this compatible
with the data? Is there unambiguous evidence in the data that the IQ declined?
Answer. This is not compatible. This scenario gets everything right except the covariance:
E[x] = E[z] + E[ε] = 100; E[y] = E[z] + E[δ] = 100; var[x] = 5 + 20 = 25; var[y] = 5 + 4 = 9;
cov[x, y] = 5. A scenario in which both tests have same underlying intelligence cannot be found.
Since the two conditional expectations are on the same side of the diagonal, the hypothesis that
the intelligence did not change between the two tests is not consistent with the joint distribution
of x and y. The diagonal goes through the points (82, 82) and (118, 118), i.e., it intersects the two
horizontal boundaries of Figure 1.
We just showed that the parameters of the true underlying relationship cannot
be inferred from the data alone if there are errors in both variables. We also showed
that this lack of identification is not complete, because one can specify an interval
which in the plim contains the true parameter value.
Chapter 53 has a much more detailed discussion of all this. There we will see
that this lack of identification can be removed if more information is available, i.e., if
one knows that the two error variances are equal, or if one knows that the regression
11. THE REGRESSION FALLACY 315
has zero intercept, etc. Question 181 shows that in this latter case, the OLS estimate
is not consistent, but other estimates exist that are consistent.
Problem 181. [Fri57, chapter 3] According to Friedman’s permanent income
hypothesis, drawing at random families in a given country and asking them about
their income y and consumption c can be modeled as the independent observations of
two random variables which satisfy
(11.0.19) y = yp + yt ,
(11.0.20) c = cp + ct ,
(11.0.21) cp = βy p .
Here y p and cp are the permanent and y t and ct the transitory components of income
and consumption. These components are not observed separately, only their sums y
and c are observed. We assume that the permanent income y p is random, with
E[y p ] = µ 6= 0 and var[y p ] = τy2 . The transitory components y t and ct are assumed
to be independent of each other and of y p , and E[y t ] = 0, var[y t ] = σy2 , E[ct ] = 0,
and var[ct ] = σc2 . Finally, it is assumed that all variables are normally distributed.
•a. 2 points Given the above information,
write down the vector of expected val-
ues E [ yc ] and the covariance matrix V [ yc ] in terms of the five unknown parameters
of the model µ, β, τy2 , σy2 , and σc2 .
316 11. THE REGRESSION FALLACY
Answer.
y µ y τ 2 + σ2 βτy2
(11.0.22) E = and V = y 2 y .
c βµ c βτy β 2 τy2 + σc2
• b. 3 points Assume that you know the true parameter values and you observe a
family’s actual income y. Show that your best guess (minimum mean squared error)
of this family’s permanent income y p is
σy2 τy2
(11.0.23) y p∗ = µ+ 2 y.
τy2 + σy2 τy + σy2
Note: here we are guessing income, not yet consumption! Use (10.3.17) for this!
Answer. This answer also does the math for part c. The best guess is the conditional mean
cov[y p , y]
E[y p |y = 22,000] = E[y p ] + (22,000 − E[y])
var[y]
16,000,000
= 12,000 + (22,000 − 12,000) = 20,000
20,000,000
11. THE REGRESSION FALLACY 317
or equivalently
τy2
E[y p |y = 22,000] = µ + (22,000 − µ)
τy2 + σy2
2
σy τy2
= µ+ 22,000
τy2 + σy 2 τy2 + σy2
= (0.2)(12,000) + (0.8)(22,000) = 20,000.
• d. 2 points If a family’s income is y, show that your best guess about this
family’s consumption is
σ2 τy2
y
(11.0.29) c∗ = β 2 µ + y .
τy + σy2 τy2 + σy2
Instead of an exact mathematical proof you may also reason out how it can be obtained
from (11.0.23). Give the numbers for a family whose actual income is 22,000.
Answer. This is 0.7 times the best guess about the family’s permanent income, since the
transitory consumption is uncorrelated with everything else and therefore must be predicted by 0.
This is an acceptable answer, but one can also derive it from scratch:
(11.0.30)
cov[c, y]
E[c|y = 22,000] = E[c] + (22,000 − E[y])
var[y]
βτy2 16,000,000
(11.0.31) = βµ + (22,000 − µ) = 8,400 + 0.7 (22,000 − 12,000) = 14,000
τy2 + σy2 20,000,000
(11.0.32)
σy2 τy2
or =β µ+ 22,000
τy2 + σy2 τy2 + σy2
(11.0.33) = 0.7 (0.2)(12,000) + (0.8)(22,000) = (0.7)(20,000) = 14,000.
11. THE REGRESSION FALLACY 319
The remainder of this Problem uses material that comes later in these Notes:
• e. 4 points From now on we will assume that the true values of the parameters
are not known, but two vectors y and c of independent observations are available.
We will show that it is not correct in this situation to estimate β by regressing c on
y with the intercept suppressed. This would give the estimator
P
ci y i
(11.0.34) β̂ = P 2
yi
Show that the plim of this estimator is
E[cy]
(11.0.35) plim[β̂] =
E[y 2 ]
Which theorems do you need for this proof ? Show that β̂ is an inconsistent estimator
of β, which yields too small values for β.
Answer. First rewrite the formula for β̂ in such a way that numerator and denominator each
has a plim: by the weak law of large numbers the plim of the average is the expected value, therefore
we have to divide both numerator and denominator by n. Then we can use the Slutsky theorem
that the plim of the fraction is the fraction of the plims.
1
P
cy E[cy] E[c] E[y] + cov[c, y] µβµ + βτy2 µ2 + τy2
β̂ = n
1
P i 2i ; plim[β̂] = 2
= 2
= 2 2 2
=β 2 .
n
yi E[y ] (E[y]) + var[y] µ + τy + σy µ + τy2 + σy2
320 11. THE REGRESSION FALLACY
• f. 4 points Give the formulas of the method of moments estimators of the five
paramaters of this model: µ, β, τy2 , σy2 , and σp2 . (For this you have to express these
five parameters in terms of the five moments E[y], E[c], var[y], var[c], and cov[y, c],
and then simply replace the population moments by the sample moments.) Are these
consistent estimators?
E[c]
Answer. From (11.0.22) follows E[c] = β E[y], therefore β = E[y]
. This together with
cov[y,c] cov[y,c] E[y]
cov[y, c] = βτy2 gives τy2 = β
= E[c]
. This together with var[y] = τy2 + σy2 gives
cov[y,c] E[y]
σy2 = var[y] − τy2 = var[y] − E[c]
. And from the last equation var[c] = β 2 τy2 + σc2 one get
cov[y,c] E[c]
σc2 = var[c] − E[y]
. All these are consistent estimators, as long as E[y] 6= 0 and β 6= 0.
• g. 4 points Now assume you are not interested in estimating β itself, but in
addition to the two n-vectors y and c you have an observation of y n+1 and you want
to predict the corresponding cn+1 . One obvious way to do this would be to plug the
method-of moments estimators of the unknown parameters into formula (11.0.29)
for the best linear predictor. Show that this is equivalent to using the ordinary least
squares predictor c∗ = α̂ + β̂y n+1 where α̂ and β̂ are intercept and slope in the simple
11. THE REGRESSION FALLACY 321
regression of c on y, i.e.,
P
(y i − ȳ)(ci − c̄)
(11.0.36) β̂ = P
(y i − ȳ)2
(11.0.37) α̂ = c̄ − β̂ȳ
Note that we are regressing c on y with an intercept, although the original model
does not have an intercept.
Answer. Here I am writing population moments where I should be writing sample moments.
First substitute the method of moments estimators in the denominator in (11.0.29): τy2 +σy2 = var[y].
Therefore the first summand becomes
1 E[c] cov[y, c] E[y] 1 cov[y, c] E[y] cov[y, c] E[y]
βσy2 µ = var[y]− E[y] = E[c] 1− = E[c]−
var[y] E[y] E[c] var[y] var[y] E[c] var[y]
cov[y,c]
But since var[y]
= β̂ and α̂ + β̂ E[y] = E[c] this expression is simply α̂. The second term is easier
to show:
τy2 cov[y, c]
β y= y = β̂y
var[y] var[y]
• h. 2 points What is the “Iron Law of Econometrics,” and how does the above
relate to it?
322 11. THE REGRESSION FALLACY
Answer. The Iron Law says that all effects are underestimated because of errors in the inde-
pendent variable. Friedman says Keynesians obtain their low marginal propensity to consume due
to the “Iron Law of Econometrics”: they ignore that actual income is a measurement with error of
the true underlying variable, permanent income.
Problem 182. This question follows the original article [SW76] much more
closely than [HVdP02] does. Sargent and Wallace first reproduce the usual argument
why “activist” policy rules, in which the Fed “looks at many things” and “leans
against the wind,” are superior to policy rules without feedback as promoted by the
monetarists.
They work with a very stylized model in which national income is represented by
the following time series:
(11.0.38) y t = α + λy t−1 + βmt + ut
Here y t is GNP, measured as its deviation from “potential” GNP or as unemployment
rate, and mt is the rate of growth of the money supply. The random disturbance ut
is assumed independent of y t−1 , it has zero expected value, and its variance var[ut ]
is constant over time, we will call it var[u] (no time subscript).
• a. 4 points First assume that the Fed tries to maintain a constant money
supply, i.e., mt = g0 + εt where g0 is a constant, and εt is a random disturbance
since the Fed does not have full control over the money supply. The εt have zero
11. THE REGRESSION FALLACY 323
expected value; they are serially uncorrelated, and they are independent of the ut .
This constant money supply rule does not necessarily make y t a stationary time
series (i.e., a time series where mean, variance, and covariances do not depend on
t), but if |λ| < 1 then y t converges towards a stationary time series, i.e., any initial
deviations from the “steady state” die out over time. You are not required here to
prove that the time series converges towards a stationary time series, but you are
asked to compute E[y t ] in this stationary time series.
• b. 8 points Now assume the policy makers want to steer the economy towards
a desired steady state, call it y ∗ , which they think makes the best tradeoff between
unemployment and inflation, by setting mt according to a rule with feedback:
(11.0.39) mt = g0 + g1 y t−1 + εt
Show that the following values of g0 and g1
(11.0.40) g0 = (y ∗ − α)/β g1 = −λ/β
represent an optimal monetary policy, since they bring the expected value of the steady
state E[y t ] to y ∗ and minimize the steady state variance var[y t ].
• c. 3 points This is the conventional reasoning which comes to the result that a
policy rule with feedback, i.e., a policy rule in which g1 6= 0, is better than a policy rule
324 11. THE REGRESSION FALLACY
without feedback. Sargent and Wallace argue that there is a flaw in this reasoning.
Which flaw?
• d. 5 points A possible system of structural equations from which (11.0.38) can
be derived are equations (11.0.41)–(11.0.43) below. Equation (11.0.41) indicates that
unanticipated increases in the growth rate of the money supply increase output, while
anticipated ones do not. This is a typical assumption of the rational expectations
school (Lucas supply curve).
(11.0.41) y t = ξ0 + ξ1 (mt − Et−1 mt ) + ξ2 y t−1 + ut
The Fed uses the policy rule
(11.0.42) mt = g0 + g1 y t−1 + εt
and the agents know this policy rule, therefore
(11.0.43) Et−1 mt = g0 + g1 y t−1 .
Show that in this system, the parameters g0 and g1 have no influence on the time
path of y.
• e. 4 points On the other hand, the econometric estimations which the policy
makers are running seem to show that these coefficients have an impact. During a
11. THE REGRESSION FALLACY 325
certain period during which a constant policy rule g0 , g1 is followed, the econome-
tricians regress y t on y t−1 and mt in order to estimate the coefficients in (11.0.38).
Which values of α, λ, and β will such a regression yield?
326 11. THE REGRESSION FALLACY
110 110
........................................................
........................
................. ......
........... ...
.......... ...
.....
........... .
..
.... .
...... ..
... ........ .
.
...
..
......... ....
.. ...
....... ....
...... ....
...... ....
...... .....
100 ......... .. ...... 100
..... ......
.... ......
.... ......
.... .......
.... ... ..........
.
... ........
... .......
.... .........
.........
...
.... .
. .................
....... .............
........................ .....................................................
..........
90 90
We will discuss here a simple estimation problem, which can be considered the
prototype of all least squares estimation. Assume we have n independent observations
y1 , . . . , yn of a Normally distributed random variable y ∼ N (µ, σ 2 ) with unknown
location parameter µ and dispersion parameter σ 2 . Our goal is to estimate the
location parameter and also estimate some measure of the precision of this estimator.
1. The location parameter of the Normal distribution is its expected value, and
by the weak law of large numbers, the probability limit for n → ∞ of the sample
mean is the expected value.
2. The expected value µ is sometimes called the “population mean,” while ȳ is
the sample mean. This terminology indicates that there is a correspondence between
population quantities and sample quantities, which is often used for estimation. This
is the principle of estimating the unknown distribution of the population by the
empirical distribution of the sample. Compare Problem 63.
3. This estimator is also unbiased. By definition, an estimator t of the parameter
θ is unbiased if E[t] = θ. ȳ is an unbiased estimator of µ, since E[ȳ] = µ.
4. Given n observations y1 , . . . , yn , the sample mean is the number a = ȳ which
minimizes (y1 − a)2 + (y2 − a)2 + · · · + (yn − a)2 . One can say it is the number whose
squared distance to the given sample numbers is smallest. This idea is generalized
in the least squares principle of estimation. It follows from the following frequently
used fact:
5. In the case of normality the sample mean is also the maximum likelihood
estimate.
12.1. SAMPLE MEAN AS ESTIMATOR OF THE LOCATION PARAMETER 329
Answer.
n n
X X 2
(12.1.2) (yi − α)2 = (yi − ȳ) + (ȳ − α)
i=1 i=1
n n n
X X X
(12.1.3) = (yi − ȳ)2 + 2 (yi − ȳ)(ȳ − α) + (ȳ − α)2
i=1 i=1 i=1
n n
X X
(12.1.4) = (yi − ȳ)2 + 2(ȳ − α) (yi − ȳ) + n(ȳ − α)2
i=1 i=1
.................................................................. ....................................................................................................................................
..................... . ..................
............................................................
..................... ..................
.................. .................. ....................................... ..................... ....................................... ..................
..................... ......................
...............................................................................................................................................
. q ............................................................................................................................
................. ....................... ........................
.....................
................
µ1 µ2 µ3 µ4
µ1 µ2 µ3 µ4
over (y1 + y2 )/2, there is one which is highest over y1 and y2 . Figure 4 shows the
densities for standard deviations 0.01, 0.05, 0.1, 0.5, 1, and 5. All curves, except
the last one, are truncated at the point where the resolution of TEX can no longer
distinguish between their level and zero. For the last curve this point would only be
reached at the coordinates ±25.
4) If we have many observations, then the density pattern of the observations,
as indicated by the histogram below, approximates the actual density function of y
itself. That likelihood function must be chosen which has a high value where the
points are dense, and which has a low value where the points are not so dense.
12.2. INTUITION OF THE MAXIMUM LIKELIHOOD ESTIMATOR 333
..........
............
............
.................
..........
...........
.............
.................
.........
................
..............
................
.........
............
.............
...............
..........
.................
....................
........................
............
.....................
..................
........................
.............
...................
...................
........................
...............
................
........................
. . . .............. ........ .................
... . . ... . .. ...
.... ............................. ....
............................. ... ........ .... ...................................................
...................... ......... ....................
................... ........ .. .. .... .. ..
.............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
Figure 4. Only those centered over the two observations need to be considered
1 X
(12.3.2) s2u = (y i − ȳ)2 .
n−1
Let us compute the expected value of our two estimators. Equation (12.1.1) with
α = E[y] allows us to simplify the sum of squared errors so that it becomes easy to
take expected values:
n
X n
X
(12.3.3) E[ (y i − ȳ)2 ] = E[(y i − µ)2 ] − n E[(ȳ − µ)2 ]
i=1 i=1
n
X σ2
(12.3.4) = σ2 − n = (n − 1)σ 2 .
i=1
n
336 12. A SIMPLE EXAMPLE OF ESTIMATION
σ2
because E[(y i − µ)2 ] = var[y i ] = σ 2 and E[(ȳ − µ)2 ] = var[ȳ] = n . Therefore, if we
use as estimator of σ 2 the quantity
n
1 X
(12.3.5) s2u = (y − ȳ)2
n − 1 i=1 i
then this is an unbiased estimate.
Problem 186. 4 points Show that
n
1 X
(12.3.6) s2u = (y − ȳ)2
n − 1 i=1 i
is an unbiased estimator of the variance. List the assumptions which have to be made
about y i so that this proof goes through. Do you need Normality of the individual
observations y i to prove this?
Answer. Use equation (12.1.1) with α = E[y]:
n n
X X
(12.3.7) E[ (y i − ȳ)2 ] = E[(y i − µ)2 ] − n E[(ȳ − µ)2 ]
i=1 i=1
n
X σ2
(12.3.8) = σ2 − n = (n − 1)σ 2 .
n
i=1
12.3. VARIANCE ESTIMATION AND DEGREES OF FREEDOM 337
For testing, confidence intervals, etc., one also needs to know the probability
distribution of s2u . For this look up once more Section 5.9 about the Chi-Square
distribution. There we introduced the terminology that a random variable q is dis-
tributed as a σ 2 χ2 iff q/σ 2 is a χ2 . In our modelPwith n independent normal variables
y i with same mean and variance, the variable (y i − ȳ)2 is a σ 2 χ2n−1 . Problem 187
gives a proof of this in the simplest case n = 2, and Problem 188 looks at the case
σ2
n = 3. But it is valid for higher n too. Therefore s2u is a n−1 χ2n−1 . This is re-
2
markable: the distribution of su does not depend on µ. Now use (5.9.5) to get the
2σ 4
variance of s2u : it is n−1 .
Problem 187. Let y 1 and y 2 be two independent Normally distributed variables
with mean µ and variance σ 2 , and let ȳ be their arithmetic mean.
• a. 2 points Show that
2
X
(12.3.9) SSE = (y i − ȳ)2 ∼ σ 2 χ21
i−1
Hint: Find a Normally distributed random variable z with expected value 0 and vari-
ance 1 such that SSE = σ 2 z 2 .
338 12. A SIMPLE EXAMPLE OF ESTIMATION
Answer.
y1 + y2
(12.3.10) ȳ =
2
y1 − y2
(12.3.11) y 1 − ȳ =
2
y1 − y2
(12.3.12) y 2 − ȳ =−
2
(y 1 − y 2 )2 (y − y 2 )2
(12.3.13) (y 1 − ȳ)2 + (y 2 − ȳ)2 = + 1
4 4
2
(y 1 − y 2 )2 2 y1 − y2
(12.3.14) = =σ √ ,
2 2σ 2
√
and since z = (y 1 − y 2 )/ 2σ 2 ∼ N (0, 1), its square is a χ21 .
1 1
h i
−2
and V [Dy] = D V [y]D > = σ 2 D because V [y] = σ 2 I and D = 2
1 1 is symmetric and
−2 2
idempotent. D is singular because its determinant is zero.
√
start with ȳ n and generate n − 1 linear combinations of the y i which are pairwise
uncorrelated and have √ variances σ 2 . You are simply building an orthonormal co-
ordinate system with ȳ n as its first vector; there are many different ways to do
this.
Next let us show that ȳ and s2u are statistically independent. This is an ad-
vantage. Assume, hypothetically, ȳ and s2u were negatively correlated. Then, if the
observed value of ȳ is too high, chances are that the one of s2u is too low, and a look
at s2u will not reveal how far off the mark ȳ may be. To prove independence, we will
first show that ȳ and y i − ȳ are uncorrelated:
(12.3.18) cov[ȳ, y i − ȳ] = cov[ȳ, y i ] − var[ȳ]
1 σ2
(12.3.19) = cov[ (y 1 + · · · + y i + · · · + y n ), y i ] − =0
n n
By normality, ȳ is therefore independent of y i − ȳ for all i. Since all variables in-
volved are jointly normal, it follows from this that ȳ is independent of the vector
>
y 1 − ȳ · · · y n − ȳ ; therefore it is also independent of any function of this vec-
tor, such as s2u .
The above calculations explain why the parameter of the χ2 distribution has
the colorful name “degrees of freedom.” This term is sometimes used in a very
broad sense, referring to estimation in general, and sometimes in a narrower sense,
342 12. A SIMPLE EXAMPLE OF ESTIMATION
in conjunction with the linear model. Here is first an interpretation of the general use
of the term. A “statistic” is defined to be a function of the observations and of other
known parameters of the problem, but not of the unknown parameters. Estimators
are statistics. If one has n observations, then one can find at most n mathematically
independent statistics; any other statistic is then a function of these n. If therefore
a model has k independent unknown parameters, then one must have at least k
observations to be able to estimate all parameters of the model. The number n − k,
i.e., the number of observations not “used up” for estimation, is called the number
of “degrees of freedom.”
There are at least three reasons why one does not want to make the model such
that it uses up too many degrees of freedom. (1) the estimators become too inaccurate
if one does; (2) if there are no degrees of freedom left, it is no longer possible to make
any “diagnostic” tests whether the model really fits the data, because it always gives
a perfect fit whatever the given set of data; (3) if there are no degrees of freedom left,
then one can usually also no longer make estimates of the precision of the estimates.
Specifically in our linear estimation problem, the number of degrees of freedom
is n − 1, since one observation has been used up for estimating the mean. If one
runs a regression, the number of degrees of freedom is n − k, where k is the number
of regression coefficients. In the linear model, the number of degrees of freedom
becomes immediately relevant for the estimation of σ 2 . If k observations are used
12.3. VARIANCE ESTIMATION AND DEGREES OF FREEDOM 343
up for estimating the slope parameters, then the other n − k observations can be
combined into a n − k-variate Normal whose expected value does not depend on the
slope parameter at all but is zero, which allows one to estimate the variance.
If we assume that the original observations are normally distributed, i.e., y i ∼
σ2
NID(µ, σ 2 ), then we know that s2u ∼ n−1 χ2n−1 . Therefore E[s2u ] = σ 2 and var[s2u ] =
4 2
2σ /(n − 1). This estimate of σ therefore not only gives us an estimate of the
precision of ȳ, but it has an estimate of its own precision built in.P
(y i −ȳ)2
Interestingly, the MSE of the alternative estimator s2m = n is smaller
than that of s2u , although s2m is a biased estimator and s2u an unbiased estimator of
σ 2 . For every estimator t, MSE[t; θ] = var[t] + (E[t − θ])2 , i.e., it is variance plus
2σ 4
squared bias. The MSE of s2u is therefore equal to its variance, which is n−1 . The
2 4 4
2σ (n−1)
alternative s2m = n−1 2 σ
n su has bias − n and variance n2 . Its MSE is (2−1/n)σ
n .
2
Comparing that with the formula for the MSE of su one sees that the numerator is
smaller and the denominator is bigger, therefore s2m has smaller MSE.
Problem 190. 4 points Assume y i ∼ NID(µ, σ 2 ). Show that the so-called Theil
Schweitzer estimator [TS61]
1 X
(12.3.20) s2t = (y i − ȳ)2
n+1
344 12. A SIMPLE EXAMPLE OF ESTIMATION
n−1 2 2
2σ 2(n−1)σ 4
Answer. s2t = s ;
n+1 u
therefore its bias is − n+1 and its variance is (n+1)2
, and the
2σ 4 2n−1 2
MSE is n+1
.
That this is smaller than the MSE of s2m means n2
≥ n+1
, which follows from
(2n − 1)(n + 1) = 2n2 + n − 1 > 2n2 for n > 1.
19 19
(12.3.21) fs2 (x) = f 2 ( x).
u 2 χ19 2
12.3. VARIANCE ESTIMATION AND DEGREES OF FREEDOM 345
......................
....... ...............................................
..... ......... ...................
.... ..... ....................
... ............... ........................
.................................
.
...
..
..
.
...............
.
. ..................................................................
.
.................. .
0 1 2 3 4 5 6
• a. 2 points In the same plot, plot the density function of the Theil-Schweitzer
estimate s2t defined in equation (12.3.20). This gives a plot as in Figure 6. Can
one see from the comparison of these density functions that the Theil-Schweitzer
estimator has a better MSE?
Answer. Start with plotting the Theil-Schweitzer plot, because it is higher, and therefore it
will give the right dimensions of the plot. You can run this by giving the command ecmetscript(theil
The two areas between the densities have equal size, but the area where the Theil-Schweitzer density
is higher is overall closer to the true value than the area where the unbiased density is higher.
Problem 192. 4 points The following problem illustrates the general fact that
if one starts with an unbiased estimator and “shrinks” it a little, one will end up
with a better MSE. Assume E[y] = µ, var(y) = σ 2 , and you make n independent
observations y i . The best linear unbiased estimator of µ on the basis of these
346 12. A SIMPLE EXAMPLE OF ESTIMATION
nµ2 − σ 2
(12.3.22) <α<1
nµ2 + σ 2
then MSE[αȳ; µ] < MSE[ȳ; µ]. Unfortunately, this condition depends on µ and σ 2
and can therefore not be used to improve the estimate.
This cannot be true for α ≥ 1, because for α = 1 one has equality, and for α > 1, the righthand side
is negative. Therefore we are allowed to assume α < 1, and can divide by 1 − α without disturbing
the inequality:
Problem 193. [KS79, example 17.14 on p. 22] The mathematics in the following
problem is easier than it looks. If you can’t prove a., assume it and derive b. from
it, etc.
• a. 2 points Let t be an estimator
of the nonrandom scalar parameter θ. E[t − θ]
is called the bias of t, and E (t − θ)2 is called the mean squared error of t as an
estimator of θ, written MSE[t; θ]. Show that the MSE is the variance plus the squared
bias, i.e., that
2
(12.3.29) MSE[t; θ] = var[t] + E[t − θ] .
Answer. The most elegant proof, which also indicates what to do when θ is random, is:
(12.3.30) MSE[t; θ] = E (t − θ)2 = var[t − θ] + (E[t − θ])2 = var[t] + (E[t − θ])2 .
• b. 2 points For the rest of this problem assume that t is an unbiased estimator
of θ with var[t] > 0. We will investigate whether one can get a better MSE if one
348 12. A SIMPLE EXAMPLE OF ESTIMATION
• c. 1 point Show that, whenever a > 1, then MSE[at; θ] > MSE[t; θ]. If one
wants to decrease the MSE, one should therefore not choose a > 1.
Answer. MSE[at; θ]−MSE[t; θ] = (a2 −1) var[t]+(a−1)2 θ2 > 0 since a > 1 and var[t] > 0.
From this follows that the MSE of at is smaller than the MSE of t, as long as a < 1
and close enough to 1.
Answer. The derivative of (12.3.31) is
d
(12.3.33) MSE[at; θ] = 2a var[t] + 2(a − 1)θ2
da
Plug a = 1 into this to get 2 var[t] > 0.
12.3. VARIANCE ESTIMATION AND DEGREES OF FREEDOM 349
• e. 2 points By solving the first order condition show that the factor a which
gives smallest MSE is
θ2
(12.3.34) a= .
var[t] + θ2
Answer. Rewrite (12.3.33) as 2a(var[t] + θ2 ) − 2θ2 and set it zero.
• g. 4 points Using this density function (and no other knowledge about the
exponential distribution) prove that t is an unbiased estimator of 1/λ, with var[t] =
1/λ2 .
350 12. A SIMPLE EXAMPLE OF ESTIMATION
R∞ R R
Answer. To evaluate λt exp(−λt) dt, use partial integration uv 0 dt = uv − u0 v dt
0 ∞
with u = t, u0 = 1, v = − exp(−λt), v0 = λ exp(−λt). Therefore the integral is −t exp(−λt) +
R∞ R∞ 0
exp(−λt) dt = 1/λ, since we just saw that λ exp(−λt) dt = 1.
0 R∞ 0
To evaluate λt2 exp(−λt) dt, use partial integration with u = t2 , u0 = 2t, v = − exp(−λt),
0 ∞ R∞ R∞
2
v 0 = λ exp(−λt). Therefore the integral is −t2 exp(−λt) +2 t exp(−λt) dt = λ
λt exp(−λt) d
0 0 0
2/λ2 . Therefore var[t] = E[t2 ] − (E[t])2 = 2/λ2 − 1/λ2 = 1/λ2 .
• j. 3 points Assume q ∼ σ 2 χ2m (in other words, σ12 q ∼ χ2m , a Chi-square distri-
bution with m degrees of freedom). Using the fact that E[χ2m ] = m and var[χ2m ] = 2m,
compute that multiple of q that has minimum MSE as estimator of σ 2 .
Answer. This is a trick question since q itself is not an unbiased estimator of σ 2 . E[q] = mσ 2 ,
therefore q/m is the unbiased estimator. Since var[q/m] = 2σ 4 /m, it follows from (12.3.34) that
q m q
a = m/(m + 2), therefore the minimum MSE multiple of q is m m+2
= m+2 . I.e., divide q by m + 2
instead of m.
tor is n1 (y i − ȳ)2 . What are the implications of the above for the question whether
P
one should use the first or the second or still some other multiple of (y i − ȳ)2 ?
P
Answer. Taking that multiple of the sum of squared errors which makes thePestimator un-
biased is not necessarily a good choice. In terms of MSE, the best multiple of (y i − ȳ)2 is
1
P 2.
n+1
(y i − ȳ)
• l. 3 points We are still in the model defined in k. Which multiple of the sample
mean ȳ has smallest MSE as estimator of µ? How does this example differ from the
ones given above? Can this formula have practical significance?
352 12. A SIMPLE EXAMPLE OF ESTIMATION
2
µ
Answer. Here the optimal a = µ2 +(σ 2 /n) . Unlike in the earlier examples, this a depends on
the unknown parameters. One can “operationalize” it by estimating the parameters from the data,
but the noise introduced by this estimation can easily make the estimator worse than the simple ȳ.
Indeed, ȳ is admissible, i.e., it cannot be uniformly improved upon. On the other hand, the Stein
rule, which can be considered an operationalization of a very similar formula (the only difference
being that one estimates the mean vector of a vector with at least 3 elements), by estimating µ2
1 2
and µ2 + n σ from the data, shows that such an operationalization is sometimes successful.
We will discuss here one more property of ȳ and s2u : They together form sufficient
statistics for µ and σ 2 . I.e., any estimator of µ and σ 2 which is not a function of ȳ
and s2u is less efficient than it could be. Since the factorization theorem for sufficient
statistics holds even if the parameter θ and its estimate t are vectors, we have to
write the joint density of the observation vector y as a product of two functions, one
depending on the parameters and the sufficient statistics, and the other depending
on the value taken by y, but not on the parameters. Indeed, it will turn out that
this second function can just be taken to be h(y) = 1, since the density function can
12.3. VARIANCE ESTIMATION AND DEGREES OF FREEDOM 353
be rearranged as
Xn
2 2 −n/2
(12.3.38) fy (y1 , . . . , yn ; µ, σ ) = (2πσ ) exp − (yi − µ)2 /2σ 2 =
i=1
Xn
= (2πσ 2 )−n/2 exp − (yi − ȳ)2 − n(ȳ − µ)2 /2σ 2 =
(12.3.39)
i=1
(n − 1)s2 − n(ȳ + µ)2
(12.3.40) = (2πσ 2 )−n/2 exp − u
.
2σ 2
CHAPTER 13
Roughly, a consistent estimation procedure is one which gives the correct parameter
values if the sample is large enough. There are only very few exceptional situations
in which an estimator is acceptable which is not consistent, i.e., which does not
converge in the plim to the true parameter value.
Problem 194. Can you think of a situation where an estimator which is not
consistent is acceptable?
Answer. If additional data no longer give information, like when estimating the initial state
of a timeseries, or in prediction. And if there is no identification but the value can be confined to
an interval. This is also inconsistency.
g(t) we have to show that for all ε > 0, Pr[|g(t) − g(θ)| ≥ ε] → 0. Choose for the
given ε a δ as above, then |g(t) − g(θ)| ≥ ε implies |t − θ| ≥ δ, because all those
values of t for with |t − θ| < δ lead to a g(t) with |g(t) − g(θ)| < ε. This logical
implication means that
Since the probability on the righthand side converges to zero, the one on the lefthand
side converges too.
Different consistent estimators can have quite different speeds of convergence.
Are there estimators which have optimal asymptotic properties among all consistent
estimators? Yes, if one limits oneself to a fairly reasonable subclass of consistent
estimators.
Here are the details: Most consistent estimators we will encounter are asymp-
totically normal, i.e., the “shape” of their distribution function converges towards
the normal distribution, as we had it for the sample mean in the central limit the-
orem. In order to be able to use this asymptotic distribution for significance tests
and confidence intervals, however, one needs more than asymptotic normality (and
many textbooks are not aware of this): one needs the convergence to normality to
be uniform in compact intervals [Rao73, p. 346–351]. Such estimators are called
consistent uniformly asymptotically normal estimators (CUAN estimators)
358 13. ESTIMATION PRINCIPLES
If one limits oneself to CUAN estimators it can be shown that there are asymp-
totically “best” CUAN estimators. Since the distribution is asymptotically normal,
there is no problem to define what it means to be asymptotically best: those es-
timators are asymptotically best whose asymptotic MSE = asymptotic variance is
smallest. CUAN estimators whose MSE is asymptotically no larger than that of
any other CUAN estimator, are called asymptotically efficient. Rao has shown that
for CUAN estimators the lower bound for this asymptotic variance is the asymptotic
limit of the Cramer Rao lower bound (CRLB). (More about the CRLB below). Max-
imum likelihood estimators are therefore usually efficient CUAN estimators. In this
sense one can think of maximum likelihood estimators to be something like asymp-
totically best consistent estimators, compare a statement to this effect in [Ame94, p.
144]. And one can think of asymptotically efficient CUAN estimators as estimators
who are in large samples as good as maximum likelihood estimators.
All these are large sample properties. Among the asymptotically efficient estima-
tors there are still wide differences regarding the small sample properties. Asymptotic
efficiency should therefore again be considered a minimum requirement: there must
be very good reasons not to be working with an asymptotically efficient estimator.
Answer. If robustness matters then the median may be preferable to the mean, although it
is less efficient.
for every continuous function g which is and nonincreasing for x < 0 and nondecreas-
ing for x > 0
This list is from [Ame94, pp. 118–122]. But we will simply use the MSE.
Therefore we are left with dilemma (2). There is no single estimator that has
uniformly the smallest MSE in the sense that its MSE is better than the MSE of
any other estimator whatever the value of the parameter value. To see this, simply
think of the following estimator t of θ: t = 10; i.e., whatever the outcome of the
experiments, t always takes the value 10. This estimator has zero MSE when θ
happens to be 10, but is a bad estimator when θ is far away from 10. If an estimator
existed which had uniformly best MSE, then it had to be better than all the constant
estimators, i.e., have zero MSE whatever the value of the parameter, and this is only
possible if the parameter itself is observed.
Although the MSE criterion cannot be used to pick one best estimator, it can be
used to rule out estimators which are unnecessarily bad in the sense that other esti-
mators exist which are never worse but sometimes better in terms of MSE whatever
13.2. SMALL SAMPLE PROPERTIES 361
the true parameter values. Estimators which are dominated in this sense are called
inadmissible.
But how can one choose between two admissible estimators? [Ame94, p. 124]
gives two reasonable strategies. One is to integrate the MSE out over a distribution
of the likely values of the parameter. This is in the spirit of the Bayesians, although
Bayesians would still do it differently. The other strategy is to choose a minimax
strategy. Amemiya seems to consider this an alright strategy, but it is really too
defensive. Here is a third strategy, which is often used but less well founded theoreti-
cally: Since there are no estimators which have minimum MSE among all estimators,
one often looks for estimators which have minimum MSE among all estimators with
a certain property. And the “certain property” which is most often used is unbiased-
ness. The MSE of an unbiased estimator is its variance; and an estimator which has
minimum variance in the class of all unbiased estimators is called “efficient.”
The class of unbiased estimators has a high-sounding name, and the results
related with Cramer-Rao and Least Squares seem to confirm that it is an important
class of estimators. However I will argue in these class notes that unbiasedness itself
is not a desirable property.
362 13. ESTIMATION PRINCIPLES
estimate of the parameter, without utilizing prior observations, and then would use
the average of all these independent estimates as its updated estimate, it would end
up displaying a wrong parameter value on the screen.
A biased extimator gives, even in the limit, an incorrect result as long as one’s
updating procedure is the simple taking the averages of all previous estimates. If
an estimator is biased but consistent, then a better updating method is available,
which will end up in the correct parameter value. A biased estimator therefore is not
necessarily one which gives incorrect information about the parameter value; but it
is one which one cannot update by simply taking averages. But there is no reason to
limit oneself to such a crude method of updating. Obviously the question whether
the estimate is biased is of little relevance, as long as it is consistent. The moral of
the story is: If one looks for desirable estimators, by no means should one restrict
one’s search to unbiased estimators! The high-sounding name “unbiased” for the
technical property E[t] = θ has created a lot of confusion.
Besides having no advantages, the category of unbiasedness even has some in-
convenient properties: In some cases, in which consistent estimators exist, there are
no unbiased estimators. And if an estimator t is an unbiased estimate for the pa-
rameter θ, then the estimator g(t) is usually no longer an unbiased estimator for
g(θ). It depends on the way a certain quantity is measured whether the estimator is
unbiased or not. However consistency carries over.
364 13. ESTIMATION PRINCIPLES
Unbiasedness is not the only possible criterion which ensures that the values of
the estimator are centered over the value it estimates. Here is another plausible
definition:
Definition 13.3.1. An estimator θ̂ of the scalar θ is called median unbiased for
all θ ∈ Θ iff
1
(13.3.1) Pr[θ̂ < θ] = Pr[θ̂ > θ] =
2
This concept is always applicable, even for estimators whose expected value does
not exist.
Problem 196. 6 points (Not eligible for in-class exams) The purpose of the fol-
lowing problem is to show how restrictive the requirement of unbiasedness is. Some-
times no unbiased estimators exist, and sometimes, as in the example here, unbiased-
ness leads to absurd estimators. Assume the random variable x has the geometric
distribution with parameter p, where 0 ≤ p ≤ 1. In other words, it can only assume
the integer values 1, 2, 3, . . ., with probabilities
(13.3.2) Pr[x = r] = (1 − p)r−1 p.
Show that the unique unbiased estimator of p on the basis of one observation of x is
the random variable f (x) defined by f (x) = 1 if x = 1 and 0 otherwise. Hint: Use
13.3. COMPARISON UNBIASEDNESS CONSISTENCY 365
the mathematical
P∞ fact that a function φ(q) that can be expressed as a power series
φ(q) = j=0 aj q j , and which takes the values φ(q) = 1 for all q in some interval of
nonzero length, is the power series with a0 = 1 and aj = 0 for j 6= 0. (You will need
the hint at the end of your answer, don’t try to start with the hint!)
P∞
Answer. Unbiasedness means that E[f (x)] = r=1
f (r)(1 − p)r−1 p = p for all p in the unit
P∞ r−1
interval, therefore r=1
f (r)(1 − p) = 1. This is a power series in q = 1 − p, which must be
identically equal to 1 for all values of q between 0 and 1. An application of the hint shows that
the constant term in this power series, corresponding to the value r − 1 = 0, must be = 1, and all
other f (r) = 0. Here older formulation: An application of the hint with q = 1 − p, j = r − 1, and
aj = f (j + 1) gives f (1) = 1 and all other f (r) = 0. This estimator is absurd since it lies on the
boundary of the range of possible values for q.
Problem 197. As in Question 61, you make two independent trials of a Bernoulli
experiment with success probability θ, and you observe t, the number of successes.
• a. Give an unbiased estimator of θ based on t (i.e., which is a function of t).
• b. Give an unbiased estimator of θ2 .
• c. Show that there is no unbiased estimator of θ3 .
Hint: Since t can only take the three values 0, 1, and 2, any estimator u which
is a function of t is determined by the values it takes when t is 0, 1, or 2, call them
u0 , u1 , and u2 . Express E[u] as a function of u0 , u1 , and u2 .
366 13. ESTIMATION PRINCIPLES
Answer. E[u] = u0 (1 − θ)2 + 2u1 θ(1 − θ) + u2 θ2 = u0 + (2u1 − 2u0 )θ + (u0 − 2u1 + u2 )θ2 . This
is always a second degree polynomial in θ, therefore whatever is not a second degree polynomial in θ
cannot be the expected value of any function of t. For E[u] = θ we need u0 = 0, 2u1 −2u0 = 2u1 = 1,
therefore u1 = 0.5, and u0 − 2u1 + u2 = −1 + u2 = 0, i.e. u2 = 1. This is, in other words, u = t/2.
For E[u] = θ2 we need u0 = 0, 2u1 − 2u0 = 2u1 = 0, therefore u1 = 0, and u0 − 2u1 + u2 = u2 = 1,
This is, in other words, u = t(t − 1)/2. From this equation one also sees that θ3 and higher powers,
or things like 1/θ, cannot be the expected values of any estimators.
Problem 198. This is [KS79, Question 17.11 on p. 34], originally [Fis, p. 700].
• a. 1 point Assume t and u are two unbiased estimators of the same unknown
scalar nonrandom parameter θ. t and u have finite variances and satisfy var[u − t] 6=
0. Show that a linear combination of t and u, i.e., an estimator of θ which can be
written in the form αt + βu, is unbiased if and only if α = 1 − β. In other words,
any unbiased estimator which is a linear combination of t and u can be written in
the form
(13.3.4) t + β(u − t).
13.3. COMPARISON UNBIASEDNESS CONSISTENCY 367
• b. 2 points By solving the first order condition show that the unbiased linear
combination of t and u which has lowest MSE is
cov[t, u − t]
(13.3.5) θ̂ = t − (u − t)
var[u − t]
Hint: your arithmetic will be simplest if you start with (13.3.4).
• c. 1 point If ρ2 is the squared correlation coefficient between t and u − t, i.e.,
(cov[t, u − t])2
(13.3.6) ρ2 =
var[t] var[u − t]
show that var[θ̂] = var[t](1 − ρ2 ).
• d. 1 point Show that cov[t, u − t] 6= 0 implies var[u − t] 6= 0.
• e. 2 points Use (13.3.5) to show that if t is the minimum MSE unbiased
estimator of θ, and u another unbiased estimator of θ, then
(13.3.7) cov[t, u − t] = 0.
• f. 1 point Use (13.3.5) to show also the opposite: if t is an unbiased estimator
of θ with the property that cov[t, u − t] = 0 for every other unbiased estimator u of
θ, then t has minimum MSE among all unbiased estimators of θ.
368 13. ESTIMATION PRINCIPLES
There are estimators which are consistent but their bias does not converge to
zero:
(
θ with probability 1 − n1
(13.3.8) θ̂n =
n with probability n1
Then Pr(θ̂n − θ ≥ ε) ≤ n1 , i.e., the estimator is consistent, but E[θ̂] = θ n−1
n +1 →
θ + 1 6= 0.
Problem 199. 4 points Is it possible to have a consistent estimator whose bias
becomes unbounded as the sample size increases? Either prove that it is not possible
or give an example.
Answer. Yes, this can be achieved by making the rare outliers even wilder than in (13.3.8),
say
1
θ with probability 1 − n
(13.3.9) θ̂n = 1
n2 with probability n
Here Pr(θ̂n − θ ≥ ε) ≤ 1
n
, i.e., the estimator is consistent, but E[θ̂] = θ n−1
n
+ n → θ + n.
And of course there are estimators which are unbiased but not consistent: sim-
ply take the first observation x1 as an estimator if E[x] and ignore all the other
observations.
13.4. THE CRAMER-RAO LOWER BOUND 369
This holds for every value y, and integrating over y gives 1 − E[log fy (y)] ≤ 1 −
E[log g(y)] or
This is an important extremal value property which distinguishes the density function
fy (y) of y from all other density functions: That density function g which maximizes
E[log g(y)] is g = fy , the true density function of y.
This optimality property lies at the basis of the Cramer-Rao inequality, and it
is also the reason why maximum likelihood estimation is so good. The difference
370 13. ESTIMATION PRINCIPLES
between the left and right hand side in (13.4.2) is called the Kullback-Leibler dis-
crepancy between the random variables y and x (where x is a random variable whose
density is g).
The Cramer Rao inequality gives a lower bound for the MSE of an unbiased
estimator of the parameter of a probability distribution (which has to satisfy cer-
tain regularity conditions). This allows one to determine whether a given unbiased
estimator has a MSE as low as any other unbiased estimator (i.e., whether it is
“efficient.”)
expected values with respect to the “wrong” parameter values. The same notational
convention also applies to variances, covariances, and the MSE.
Throughout this problem we assume that the following regularity conditions hold:
(a) the range of y is independent of θ, and (b) the derivative of the density function
with respect to θ is a continuous differentiable function of θ. These regularity condi-
tions ensure that one can differentiate under the integral sign, i.e., for all function
t(y) follows
Z ∞ Z ∞
∂ ∂ ∂
(13.4.3) fy (y; θ)t(y) dy = fy (y; θ)t(y) dy = Eθ [t(y)]
−∞ ∂θ ∂θ −∞ ∂θ
Z ∞ Z ∞
∂2 ∂2 ∂2
(13.4.4) 2
fy (y; θ)t(y) dy = fy (y; θ)t(y) dy = Eθ [t(y)].
−∞ (∂θ) (∂θ)2 −∞ (∂θ)2
y − ∂b(θ)
∂θ
(13.4.7) q(y; θ) =
a(ψ)
Answer. This is a simple substitution: if
yθ − b(θ)
(13.4.8) fy (y; θ, ψ) = exp + c(y, ψ) ,
a(ψ)
then
∂b(θ)
∂ log fy (y; θ, ψ) y − ∂θ
(13.4.9) =
∂θ a(ψ)
13.4. THE CRAMER-RAO LOWER BOUND 373
Answer. Follows from the fact that the score function of the exponential family (13.4.7) has
zero expected value.
Again, for θ = θ◦ , we can simplify the integrand and differentiate under the integral sign:
Z +∞ Z +∞
∂2 ∂2 ∂2
(13.4.19) 2
fy (y; θ) dy = fy (y; θ) dy = 1 = 0.
−∞
∂θ ∂θ2 −∞
∂θ2
• f. Derive from (13.4.14) that, for the exponential dispersion family (6.2.9),
∂ 2 b(θ)
(13.4.20) var◦ [y] = a(φ)
∂θ2
◦
θ=θ
376 13. ESTIMATION PRINCIPLES
∂ 2 b(θ) 1
Answer. Differentiation of (13.4.7) gives h(θ) = − ∂θ 2 a(φ)
. This is constant and therefore
equal to its own expected value. (13.4.14) says therefore
∂ 2 b(θ)
1 1
(13.4.21) = E◦ [q 2 (θ◦ )] = 2 var◦ [y]
∂θ2 θ=θ◦ a(φ)
a(φ)
Problem 201.
• a. Use the results from question 200 to derive the following strange and in-
teresting result: for any random variable t which is a function of y, i.e., t = t(y),
follows cov◦ [q(θ◦ ), t] = ∂θ
∂
Eθ [t]θ=θ◦ .
Z ∞
1 ∂fy (y; θ)
(13.4.22) E◦ [q(θ)t] = t(y)fy (y; θ◦ ) dy
−∞
fy (y; θ) ∂θ
13.4. THE CRAMER-RAO LOWER BOUND 377
This is at the same time the covariance: cov◦ [q(θ◦ ), t] = E◦ [q(θ◦ )t] − E◦ [q(θ◦ )] E◦ [t] = E◦ [q(θ◦ )t],
since E◦ [q(θ◦ )] = 0.
E◦ [q(θ◦ )] = 0, we know var◦ [q(θ◦ )] = E◦ [q 2 (θ◦ )], and since t is unbiased, we know
var◦ [t] = MSE◦ [t; θ◦ ]. Therefore the Cauchy-Schwartz inequality reads
(13.4.26) MSE◦ [t; θ◦ ] ≥ 1/ E◦ [q 2 (θ◦ )].
This is the Cramer-Rao inequality. The inverse of the variance of q(θ◦ ), 1/ var◦ [q(θ◦ )] =
1/ E◦ [q 2 (θ◦ )], is called the Fisher information, written I(θ◦ ). It is a lower bound for
the MSE of any unbiased estimator of θ. Because of (13.4.14), the Cramer Rao
378 13. ESTIMATION PRINCIPLES
(Sometimes the first and sometimes the second expression is easier to evaluate.)
If one has a whole vector of observations then the Cramer-Rao inequality involves
the joint density function:
1 −1
(13.4.29) var[t] ≥ 2 = ∂2
.
∂ E[ log fy (y; θ)]
E[ ∂θ log fy (y; θ) ] ∂θ 2
This inequality also holds if y is discrete and one uses its probability mass function
instead of the density function. In small samples, this lower bound is not always
attainable; in some cases there is no unbiased estimator with a variance as low as
the Cramer Rao lower bound.
13.4. THE CRAMER-RAO LOWER BOUND 379
In order to apply (13.4.29) you can either square this and take the expected value
2
∂ 1
X
(13.4.33) E[ `(y; µ) ]= E[(y i − µ)2 ] = n/σ 2
∂µ σ4
alternatively one may take one more derivative from (13.4.32) to get
∂2 n
(13.4.34) `(y; µ) = − 2
∂µ2 σ
380 13. ESTIMATION PRINCIPLES
This is constant, therefore equal to its expected value. Therefore the Cramer-Rao Lower Bound
says that var[ȳ] ≥ σ 2 /n. This holds with equality.
2
• a. 2 points Show that s2 is an unbiased estimator of σ 2 , is distributed ∼ σn χ2n ,
and has variance 2σ 4 /n. You are allowed to use the fact that a χ2n has variance 2n,
which is equation (5.9.5).
13.4. THE CRAMER-RAO LOWER BOUND 381
Answer.
• b. 4 points Show that this variance is at the same time the Cramer Rao lower
bound.
382 13. ESTIMATION PRINCIPLES
Answer.
1 1 y2
(13.4.42) `(y, σ 2 ) = log fy (y; σ 2 ) = − log 2π − log σ 2 −
2 2 2σ 2
∂ log fy 1 y 2 2
y −σ 2
(13.4.43) (y; σ 2 ) = − 2 + =
∂σ 2 2σ 2σ 4 2σ 4
y2 − σ2
Since has zero mean, it follows
2σ 4
2
∂ log fy var[y 2 ] 1
(13.4.44) E[ (y; σ 2 ) ]= = .
∂σ 2 4σ 8 2σ 4
∂ 2 log fy y2 1
(13.4.45) 2 2
(y; σ 2 ) = − 6 +
(∂σ ) σ 2σ 4
∂ 2 log fy σ2 1 1
(13.4.46) E[ 2 2
(y; σ 2 )] = − 6 + =
(∂σ ) σ 2σ 4 2σ 4
(13.4.47)
λx −λ
(13.4.48) pxi (x) = Pr[xi = x] = e x = 0, 1, 2, . . . .
x!
A Poisson variable with parameter λ has expected value λ and variance λ. (You
are not required to prove this here.) Is there an unbiased estimator of λ with lower
variance than the sample mean x̄?
Here is a formulation of the Cramer Rao Inequality for probability mass func-
tions, as you need it for Question 204. Assume y 1 , . . . , y n are n independent ob-
servations of a random variable y whose probability mass function depends on the
unknown parameter θ and satisfies certain regularity conditions. Write the univari-
ate probability mass function of each of the y i as py (y; θ) and let t be any unbiased
estimator of θ. Then
1 −1
(13.4.49) var[t] ≥ 2 = ∂2
.
∂ n E[ ∂θ2 ln py (y; θ)]
n E[ ∂θ ln py (y; θ) ]
384 13. ESTIMATION PRINCIPLES
∂ 2 log px x
(13.4.53) (x; λ) = − 2
∂λ2 λ
2
∂ log px E[x] 1
(13.4.54) − E[ 2
(x; λ) ] = 2 = .
∂λ λ λ
λ
Therefore the Cramer Rao lower bound is n
, which is the variance of the sample mean.
If the density function depends on more than one unknown parameter, i.e., if
it has the form fy (y; θ1 , . . . , θk ), the Cramer Rao Inequality involves the following
steps: (1) define `(y; θ1 , · · · , θk ) = log fy (y; θ1 , . . . , θk ), (2) form the following matrix
13.4. THE CRAMER-RAO LOWER BOUND 385
t1
−1 ..
and (3) form the matrix inverse I . If the vector random variable t = .
tn
θ1
is an unbiased estimator of the parameter vector θ = ... , then the inverse of
θn
the information matrix I −1 is a lower bound for the covariance matrix V [t] in the
following sense: the difference matrix V [t] − I −1 is always nonnegative definite.
From this follows in particular: if iii is the ith diagonal element of I −1 , then
var[ti ] ≥ iii .
386 13. ESTIMATION PRINCIPLES
Problem 205. 5 points [Lar82, example 5.4.1 on p 266] Let y 1 and y 2 be two
random variables with same mean µ and variance σ 2 , but we do not assume that they
are uncorrelated; their correlation coefficient is ρ, which can take any value |ρ| ≤ 1.
Show that ȳ = (y 1 + y 2 )/2 has lowest mean squared error among all linear unbiased
estimators of µ, and compute its MSE. (An estimator µ̃ of µ is linear iff it can be
written in the form µ̃ = α1 y 1 + α2 y 2 with some constant numbers α1 and α2 .)
13.5. BEST LINEAR UNBIASED WITHOUT DISTRIBUTION ASSUMPTIONS 387
Answer.
(13.5.1) ỹ = α1 y 1 + α2 y 2
(13.5.2) var ỹ = α21 var[y 1 ] + α22 var[y 2 ] + 2α1 α2 cov[y 1 , y 2 ]
(13.5.3) = σ 2 (α21 + α22 + 2α1 α2 ρ).
Problem 206. You have two unbiased measurements with errors of the same
quantity µ (which may or may not be random). The first measurement y 1 has mean
squared error E[(y 1 − µ)2 ] = σ 2 , the other measurement y 2 has E[(y 1 − µ)2 ] =
τ 2 . The measurement errors y 1 − µ and y 2 − µ have zero expected values (i.e., the
measurements are unbiased) and are independent of each other.
388 13. ESTIMATION PRINCIPLES
• a. 2 points Show that the linear unbiased estimators of µ based on these two
measurements are simply the weighted averages of these measurements, i.e., they can
be written in the form µ̃ = αy 1 + (1 − α)y 2 , and that the MSE of such an estimator
is α2 σ 2 + (1 − α)2 τ 2 . Note: we are using the word “estimator” here even if µ is
random. An estimator or predictor µ̃ is unbiased if E[µ̃ − µ] = 0. Since we allow µ
to be random, the proof in the class notes has to be modified.
Answer. The estimator µ̃ is linear (more precisely: affine) if it can written in the form
(13.5.7) µ̃ = α1 y 1 + α2 y 2 + γ
for all possible values of E[µ]; therefore γ = 0 and α2 = 1 − α1 . To simplify notation, we will call
from now on α1 = α, α2 = 1 − α. Due to unbiasedness, the MSE is the variance of the estimation
error
• b. 4 points Define ω 2 by
1 1 1 σ2 τ 2
(13.5.10) = 2+ 2 which can be solved to give ω2 = .
ω2 σ τ σ2+ τ2
13.5. BEST LINEAR UNBIASED WITHOUT DISTRIBUTION ASSUMPTIONS 389
Show that the Best (i.e., minimum MSE) linear unbiased estimator (BLUE) of µ
based on these two measurements is
ω2 ω2
(13.5.11) ŷ = 2 y 1 + 2 y 2
σ τ
i.e., it is the weighted average of y 1 and y 2 where the weights are proportional to the
inverses of the variances.
Answer. The variance (13.5.9) takes its minimum value where its derivative with respect of
α is zero, i.e., where
∂
(13.5.12) α2 σ 2 + (1 − α)2 τ 2 = 2ασ 2 − 2(1 − α)τ 2 = 0
∂α
(13.5.13) ασ 2 = τ 2 − ατ 2
τ2
(13.5.14) α=
σ2 + τ2
In terms of ω one can write
τ2 ω2 σ2 ω2
(13.5.15) α= 2 2
= 2 and 1−α= = 2.
σ +τ σ σ2 +τ 2 τ
• c. 2 points Show: the MSE of the BLUE ω 2 satisfies the following equation:
1 1 1
(13.5.16) = 2+ 2
ω2 σ τ
390 13. ESTIMATION PRINCIPLES
Answer. We already have introduced the notation ω 2 for the quantity defined by (13.5.16);
therefore all we have to show is that the MSE or, equivalently, the variance of the estimation error
is equal to this ω 2 :
ω 2 2 2 ω 2 2 2 1 1 1
(13.5.17) var[µ̃ − µ] = σ + τ = ω4 2 + 2 = ω4 2 = ω2
σ2 τ2 σ τ ω
Examples of other classes of estimators for which a best estimator exists are: if
one requires the estimator to be translation invariant, then the least squares estima-
tors are best in the class of all translation invariant estimators. But there is no best
linear estimator in the linear model. (Theil)
Answer. Its high information requirements (the functional form of the density function must
be known), and computational complexity.
function as
1 (y−µ0 )2
(13.6.1) fy (y; µ0 ) = √ e− 2 .
2π
It is a function of y, the possible values assumed by y, and the letter µ0 symbolizes
a constant, the true parameter value. The same function considered as a function of
the variable µ, representing all possible values assumable by the true mean, with y
being fixed at the actually observed value, becomes the likelihood function.
In the same way one can also turn probability mass functions px (x) into likelihood
functions.
Now let us compute some examples of the MLE. You make n independent
observations y 1 , . . . , y n from a N (µ, σ 2 ) distribution. Write the likelihood function
as
n 1 n P
Y 1 2
(13.6.2) L(µ, σ 2 ; y 1 , . . . , y n ) = fy (y i ) = √ e− 2σ2 (yi −µ) .
2πσ 2
i=1
n n 1 X
(13.6.3) ` = ln L(µ, σ 2 ; y 1 , . . . , y n ) = − ln 2π − ln σ 2 − 2 (y i − µ)2 .
2 2 2σ
13.6. MAXIMUM LIKELIHOOD ESTIMATION 393
n
Set this zero, and write λ̂ instead of λ to get λ̂ = t1 +···+tn
= 1/t̄.
Usually the MLE is asymptotically unbiased and asymptotically normal. There-
fore it is important to have an estimate of its asymptotic variance. Here we can use
the fact that asymptotically the Cramer Rao Lower Bound is not merely a lower
bound for this variance but is equal to its variance. (From this follows that the max-
imum likelihood estimator is asymptotically efficient.) The Cramer Rao lower bound
itself depends on unknown parameters. In order to get a consistent estimate of the
Cramer Rao lower bound, do the following: (1) Replace the unknown parameters
in the second derivative of the log likelihood function by their maximum likelihood
estimates. (2) Instead of taking expected values over the observed values xi you may
simply insert the sample values of the xi into these maximum likelihood estimates,
and (3) then invert this estimate of the information matrix.
MLE obeys an important functional invariance principle: if θ̂ is the MLE of θ,
then g(θ̂) is the MLE of g(θ). E.g., µ = λ1 is the expected value of the exponential
variable, and its MLE is x̄.
• a. 2 points Show that the MLE of µx , based on the combined sample, is x̄. (By
symmetry it follows that the MLE of µy is ȳ.)
Answer.
m
m m 1 X
(13.6.10) `(µx , µy , σ 2 ) = − ln 2π − ln σ 2 − (xi − µx )2
2 2 2σ 2
i=1
n
n n 1 X
− ln 2π − ln σ 2 − (y j − µy )2
2 2 2σ 2
j=1
∂` 1 X
(13.6.11) =− 2 −2(xi − µx ) =0 for µx = x̄
∂µx 2σ
2
• b. 2 points Derive the MLE of σ , based on the combined samples.
Answer.
m n
∂` m+n 1
X X
(13.6.12) =− + (xi − µx )2 + (y j − µy )2
∂σ 2 2σ 2 2σ 4
i=1 j=1
m n
1
X X
(13.6.13) σ̂ 2 = (xi − x̄)2 + (y i − ȳ)2 .
m+n
i=1 j=1
396 13. ESTIMATION PRINCIPLES
13.8. M-Estimators
The class of M -estimators maximizes something other than a likelihood func-
tion: it includes nonlinear least squares, generalized method of moments, minimum
distance and minimum chi-squared estimators. The purpose is to get a “robust”
estimator which is good for a wide variety of likelihood functions. Many of these are
asymptotically efficient; but their small-sample properties may vary greatly.
13.9. SUFFICIENT STATISTICS AND ESTIMATION 397
Answer. You need sufficiency for the first part of the problem, the law of iterated expectations
for the second, and completeness for the third.
Set E = {p ≤ p} in the definition of sufficiency given at the beginning of the Problem to see
that the cdf of p conditionally on s being in any interval does not involve θ, therefore also E[p|s]
does not involve θ.
Unbiasedness follows from the theorem of iterated expectations E E[p|s] = E[p] = θ.
The independence on the choice of p can be shown as follows: Since the conditional expectation
conditionally on s is a function of s, we can use the notation E[p|s] = g1 (s) and E[q|s] = g2 (s).
From E[p] = E[q] follows by the law of iterated expectations E[g1 (s) − g2 (s)] = 0, therefore by
completeness g1 (s) − g2 (s) ≡ 0.
Answer.
(13.9.1) π = Pr[y i ≥ 0] = Pr[y i − µ ≥ −µ] = Pr[y i − µ ≤ µ] = Φ(µ)
because y i − µ ∼ N (0, 1). We needed symmetry of the distribution to flip the sign.
yj µ 1 1/n
(13.9.2) ∼N ,
ȳ µ 1/n 1/n
• f. 2 points Second step: From this joint distribution derive the conditional
distribution of y j conditionally on ȳ = ȳ. (Not just the conditional mean but the whole
conditional distribution.) For this you will need formula (10.3.18) and (10.3.20).
Answer. Here are these two formulas: if u and v are jointly normal, then the conditional
distribution of v conditionally on u = u is Normal with mean
cov[u, v]
(13.9.3) E[v|u = u] = E[v] + (u − E[u])
var[u]
and variance
(cov[u, v])2
(13.9.4) var[v|u = u] = var[v] − .
var[u]
402 13. ESTIMATION PRINCIPLES
Plugging u = ȳ and v = y j into (10.3.18) and (10.3.20) gives: the conditional distribution of
y j conditionally on ȳ = ȳ has mean
cov[ȳ, y j ]
(13.9.5) E[y j |ȳ = ȳ] = E[y j ] + (ȳ − E[ȳ])
var[ȳ]
1/n
(13.9.6) =µ+ (ȳ − µ) = ȳ
1/n
and variance
(cov[ȳ, y j ])2
(13.9.7) var[y j |ȳ = ȳ] = var[y j ] −
var[ȳ]
(1/n)2 1
(13.9.8) =1− =1− .
1/n n
Therefore the conditional distribution of y j conditional on ȳ is N (ȳ, (n − 1)/n). How can this
be motivated? if we know the actual arithmetic mean of the variables, then our best estimate is
that each variable is equal to this arithmetic mean. And this additional knowledge cuts down the
variance by 1/n.
Answer.
(13.9.9) var[y j ] = var E[y j |ȳ] + E var[y j |ȳ]
n−1 1 n−1
h i
(13.9.10) = var[ȳ] + E = +
n n n
• j. 1 point Finally, put all the pieces together and write down E[q(y j )|ȳ], the
conditional expectation of q(y j ) conditionally on ȳ, which by the Lehmann-Scheffé
404 13. ESTIMATION PRINCIPLES
theorem is the minimum MSE unbiased estimator of π. The formula you should
come up with is
p
(13.9.11) π̂ = Φ(ȳ n/(n − 1)),
Remark: this particular example did not give any brand new estimators, but it can
rather be considered a proof that certain obvious estimators are unbiased and efficient.
But often this same procedure gives new estimators which one would not have been
able to guess. Already when the variance is unknown, the above example becomes
quite a bit more complicated, see [Rao73, p. 322, example 2]. When the variables
have an exponential distribution then this example (probability of early failure) is
discussed in [BD77, example 4.2.4 on pp. 124/5].
13.10. THE LIKELIHOOD PRINCIPLE 405
Problem 210. 3 points You have a Bernoulli experiment with unknown pa-
rameter θ, 0 ≤ θ ≤ 1. Person A was originally planning to perform this experiment
12 times, which she does. She obtains 9 successes and 3 failures. Person B was
originally planning to perform the experiment until he has reached 9 successes, and
it took him 12 trials to do this. Should both experimenters draw identical conclusions
from these two experiments or not?
9
Answer. The probability mass function in the first is by (3.7.1) 12
9
θ (1 − θ)3 , and in the
11 9 3
second it is by (5.1.13) 8
θ (1 − θ) . They are proportional, the stopping rule therefore does not
matter!
parameter values and the data. Like all joint density function, it can be written
as the product of a marginal and conditional density. The marginal density of the
parameter value represents the beliefs the experimenter holds about the parameters
before the experiment (prior density), and the likelihood function of the experiment
is the conditional density of the data given the parameters. After the experiment has
been conducted, the experimenter’s belief about the parameter values is represented
by their conditional density given the data, called the posterior density.
Let y denote the observations, θ the unknown parameters, and f (y, θ) their
joint density. Then
(13.11.1) f (y, θ) = f (θ)f (y|θ)
(13.11.2) = f (y)f (θ|y).
Therefore
f (θ)f (y|θ)
(13.11.3) f (θ|y) = .
f (y)
In this formula, the value of f (y) is irrelevant. It only depends on y but not on
θ, but y is fixed, i.e., it is a constant. If one knows the posterior density function
of θ up to a constant, one knows it altogether, since the constant is determined by
the requirement that the area under the density function is 1. Therefore (13.11.3) is
13.11. BAYESIAN INFERENCE 409
here the lefthand side contains the posterior density function of the parameter, the
righthand side the prior density function and the likelihood function representing the
probability distribution of the experimental data.
The Bayesian procedure does not yield a point estimate or an interval estimate,
but a whole probability distribution for the unknown parameters (which represents
our information about these parameters) containing the “prior” information “up-
dated” by the information yielded by the sample outcome.
Of course, such probability distributions can be summarized by various measures
of location (mean, median), which can then be considered Bayesian point estimates.
Such summary measures for a whole probability distribution are rather arbitrary.
But if a loss function is given, then this process of distilling point estimates from
the posterior distribution can once more be systematized. For a concrete decision it
tells us that parameter value which minimizes the expected loss function under the
posterior density function, the so-called “Bayes risk.” This can be considered the
Bayesian analog of a point estimate.
For instance, if the loss function is quadratic, then the posterior mean is the
parameter value which minimizes expected loss.
410 13. ESTIMATION PRINCIPLES
There is a difference between Bayes risk and the notion of risk we applied previ-
ously. The frequentist minimizes expected loss in a large number of repetitions of the
trial. This risk is dependent on the unknown parameters, and therefore usually no
estimators exist which give minimum risk in all situations. The Bayesian conditions
on the data (final criterion!) and minimizes the expected loss where the expectation
is taken over the posterior density of the parameter vector.
The irreducibility of absence to presences: the absence of knowledge (or also
the absence of regularity itself) cannot be represented by a probability distribution.
Proof: if I give a certain random variable a neutral prior, then functions of this
random variable have non-neutral priors. This argument is made in [Roy97, p. 174].
Many good Bayesians drift away from the subjective point of view and talk about
a stratified world: their center of attention is no longer the world out there versus
our knowledge of it, but the empirical world versus the underlying systematic forces
that shape it.
Bayesians say that frequentists use subjective elements too; their outcomes de-
pend on what the experimenter planned to do, even if he never did it. This again
comes from [Roy97, p. ??]. Nature does not know about the experimenter’s plans,
and any evidence should be evaluated in a way independent of this.
CHAPTER 14
Interval Estimation
of 50% that the actual mean lies above 3.5, and the same with below. The sample
mean can therefore be considered a one-sided confidence bound, although one usually
wants higher confidence levels than 50%. (I am 95% confident that φ is greater or
equal than a certain value computed from the sample.) The concept of “confidence”
is nothing but the usual concept of probability if one uses an initial criterion of
precision.
The following thought experiment illustrates what is involved. Assume you
bought a widget and want to know whether it is defective or not. The obvious
way (which would correspond to a “final” criterion of precision) would be to open
it up and look if it is defective or not. Now assume we cannot do it: there is no
way telling by just looking at it whether it will work. Then another strategy would
be to go by an “initial” criterion of precision: we visit the widget factory and look
how they make them, how much quality control there is and such. And if we find
out that 95% of all widgets coming out of the same factory have no defects, then we
have the “confidence” of 95% that our particular widget is not defective either.
The matter becomes only slightly more mystified if one talks about intervals.
Again, one should not forget that confidence intervals are random intervals. Besides
confidence intervals and one-sided confidence bounds one can, if one regards several
parameters simultaneously, also construct confidence rectangles, ellipsoids and more
complicated shapes. Therefore we will define in all generality:
14. INTERVAL ESTIMATION 413
The important thing to remember about this definition is that these regions R(y)
are random regions; every time one performs the experiment one obtains a different
region.
Now let us go to the specific case of constructing an interval estimate for the
parameter µ when we have n independent observations from a normally distributed
population ∼ N (µ, σ 2 ) in which neither µ nor σ 2 are known. The vector of ob-
servations is therefore distributed as y ∼ N (ιµ, σ 2 I), where ιµ is the vector every
component of which is µ.
I will give you now what I consider to be the cleanest argument deriving the
so-called t-interval. It generalizes directly to the F -test in linear regression. It is not
the same derivation which you will usually find, and I will bring the usual derivation
below for comparison. Recall the observation made earlier, based on (12.1.1), that the
sample mean ȳ is that number ȳ = a which minimizes the sum of squared deviations
414 14. INTERVAL ESTIMATION
(yi − a)2 . (In other words, ȳ is the “least squares estimate” in this situation.) This
P
least squares principle also naturally leads to interval estimates for µ: we will say
that a lies in the interval for µ if and only if
(y − a)2
P
(14.0.6) P i ≤c
(yi − ȳ)2
for some number c ≥ 1. Of course, the value of c depends on the confidence level,
but the beauty of this criterion here is that the value of c can be determined by the
confidence level alone without knowledge of the true values of µ or σ 2 .
To show this, note first that (14.0.6) is equivalent to
and then apply the identity (yi − a)2 = (yi − ȳ)2 + n(ȳ − a)2 to the numerator
P P
to get the following equivalent formulation of (14.0.6):
n(ȳ − a)2
(14.0.8) P ≤c−1
(yi − ȳ)2
14. INTERVAL ESTIMATION 415
The confidence level of this interval is the probability that the true µ lies in an
interval randomly generated using this principle. In other words, it is
h n(ȳ − µ)2 i
(14.0.9) Pr P ≤ c − 1
(y i − ȳ)2
Although for every known a, the probability that a lies in the confidence interval
depends on the unknown µ and σ 2 , we will show now that the probability that the
unknown µ lies in the confidence interval does not depend on any unknown parame-
ters. First look at the distribution of the numerator: Since ȳ ∼ N (µ, σ 2 /n), it follows
(ȳ − µ)2 ∼ (σ 2 /n)χ21 . We alsoPknow the distribution of the denominator. Earlier we
have shown that the variable (y i − ȳ)2 is a σ 2 χ2n−1 . It is not enough to know the
distribution of numerator and denominator separately, we also need their joint distri-
bution. For this go back to our earlier discussion of variance estimation again; there
>
we also showed that ȳ is independent of the vector y 1 − ȳ · · · y n − ȳ ; there-
fore any function of ȳ is also independent of any function of this vector, from which
follows that numerator and denominator in our fraction are independent. Therefore
this fraction is distributed as an σ 2 χ21 over an independent σ 2 χ2n−1 , and since the
σ 2 ’s cancel out, this is the same as a χ21 over an independent χ2n−1 . In other words,
this distribution does not depend on any unknown parameters!
416 14. INTERVAL ESTIMATION
(ȳ − µ)2
(14.0.10) 1 1
P ∼ F 1,n−1
n n−1 (y i − ȳ)2
If one does not take the square in the numerator, i.e., works with ȳ − µ instead of
(ȳ − µ)2 , and takes square root in the denominator, one obtains a t-distribution:
ȳ − µ
(14.0.11) q q ∼ tn−1
1 1
P
n n−1 (y i − ȳ)2
The left hand side of this last formula has a suggestive form. It can be written as
(ȳ − µ)/sȳ , where sȳ is an estimate of the standard deviation of ȳ (it is the square
root of the unbiased estimate of the variance of ȳ). In other words, this t-statistic
can be considered an estimate of the number of standard deviations the observed
value of ȳ is away from µ.
Now we will give, as promised, the usual derivation of the t-confidence intervals,
which is based on this interpretation. This usual derivation involves the following
two steps:
14. INTERVAL ESTIMATION 417
(1) First assume that σ 2 is known. Then it is obvious what to do; for every
observation y of y construct the following interval:
(14.0.12) R(y) = {u ∈ R : |u − ȳ| ≤ N(α/2) σȳ }.
What do these symbols mean? The interval R (as in region) has y as an argument,
i.e.. it is denoted R(y), because it depends on the observed value y. R is the set of real
numbers. N(α/2) is the upper α/2-quantile of the Normal distribution, i.e., it is that
number c for which a standard Normal random variable z satisfies Pr[z ≥ c] = α/2.
Since by the symmetry of the Normal distribution, Pr[z ≤ −c] = α/2 as well, one
obtains for a two-sided test:
(14.0.13) Pr[|z| ≥ N(α/2) ] = α.
From this follows the coverage probability:
(14.0.14) Pr[R(y) 3 µ] = Pr[|µ − ȳ| ≤ N(α/2) σȳ ]
(14.0.15) = Pr[|(µ − ȳ)/σȳ | ≤ N(α/2) ] = Pr[|−z| ≤ N(α/2) ] = 1 − α
since z = (ȳ − µ)/σȳ is a standard Normal. I.e., R(y) is a confidence interval for µ
with confidence level 1 − α.
(2) Second part: what if σ 2 is not known? Here a seemingly ad-hoc way out
would be to replace σ 2 by its unbiased estimate s2 . Of course, then the Normal
distribution no longer applies. However if one replaces the normal critical values by
418 14. INTERVAL ESTIMATION
those of the tn−1 distribution, one still gets, by miraculous coincidence, a confidence
level which is independent of any unknown parameters.
Problem 211. If y i ∼ NID(µ, σ 2 ) (normally independently distributed) with µ
and σ 2 unknown, then the confidence interval for µ has the form
(14.0.16) R(y) = {u ∈ R : |u − ȳ| ≤ t(n−1;α/2) sȳ }.
Here t(n−q;α/2) is the upper α/2-quantile of the t distribution with n − 1 degrees
of freedom, i.e., it is that number c for which a random variable t which has a t
distribution with n − 1 degrees of freedom satisfies Pr[t ≥ c] = α/2. And sȳ is
obtained as follows: write down the standard deviation of ȳ and replace σ byps. One
can also say sȳ = σȳ σs where σȳ is an abbreviated notation for std. dev[y] = var[y].
• a. 1 point Write down the formula for sȳ .
σ2 √
Answer. Start with σȳ2 = var[ȳ] = n
, therefore σȳ = σ/ n, and
r
√ X (y − ȳ)2
i
(14.0.17) sȳ = s/ n =
n(n − 1)
because the expression in the numerator is a standard normal, and the expression in the denominator
is the square root of an independent χ2n−1 divided by n − 1. The random variable between the
absolute signs has therefore a t-distribution, and (14.0.22) follows from (41.4.8).
p=
n .750 .900 .950 .975 .990 .995
1 1.000 3.078 6.314 12.706 31.821 63.657
2 0.817 1.886 2.920 4.303 6.965 9.925
3 0.765 1.638 2.354 3.182 4.541 5.841
4 0.741 1.533 2.132 2.776 3.747 4.604
5 0.727 1.476 2.015 2.571 3.365 4.032
therefore
or
The ecmet package has a function confint.segments which draws such plots
automatically. Choose how many observations in each experiment (the argument
n), and how many confidence intervals (the argument rep), and the confidence level
level (the default is here 95%), and then issue, e.g. the command confint.segments(n
Here is the transcript of the function:
confint.segments <- function(n, rep, level = 95/100)
{
stdnormals <- matrix(rnorm(n * rep), nrow = n, ncol = rep)
midpts <- apply(stdnormals, 2, mean)
halfwidth <- qt(p=(1 + level)/2, df= n - 1) * sqrt(1/n)* sqrt(apply(st
frame()
x <- c(1:rep, 1:rep)
y <- c(midpts + halfwidth, midpts - halfwidth)
par(usr = c(1, rep, range(y)))
segments(1:rep, midpts - halfwidth, 1:rep, midpts + halfwidth)
abline(0, 0)
invisible(cbind(x,y))
}
This function draws the plot as a “side effect,” but it also returns a matrix with
the coordinates of the endpoints of the plots (without printing them on the screen).
424 14. INTERVAL ESTIMATION
This matrix can be used as input for the identify function. If you do for instance
iddata<-confint.segments(12,20) and then identify(iddata,labels=iddata[,2]
then the following happens: if you move the mouse cursor on the graph near one of
the endpoints of one of the intervals, and click the left button, then it will print on
the graph the coordinate of the bounday of this interval. Clicking any other button of
the mouse gets you out of the identify function.
CHAPTER 15
Hypothesis Testing
425
426 15. HYPOTHESIS TESTING
it otherwise. This is indeed an optimal decision rule, and we will discuss in what
respect it is, and how c should be picked.
Your decision can be the wrong decision in two different ways: either you decide
to go ahead with the investment although there will be no demand for the product,
or you fail to invest although there would have been demand. There is no decision
rule which eliminates both errors at once; the first error would be minimized by the
rule never to produce, and the second by the rule always to produce. In order to
determine the right tradeoff between these errors, it is important to be aware of their
asymmetry. The error to go ahead with production although there is no demand has
potentially disastrous consequences (loss of a lot of money), while the other error
may cause you to miss a profit opportunity, but there is no actual loss involved, and
presumably you can find other opportunities to invest your money.
To express this asymmetry, the error with the potentially disastrous consequences
is called “error of type one,” and the other “error of type two.” The distinction
between type one and type two errors can also be made in other cases. Locking up
an innocent person is an error of type one, while letting a criminal go unpunished
is an error of type two; publishing a paper with false results is an error of type one,
while foregoing an opportunity to publish is an error of type two (at least this is
what it ought to be).
15. HYPOTHESIS TESTING 427
Such an asymmetric situation calls for an asymmetric decision rule. One needs
strict safeguards against committing an error of type one, and if there are several
decision rules which are equally safe with respect to errors of type one, then one will
select among those that decision rule which minimizes the error of type two.
Let us look here at decision rules of the form: make the investment if ȳ > c.
An error of type one occurs if the decision rule advises you to make the investment
while there is no demand for the product. This will be the case if ȳ > c but µ ≤ 0.
The probability of this error depends on the unknown parameter µ, but it is at most
α = Pr[ȳ > c | µ = 0]. This maximum value of the type one error probability is called
the significance level, and you, as the director of the firm, will have to decide on α
depending on how tolerable it is to lose money on this venture, which presumably
depends on the chances to lose money on alternative investments. It is a serious
shortcoming of the classical theory of hypothesis testing that it does not provide
good guidelines how α should be chosen, and how it should change with sample size.
Instead, there is the tradition to choose α to be either 5% or 1% or 0.1%. Given α,
a table of the cumulative standard normal distribution function allows you to find
that c for which Pr[ȳ > c | µ = 0] = α.
Problem 213. 2 points Assume each y i ∼ N (µ, 1), n = 400 and α = 0.05, and
different y i are independent. Compute the value c which satisfies Pr[ȳ > c | µ = 0] =
α. You shoule either look it up in a table and include a xerox copy of the table with
428 15. HYPOTHESIS TESTING
the entry circled and the complete bibliographic reference written on the xerox copy,
or do it on a computer, writing exactly which commands you used. In R, the function
qnorm does what you need, find out about it by typing help(qnorm).
Answer. In the case n = 400, ȳ has variance 1/400 and therefore standard deviation 1/20 =
0.05. Therefore 20ȳ is a standard normal: from Pr[ȳ > c | µ = 0] = 0.05 follows Pr[20ȳ > 20c | µ =
0] = 0.05. Therefore 20c = 1.645 can be looked up in a table, perhaps use [JHG+ 88, p. 986], the
row for ∞ d.f.
Let us do this in R. The p-“quantile” of the distribution of the random variable y is defined
as that value q for which Pr[y ≤ q] = p. If y is normally distributed, this quantile is computed
by the R-function qnorm(p, mean=0, sd=1, lower.tail=TRUE). In the present case we need either
qnorm(p=1-0.05, mean=0, sd=0.05) or qnorm(p=0.05, mean=0, sd=0.05, lower.tail=FALSE) which
gives the value 0.08224268.
Choosing a decision which makes a loss unlikely is not enough; your decision
must also give you a chance of success. E.g., the decision rule to build the plant if
−0.06 ≤ ȳ ≤ −0.05 and not to build it otherwise is completely perverse, although
the significance level of this decision rule is approximately 4% (if n = 100). In other
words, the significance level is not enough information for evaluating the performance
of the test. You also need the “power function,” which gives you the probability
with which the test advises you to make the “critical” decision, as a function of
the true parameter values. (Here the “critical” decision is that decision which might
15. HYPOTHESIS TESTING 429
.........................
.......................................................
...................
...............
..
..
..
..
..
..............
...........
...........
.............
...............
....................
..............................................................................
-3 -2 -1 0 1 2 3
potentially lead to an error of type one.) By the definition of the significance level, the
power function does not exceed the significance level for those parameter values for
which going ahead would lead to a type 1 error. But only those tests are “powerful”
whose power function is high for those parameter values for which it would be correct
to go ahead. In our case, the power function must be below 0.05 when µ ≤ 0, and
we want it as high as possible when µ > 0. Figure 1 shows the power function for
the decision rule to go ahead whenever ȳ ≥ c, where c is chosen in such a way that
the significance level is 5%, for n = 100.
The hypothesis whose rejection, although it is true, constitutes an error of type
one, is called the null hypothesis, and its alternative the alternative hypothesis. (In the
examples the null hypotheses were: the return on the investment is zero or negative,
430 15. HYPOTHESIS TESTING
the defendant is innocent, or the results about which one wants to publish a research
paper are wrong.) The null hypothesis is therefore the hypothesis that nothing is
the case. The test tests whether this hypothesis should be rejected, will safeguard
against the hypothesis one wants to reject but one is afraid to reject erroneously. If
you reject the null hypothesis, you don’t want to regret it.
Mathematically, every test can be identified with its null hypothesis, which is
a region in parameter space (often consisting of one point only), and its “critical
region,” which is the event that the test comes out in favor of the “critical decision,”
i.e., rejects the null hypothesis. The critical region is usually an event of the form
that the value of a certain random variable, the “test statistic,” is within a given
range, usually that it is too high. The power function of the test is the probability
of the critical region as a function of the unknown parameters, and the significance
level is the maximum (or, if this maximum depends on unknown parameters, any
upper bound) of the power function over the null hypothesis.
Problem 214. Mr. Jones is on trial for counterfeiting Picasso paintings, and
you are an expert witness who has developed fool-proof statistical significance tests
for identifying the painter of a given painting.
• a. 2 points There are two ways you can set up your test.
15. HYPOTHESIS TESTING 431
a: You can either say: The null hypothesis is that the painting was done by
Picasso, and the alternative hypothesis that it was done by Mr. Jones.
b: Alternatively, you might say: The null hypothesis is that the painting was
done by Mr. Jones, and the alternative hypothesis that it was done by Pi-
casso.
Does it matter which way you do the test, and if so, which way is the correct one.
Give a reason to your answer, i.e., say what would be the consequences of testing in
the incorrect way.
Answer. The determination of what the null and what the alternative hypothesis is depends
on what is considered to be the catastrophic error which is to be guarded against. On a trial, Mr.
Jones is considered innocent until proven guilty. Mr. Jones should not be convicted unless he can be
proven guilty beyond “reasonable doubt.” Therefore the test must be set up in such a way that the
hypothesis that the painting is by Picasso will only be rejected if the chance that it is actually by
Picasso is very small. The error of type one is that the painting is considered counterfeited although
it is really by Picasso. Since the error of type one is always the error to reject the null hypothesis
although it is true, solution a. is the correct one. You are not proving, you are testing.
• b. 2 points After the trial a customer calls you who is in the process of acquiring
a very expensive alleged Picasso painting, and who wants to be sure that this painting
is not one of Jones’s falsifications. Would you now set up your test in the same way
as in the trial or in the opposite way?
432 15. HYPOTHESIS TESTING
region:
Also
since U ⊂ C and C were chosen such that the likelihood (density) function of the
alternative hypothesis is high relatively to that of the null hypothesis. Since W lies
outside C, the same argument gives
hence Pr[D|θ1 ] ≤ Pr[C|θ1 ]. In other words, if θ1 is the correct parameter value, then
C will discover this and reject at least as often as D. Therefore C is at least as
powerful as D, or the type two error probability of C is at least as small as that of
D.
Back to our fertilizer example. To make both null and alternative hypotheses
simple, assume that either µ = 0 (fertilizer is ineffective) or µ = t for some fixed
15.2. THE NEYMAN PEARSON LEMMA AND LIKELIHOOD RATIO TESTS 437
t > 0. Then the likelihood ratio critical region has the form
(15.2.6)
1 n 1 2 2
1 n 1 2 2
C = {y1 , . . . , yn : √ e− 2 ((y1 −t) +···+(yn −t) ) ≥ k √ e− 2 (y1 +···+yn ) }
2π 2π
(15.2.7)
1 1
= {y1 , . . . , yn : − ((y1 − t)2 + · · · + (yn − t)2 ) ≥ ln k − (y12 + · · · + yn2 )}
2 2
(15.2.8)
t2 n
= {y1 , . . . , yn : t(y1 + · · · + yn ) − ≥ ln k}
438 15. HYPOTHESIS TESTING
i.e., C has the form ȳ ≥ some constant. The dependence of this constant on k is not
relevant, since this constant is usually chosen such that the maximum probability of
error of type one is equal to the given significance level.
Problem 217. 8 points You have four independent observations y1 , . . . , y4 from
an N (µ, 1), and you are testing the null hypothesis µ = 0 against the alternative
hypothesis µ = 1. For your test you are using the likelihood ratio test with critical
region
(15.2.10) C = {y1 , . . . , y4 : L(y1 , . . . , y4 ; µ = 1) ≥ 3.633 · L(y1 , . . . , y4 ; µ = 0)}.
Compute the significance level of this test. (According to the Neyman-Pearson
lemma, this is the uniformly most powerful test for this significance level.) Hints:
In order to show this you need to know that ln 3.633 = 1.29, everything else can be
done without a calculator. Along the way you may want to show that C can also be
written in the form C = {y1 , . . . , y4 : y1 + · · · + y4 ≥ 3.290}.
Answer. Here is the equation which determines when y1 , . . . , y4 lie in C:
1 1 2
(15.2.11) (2π)−2 exp − (y1 − 1)2 + · · · + (y4 − 1)2 ≥ 3.633 · (2π)−2 exp − y1 + · · · + y42
2 2
1 1 2
2 2
(15.2.12) − (y1 − 1) + · · · + (y4 − 1) ≥ ln(3.633) − y1 + · · · + y42
2 2
(15.2.13) y1 + · · · + y4 − 2 ≥ 1.290
15.2. THE NEYMAN PEARSON LEMMA AND LIKELIHOOD RATIO TESTS 439
Since Pr[y 1 + · · · + y 4 ≥ 3.290] = Pr[z = (y 1 + · · · + y 4 )/2 ≥ 1.645] and z is a standard normal, one
obtains the significance level of 5% from the standard normal table or the t-table.
Note that due to the properties of the Normal distribution, this critical region,
for a given significance level, does not depend at all on the value of t. Therefore this
test is uniformly most powerful against the composite hypothesis µ > 0.
One can als write the null hypothesis as the composite hypothesis µ ≤ 0, because
the highest probability of type one error will still be attained when µ = 0. This
completes the proof that the test given in the original fertilizer example is uniformly
most powerful.
Most other distributions discussed here are equally well behaved, therefore uni-
formly most powerful one-sided tests exist not only for the mean of a normal with
known variance, but also the variance of a normal with known mean, or the param-
eters of a Bernoulli and Poisson distribution.
However the given one-sided hypothesis is the only situation in which a uniformly
most powerful test exists. In other situations, the generalized likelihood ratio test has
good properties even though it is no longer uniformly most powerful. Many known
tests (e.g., the F test) are generalized likelihood ratio tests.
Assume you want to test the composite null hypothesis H0 : θ ∈ ω, where ω is
a subset of the parameter space, against the alternative HA : θ ∈ Ω, where Ω ⊃ ω
is a more comprehensive subset of the parameter space. ω and Ω are defined by
440 15. HYPOTHESIS TESTING
Each of your three research assistants has to repeat a certain experiment 9 times,
and record whether each experiment was a success (1) or a failure (0). In all cases, the
experiments happen to have been successful 4 times. Assistant A has the following
sequence of successes and failures: 0, 1, 0, 0, 1, 0, 1, 1, 0, B has 0, 1, 0, 1, 0, 1, 0, 1, 0, and
C has 1, 1, 1, 1, 0, 0, 0, 0, 0.
On the basis of these results, you suspect that the experimental setup used by
B and C is faulty: for C, it seems that something changed over time so that the
first experiments were successful and the latter experiments were not. Or perhaps
the fact that a given experiment was a success (failure) made it more likely that also
the next experiment would be a success (failure). For B, the opposite effect seems
to have taken place.
From the pattern of successes and failures you made inferences about whether
the outcomes were independent or followed some regularity. A mathematical for-
malization of this inference counts “runs” in each sequence of outcomes. A run is a
sucession of several ones or zeros. The first outcome had 7 runs, the second 9, and
the third only 2. Given that the number of successes is 4 and the number of failures
is 5, 9 runs seem too many and 2 runs too few.
The “runs test” (sometimes also called “run test”) exploits this in the following
way: it counts the number of runs, and then asks if this is a reasonable number of
442 15. HYPOTHESIS TESTING
runs to expect given the total number of successes and failures. It rejects whenever
the number of runs is either too large or too low.
The choice of the number of runs as test statistic cannot be derived from a like-
lihood ratio principle, since we did not specify the joint distribution of the outcome
of the experiment. But the above argument says that it will probably detect at least
some of the cases we are interested in.
In order to compute the error of type one, we will first derive the probability
distribution of the number of runs conditionally on the outcome that the number of
successes is 4. This conditional distribution can be computed, even if we do not know
the probability of success of each experiment, as long as their joint distribution has
the following property (which holds under the null hypothesis of statistical indepen-
dence): the probability of a given sequence of failures and successes only depends on
the number of failures and successes, not on the order in which they occur. Then the
conditional distribution of the number of runs can be obtained by simple counting.
How many arrangements of 5 zeros and 4 ones are there? The answer is 94 =
(9)(8)(7)(6)
(1)(2)(3)(4) = 126. How many of these arrangements have 9 runs? Only one, i.e., the
probability of having 9 runs (conditionally on observing 4 successes) is 1/126. The
probability of having 2 runs is 2/126, since one can either have the zeros first, or the
ones first.
15.3. THE RUNS TEST 443
In order to compute the probability of 7 runs, lets first ask: what is the proba-
bility of having 4 runs of ones and 3 runs of zeros? Since there are only 4 ones, each
run of ones must have exactly one element. So the distribution of ones and zeros
must be:
1 − one or more zeros − 1 − one or more zeros − 1 − one or more zeros − 1.
In order to specify the distribution of ones and zeros completely, we must therefore
count how many ways there are to split the sequence of 5 zeros into 3 nonempty
batches. Here are the possibilities:
0 0 0 | 0 | 0
0 0 | 0 0 | 0
0 0 | 0 | 0 0
(15.3.1)
0 | 0 0 0 | 0
0 | 0 0 | 0 0
0 | 0 | 0 0 0
Generally, the number of possibilities is 42 because there are 4 spaces between those
m−1 n−1
m
s−1 s−1
(15.3.4) Pr[r = 2s] = 2 m+n
m
Some computer programs (StatXact, www.cytel.com) compute these probabilities
exactly or by monte carlo simulation; but there is also an asymptotic test based on
15.3. THE RUNS TEST 445
6
6
6
6
Figure 3. Distribution of runs in 7 trials, if there are 4 successes
and 3 failures
probability is also equal to the unconditional probability. The only problem here
is that, due to discreteness, we can make the probability of type one errors only
approximately equal; but with increasing sample size this problem disappears.
15.4. PEARSON’S GOODNESS OF FIT TEST. 447
Problem 218. Write approximately 200 x’es and o’s on a piece of paper trying
to do it in a random manner. Then make a run test whether these x’s and o’s were
indeed random. Would you want to run a two-sided or one-sided test?
The law of rare events literature can be considered a generalization of the run
test. For epidemiology compare [Cha96], [DH94], [Gri79], and [JL97].
Why does one get a χ2 distribution in the limiting case? Because the xi them-
selves are asymptotically normal, and certain quadratic forms of normal distributions
are χ2 . The matterP is made a little complicated by the fact that the xi are linearly
dependent, since xj = n, and therefore their covariance matrix is singular. There
are two ways to deal with such a situation. One is to drop one observation; one
will not lose any information by this, and the remaining r − 1 observations are well
behaved. (This explains, by the way, why one has a χ2r−1 instead of a χ2r .)
We will take an alternative route, namely, use theorems which are valid even
if the covariance matrix is singular. This is preferable because it leads to more
unified theories. In equation (10.4.9), we characterized all the quadratic forms of
multivariate normal variables that are χ2 ’s. Here it is again: Assume y is a jointly
normal vector random variable with mean vector µ and covariance matrix σ 2 Ψ, and
Ω is a symmetric nonnegative definite matrix. Then (y − µ)> Ω(y − µ) ∼ σ 2 χ2k iff
ΨΩΨΩΨ = ΨΩΨ and k is the rank of Ω. If Ψ is singular, i.e., does not have an
inverse, and Ω is a g-inverse of Ψ, then condition (10.4.9) holds. A matrix Ω is a
g-inverse of Ψ iff ΨΩΨ = Ψ. Every matrix has at least one g-inverse, but may have
more than one.
Now back to our multinomial distribution. By the central limit theorem, the xi
are asymptotically jointly normal; their mean and covariance matrix are given by
equation (8.4.2). This covariance matrix is singular (has rank r − 1), and a g-inverse
15.4. PEARSON’S GOODNESS OF FIT TEST. 449
is given by (15.4.2), which has in its diagonal exactly the weighting factors used in
the statistic for the goodness of fit test.
1
0 ··· 0
p1
1
10 p2 ··· 0
(15.4.2)
. .. .. ..
n .. . . .
1
0 0 ··· pr
Answer. Postmultiplied by the g-inverse given in (15.4.2), the covariance matrix from (8.4.2)
becomes
(15.4.3)
1 0 ··· 0
p1 − p21 −p1 p2 ··· −p1 pr p1 1 − p1 −p1 ··· −p1
2 1
−p2 p1 p2 − p2 · · · −p2 pr 1 0 p2 · · · 0 −p2 1 − p2 · · · −p2
= .
,
. .. .. .. . . . . . . ..
.. . . .
n .. .. .. .. .. .. ..
.
2 1
450 15. HYPOTHESIS TESTING
and if one postmultiplies this again by the covariance matrix, one gets the covariance matrix back:
1 − p1 −p1 ··· −p1 p1 − p21 −p1 p2 ··· −p1 pr p1 − p21 −p1 p2 ···
−p2 1 − p2 ··· −p2 −p2 p1 p2 − p22 ··· −p2 pr −p2 p1 p2 − p22 ···
. . .. . . . .. . = . . ..
.. .
. . .
. .
. .
. . .
.
.. .
. .
−pr −pr ··· 1 − pr −pr p1 −pr p2 ··· 2
pr − pr −pr p1 −pr p2 ···
In the left upper corner, matrix multiplication yields the element p1 − 2p21 + p31 + p21 (p2 + · · · +
pr ) = p1 − 2p21 + p21 (p1 + p2 + · · · + pn ) = p1 − p21 , and the first element in the second row is
−2p1 p2 + p1 p2 (p1 + p2 + · · · + pr ) = −p1 p2 . Since the product matrix is symmetric, this gives all
the typical elements.
Problem 220. Show that the covariance matrix of the multinomial distribution
given in (8.4.2) has rank n − 1.
Answer. Use the fact that the rank of Ψ is tr(ΨΨ− ), and one sees that the trace of the
matrix on the rhs of (15.4.3) is n − 1.
2
From this follows that asymptotically, i+1 (xi −np
Pr i)
npi has a χ2 distribution with
2
r − 1 degrees of freedom. This is only asymptotically a χ ; the usual rule is that n
must be large enough so that npi is at least 5 for all i. Others refined this rule: if
r ≥ 5, it is possible to let one of the npi be as small as 1 (requiring the others to be
5 or more) and still get a “pretty good” approximation.
15.4. PEARSON’S GOODNESS OF FIT TEST. 451
Problem 221. [HC70, pp. 310/11] I throw a die 60 times and get the following
frequencies for the six faces of the die: 13, 19, 11, 8, 5, 4. Test at the 5% significance
level whether the probability of each face is 16 .
Answer. The hypothesis must be rejected: observed value is 15.6, critical value 11.1.
smoker nonsmoker
(15.4.4) lung cancer y 11 y 12
no lung cancer y 21 y 22
452 15. HYPOTHESIS TESTING
The procedure described would allow us to test whether the data is compabile with
p p12
the four cell probabilities being any given set of values 11 . But the ques-
p21 p22
tion which is most often asked is not whether the data are compatible with a spe-
cific set of cell probabilities, but whether the criteria tabled are independent. I.e.,
whether there are two numbers p and q so that the cell probabilities have the form
pq p(1 − q)
. If this is so, then p is the probability of being a smoker,
(1 − p)q (1 − p)(1 − q)
and q the probability of having lung cancer. Their MLE’s are p̃ = (y 11 + y 12 )/n,
which is usually written as y 1. /n, and q̃ = (y 11 + y 21 )/n = y .1 /n. Therefore the
MLE’s of all four cell probabilities are
y 1. y .1 /n2 y 1. y .2 /n2
(15.4.5) .
y 2. y .1 /n2 y 2. y .2 /n2
Plugging these MLE’s into the formula for the goodness of fit test statistic, we get
X (xij − x.i x.j /n)2
(15.4.6) ∼ χ21 .
i,j
x.i x.j /n
Since there are four cells, i.e., three independent counts, and we estimated two param-
eters, the number of degrees of freedom is 3−2 = 1. Again, this is only asymptotically
a χ2 .
15.5. PERMUTATION TESTS 453
ā − b̄
(15.5.1) t= p
s2 /n a + s2 /nb
454 15. HYPOTHESIS TESTING
where
n
X
(15.5.2) na = di = 3
i=1
Xn
(15.5.3) nb = (1 − di ) = 2
i=1
1 X
(15.5.4) ā = yi
na
i : di =1
1 X
(15.5.5) b̄ = yi
nb
i : di =0
2 2
P P
2 i : di =1 (y i − ā) + i : di =0 (y i − b̄)
(15.5.6) s =
na + nb − 2
Assume for example that the observed values of the two random variables d
and y are d = [ 0 0 1 1 1 ] and y = [ 6 12 18 30 54 ]. I.e., the three subjects receiving
treatment A had the results 18, 30, and 54, and the two subjects receiving treatment
B the results 6 and 12. This gives a value of t = 1.81. Does this indicate a significant
difference between A and B?
15.5. PERMUTATION TESTS 455
d1 d2 d3 d4 d5 ā b̄ t
1 1 1 0 0 12 42 –3.00
1 1 0 1 0 16 36 –1.22
1 1 0 0 1 24 24 0.00
1 0 1 1 0 18 33 –0.83
1 0 1 0 1 26 21 0.25
1 0 0 1 1 30 15 0.83
0 1 1 1 0 20 30 –0.52
0 1 1 0 1 28 18 0.52
0 1 0 1 1 32 12 1.22
0 0 1 1 1 34 9 1.81
Since there are 10 possible outcomes, and the outcome observed is the most
extreme of them, the significance level is 0.1. i.e., in this case, the permutation test,
gives less evidence for a treatment effect than the ordinary t-test.
Problem 222. This is an example adapted from [GG95], which is also discussed
in [Spr98, pp. 2/3 and 375–379]. Table 1 contains artificial data about two firms
hiring in the same labor market. For the sake of the argument it is assumed that
both firms receive the exact same number of applications (100), and both firms hire
11 new employees. Table 1 shows how many of the applicants and how many of the
new hires were Minorities.
15.5. PERMUTATION TESTS 457
5 68 17
Answer. In Firm C the selection ratio is 32 40
= 64 = 0.265625. In firm A, the chances
for blacks to be hired is 24% that of whites, and in firm C it is 26%. Firm C seems better. But
if we compare the chances not to get hired we get a conflicting verdict: In firm A the ratio is
31 68
32 58
= 1.1357. In firm C it is 27 68
32 28
= 2.0491. In firm C, the chances not to get hired is twice as
high for Minorities as it is for Whites, in firm A the chances not to get hired are more equal. Here
A seems better.
15.5. PERMUTATION TESTS 459
This illustrates an important drawback of the selection ratio: if we compare the chances of
not being hired instead of those of being hired, we get (1 − p1 )/(1 − p2 ) instead of p1 /p2 . There
is no simple relationship between these two numbers, indeed (1 − p1 )/(1 − p2 ) is not a function of
p1 /p2 , although both ratios should express the same concept. This is why one can get conflicting
information if one looks at the selection ratio for a certain event or the selection ratio for its
complement.
The odds ratio and the differences in probabilities do not give rise to such discrepancies: the
odds ratio for not being hired is just the inverse of the odds ratio for being hired, and the difference
in the probabilities of not being hired is the negative of the difference in the probabilities of being
hired.
As long as p1 and p2 are both close to zero, the odds ratio is approximately equal to the
selection ratio, therefore in this case the selection ratio is acceptable despite the above criticism.
• e. 4 points Compute the significance levels for rejecting the null hypothesis of
equal treatment with the one-sided alternative of discrimination for each firm using
460 15. HYPOTHESIS TESTING
Fisher’s exact test. You will get a counterintuitive result. How can you explain this
result?
Answer. The R-commands can be run as ecmet.script(hiring). Although firm A hired a
lower percentage of applicants than firm B, the significance level for discrimination on Fisher’s exact
test is 0.07652 for firm A and 0.03509 for firm B. I.e., in a court of law, firm B might be convicted
of discrimination, but firm A, which hired a lower percentage of its minority applicants, could not.
[Spr98, p. 377] explains this as follows: “the smaller number of minority hirings reduces the
power of Fisher’s exact test applied to firm A relative to the power where there is a surplus of
minority hirings (firm B). This extra power is enough to produce a significant result despite the
higher percentage of promotions among minority hirings (or the higher odds ratio if one makes the
comparison on that basis).”
• f. 5 points Now let’s change the example. Table 3 has the same numbers as
Table 1, but now these numbers do not count hirings but promotions from the pool
of existing employees, and instead of the number of applicants, the column totals are
the total numbers of employees of each firm. Let us first look at the overall race
composition of the employees in each firm. Let us assume that 40% of the population
are minorities, and 32 of the 100 employees of firm A are minorities, and 48 of the
100 employees of firm B are minorities. Is there significant evidence that the firms
discriminated in hiring?
Answer. Assuming that the population is infinite, the question is: if one makes 100 inde-
pendent random drawings from a population that contains 40% minorities, what is the probabil-
ity to end up with 32 or less minorities? The R-command is pbinom(q=32,size=100,prob=0.4)
which is 0.06150391. The other firm has more than 40 black employees; here one might won-
der if there is evidence of discrimination against whites. pbinom(q=48,size=100,prob=0.4) gives
0.9576986 = 1 − 0.0423, i.e., it is significant at the 5% level. But here we should apply a two-sided
test. A one-sided test about discrimination against Blacks can be justified by the assumption “if
there is discrimination at all, it is against blacks.” This assumption cannot be made in the case
of discrimination against Whites. We have to allow for the possibility of discrimination against
Minorities and against Whites; therefore the critical value is at probability 0.975, and the observed
result is not significant.
462 15. HYPOTHESIS TESTING
• g. 2 points You want to use Table 3 to investigate whether the firms discrimi-
nated in promotion, and you are considering Fisher’s exact test. Do the arguments
made above with respect to Fisher’s exact still apply?
Answer. No. A conditional test is no longer appropriate here because the proportion of
candidates for promotion is under control of the firms. Firm A not only promoted a smaller
percentage of their minority employees, but it also hired fewer minority workers in the first place.
These two acts should be considered together to gauge the discrimination policies. The above
Sprent-quote [Spr98, p. 377] continues: “There is a timely warning here about the need for care
when using conditional tests when the marginal totals used for conditioning may themselves be
conditional upon a further factor, in this case hiring policy.”
which is equivalent to
rm − 1
(15.5.10) rm f1 + ≤ f2
rx − 1
Problem 223. Suppose that 60% of whites are hired, while only 40% of a minor-
ity group are hired. Suppose that a certain type of training or education was related
to the job in question, and it is believed that at least 10% of the minority group had
this training.
• a. 3 points Assuming that persons with this training had twice the chance of
getting the job, which percentage of whites would have had this qualification in order
to explain the disparity in the hiring rates?
Answer. Since 60% of whites are hired and 40% of the minority group, rm = 60/40 = 1.5.
Training is the factor x. Sind persons with training had twice the chance of getting the job, rx = 2.
Since 10% of the minority group had this training, f1 = 0.1. Therefore (15.5.10) implies that at
least 1.5 · 0.1 + 0.5
1
= 65% of whites had to have this qualification in order to explain the observed
disparity in hiring rates.
is an estimate of the covariance matrix of h(θ̃). I.e., one takes h(θ̃) twice and
“divides” it by its covariance matrix.
Now let us make more stringent assumptions. Assume the density fx (x; θ) of
x depends on the parameter vector θ. We are assuming that the conditions are
satisfied which ensure asymptotic normality of the maximum likelihood estimator θ̂
and also of θ̄, the constrained maximum likelihood estimator subject to the constraint
h(θ) = o.
There are three famous tests to test this hypothesis, which asymptotically are
all distributed like χ2q . The likehihood-ratio test is
maxh(θ)=o fy (y; θ)
(15.6.3) LRT = −2 log = 2(log fy (y, θ̂) − log fy (y, θ̄))
maxθ fy (y; θ)
It rejects if imposing the constraint reduces the attained level of the likelihood func-
tion too much.
The Wald test has the form
n ∂h ∂ 2 log f (y; θ) −1 ∂h> o−1
(15.6.4) Wald = −h(θ̂)> h(θ̂)
> θ̂ >
∂θ
∂θ ∂θ∂θ θ̂ θ̂
h 2 i−1
To understand this formula, note that − E ∂ ∂θ∂θ log f (y;θ)
> is the Cramer Rao
lower bound, and since all maximum likelihood estimators asymptotically attain the
15.6. WALD, LIKELIHOOD RATIO, LAGRANGE MULTIPLIER TESTS 467
CRLB, it is the asymptotic covariance matrix of θ̂. If one does not take the expected
value but plugs θ̂ into these partial derivatives of the log likelihood function, one
gets a consistent estimate of the asymtotic covariance matrix. Therefore the Wald
test is a special case of the generalized Wald test.
Finally the score test has the form
∂ log f (y; θ) ∂ 2 log f (y; θ) −1 ∂ log f (y; θ)>
(15.6.5) Score = −
∂θ > ∂θ∂θ > ∂θ
θ̄ θ̄ θ̄
This test tests whether the score, i.e., the gradient of the unconstrained log likelihood
function, evaluated at the constrained maximum likelihood estimator, is too far away
from zero. To understand this formula, remember that we showed in the proof of
the Cramer-Rao
h 2 ilower bound that the negative of the expected value of the Hessian
∂ log f (y;θ)
−E ∂θ∂θ >
is the covariance matrix of the score, i.e., here we take the score
twice and divide it by its estimated covariance matrix.
CHAPTER 16
[Gre97, 6.1 on p. 220] says: “An econometric study begins with a set of propo-
sitions about some aspect of the economy. The theory specifies a set of precise,
deterministic relationships among variables. Familiar examples are demand equa-
tions, production functions, and macroeconomic models. The empirical investigation
provides estimates of unknown parameters in the model, such as elasticities or the
marginal propensity to consume, and usually attempts to measure the validity of the
theory against the behavior of the observable data.”
[Hen95, p. 6] distinguishes between two extremes: “‘Theory-driven’ approaches,
in which the model is derived from a priori theory and calibrated from data evidence.
They suffer from theory dependence in that their credibility depends on the credi-
bility of the theory from which they arose—when that theory is discarded, so is the
469
470 16. GENERAL PRINCIPLES OF ECONOMETRIC MODELLING
in order to “control” for this effect, so that the effects of education and age will not
be confounded.
Problem 224. Why should a regression of income on education include not only
age but also the square of age?
Answer. Because the effect of age becomes smaller with increases in age.
This chapter establishes the connection between critical realism and Holland and
Rubin’s modelling of causality in statistics as explained in [Hol86] and [WM83, pp.
3–25] (and the related paper [LN81] which comes from a Bayesian point of view). A
different approach to causality and inference, [Roy97], is discussed in chapter/section
2.8. Regarding critical realism and econometrics, also [Dow99] should be mentioned:
this is written by a Post Keynesian econometrician working in an explicitly realist
framework.
Everyone knows that correlation does not mean causality. Nevertheless, expe-
rience shows that statisticians can on occasion make valid inferences about causal-
ity. It is therefore legitimate to ask: how and under which conditions can causal
473
474 17. CAUSALITY AND INFERENCE
A third variable indicates who receives the treatment. I.e, he has the “causal in-
dicator” s which can take two values, t (treatment) and c (control), and two variables
y t and y c , which, evaluated at individual ω, indicate the responses this individual
would give in case he was subject to the treatment, and in case he was or not.
Rubin defines y t − y c to be the causal effect of treatment t versus the control
c. But this causal effect cannot be observed. We cannot observe how those indi-
viuals who received the treatement would have responded if they had not received
the treatment, despite the fact that this non-actualized response is just as real as
the response which they indeed gave. This is what Holland calls the Fundamental
Problem of Causal Inference.
Problem 225. Rubin excludes race as a cause because the individual cannot do
anything about his or her race. Is this argument justified?
Does this Fundamental Problem mean that causal inference is impossible? Here
are several scenarios in which causal inference is possible after all:
• Temporal stability of the response, and transience of the causal effect.
• Unit homogeneity.
• Constant effect, i.e., yt (ω) − yc (ω) is the same for all ω.
• Independence of the response with respect to the selection process regarding
who gets the treatment.
476 17. CAUSALITY AND INFERENCE
and that
if given treatment, and for all ω ∈ T , i.e., for those patients who do receive treatment, we do not
know whether they would have recovered if not given treatment. In other words, neither Pr[T |S]
nor E[C|S 0 ] can be estimated as the frequencies of observable outcomes.
Pr[T ] − Pr[C] = Pr[T |S] Pr[S] + Pr[T |S 0 ](1 − Pr[S]) − Pr[C|S] Pr[S] + Pr[C|S 0 ](1 − Pr[S])
= Pr[T |S] Pr[S] + Pr[T |S](1 − Pr[S]) − Pr[C|S 0 ] Pr[S] + Pr[C|S 0 ](1 − Pr[S])
(17.0.8) = Pr[T |S] − Pr[C|S 0 ]
• c. 2 points Why were all these calculations necessary? Could one not have
defined from the beginning that the causal effect of the treatment is Pr[T |S]−Pr[C|S 0 ]?
Answer. Pr[T |S] − Pr[C|S 0 ] is only the empirical difference in recovery frequencies between
those who receive treatment and those who do not. It is always possible to measure these differences,
but these differences are not necessarily due to the treatment but may be due to other reasons.
478 17. CAUSALITY AND INFERENCE
The main message of the paper is therefore: before drawing causal conclusions
one should acertain whether one of these conditions apply which make causal con-
clusions possible.
In the rest of the paper, Holland compares his approach with other approaches.
Suppes’s definitions of causality are interesting:
• If r < s denote two time values, event Cr is a prima facie cause of Es iff
Pr[Es |Cr ] > Pr[Es ].
• Cr is a spurious cause of Es iff it is a prima facie cause of Es and for some
q < r < s there is an event Dq so that Pr[Es |Cr , Dq ] = Pr[Es |Dq ] and
Pr[Es |Cr , Dq ] ≥ Pr[Es |Cr ].
• Event Cr is a genuine cause of Es iff it is a prima facie but not a spurious
cause.
This is quite different than Rubin’s analysis. Suppes concentrates on the causes of a
given effect, not the effects of a given cause. Suppes has a Popperian falsificationist
view: a hypothesis is good if one cannot falsify it, while Holland has the depth-realist
view which says that the empirical is only a small part of reality, and which looks at
the underlying mechanisms.
In the present chapter, the only distributional assumptions are that means and
variances exist. (From this follows that also the covariances exist).
> >
εi = y i −µ, and define the vectors y = y 1 y 2 · · · y n , ε = ε1 ε2 ··· εn ,
>
and ι = 1 1 · · · 1 . Then one can write the model in the form
(18.1.1) y = ιµ + ε ε ∼ (o, σ 2 I)
The notation ε ∼ (o, σ 2 I) is shorthand for E [ε ε] = σ 2 I
ε] = o (the null vector) and V [ε
2
(σ times the identity matrix, which has 1’s in the diagonal and 0’s elsewhere). µ is
the deterministic part of all the y i , and εi is the random part.
Model 2 is “simple regression” in which the deterministic part µ is not constant
but is a function of the nonrandom variable x. The assumption here is that this
function is differentiable and can, in the range of the variation of the data, be ap-
proximated by a linear function [Tin51, pp. 19–20]. I.e., each element of y is a
constant α plus a constant multiple of the corresponding element of the nonrandom
vector x plus a random error term: y t = α + xt β + εt , t = 1, . . . , n. This can be
written as
y1 1 x1 ε1 1 x1 ε1
(18.1.2)
.. ..
= α +
..
β +
.. ..
= .. α + ..
. . . . . . β .
yn 1 xn εn 1 xn εn
or
(18.1.3) y = Xβ + ε ε ∼ (o, σ 2 I)
18.1. THREE VERSIONS OF THE LINEAR MODEL 483
If the systematic part of y depends on more than one variable, then one needs
multiple regression, model 3. Mathematically, multiple regression has the same form
(18.1.3), but this time X is arbitrary (except for the restriction that all its columns
are linearly independent). Model 3 has Models 1 and 2 as special cases.
Multiple regression is also used to “correct for” disturbing influences. Let me
explain. A functional relationship, which makes the systematic part of y dependent
on some other variable x will usually only hold if other relevant influences are kept
constant. If those other influences vary, then they may affect the form of this func-
tional relation. For instance, the marginal propensity to consume may be affected
by the interest rate, or the unemployment rate. This is why some econometricians
484 18. MEAN-VARIANCE ANALYSIS IN THE LINEAR MODEL
(Hendry) advocate that one should start with an “encompassing” model with many
explanatory variables and then narrow the specification down by hypothesis tests.
Milton Friedman, by contrast, is very suspicious about multiple regressions, and
argues in [FS91, pp. 48/9] against the encompassing approach.
Friedman does not give a theoretical argument but argues by an example from
Chemistry. Perhaps one can say that the variations in the other influences may have
more serious implications than just modifying the form of the functional relation:
they may destroy this functional relation altogether, i.e., prevent any systematic or
predictable behavior.
observed unobserved
random y ε
nonrandom X β, σ 2
Problem 184 shows that in model 1, this principle yields the arithmetic mean.
18.2. ORDINARY LEAST SQUARES 485
We will solve this minimization problem using the first-order conditions in vector
notation. As a preparation, you should read the beginning of Appendix C about
matrix differentiation and the connection between matrix differentiation and the
Jacobian matrix of a vector function. All you need at this point is the two equations
(C.1.6) and (C.1.7). The chain rule (C.1.23) is enlightening but not strictly necessary
for the present derivation.
The matrix differentiation rules (C.1.6) and (C.1.7) allow us to differentiate
(18.2.1) to get
(18.2.2) ∂SSE/∂β > = −2y > X + 2β > X > X.
Transpose it (because it is notationally simpler to have a relationship between column
vectors), set it zero while at the same time replacing β by β̂, and divide by 2, to get
the “normal equation”
(18.2.3) X > y = X > X β̂.
486 18. MEAN-VARIANCE ANALYSIS IN THE LINEAR MODEL
Due to our assumption that all columns of X are linearly independent, X > X has
an inverse and one can premultiply both sides of (18.2.3) by (X > X)−1 :
(18.2.4) β̂ = (X > X)−1 X > y.
If the columns of X are not linearly independent, then (18.2.3) has more than one
solution, and the normal equation is also in this case a necessary and sufficient
condition for β̂ to minimize the SSE (proof in Problem 232).
Problem 230. 4 points Using the matrix differentiation rules
(18.2.5) ∂w> x/∂x> = w>
(18.2.6) ∂x> M x/∂x> = 2x> M
for symmetric M , compute the least-squares estimate β̂ which minimizes
(18.2.7) SSE = (y − Xβ)> (y − Xβ)
You are allowed to assume that X > X has an inverse.
Answer. First you have to multiply out
(18.2.8) (y − Xβ)> (y − Xβ) = y > y − 2y > Xβ + β > X > Xβ.
The matrix differentiation rules (18.2.5) and (18.2.6) allow us to differentiate (18.2.8) to get
(18.2.9) ∂SSE/∂β > = −2y > X + 2β > X > X.
18.2. ORDINARY LEAST SQUARES 487
Transpose it (because it is notationally simpler to have a relationship between column vectors), set
it zero while at the same time replacing β by β̂, and divide by 2, to get the “normal equation”
(18.2.10) X > y = X > X β̂.
Since X > X has an inverse, one can premultiply both sides of (18.2.10) by (X > X)−1 :
(18.2.11) β̂ = (X > X)−1 X > y.
Problem 231. 2 points Show the following: if the columns of X are linearly
independent, then X > X has an inverse. (X itself is not necessarily square.) In your
proof you may use the following criteria: the columns of X are linearly independent
(this is also called: X has full column rank) if and only if Xa = o implies a = o.
And a square matrix has an inverse if and only if its columns are linearly independent.
Answer. We have to show that any a which satisfies X > Xa = o is itself the null vector.
From X > Xa = o follows a> X > Xa = 0 which can also be written kXak2 = 0. Therefore Xa = o,
and since the columns of X are linearly independent, this implies a = o.
Problem 232. 3 points In this Problem we do not assume that X has full column
rank, it may be arbitrary.
• a. The normal equation (18.2.3) has always at least one solution. Hint: you
are allowed to use, without proof, equation (A.3.3) in the mathematical appendix.
488 18. MEAN-VARIANCE ANALYSIS IN THE LINEAR MODEL
• c. Conclude from this that the normal equation is a necessary and sufficient
condition characterizing the values β̂ minimizing the sum of squared errors (18.2.12).
Answer. (18.2.12) shows that the normal equations are sufficient. For necessity of the normal
equations let β̂ be an arbitrary solution of the normal equation, we have seen that there is always
at least one. Given β̂, it follows from (18.2.12) that for any solution β ∗ of the minimization,
X > X(β ∗ − β̂) = o. Use (18.2.3) to replace (X > X)β̂ by X > y to get X > Xβ ∗ = X > y.
It is customary to use the notation X β̂ = ŷ for the so-called fitted values, which
are the estimates of the vector of means η = Xβ. Geometrically, ŷ is the orthogonal
projection of y on the space spanned by the columns of X. See Theorem A.6.1 about
projection matrices.
The vector of differences between the actual and the fitted values is called the
vector of “residuals” ε̂ = y − ŷ. The residuals are “predictors” of the actual (but
18.2. ORDINARY LEAST SQUARES 489
Problem 234. Assume X has full column rank. Define M = I−X(X > X)−1 X > .
• a. 1 point Show that the space M projects on is the space orthogonal to all
columns in X, i.e., M q = q if and only if X > q = o.
490 18. MEAN-VARIANCE ANALYSIS IN THE LINEAR MODEL
Answer. X > q = o clearly implies M q = q. Conversely, M q = q implies X(X > X)−1 X > q =
o. Premultiply this by X > to get X > q = o.
• b. 1 point Show that a vector q lies in the range space of X, i.e., the space
spanned by the columns of X, if and only if M q = o. In other words, {q : q = Xa
for some a} = {q : M q = o}.
Answer. First assume M q = o. This means q = X(X > X)−1 X > q = Xa with a =
(X > X)−1 X > q. Conversely, if q = Xa then M q = M Xa = Oa = o.
Problem 235. In 2-dimensional space, write down the projection matrix on the
diagonal line y = x (call it E), and compute Ez for the three vectors a = [ 21 ],
b = [ 22 ], and c = [ 32 ]. Draw these vectors and their projections.
Assume we have a dependent variable y and two regressors x1 and x2 , each with
15 observations. Then one can visualize the data either as 15 points in 3-dimensional
space (a 3-dimensional scatter plot), or 3 points in 15-dimensional space. In the
first case, each point corresponds to an observation, in the second case, each point
corresponds to a variable. In this latter case the points are usually represented
as vectors. You only have 3 vectors, but each of these vectors is a vector in 15-
dimensional space. But you do not have to draw a 15-dimensional space to draw
these vectors; these 3 vectors span a 3-dimensional subspace, and ŷ is the projection
of the vector y on the space spanned by the two regressors not only in the original
18.2. ORDINARY LEAST SQUARES 491
Problem 236. “Simple regression” is regression with an intercept and one ex-
planatory variable only, i.e.,
(18.2.15) y t = α + βxt + εt
>
Here X = ι x and β = α β . Evaluate (18.2.4) to get the following formulas
>
for β̂ = α̂ β̂ :
x2t
P P P P
y t − xt xt y t
(18.2.16) α̂ =
n x2t − ( xt )2
P P
P P P
n xt y t − xt y t
(18.2.17) β̂ =
n x2t − ( xt )2
P P
Answer.
P
ι> ι> ι ι> x Pn P x2t
(18.2.18) X>X = ι x = =
x> x> ι x> x xt xt
492 18. MEAN-VARIANCE ANALYSIS IN THE LINEAR MODEL
P P
1 2 −
(18.2.19) X > X −1 = P P Pxt xt
n x2t − ( xt )2 − xt n
P
>ι> y P yt
(18.2.20) X y= =
x> y xi y t
>
Therefore (X X)−1 X > y gives equations (18.2.16) and (18.2.17).
Problem 238. Show that (18.2.17) and (18.2.16) can also be written as follows:
P
(xt − x̄)(y t − ȳ)
(18.2.22) β̂ = P
(xt − x̄)2
(18.2.23) α̂ = ȳ − β̂ x̄
18.2. ORDINARY LEAST SQUARES 493
P P
Answer. Using xi = nx̄ and y i = nȳ in (18.2.17), it can be written as
P
x y − nx̄ȳ
(18.2.24) β̂ = P t 2t 2
xt − nx̄
Now apply Problem 237 to the numerator of (18.2.24), and Problem 237 with y = x to the denom-
inator, to get (18.2.22).
To prove equation (18.2.23) for α̂, let us work backwards and plug (18.2.24) into the righthand
side of (18.2.23):
P P
ȳ x2t − ȳnx̄2 − x̄ xt y t + nx̄x̄ȳ
(18.2.25) ȳ − x̄β̂ = P
x2t − nx̄2
The second and the fourth term in the numerator cancel out, and what remains can be shown to
be equal to (18.2.16).
Problem 239. 3 points Show that in the simple regression model, the fitted
regression line can be written in the form
(18.2.26) ŷ t = ȳ + β̂(xt − x̄).
From this follows in particular that the fitted regression line always goes through the
point x̄, ȳ.
494 18. MEAN-VARIANCE ANALYSIS IN THE LINEAR MODEL
Answer. Follows immediately if one plugs (18.2.23) into the defining equation ŷ t = α̂ +
β̂xt .
Formulas (18.2.22) and (18.2.23) are interesting because they express the regres-
sion coefficients in terms of the sample means and covariances. Problem 240 derives
the properties of the population equivalents of these formulas:
Problem 240. Given two random variables x and y with finite variances, and
var[x] > 0. You know the expected values, variances and covariance of x and y, and
you observe x, but y is unobserved. This question explores the properties of the Best
Linear Unbiased Predictor (BLUP) of y in this situation.
• a. 4 points Give a direct proof of the following, which is a special case of theorem
27.1.1: If you want to predict y by an affine expression of the form a+bx, you will get
the lowest mean squared error MSE with b = cov[x, y]/ var[x] and a = E[y] − b E[x].
Answer. The MSE is variance plus squared bias (see e.g. problem 193), therefore
Therefore we choose a so that the second term is zero, and then you only have to minimize the first
term with respect to b. Since
• b. 2 points For the first-order conditions you needed the partial derivatives
∂ ∂
∂a E[(y − a − bx)2 ] and ∂b E[(y − a − bx)2 ]. It is also possible, and probably shorter,
h to
∂
interchange taking expected value and partial derivative, i.e., to compute E ∂a (y −
i h i
∂
a − bx)2 and E ∂b (y − a − bx)2 and set those zero. Do the above proof in this
alternative fashion.
h i
∂
Answer. E ∂a
(y −a−bx)2 = −2 E[y −a−bx] = −2(E[y]−a−b E[x]). Setting this zero gives
h i
∂
the formula for a. Now E ∂b
(y − a − bx)2 = −2 E[x(y − a − bx)] = −2(E[xy] − a E[x] − b E[x2 ]).
Setting this zero gives E[xy] − a E[x] − b E[x2 ] = 0. Plug in formula for a and solve for b:
E[xy] − E[x] E[y] cov[x, y]
(18.2.30) b= = .
E[x2 ] − (E[x])2 var[x]
Answer. If one plugs the optimal a into (18.2.27), this just annulls the last term of (18.2.27)
so that the MSE is given by (18.2.28). If one plugs the optimal b = cov[x, y]/ var[x] into (18.2.28),
one gets
cov[x, y] 2 (cov[x, y])
(18.2.31) MSE = var[x] − 2 cov[x, y] + var[x]
var[x] var[x]
(cov[x, y])2
(18.2.32) = var[y] − .
var[x]
• d. 2 points Show that the prediction error is uncorrelated with the observed x.
Answer.
(18.2.33) cov[x, y − a − bx] = cov[x, y] − a cov[x, x] = 0
• e. 4 points If var[x] = 0, the quotient cov[x, y]/ var[x] can no longer be formed,
but if you replace the inverse by the g-inverse, so that the above formula becomes
(18.2.34) b = cov[x, y](var[x])−
then it always gives the minimum MSE predictor, whether or not var[x] = 0, and
regardless of which g-inverse you use (in case there are more than one). To prove this,
you need to answer the following four questions: (a) what is the BLUP if var[x] = 0?
18.2. ORDINARY LEAST SQUARES 497
(b) what is the g-inverse of a nonzero scalar? (c) what is the g-inverse of the scalar
number 0? (d) if var[x] = 0, what do we know about cov[x, y]?
Answer. (a) If var[x] = 0 then x = µ almost surely, therefore the observation of x does not
give us any new information. The BLUP of y is ν in this case, i.e., the above formula holds with
b = 0.
(b) The g-inverse of a nonzero scalar is simply its inverse.
(c) Every scalar is a g-inverse of the scalar 0.
(d) if var[x] = 0, then cov[x, y] = 0.
Therefore pick a g-inverse 0, an arbitrary number will do, call it c. Then formula (18.2.34)
says b = 0 · c = 0.
Problem 241. 3 points Carefully state the specifications of the random variables
involved in the linear regression model. How does the model in Problem 240 differ
from the linear regression model? What do they have in common?
Answer. In the regression model, you have several observations, in the other model only one.
In the regression model, the xi are nonrandom, only the y i are random, in the other model both
x and y are random. In the regression model, the expected value of the y i are not fully known,
in the other model the expected values of both x and y are fully known. Both models have in
common that the second moments are known only up to an unknown factor. Both models have in
common that only first and second moments need to be known, and that they restrict themselves
to linear estimators, and that the criterion function is the MSE (the regression model minimaxes
it, but the other model minimizes it since there is no unknown parameter whose value one has to
498 18. MEAN-VARIANCE ANALYSIS IN THE LINEAR MODEL
minimax over. But this I cannot say right now, for this we need the Gauss-Markov theorem. Also
the Gauss-Markov is valid in both cases!)
Answer. Premultiply the normal equation by a> to get ι> y − ι> X β̂ = 0. Premultiply by
1/n to get the result.
Problem 243. The fitted values ŷ and the residuals ε̂ are “orthogonal” in two
different ways.
• a. 2 points Show that the inner product ŷ > ε̂ = 0. Why should you expect this
from the geometric intuition of Least Squares?
Answer. Use ε̂ = M y and ŷ = (I −M )y: ŷ > ε̂ = y > (I −M )M y = 0 because M (I −M ) = O.
This is a consequence of the more general result given in problem ??.
18.3. THE COEFFICIENT OF DETERMINATION 499
based on the sums of squared residuals from the two models. This is particularly
appropriate for nls(), which minimizes a sum of squares.
The treatment which follows here is a little more complete than most. Some
textbooks, such as [DM93], never even give the leftmost term in formula (18.3.6)
according to which R2 is the sample correlation coefficient. Other textbooks, such
that [JHG+ 88] and [Gre97], do give this formula, but it remains a surprise: there
is no explanation why the same quantity R2 can be expressed mathematically in
two quite different ways, each of which has a different interpretation. The present
treatment explains this.
If the regression has a constant term, then the OLS estimate β̂ has a third
optimality property (in addition to minimizing the SSE and being the BLUE): no
other linear combination of the explanatory variables has a higher squared sample
correlation with y than ŷ = X β̂.
In the proof of this optimality property we will use the symmetric and idempotent
projection matrix D = I − n1 ιι> . Applied to any vector z, D gives Dz = z − ιz̄,
which is z with the mean taken out. Taking out the mean is therefore a projection,
on the space orthogonal to ι. See Problem 189.
Answer. Dx2 is og, the dark blue line starting at the origin, and Dy is cy, the red line
starting on x1 and going up to the peak.
In order to prove that ŷ has the highest squared sample correlation, take any
vector c and look at ỹ = Xc. We will show that the sample correlation of y with
ỹ cannot be higher than that of y with ŷ. For this let us first compute the sample
covariance. By (12.3.17), n times the sample covariance between ỹ and y is
(18.3.3) n times sample covariance(ỹ, y) = ỹ > Dy = c> X > D(ŷ + ε̂
ε).
By Problem 246, Dε̂ ε, hence X > Dε̂
ε = ε̂ ε = X >ε̂ ε = o (this last equality is
equivalent to the Normal Equation (18.2.3)), therefore (18.3.3) becomes ỹ > Dy =
502 18. MEAN-VARIANCE ANALYSIS IN THE LINEAR MODEL
Answer. Since X has a constant term, a vector a exists such that Xa = ι, therefore ι>ε̂
ε=
a> X >ε̂
ε = a> o = 0. From ι>ε̂
ε = 0 follows Dε̂
ε = ε̂
ε.
Problem 247. 1 point Show that, if X has a constant term, then ŷ¯ = ȳ
Answer. Follows from 0 = ι>ε̂ ε = ι> y − ι> ŷ. In the visualization, this is equivalent with the
fact that both ocb and ocy are right angles.
Answer. ε̂ is the by, the green line going up to the peak, and SSE is the squared length of
it. SST is the squared length of y − ιȳ. Sincer ιȳ is the projection of y on x1 , i.e., it is oc, the part
of x1 that is red, one sees that SST is the squared length of cy. SSR is the squared length of cb.
The analysis of variance identity follows because cby is a right angle. R2 = cos2 α where α is the
angle between bcy in this same triangle.
is an orthogonal decomposition (all three vectors on the righthand side are orthogonal
to each other), therefore in particular
Geometrically this follows from the fact that y − ŷ is orthogonal to the column space
of X, while ŷ − ιȳ lies in that column space.
Answer. From y take the green line down to b, then the light blue line to c, then the red line
to the origin.
18.3. THE COEFFICIENT OF DETERMINATION 505
Problem 251. 5 points Show that the “analysis of variance” identity SST =
SSE + SSR holds in a regression with intercept, i.e., prove one of the two following
equations:
ε> (X β̂−ι n
1 >
P P
and then show that the cross product term (yt −ŷt )(ŷt −ȳ) = ε̂t (ŷt −ȳ) = ε̂ ι y) = 0
> > >
ε X = o and in particular, since a constant term is included, ε̂
since ε̂ ε ι = 0.
From the so-called “analysis of variance” identity (18.3.12), together with (18.3.6),
one obtains the following three alternative expressions for the maximum possible cor-
relation, which is called R2 and which is routinely used as a measure of the “fit” of
the regression:
¯ j − ȳ) 2
P
2 (ŷj − ŷ)(y SSR SST − SSE
(18.3.16) R = P ¯ 2 (yj − ȳ)2 = SST =
P
(ŷj − ŷ) SST
The first of these three expressions is the squared sample correlation coefficient be-
tween ŷ and y, hence the notation R2 . The usual interpretation of the middle
expression is the following: SST can be decomposed into a part SSR which is “ex-
plained” by the regression, and a part SSE which remains “unexplained,” and R2
measures that fraction of SST which can be “explained” by the regression. [Gre97,
pp. 250–253] and also [JHG+ 88, pp. 211/212] try to make this notion plausible.
Instead of using the vague notions “explained” and “unexplained,” I prefer the fol-
lowing reading, which is based on the third expression for R2 in (18.3.16): ιȳ is the
vector of fitted values if one regresses y on a constant term only, and SST is the SSE
in this “restricted” regression. R2 measures therefore the proportionate reduction in
18.3. THE COEFFICIENT OF DETERMINATION 507
the SSE if one adds the nonconstant regressors to the regression. From this latter
formula one can also see that R2 = cos2 α where α is the angle between y − ιȳ and
ŷ − ιȳ.
Problem 252. Given two data series x and y. Show that the regression of y
on x has the same R2 as the regression of x on y. (Both regressions are assumed to
include a constant term.) Easy, but you have to think!
Answer. The symmetry comes from the fact that, in this particular case, R2 is the squared
sample correlation coefficient between x and y. Proof: ŷ is an affine transformation of x, and
correlation coefficients are invariant under affine transformations (compare Problem 254).
Problem 253. This Problem derives some relationships which are valid in simple
regression yt = α + βxt + εt but their generalization to multiple regression is not
obvious.
Answer. From ŷt = α̂ + β̂xt and ȳ = α̂ + β̂ x̄ follows ŷt − ȳ = β̂(xt − x̄). Therefore
P P
(ŷ − ȳ)2 (x − x̄)2
(18.3.18) 2
R = P t 2
= β̂
2
P t 2
(yt − ȳ) (yt − ȳ)
• c. 1 point Finally show that R2 = β̂xy β̂yx , i.e., it is the product of the two
slope coefficients one gets if one regresses y on x and x on y.
18.4. THE ADJUSTED R-SQUARE 509
If the regression does not have a constant term, but a vector a exists with
ι = Xa, then the above mathematics remains valid. If a does not exist, then
the identity SST = SSR + SSE no longer holds, and (18.3.16) is no longer valid.
−SSE
The fraction SSTSST can assume negative values. Also the sample correlation
coefficient between ŷ and y loses its motivation, since there will usually be other
linear combinations of the columns of X that have higher sample correlation with y
than the fitted values ŷ.
Equation (18.3.16) is still puzzling at this point: why do two quite different simple
concepts, the sample correlation and the proportionate reduction of the SSE, give
the same numerical result? To explain this, we will take a short digression about
correlation coefficients, in which it will be shown that correlation coefficients always
denote proportionate reductions in the MSE. Since the SSE is (up to a constant
factor) the sample equivalent of the MSE of the prediction of y by ŷ, this shows
that (18.3.16) is simply the sample equivalent of a general fact about correlation
coefficients.
But first let us take a brief look at the Adjusted R2 .
the distribution function of R2 depends on both the unknown error variance and the
values taken by the explanatory variables; therefore the R2 belonging to different
regressions cannot be compared.
A further drawback is that inclusion of more regressors always increases the
R2 . The adjusted R̄2 is designed to remedy this. Starting from the formula R2 =
1 − SSE/SST , the “adjustment” consists in dividing both SSE and SST by their
degrees of freedom:
SSE/(n − k) n−1
(18.4.1) R̄2 = 1 − = 1 − (1 − R2 ) .
SST /(n − 1) n−k
For given SST , i.e., when one looks at alternative regressions with the same depen-
dent variable, R̄2 is therefore a declining function of s2 , the unbiased estimator of
σ 2 . Choosing the regression with the highest R̄2 amounts therefore to selecting that
regression which yields the lowest value for s2 .
R̄2 has the following interesting property: (which we note here only for reference,
because we have not yet discussed the F -test:) Assume one adds i more regressors:
then R̄2 increases only if the F statistic for these additional regressors has a value
greater than one. One can also say: s2 decreases only if F > 1. To see this, write
18.4. THE ADJUSTED R-SQUARE 511
this F statistic as
(SSE k − SSE k+i )/i n − k − i SSE k
(18.4.2) F = = −1
SSE k+i /(n − k − i) i SSE k+i
n−k−i (n − k)s2k
(18.4.3) = 2 −1
i (n − k − i)sk+i
(n − k)s2k n−k
(18.4.4) = 2 − +1
isk+i i
(n − k) s2k
(18.4.5) = 2 −1 +1
i sk+i
From this the statement follows.
Minimizing the adjusted R̄2 is equivalent to minimizing the unbiased variance
estimator s2 ; it still does not penalize the loss of degrees of freedom heavily enough,
i.e., it still admits too many variables into the model.
Alternatives minimize Amemiya’s prediction criterion or Akaike’s information
criterion, which minimize functions of the estimated variances and n and k. Akaike’s
information criterion minimizes an estimate of the Kullback-Leibler discrepancy,
which was discussed on p. 370.
CHAPTER 19
cov[x, y]
(19.1.1) ρxy = p p .
var[x] var[y]
513
514 19. DIGRESSION ABOUT CORRELATION COEFFICIENTS
Problem 254. Given the constant scalars a 6= 0 and c 6= 0 and b and d arbitrary.
Show that corr[x, y] = ± corr[ax + b, cy + d], with the + sign being valid if a and c
have the same sign, and the − sign otherwise.
Besides the simple correlation coefficient ρxy between two scalar variables y and
x, one can also define the squared multiple correlation coefficient ρ2y(x) between one
scalar variable y and a whole vector of variables x, and the partial correlation coef-
ficient ρ12.x between two scalar variables y 1 and y 2 , with a vector of other variables
x “partialled out.” The multiple correlation coefficient measures the strength of
a linear association between y and all components of x together, and the partial
correlation coefficient measures the strength of that part of the linear association
between y 1 and y 2 which cannot be attributed to their joint association with x. One
can also define partial multiple correlation coefficients. If one wants to measure the
linear association between two vectors, then one number is no longer enough, but
one needs several numbers, the “canonical correlations.”
The multiple or partial correlation coefficients are usually defined as simple cor-
relation coefficients involving the best linear predictor or its residual. But all these
19.1. A UNIFIED DEFINITION OF CORRELATION COEFFICIENTS 515
correlation coefficients share the property that they indicate a proportionate reduc-
tion in the MSE. See e.g. [Rao73, pp. 268–70]. Problem 255 makes this point for
the simple correlation coefficient:
Problem 255. 4 points Show that the proportionate reduction in the MSE of
the best predictor of y, if one goes from predictors of the form y ∗ = a to predictors
of the form y ∗ = a + bx, is equal to the squared correlation coefficient between y and
x. You are allowed to use the results of Problems 229 and 240. To set notation, call
the minimum MSE in the first prediction (Problem 229) MSE[constant term; y], and
the minimum MSE in the second prediction (Problem 240) MSE[constant term and
x; y]. Show that
(19.1.2)
MSE[constant term; y] − MSE[constant term and x; y] (cov[y, x])2
= = ρ2yx .
MSE[constant term; y] var[y] var[x]
Answer. The minimum MSE with only a constant is var[y] and (18.2.32) says that MSE[constant
term and x; y] = var[y]−(cov[x, y])2 / var[x]. Therefore the difference in MSE’s is (cov[x, y])2 / var[x],
and if one divides by var[y] to get the relative difference, one gets exactly the squared correlation
coefficient.
516 19. DIGRESSION ABOUT CORRELATION COEFFICIENTS
By theorem ??, the best linear predictor of y based on x has the formula
−
(19.1.4) y ∗ = ν + ω xy
>
Ω xx (x − µ)
y ∗ has the following additional extremal value property: no linear combination b> x
has a higher squared correlation with y than y ∗ . This maximal value of the squared
correlation is called the squared multiple correlation coefficient
−
2
ω>
xy Ω xxω xy
(19.1.5) ρy(x) =
ωyy
The multiple correlation coefficient itself is the positive square root, i.e., it is always
nonnegative, while some other correlation coefficients may take on negative values.
The squared multiple correlation coefficient can also defined in terms of propor-
tionate reduction in MSE. It is equal to the proportionate reduction in the MSE of
the best predictor of y if one goes from predictors of the form y ∗ = a to predictors
19.1. A UNIFIED DEFINITION OF CORRELATION COEFFICIENTS 517
There are therefore two natural definitions of the multiple correlation coefficient.
These two definitions correspond to the two formulas for R2 in (18.3.6).
>
Partial Correlation Coefficients. Now assume y = y 1 y 2 is a vector with
two elements and write
x µ Ω xx ω y1 ω y2
(19.1.7) y 1 ∼ ν1 , σ 2 ω >
y1 ω11 ω12 .
>
y2 ν2 ω y2 ω21 ω22
Let y ∗ be the best linear predictor of y based on x. The partial correlation coefficient
ρ12.x is defined to be the simple correlation between the residuals corr[(y 1 −y ∗1 ), (y 2 −
y ∗2 )]. This measures the correlation between y 1 and y 2 which is “local,” i.e., which
does not follow from their association with x. Assume for instance that both y 1 and
y 2 are highly correlated with x. Then they will also have a high correlation with
each other. Subtracting y ∗i from y i eliminates this dependency on x, therefore any
remaining correlation is “local.” Compare [Krz88, p. 475].
518 19. DIGRESSION ABOUT CORRELATION COEFFICIENTS
The partial correlation coefficient can be defined as the relative reduction in the
MSE if one adds y 2 to x as a predictor of y 1 :
(19.1.8)
MSE[constant term and x; y 2 ] − MSE[constant term, x, and y 1 ; y 2 ]
ρ212.x = .
MSE[constant term and x; y 2 ]
Problem 256. Using the definitions in terms of MSE’s, show that the following
relationship holds between the squares of multiple and partial correlation coefficients:
Mixed cases: One can also form multiple correlations coefficients with some of
the variables partialled out. The dot notation used here is due to Yule, [Yul07]. The
notation, definition, and formula for the squared correlation coefficient is
(19.1.12)
MSE[constant term and z; y] − MSE[constant term, z, and x; y]
ρ2y(x).z =
MSE[constant term and z; y]
>
ω xy.z Ω−
xx.z ω xy.z
(19.1.13) =
ωyy.z
where Ω xx and Ω yy are nonsingular, and let r be the rank of Ω xy . Then there exist
two separate transformations
(19.3.2) u = Lx, v = My
such that
u 2 Ip Λ
(19.3.3) [
V v ] = σ
Λ> Iq
522 19. DIGRESSION ABOUT CORRELATION COEFFICIENTS
q > q = 1. Then
(19.3.5)
> > >
o>
>
o>
l x p Lx p u p Ip Λ p o
V[ ] = V [ ] = V [ ] = σ2 > =σ
m> y q> M y o> q> v o q> Λ> Iq o q
Since the matrix at the righthand side has ones in the diagonal, it is the correlation
matrix, i.e., p> Λq = corr(l> x, m> y). Therefore (19.3.4) follows from Problem 258.
P 2 P 2 P
Problem 258. If pi = q i P = 1, and λi ≥ 0, show that | pi λi qi | ≤ max λi .
Hint: first get an upper bound for | pi λi qi | through a Cauchy-Schwartz-type argu-
ment.
P 2
P 2 P 2 2
Answer. ( pi λi qi ) ≤ pi λi qi λi ≤ (max λi ) .
Problem 259. Show that for every p-vector l and q-vector m such that l> x is
uncorrelated with l> > >
1 x, and m y is uncorrelated with m1 y,
(19.3.6) corr(l> x, m> y) ≤ λ2
p = (L−1 )> l and q = (M −1 )> m satisfy p> p = 1 and q > q = 1. Now write e1 for the first unit
vector, which has a 1 as first component and zeros everywhere else:
(19.3.7) cov[l> x, l> > > > >
1 x] = cov[p Lx, e1 Lx] = p Λe1 = p e1 λ1 .
This covariance is zero iff p1 = 0. Furthermore one also needs the following, directly from the proof
of Problem 257:
(19.3.8)
l> x p> Lx p> o> u 2 p
> o> Ip Λ p o 2 p> p
V[ ] = V [ ] = V [ ] = σ = σ
m> y q> M y o> q > v o> q > Λ> I q o q q > Λp
Since the matrix at the righthand side has ones in the diagonal, it is the correlation matrix, i.e.,
p> Λq = corr(l> x, m> y). Equation (19.3.6) follows from Problem 258 if one lets the subscript i
start at 2 instead of 1.
Problem 260. (Not eligible for in-class exams) Extra credit question for good
mathematicians: Reformulate the above treatment of canonical correlations without
the assumption that Ω xx and Ω yy are nonsingular.
the “residual maker” with respect to X. Then the squared partial sample correlation
is the squared simple correlation between the least squares residuals:
2 (z > M y)2
(19.4.1) rzy.X =
(z > M z)(y > M y)
Alternatively, one can define it as the proportionate reduction in the SSE. Although
X is assumed to incorporate a constant term, I am giving it here separately, in order
to show the analogy with (19.1.8):
(19.4.2)
2 SSE[constant term and X; y] − SSE[constant term, X, and z; y]
rzy.X = .
SSE[constant term and X; y]
[Gre97, p. 248] considers it unintuitive that this can be computed using t-statistics.
Our approach explains why this is so. First of all, note that the square of the t-
statistic is the F -statistic. Secondly, the formula for the F -statistic for the inclusion
of z into the regression is
(19.4.3)
SSE[constant term and X; y] − SSE[constant term, X, and z; y]
t2 = F =
SSE[constant term, X, and z; y]/(n − k − 1)
526 19. DIGRESSION ABOUT CORRELATION COEFFICIENTS
This is very similar to the formula for the squared partial correlation coefficient.
From (19.4.3) follows
SSE[constant term and X; y](n − k − 1)
(19.4.4) F +n−k−1=
SSE[constant term, X, and z; y]
and therefore
2 F
(19.4.5) rzy.X =
F +n−k−1
which is [Gre97, (6-29) on p. 248].
It should also be noted here that [Gre97, (6-36) on p. 254] is the sample equiv-
alent of (19.1.11).
CHAPTER 20
20.1. QR Decomposition
One precise and fairly efficient method to compute the Least Squares estimates
is the QR decomposition. It amounts to going over to an orthonormal basis in R[X].
It uses the following mathematical fact:
Every matrix X, which has full column rank, can be decomposed in the product
of two matrices QR, where Q has the same number of rows and columns as X, and
is “suborthogonal” or “incomplete orthogonal,” i.e., it satisfies Q> Q = I. The other
factor R is upper triangular and nonsingular.
To construct the least squares estimates, make a QR decomposition of the matrix
of explanatory variables X (which is assumed to have full column rank). With
527
528 20. NUMERICAL METHODS FOR COMPUTING OLS ESTIMATES
This last step can be made because R is nonsingular. (20.1.4) is a triangular system of
equations, which can be solved easily. Note that it is not necessary for this procedure
to compute the matrix X > X, which is a big advantage, since this computation is
numerically quite unstable.
>
Answer. X > X = R> Q> QR = R> R, its inverse is therefore R−1 R−1 .
20.1. QR DECOMPOSITION 529
1 5 −4
Answer.
1 −1 1 " #
1 3 −1
1 1 1 1
(20.1.6) Q= R=2 0 2 −2
2 1 −1 −1
0 0 1
1 1 −1
How to get it? We need a decomposition
" #
r11 r12 r13
(20.1.7) x1 x2 x3 = q 1 q2 q3 0 r22 r23
0 0 r33
where q > > > > > >
1 q 1 = q 2 q 2 = q 3 q 3 = 1 and q 1 q 2 = q 1 q 3 = q 2 q 3 = 0. First column: x1 = q 1 r11 and
q 1 must have unit length. This gives q > 1 = 1/2 1/2 1/2 1/2 and r11 = 2. Second column:
(20.1.8) x2 = q 1 r12 + q 2 r22
and q>
1 q2 = 0 and q>
2 q2 = 1. Premultiply (20.1.8) by q > >
1 to get q 1 x2 = r12 , i.e., r12 = 6.
>
Thus we know q 2 r22 = x2 − q 1 · 6 = −2 2 −2 2 . Now we have to normalize it, to get
530 20. NUMERICAL METHODS FOR COMPUTING OLS ESTIMATES
q 2 = −1/2 1/2 −1/2 1/2 and r22 = 4. The rest remains a homework problem. But I am
not sure if my numbers are right.
1 3 −1
Problem 263. 2 points Compute trace and determinant of 0 2 −2. Is
0 0 1
this matrix symmetric and, if so, is it nonnegative definite? Are its column vectors
linearly dependent? Compute the matrix product
1 −1 1
1 1 1 1 3 −1
(20.1.9) 1 −1 −1 0 2 −2
0 0 1
1 1 −1
For every matrix X one can find an orthogonal matrix Q such that Q> X has
zeros below the diagonal, call that matrix R. Alternatively one may say: every
matrix X can be written as the product of two matrices QR, where R is conformable
with X and has zeros below the diagonal, and Q is orthogonal.
To prove this, and also for the numerical procedure, we will build Q> as the
product of several orthogonal matrices, each converting one column of X into one
with zeros below the diagonal.
First note that for every vector v, the matrix I − v>2 v vv > is orthogonal. Given
X, let x be the first column
of √X. If x = o, then go on to the next column.
x11 + σ x> x
x21
Otherwise choose v = , where σ = 1 if x11 ≥ 0 and σ = −1
..
.
xn1
otherwise. (Mathematically, either σ − +1 or σ = −1 would do; but if one gives σ
the same sign as x11 , then the first element of v gets largest possible absolute value,
532 20. NUMERICAL METHODS FOR COMPUTING OLS ESTIMATES
the zeros into that matrix, it uses the “free” space to store the vector v. There is
almost enough room; the first nonzero element of v must be stored elsewhere. This
is why the QR decomposition in Splus has two main components: qr is a matrix
like a, and qraux is a vector of length ncols(a).
LINPACK
√ does not use or store exactly the same v as given here, but uses
u = v/(σ x> x) instead. The normalization does not affect the resulting orthogonal
transformation; its advantage is that the leading element of each vector, that which
is stored in qraux, is at the same time equal u> u/2. In other words, qraux doubles
up as the divisor in the construction of the orthogonal matrices.
In Splus type help(qr). At the end of the help file a program is given which
shows how the Q might be constructed from the fragments qr and qraux.
CHAPTER 21
About Computers
way to make GNU/Linux more and more user friendly. Windows, on the other hand,
has the following disadvantages:
• Microsoft Windows and the other commercial software are expensive.
• The philosophy of Microsoft Windows is to keep the user in the dark about
how the computer is working, i.e., turn the computer user into a passive
consumer. This severely limits the range of things you can do with your
computer. The source code of the programs you are using is usually unavail-
able, therefore you never know exactly what you are doing and you cannot
modify the program for your own uses. The unavailability of source code
also makes the programs more vulnerable to virus attacks and breakins.
In Linux, the user is the master of the computer and can exploit its full
potential.
• You spend too much time pointing and clicking. In GNU/Linux and other
unix systems, it is possible to set up menus too,m but everything that can
be done through a menu can also be done on the command line or through
a script.
• Windows and the commercial software based on it are very resource-hungry;
they require powerful computers. Computers which are no longer fast and
big enough to run the latest version of Windows are still very capable to
run Linux.
21.1. GENERAL STRATEGY 537
• It is becoming more and more apparent that free software is more stable
and of higher quality than commercial software. Free software is developed
by programmers throughout the world who want good tools for themselves.
• Most Linux distributions have excellent systems which allows the user to
automatically download always the latest versions of the software; this au-
tomates the tedious task of software maintenance, i.e., updating and fitting
together the updates.
Before taking the disk out you should give the command umount /floppy. You can
do this only if /floppy is not the current directory.
In order to remotely access X-windows from Microsoft-Windows, you have to go
through the following steps.
Something else: if I use the usual telnet program which comes with windows, in
order to telnet into a unix machine, and then I try to edit a file using emacs, it does
not work, it seems that some of the key sequences used by emacs make telnet hang.
Therefore I use a different telnet program, Teraterm Pro, with downloading instruc-
tions at http://www.egr.unlv.ecu/stock answers/remote access/install ttssh.
540 21. ABOUT COMPUTERS
not as powerful as Splus, but it is very similar, in the simple tasks almost identical.
There is also a GNU version of SPSS in preparation.
you invoked local will not accept any other commands, i.e., will be useless, until
you leave emacs again.
The emacs commands which you have to learn first are the help commands.
They all start with a C-h, i.e., control-h: type h while holding the control button
down. The first thing you may want to do at a quiet moment is go through the emacs
tutorial: get into emacs and then type C-h t and then follow instructions. Another
very powerful resource at your fingertip is emacs-info. To get into it type C-h i. It
has information pages for you to browse through, not only about emacs itself, but
also a variety of other subjects. The parts most important for you is the Emacs menu
item, which gives the whole Emacs-manual, and the ESS menu item, which explains
how to run Splus and SAS from inside emacs.
Another important emacs key is the “quit” command C-g. If you want to abort a
command, this will usually get you out. Also important command is the changing of
the buffer, C-x b. Usually you will have many buffers in emacs, and switch between
them if needed. The command C-x C-c terminates emacs.
Another thing I recommend you to learn is how to send and receive electronic
mail from inside emacs. To send mail, give the command C-x m. Then fill out address
and message field, and send it by typing C-c C-c. In order to receive mail, type M-x
rmail. There are a few one-letter commands which allow you to move around in
544 21. ABOUT COMPUTERS
browser it may not arrive in the right format. And the following SAS commands
deposit the data sets into your directory sasdata on your machine:
libname myec7800 ’mysasdata’;
proc cimport L=myec7800;
run;
21.5. INSTRUCTIONS FOR STATISTICS 5969, HANS EHRBAR’S SECTION 547
Problem 265. 6 points x <- 1:26; names(x) <- letters; vowels <- c("a",
"e", "i", "o", "u’’) Which R-expression returns the subvector of x correspond-
ing to all consonants?
Answer. x[-x[vowels]]
Problem 268. 2 points Use paste to get the character vector "1999:1" "1999:2"
"1999:3" "1999:4"
Answer. paste(1999, 1:4, sep=":")
550 21. ABOUT COMPUTERS
Problem 269. 5 points Do the exercise described on the middle of p. 17, i.e.,
compute the 95 percent confidence limits for the state mean incomes. You should be
getting the following intervals:
act nsw nt qld sa tas vic wa
63.56 68.41 112.68 65.00 63.72 66.85 70.56 60.71
25.44 46.25 -1.68 42.20 46.28 54.15 41.44 43.79
Answer. state <- c("tas", "sa", "qld", "nsw", "nsw", "nt", "wa", "wa", "qld", "vic"
"nsw", "vic", "qld", "qld", "sa", "tas", "sa", "nt", "wa", "vic", "qld", "nsw", "nsw", "
"sa", "act", "nsw", "vic", "vic", "act"); statef <- factor(state); incomes <- c(60, 49,
40, 61, 64, 60, 59, 54, 62, 69, 70, 42, 56, 61, 61, 61, 58, 51, 48, 65, 49, 49, 41, 48,
52, 46, 59, 46, 58, 43); incmeans <- tapply(incomes, statef, mean); stderr <- function(x
sqrt(var(x)/length(x)); incster <- tapply(incomes, statef, stderr); sampsize <- tapply(i
statef, length); Use 2-tail 5 percent, each tail has 2.5 percent: critval <- qt(0.975,sampsize-1);
conflow <- incmeans - critval * incster; confhigh <- incmeans + critval * incster; To prin
the confidence intervals use rbind(confhigh, conflow) which gives the following output:
21.5. INSTRUCTIONS FOR STATISTICS 5969, HANS EHRBAR’S SECTION 551
Problem 270. 4 points Use the cut function to generate a factor from the
variable ddpi in the data frame LifeCycleSavings. This factor should have the
three levels low for values ddpi ≤ 3, medium for values 3 < ddpi ≤ 6, and high for
the other values.
Monday June 18: graphical procedures, chapter 12. Please read this chapter
before coming to class, there will be a mini quiz again. For the following homework
it is helpful to do demo(graphics) and to watch closely which commands were used
there.
Problem 271. 5 points The data frame LifeCycleSavings has some egregious
outliers. Which plots allow you to identify those? Use those plots to determine which
of the data you consider outliers.
Answer. Do pairs(LifeCycleSavings) and look for panels which have isolated points. In
order to see which observation this is, do attach(LifeCycleSavings), then plot(sr,ddpi), then
identify(sr,ddpi). You see that 49 is clearly an outlier, and perhaps 47 and 23. Looking at some
other panels in the scatter plot matrix you will find that 49 always stands out, with also 47 and
44.
552 21. ABOUT COMPUTERS
Wednesday June 20: More language features, chapters 6–10, and the beginning
of statistical models, chapter 11. A Mini Quiz will check that you read chapters 6–10
before coming to class. Homework is an estimation problem.
Monday June 25: Mini Quiz about chapter 11. We will finish chapter 11. After
this session you will have a take-home final exam for this part of the class, using the
features of R. It will be due on Monday, July 2nd, at the beginning of class.
If you have installed wget in a location R can find it in (I think no longer
necessary).
21.5. INSTRUCTIONS FOR STATISTICS 5969, HANS EHRBAR’S SECTION 553
you writing the SAS code with the proper indentation. Say you have such a sas file
in your current buffer and you want to submit it to SAS. First do M-x SAS to start
SAS. This creates some other windows but your cursor should stay in the original
window with the sas-file. Then to C-c C-b to submit the whole buffer to SAS.
There are some shortcuts to switch between the buffers: C-c C-t switches you
into *SAS.lst* which lists the results of your computation.
For further work you may have to create a region in your buffer; go to the
beginning of the region and type C-@ (emacs will respond with the message in the
minibuffer: “mark set”), and then go to the end of the region. Before using the
region for editing, it is always good to do the command C-x C-x (which puts the
cursor where the mark was and the marker where the cursor was) to make sure the
region is what you want it to be. There is apparently a bug in many emacs versions
where the point jumps by a word when you do it the first time, but when you correct
it then it will stay. Emacs may also be configured in such a way that the region
becomes inactive if other editing is done before it is used; the command C-x C-x
re-activates the region. Then type C-c C-r to submit the region to the SAS process.
In order to make high resolution gs-plots, you have to put the following two lines
into your batch files. For interactive use on X-terminals you must comment them
out again (by putting /* in front and */ after them).
21.5. INSTRUCTIONS FOR STATISTICS 5969, HANS EHRBAR’S SECTION 555
S-mode in Emacs-info. They cannot be learned by trial and error, and they cannot
be learned in one or two sessions.
If you are sitting at the console, then you must give the command openwin()
to tell Splus to display high resolution graphs in a separate window. You will get a
postscript printout simply by clicking the mouse on the print button in this window.
If you are logged in over telnet and access Splus through emacs, then it is possible
to get some crude graphs on your screen after giving the command printer(width=79).
Your plotting commands will not generate a plot until you give the command show()
in order to tell Splus that now is the time to send a character-based plot to the screen.
Splus has a very convenient routine to translate SAS-datasets into Splus-datasets.
Assume there is a SAS dataset cobbdoug in the unix directory /home/econ/ehrbar/ec78
i.e., this dataset is located in a unix file by the name /home/econ/ehrbar/ec7800/sasda
Then the Splus-command mycobbdoug <- sas.get("/home/econ/ehrbar/ec7800/s
"cobbdoug") will create a Splus-dataframe with the same data in it.
In order to transfer Splus-files from one computer to another, use the data.dump
and data.restore commands.
To get out of Splus again, issue the command C-c C-q. It will ask you if you
want all temporary files and buffers deleted, and you should answer yes. This will
not delete the buffer with your Splus-commands in it. If you want a record of your
21.6. THE DATA STEP IN SAS 557
Splus-session, you should save this buffer in a file, by giving the command C-x C-s
(it will prompt you for a filename).
By the way, it is a good idea to do your unix commands through an emacs buffer
too. In this way you have a record of your session and you have easier facilities
to recall commands, which are usually the same as the commands you use in your
*S*-buffer. To do this you have to give the command M-x shell.
Books on Splus include the “Blue book” [BCW96] which unfortunately does
not discuss some of the features recently introduced into S, and the “White book”
[CH93] which covers what is new in the 1991 release of S. The files book.errata and
model.errata in the directory /usr/local/splus-3.1/doc/ specify known errors
in the Blue and White book.
Textbooks for using Splus include [VR99] which has an url www.stats.oz.ac.uk/pu
[Spe94], [Bur98] (downloadable for free from the internet), and [Eve94].
R has now a very convenient facility to automatically download and update
packages from CRAN. Look at the help page for update.packages.
Assume you have a dataset mydata which includes the variable year, and you
want to run a regression procedure only for the years 1950–59. This you can do by
including the following data step before running the regression:
data fifties;
set mydata;
if 1950 <= year <= 1959;
This works because the data step executes every command once for every obser-
vation. When it executes the set statement, It starts with the first observation and
includes every variable from the data set mydata into the new data set fifties; but
if the expression 1950 <= year <= 1959 is not true, then it throws this observation
out again.
Another example is: you want to transform some of the variables in your data
set. For instance you want to get aggregate capital stock, investment, and output
for all industries. Then you might issue the commands:
data aggregate;
set ec781.invconst;
kcon00=sum(of kcon20-kcon39);
icon00=sum(of icon20-icon39);
ocon00=sum(of ocon20-ocon39);
keep kcon00, icon00, ocon00, year;
21.6. THE DATA STEP IN SAS 559
The keep statement tells SAS to drop all the other variables, otherwise all variables
in ec781.invconst would also be in aggregate.
Assume you need some variables from ec781.invconst and some from ec781.invmi
Let us assume both have the same variable year. Then you can use the merge state-
ment:
data mydata;
merge ec781.invcost ec781.invmisc;
by year;
keep kcon20, icon20, ocon20, year, prate20, primeint;
For this step it is sometimes necessary to rename variables before merging. This can
be done by the rename option.
The by statement makes sure that the years in the different datasets do not get
mixed up. This allows you to use the merge statement also to get variables from the
Citybase, even if the starting end ending years are not the same as in our datasets.
An alternative, but not so good method would be to use two set statements:
data mydata;
set ec781.invcost;
set ec781.invmisc;
keep kcon20, icon20, ocon20, year, prate20, primeint;
560 21. ABOUT COMPUTERS
If the year variable is in both datasets, SAS will first take the year from invconst,
and overwrite it with the year data from invmisc, but it will not check whether the
years match. Since both datasets start and stop with the same year, the result will
still be correct.
If you use only one set statement with two datasets as arguments, the result
will not be what you want. The following is therefore wrong:
data mydata;
set ec781.invcost ec781.invmisc;
keep kcon20, icon20, ocon20, year, prate20, primeint;
Here SAS first reads all observations from the first dataset and then all observations
from the second dataset. Those variables in the first dataset which are not present
in the second dataset get missing values for the second dataset, and vice versa. So
you would end up with the variable year going twice from 1947 to 1985, and the
variables kcon20 having 39 missing values at the end, and prate having 39 missing
values at the beginning.
People who want to use some Citibase data should include the following options
on the proc citibase line: beginyr=47 endyr=85. If their data starts later, they
will add missing values at the beginning, but the data will still be lined up with your
data.
21.6. THE DATA STEP IN SAS 561
The retain statement tells SAS to retain the value of the variable from one loop
through the data step to the next (instead of re-initializing it as a missing value.)
The variable monthtot initially contains a missing value; if the data set does not
start with a January, then the total value for the first year will be a missing value,
since adding something to a missing value gives a missing value again. If the dataset
does not end with a December, then the (partial) sum of the months of the last year
will not be read into the new data set.
The variable date which comes with the citibase data is a special data type.
Internally it is the number of days since Jan 1, 1960, but it prints in several formats
directed by a format statement which is automatically given by the citibase proce-
dure. In order to get years, quarters, or months, use year(date), qtr(date), or
month(date). Therefore the conversion of monthly to yearly data would be now:
data annual;
set monthly;
retain monthtot;
if month(date)=1 then monthtot=0;
monthtot=monthtot+timeser;
if month(date)=12 then output;
yr=year(date);
keep yr monthtot;
CHAPTER 22
Specific Datasets
The assumption here is that output is the only random variable. The regression
model is based on the assumption that the dependent variables have more noise in
them than the independent variables. One can justify this by the argument that
any noise in the independent variables will be transferred to the dependent variable,
and also that variables which affect other variables have more steadiness in them
than variables which depend on others. This justification often has merit, but in the
specific case, there is much more measurement error in the labor and capital inputs
22.1. COBB DOUGLAS AGGREGATE PRODUCTION FUNCTION 565
than in the outputs. Therefore the assumption that only the output has an error
term is clearly wrong, and problem 275 below will look for possible alternatives.
Problem 274. Table 1 shows the data used by Cobb and Douglas in their original
article [CD28] introducing the production function which would bear their name.
output is “Day’s index of the physical volume of production (1899 = 100)” described
in [DP20], capital is the capital stock in manufacturing in millions of 1880 dollars
[CD28, p. 145], labor is the “probable average number of wage earners employed in
manufacturing” [CD28, p. 148], and wage is an index of the real wage (1899–1908
= 100).
• a. A text file with the data is available on the web at www.econ.utah.edu/
ehrbar/data/cobbdoug.txt, and a SDML file (XML for statistical data which can be
read by R, Matlab, and perhaps also SPSS) is available at www.econ.utah.edu/ehrbar/
data/cobbdoug.sdml. Load these data into your favorite statistics package.
Answer. In R, you can simply issue the command cobbdoug <- read.table("http://www.
econ.utah.edu/ehrbar/data/cobbdoug.txt", header=TRUE). If you run R on unix, you can also
do the following: download cobbdoug.sdml from the www, and then first issue the command
library(StatDataML) and then readSDML("cobbdoug.sdml"). When I tried this last, the XML pack-
age necessary for StatDataML was not available on windows, but chances are it will be when you
read this.
In SAS, you must issue the commands
566 22. SPECIFIC DATASETS
year 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910
output 100 101 112 122 124 122 143 152 151 126 155 159
capital 4449 4746 5061 5444 5806 6132 6626 7234 7832 8229 8820 9240
labor 4713 4968 5184 5554 5784 5468 5906 6251 6483 5714 6615 6807
year 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922
output 153 177 184 169 189 225 227 223 218 231 179 240
capital 9624 10067 10520 10873 11840 13242 14915 16265 17234 18118 18542 19192
labor 6855 7167 7277 7026 7269 8601 9218 9446 9096 9110 6947 7602
data cobbdoug;
infile ’cobbdoug.txt’;
input year output capital labor;
run;
But for this to work you must delete the first line in the file cobbdoug.txt which contains the
variable names. (Is it possible to tell SAS to skip the first line?) And you may have to tell SAS
22.1. COBB DOUGLAS AGGREGATE PRODUCTION FUNCTION 567
the full pathname of the text file with the data. If you want a permanent instead of a temporary
dataset, give it a two-part name, such as ecmet.cobbdoug.
Here are the instructions for SPSS: 1) Begin SPSS with a blank spreadsheet. 2) Open up a file
with the following commands and run:
SET
BLANKS=SYSMIS
UNDEFINED=WARN.
DATA LIST
FILE=’A:\Cbbunst.dat’ FIXED RECORDS=1 TABLE /1 year 1-4 output 5-9 capital
10-16 labor 17-22 wage 23-27 .
EXECUTE.
This files assume the data file to be on the same directory, and again the first line in the data file
with the variable names must be deleted. Once the data are entered into SPSS the procedures
(regression, etc.) are best run from the point and click environment.
• b. The next step is to look at the data. On [CD28, p. 150], Cobb and Douglas
plot capital, labor, and output on a logarithmic scale against time, all 3 series
normalized such that they start in 1899 at the same level =100. Reproduce this graph
using a modern statistics package.
568 22. SPECIFIC DATASETS
• c. Run both regressions (22.1.2) and (22.1.3) on Cobb and Douglas’s original
dataset. Compute 95% confidence intervals for the coefficients of capital and labor
in the unconstrained and the cconstrained models.
Answer. SAS does not allow you to transform the data on the fly, it insists that you first
go through a data step creating the transformed data, before you can run a regression on them.
Therefore the next set of commands creates a temporary dataset cdtmp. The data step data cdtmp
includes all the data from cobbdoug into cdtemp and then creates some transformed data as well.
Then one can run the regressions. Here are the commands; they are in the file cbbrgrss.sas in
your data disk:
data cdtmp;
set cobbdoug;
logcap = log(capital);
loglab = log(labor);
logout = log(output);
logcl = logcap-loglab;
logol = logout-loglab;
run;
proc reg data = cdtmp;
model logout = logcap loglab;
run;
proc reg data = cdtmp;
model logol = logcl;
run;
22.1. COBB DOUGLAS AGGREGATE PRODUCTION FUNCTION 569
1.0
@
0.9 @
0.8 @
@
0.7 @
0.6 @
@
0.5 @
0.4 .............. @
....... ..........................................
..... ...............
...... @ ...........
....... ...........
....... ..........
0.3 ........
........ @q ..........
.........
@q
........ ........................ .........
......... .... ....................... .........
......... ............... ........
........
0.2 .........
..........
..........
...........
........
.......
.......
@ ........... .......
.............. ......
0.1 @ .................
...................... ....
.........................
0.0
@
@
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5
Figure 1. Coefficients of capital (vertical) and labor (horizon-
tal), dependent variable output, unconstrained and constrained,
1899–1922
22.1. COBB DOUGLAS AGGREGATE PRODUCTION FUNCTION 571
Problem 275. In this problem we will treat the Cobb-Douglas data as a dataset
with errors in all three variables. See chapter 53.4 and problem 476 about that.
• a. Run the three elementary regressions for the whole period, then choose at
least two subperiods and run it for those. Plot all regression coefficients as points
in a plane, using different colors for the different subperiods (you have to normalize
them in a special way that they all fit on the same plot).
Answer. Here are the results in R:
> outputlm<-lm(log(output)~log(capital)+log(labor),data=cobbdoug)
> capitallm<-lm(log(capital)~log(labor)+log(output),data=cobbdoug)
> laborlm<-lm(log(labor)~log(output)+log(capital),data=cobbdoug)
> coefficients(outputlm)
(Intercept) log(capital) log(labor)
-0.1773097 0.2330535 0.8072782
> coefficients(capitallm)
(Intercept) log(labor) log(output)
-2.72052726 -0.08695944 1.67579357
> coefficients(laborlm)
(Intercept) log(output) log(capital)
1.27424214 0.73812541 -0.01105754
Call:
lm(formula = log(output) ~ log(capital) + log(labor), data = cobbdoug)
Residuals:
Min 1Q Median 3Q Max
-0.075282 -0.035234 -0.006439 0.038782 0.142114
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.17731 0.43429 -0.408 0.68721
log(capital) 0.23305 0.06353 3.668 0.00143 **
log(labor) 0.80728 0.14508 5.565 1.6e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Coefficients:
(Intercept) log(capital)
log(capital) 0.7243
log(labor) -0.9451 -0.9096
22.1. COBB DOUGLAS AGGREGATE PRODUCTION FUNCTION 573
[1] 3.4668
• b. The elementary regressions will give you three fitted equations of the form
(22.1.6) output = α̂1 + β̂12 labor + β̂13 capital + residual1
(22.1.7) labor = α̂2 + β̂21 output + β̂23 capital + residual2
(22.1.8) capital = α̂3 + β̂31 output + β̂32 labor + residual3 .
In order to compare the slope parameters in these regressions, first rearrange them
in the form
(22.1.9) −output + β̂12 labor + β̂13 capital + α̂1 + residual1 = 0
(22.1.10) β̂21 output − labor + β̂23 capital + α̂2 + residual2 = 0
(22.1.11) β̂31 output + β̂32 labor − capital + α̂3 + residual3 = 0
This gives the following table of coefficients:
574 22. SPECIFIC DATASETS
Problem 276. Given a univariate problem with three variables all of which have
zero mean, and a linear constraint that the coefficients of all variables sum to 0. (This
is the model apparently appropriate to the Cobb-Douglas data, with the assumption
of constant returns to scale, after taking out the means.) Call the observed variables
x, y, and z, with underlying systematic variables x∗ , y ∗ , and z ∗ , and errors u, v,
and w.
Answer.
"
−1
# x∗ = βy ∗ + (1 − β)z ∗
x∗ y∗ z∗ β =0 x = x∗ + u
(22.1.12) 1−β or
y = y∗ + v
x y z = x∗ y∗ z∗ + u v w z = z ∗ + w.
• b. The moment matrix of the systematic variables can be written fully in terms
of σy2∗ , σz2∗ , σy∗ z∗ , and the unknown parameter β. Write out the moment matrix and
therefore the Frisch decomposition.
Answer. The moment matrix is the middle matrix in the following Frisch decomposition:
σx2
" #
σxy σxz
(22.1.13) σxy σy2 σyz =
σxz σyz σz2
(22.1.14)
" 2 2
β σy∗ + 2β(1 − β)σy∗ z∗ + (1 − β)2 σz2∗ βσy2∗ + (1 − β)σy∗ z∗ βσy∗ z∗ + (1 − β)σz2∗
# " 2
σu
= βσy2∗ + (1 − β)σy∗ z∗ σy2∗ σy ∗ z ∗ + 0 σ
βσy∗ z∗ + (1 − β)σz2∗ σy2∗ σz2∗ 0
22.1. COBB DOUGLAS AGGREGATE PRODUCTION FUNCTION 577
• c. Show that the unknown parameters are not yet identified. However, if one
makes the additional assumption that one of the three error variances σu2 , σv2 , or σw
2
is zero, then the equations are identified. Since the quantity of output presumably
has less error than the other two variables, assume σu2 = 0. Under this assumption,
show that
σx2 − σxz
(22.1.15) β=
σxy − σxz
and this can be estimated by replacing the variances and covariances by their sample
counterparts. In a similar way, derive estimates of all other parameters of the model.
Answer. Solving (22.1.14) one gets from the yz element of the covariance matrix
σxy − (1 − β)σyz
(22.1.18) σy2∗ =
β
578 22. SPECIFIC DATASETS
Now plug (22.1.16), (22.1.17), and (22.1.18) into the equation for the xx element:
2
(22.1.20) = βσxy + (1 − β)σxz + σu
2 = 0 this last equation can be solved for β:
Since we are assuming σu
σx2 − σxz
(22.1.21) β=
σxy − σxz
If we replace the variances and covariances by the sample variances and covariances, this gives an
estimate of β.
• d. Evaluate these formulas numerically. In order to get the sample means and
the sample covariance matrix of the data, you may issue the SAS commands
proc corr cov nocorr data=cdtmp;
var logout loglab logcap;
run;
These commands are in the file cbbcovma.sas on the disk.
Answer. Mean vector and covariance matrix are
" # "5.07734# "0.0724870714 #
LOGOUT 0.0522115563 0.1169330807
(22.1.22) LOGLAB ∼ 4.96272 , 0.0522115563 0.0404318579 0.0839798588
LOGCAP 5.35648 0.1169330807 0.0839798588 0.2108441826
22.1. COBB DOUGLAS AGGREGATE PRODUCTION FUNCTION 579
0.0724870714 − 0.1169330807
(22.1.23) β̂ = = 0.686726861149148
0.0522115563 − 0.1169330807
In Figure 3, the point (β̂, 1− β̂) is exactly the intersection of the long dotted line with the constraint.
• e. The fact that all 3 points lie almost on the same line indicates that there may
be 2 linear relations: log labor is a certain coefficient times log output, and log capital
is a different coefficient times log output. I.e., y ∗ = δ1 + γ1 x∗ and z ∗ = δ2 + γ2 x∗ .
In other words, there is no substitution. What would be the two coefficients γ1 and
γ2 if this were the case?
σx2
" # " # " 2
#
σxy σxz 1 γ1 γ2 σu 0 0
(22.1.24) σxy σy2 σyz = σx2∗ γ1 γ12 γ1 γ 2 + 0 σv2 0 .
σxz σyz σz2 γ2 γ1 γ 2 γ22 0 0 2
σw
580 22. SPECIFIC DATASETS
Solving this gives (obtain γ1 by dividing the 32-element by the 31-element, γ2 by dividing the
32-element by the 12-element, σx2∗ by dividing the 21-element by γ1 , etc.
(22.1.25)
σyz 0.0839798588 2 σyx σxz
γ1 = = = 0.7181873452513939 σu = σx2 − = 0.072487071
σxy 0.1169330807 σyz
σyz 0.0839798588 σxy σyz
γ2 = = = 1.608453467992104 σv2 = σy2 −
σxz 0.0522115563 σxz
σyx σxz 0.0522115563 · 0.1169330807 2 σxz σzy
σx2∗ = = = 0.0726990758 σw = σz2 −
σyz 0.0839798588 σxy
This model is just barely rejected by the data since it leads to a slightly negative variance for U .
• f. The assumption that there are two linear relations is represented as the
light-blue line in Figure 3. What is the equation of this line?
Answer. If y = γ1 x and z = γ2 x then the equation x = β1 y + β2 z holds whenever β1 γ1 +
β2 γ2 = 1. This is a straight line in the β1 , β2 -plane, going through the points and (0, 1/γ2 ) =
(0, 0.0522115563
0.0839798588
0.1169330807
= 0.6217152189353289) and (1/γ1 , 0) = ( 0.0839798588 = 1.3923943475361023, 0).
This line is in the figure, and it is just a tiny bit on the wrong side of the dotted line connecting
the two estimates.
One example described there is the estimation of a demand function for electric-
ity [Hou51], which is the first multiple regression with several variables run on a
computer. In this exercise you are asked to do all steps in exercise 1 and 3 in chapter
7 of Berndt, and use the additional facilities of R to perform other steps of data
analysis which Berndt did not ask for, such as, for instance, explore the best subset
of regressors using leaps and the best nonlinear transformation using avas, do some
diagnostics, search for outliers or influential observations, and check the normality
of residuals by a probability plot.
Problem 277. 4 points The electricity demand date from [Hou51] are avail-
able on the web at www.econ.utah.edu/ehrbar/data/ukelec.txt. Import these
data into your favorite statistics package. For R you need the command ukelec <-
read.table("http://www.econ.utah.edu/ehrbar/data/ukelec.txt"). Make a
scatterplot matrix of these data using e.g. pairs(ukelec) and describe what you
see.
Answer. inc and cap are negatively correlated. cap is capacity of rented equipment and not
equipment owned. Apparently customers with higher income buy their equipment instead of renting
it.
gas6 and gas8 are very highly correlated. mc4, mc6, and mc8 are less hightly correlated, the
corrlation between mc6 and mc8 is higher than that between mc4 and mc6. It seem electicity prices
have been coming down.
582 22. SPECIFIC DATASETS
If you simply type ukelec in R, it will print the data on the screen. The variables
have the following meanings:
cust Average number of consumers with two-part tariffs for electricity in 1937–
38, in thousands. Two-part tariff means: they pay a fixed monthly sum plus a certain
“running charge” times the number of kilowatt hours they use.
inc Average income of two-part consumers, in pounds per year. (Note that one
pound had 240 pence at that time.)
22.2. HOUTHAKKER’S DATA 583
mc4 The running charge (marginal cost) on domestic two-part tariffs in 1933–34,
in pence per KWH. (The marginal costs are the costs that depend on the number of
kilowatt hours only, it is the cost of one additional kilowatt hour.
mc6 The running charge (marginal cost) on domestic two-part tariffs in 1935–36,
in pence per KWH
mc8 The running charge (marginal cost) on domestic two-part tariffs in 1937–38,
in pence per KWH
gas6 The marginal price of gas in 1935–36, in pence per therm
gas8 The marginal price of gas in 1937–38, in pence per therm
kwh Consumption on domestic two-part tariffs per consumer in 1937–38, in kilo-
watt hours
cap The average holdings (capacity) of heavy electric equipment bought on hire
purchase (leased) by domestic two-part consumers in 1937–38, in kilowatts
expen The average total expenditure on electricity by two-part consumers in
1937–38, in pounds
The function summary(ukelec) displays summary statistics about every vari-
able.
Since every data frame in R is a list, it is possible to access the variables in ukelec
by typing ukelec$mc4 etc. Try this; if you type this and then a return, you will get
a listing of mc4. In order to have all variables available as separate objects and save
584 22. SPECIFIC DATASETS
typing ukelec$ all the time, one has to “mount” the data frame by the command
attach(ukelec). After this, the individual data series can simply be printed on the
screen by typing the name of the variable, for instance mc4, and then the return key.
Problem 278. 2 points Make boxplots of mc4, mc6, and mc6 in the same graph
next to each other, and the same with gas6 and gas8.
Problem 279. 2 points How would you answer the question whether marginal
prices of gas vary more or less than those of electricity (say in the year 1936)?
Answer. Marginal gas prices vary a little more than electricity prices, although electricity
was the newer technology, and although gas prices are much more stable over time than the elec-
tricity prices. Compare sqrt(var(mc6))/mean(mc6) with sqrt(var(gas6))/mean(gas6). You get
0.176 versus 0.203. Another way would be to compute max(mc6)/min(mc6) and compare with
max(gas6)/min(gas6): you get 2.27 versus 2.62. In any case this is a lot of variation.
Problem 280. 2 points Make a plot of the (empirical) density function of mc6
and gas6 and interpret the results.
Problem 281. 2 points Is electricity a big share of total income? Which com-
mand is better: mean(expen/inc) or mean(expen)/mean(inc)? What other options
are there? Actually, there is a command which is clearly better than at least one of
the above, can you figure out what it is?
22.2. HOUTHAKKER’S DATA 585
Answer. The proportion is small, less than 1 percent. The two above commands give 0.89%
and 0.84%. The command sum(cust*expen) / sum(cust*inc) is better than mean(expen) / mean(inc
because each component in expen and inc is the mean over many households, the number of house-
holds given by cust. mean(expen) is therefore an average over averages over different popula-
tion sizes, not a good idea. sum(cust*expen) is total expenditure in all households involved, and
sum(cust*inc) is total income in all households involved. sum(cust*expen) / sum(cust*inc) gives
the value 0.92%. Another option is median(expen/inc) which gives 0.91%. A good way to answer
this question is to plot it: plot(expen,inc). You get the line where expenditure is 1 percent of
income by abline(0,0.01). For higher incomes expenditure for electricity levels off and becomes a
lower share of income.
Problem 282. Have your computer compute the sample correlation matrix of
the data. The R-command is cor(ukelec)
In the logarithmic data, cust has higher correlations than in the non-logarithmic data, and it
is also more nearly normally distributed.
inc has negative correlation with mc4 but positive correlation with mc6 and mc8. (If one looks
at the scatterplot matrix this seems just random variations in an essentially zero correlation).
mc6 and expen are positively correlated, and so are mc8 and expen. This is due to the one
outlier with high expen and high income and also high electricity prices.
The marginal prices of electricity are not strongly correlated with expen, and in 1934, they are
negatively correlated with income.
From the scatter plot of kwh versus cap it seems there are two datapoints whose removal
might turn the sign around. To find out which they are do plot(kwh,cap) and then use the identify
function: identify(kwh,cap,labels=row.names(ukelec)). The two outlying datapoints are Halifax
and Wallase. Wallase has the highest income of all towns, namely, 1422, while Halifax’s income of
352 is close to the minimum, which is 279. High income customers do not lease their equipment
but buy it.
• b. 3 points The correlation matrix says that kwh is negatively related with cap,
but the correlation of the logarithm gives the expected positive sign. Can you explain
this behavior?
Answer. If one plots the date using plot(cap,kwh) one sees that the negative correlation
comes from the two outliers. In a logarithmic scale, these two are no longer so strong outliers.
22.2. HOUTHAKKER’S DATA 587
After this preliminary look at the data, let us run the regressions.
Problem 284. 6 points Write up the main results from the regressions which in
R are run by the commands
houth.olsfit <- lm(formula = kwh ~ inc+I(1/mc6)+gas6+cap)
houth.glsfit <- lm(kwh ~ inc+I(1/mc6)+gas6+cap, weight=cust)
houth.olsloglogfit <- lm(log(kwh) ~
log(inc)+log(mc6)+log(gas6)+log(cap))
Instead of 1/mc6 you had to type I(1/mc6) because the slash has a special meaning
in formulas, creating a nested design, therefore it had to be “protected” by applying
the function I() to it.
If you then type houth.olsfit, a short summary of the regression results will be
displayed on the screen. There is also the command summary(houth.olsfit), which
gives you a more detailed summary. If you type plot(houth.olsfit) you will get a
series of graphics relevant for this regression.
Gas prices do not play a great role in determining electricity consumption, despite the “cook-
ers” Berndt talks about on p. 337. Especially the logarithmic regression makes gas prices highly
insignificant!
The weighted estimation has a higher R2 .
Problem 285. 2 points The output of the OLS regression gives as standard
error of inc the value of 0.18, while in the GLS regression it is 0.20. For the other
variables, the standard error as given in the GLS regression is lower than that in the
OLS regression. Does this mean that one should use for inc the OLS estimate and
for the other variables the GLS estimates?
Problem 286. 5 points Show, using the leaps procedure om R or some other
selection of regressors, that the variables Houthakker used in his GLS-regression are
the “best” among the following: inc, mc4, mc6, mc8, gas6, gas8, cap using ei-
ther the Cp statistic or the adjusted R2 . (At this stage, do not transform the variables
but just enter them into the regression untransformed, but do use the weights, which
are theoretically well justified).
To download the leaps package, use install.packages("leaps", lib="C:/Docu
and Settings/420lab.420LAB/My Documents") and to call it up, use library(leaps
lib.loc="C:/Documents and Settings/420lab.420LAB/My Documents"). If the
library ecmet is available, the command ecmet.script(houthsel) runs the follow-
ing script:
22.2. HOUTHAKKER’S DATA 589
library(leaps)
data(ukelec)
attach(ukelec)
houth.glsleaps<-leaps(x=cbind(inc,mc4,mc6,mc8,gas6,gas8,cap),
y=kwh, wt=cust, method="Cp",
nbest=5, strictly.compatible=F)
ecmet.prompt("Plot Mallow’s Cp against number of regressors:")
plot(houth.glsleaps$size, houth.glsleaps$Cp)
ecmet.prompt("Throw out all regressions with a Cp > 50 (big gap)")
plot(houth.glsleaps$size[houth.glsleaps$Cp<50],
houth.glsleaps$Cp[houth.glsleaps$Cp<50])
ecmet.prompt("Cp should be roughly equal the number of regressors")
abline(0,1)
cat("Does this mean the best regression is overfitted?")
ecmet.prompt("Click at the points to identify them, left click to quit")
## First construct the labels
lngth <- dim(houth.glsleaps$which)[1]
included <- as.list(1:lngth)
for (ii in 1:lngth)
included[[ii]] <- paste(
590 22. SPECIFIC DATASETS
colnames(houth.glsleaps$which)[houth.glsleaps$which[ii,]],
collapse=",")
identify(x=houth.glsleaps$size, y=houth.glsleaps$Cp, labels=included)
ecmet.prompt("Now use regsubsets instead of leaps")
houth.glsrss<- regsubsets.default(x=cbind(inc,mc4,mc6,mc8,gas6,gas8,cap)
y=kwh, weights=cust, method="exhaustive")
print(summary.regsubsets(houth.glsrss))
plot.regsubsets(houth.glsrss, scale="Cp")
ecmet.prompt("Now order the variables")
houth.glsrsord<- regsubsets.default(x=cbind(inc,mc6,cap,gas6,gas8,mc8,mc
y=kwh, weights=cust, method="exhaustive")
print(summary.regsubsets(houth.glsrsord))
plot.regsubsets(houth.glsrsord, scale="Cp")
the GLS-regression Houthakker actually ran is the “best” regression among the fol-
lowing variables: inc, 1/mc4, 1/mc6, 1/mc8, gas6, gas8, cap using either the
Cp statistic or the adjusted R2 .
Problem 289. Diagnostics, the identification of outliers or influential observa-
tions is something which we can do easily with R, although Berndt did not ask for it.
The command houth.glsinf<-lm.influence(houth.glsfit) gives you the build-
ing blocks for many of the regression disgnostics statistics. Its output is a list if three
objects: A matrix whose rows are all the the least squares estimates β̂(i) when the
ith observation is dropped, a vector with all the s(i), and a vector with all the hii .
A more extensive function is influence.measures(houth.glsfit), it has Cook’s
distance and others.
In order to look at the residuals, use the command plot(resid(houth.glsfit),
type="h") or plot(rstandard(houth.glsfit), type="h") or plot(rstudent(houth
type="h"). To add the axis do abline(0,0). If you wanted to check the residuals
for normality, you would use qqnorm(rstandard(houth.glsfit)).
Problem 290. Which commands do you need to plot the predictive residuals?
Problem 291. 4 points Although there is good theoretical justification for using
cust as weights, one might wonder if the data bear this out. How can you check this?
592 22. SPECIFIC DATASETS
Problem 292. The variable cap does not measure the capacity of all electrical
equipment owned by the households, but only those appliances which were leased from
the Electric Utility company. A plot shows that people with higher income do not
lease as much but presumably purchase their appliances outright. Does this mean the
variable should not be in the regression?
estimates and constructed investment from it. In the 1800s, only a few observations
were available, which were then interpolated. The capacity utilization ratio is equal
to the ratio of gnp2 to its trend, i.e., it may be negative.
Here are some possible commands for your R-session: data(uslt) makes the data
available; uslt.clean<-na.omit(uslt) removes missing values; this dataset starts
in 1869 (instead of 1805). attach(uslt.clean) makes the variables in this dataset
available. Now you can plot various series, for instance plot((nnp-hours*wage)/nnp,
type="l") plots the profit share, or plot(gnp/gnp2, kg/kg2, type="l") gives you
a scatter plot of the price level for capital goods versus that for gnp. The command
plot(r, kn2/hours, type="b") gives both points and dots; type = "o" will have
the dots overlaid the line. After the plot you may issue the command identify(r,
kn2/hours, label=1869:1989) and then click with the left mouse button on the
plot those data points for which you want to have the years printed.
If you want more than one timeseries on the same plot, you may do matplot(1869:1
cbind(kn2,kns2), type="l"). If you want the y-axis logarithmic, say matplot(1869:
cbind(gnp/gnp2,kns/kns2,kne/kne2), type="l", log="y").
Problem 293. Computer assignment: Make a number of such plots on the
screen, and import the most interesting ones into your wordprocessor. Each class
participant should write a short paper which shows the three most insteresting plots,
together with a written explanation why these plots seem interesting.
594 22. SPECIFIC DATASETS
To use pairs or xgobi, you should carefully select the variables you want to in-
clude, and then you need the following preparations: usltsplom <- cbind(gnp2=gnp2,
kn2=kn2, inv2=inv2, hours=hours, year=1869:1989) dimnames(usltsplom)[[1]
<- paste(1869:1989) The dimnames function adds the row labels to the matrix, so
that you can see which year it is. pairs(usltsplom) or library(xgobi) and then
xgobi(usltsplom)
You can also run regressions with commands of the following sort: lm.fit <-
lm(formula = gnp2 ~ hours + kne2 + kns2). You can also fit a “generalized ad-
ditive model” with the formula gam.fit <- gam(formula = gnp2 ~ s(hours) +
s(kne2) + s(kns2)). This is related to the avas command we talked about in
class. It is discussed in [CH93].
gender
race Female Male
Hisp 12 24
Nonwh 28 29
Other 167 290
> #Berndt also asked for the sample means of certain dummy variables;
> #This has no interest in its own right but was an intermediate
> #step in order to compute the numbers of cases as above.
> ##Exercise 1c (2 points) can be answered using tapply
> tapply(ed,gender,mean)
Female Male
12.76329 12.39942
> #now the standard deviation:
> sqrt(tapply(ed,gender,var))
Female Male
2.220165 3.052312
> #Women do not have less education than men; it is about equal,
> #but their standard deviation is smaller
> #Now the geometric mean of the wage rate:
> exp(tapply(lnwage,gender,mean))
22.5. WAGE DATA 599
Female Male
4.316358 6.128320
> #Now do the same with race
> ##Exercise 1d (4 points)
> detach()
> ##This used to be my old command:
> cps85 <- read.table("~/dpkg/ecmet/usr/share/ecmet/usr/lib/R/library/ec
> #But this should work for everyone (perhaps only on linux):
> cps85 <- readSDML("http://www.econ.utah.edu/ehrbar/data/cps85.sdml")
> attach(cps85)
> mean(exp(lnwage))
[1] 9.023947
> sqrt(var(lnwage))
[1] 0.5277335
> exp(mean(lnwage))
[1] 7.83955
> 2000*exp(mean(lnwage))
[1] 15679.1
> 2000*exp(mean(lnwage))/1.649
[1] 9508.248
600 22. SPECIFIC DATASETS
data: lnwage
D = 0.8754, p-value = < 2.2e-16
alternative hypothesis: two.sided
data: lnwage
D = 0.0426, p-value = 0.2879
alternative hypothesis: two.sided
data: wage
D = 0.1235, p-value = 1.668e-07
alternative hypothesis: two.sided
Call:
22.5. WAGE DATA 603
Residuals:
Min 1Q Median 3Q Max
-2.123168 -0.331368 -0.007296 0.319713 1.594445
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.030445 0.092704 11.115 < 2e-16 ***
ed 0.051894 0.007221 7.187 2.18e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> #One year of education increases wages by 5 percent, but low R^2.
> #Mincer (5.18) had 7 percent for 1959
> #Now we need a 95 percent confidence interval for this coefficient
604 22. SPECIFIC DATASETS
Call:
lm(formula = lnwage ~ union + ed, data = cps78)
Residuals:
Min 1Q Median 3Q Max
-2.331754 -0.294114 0.001475 0.263843 1.678532
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.859166 0.091630 9.376 < 2e-16 ***
unionUnion 0.305129 0.041800 7.300 1.02e-12 ***
ed 0.058122 0.006952 8.361 4.44e-16 ***
---
22.5. WAGE DATA 605
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> exp(0.058)
[1] 1.059715
> exp(0.305129)
[1] 1.3568
> # Union members have 36 percent higher wages
> # The test whether union and nonunion members have the same intercept
> # is the same as the test whether the union dummy is 0.
> # t-value = 7.300 which is highly significant,
> # i.e., they are different.
Nonun 0
Union 1
> #One sees it also if one runs
> model.matrix(lnwage ~ union + ed, data=cps78)
(Intercept) union ed
1 1 0 12
2 1 1 12
3 1 1 6
4 1 1 12
5 1 0 12
> #etc, rest of output flushed
> #and compares this with
> cps78$union[1:5]
[1] Nonun Union Union Union Nonun
Levels: Nonun Union
> #Consequently, the intercept for nonunion is 0.8592
> #and the intercept for union is 0.8592+0.3051=1.1643.
> #Can I have a different set of dummies constructed from this factor?
> #We will first do
> ##Exercise 2e (2 points)
22.5. WAGE DATA 607
> contrasts(union)<-matrix(c(1,0),nrow=2,ncol=1)
> #This generates a new contrast matrix
> #which covers up that in cps78
> #Note that I do not say "data=cps78" in the next command:
> summary(lm(lnwage ~ union + ed))
Call:
lm(formula = lnwage ~ union + ed)
Residuals:
Min 1Q Median 3Q Max
-2.331754 -0.294114 0.001475 0.263843 1.678532
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.164295 0.090453 12.872 < 2e-16 ***
union1 -0.305129 0.041800 -7.300 1.02e-12 ***
ed 0.058122 0.006952 8.361 4.44e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
608 22. SPECIFIC DATASETS
Call:
lm(formula = lnwage ~ union + ed - 1)
Residuals:
Min 1Q Median 3Q Max
-2.331754 -0.294114 0.001475 0.263843 1.678532
22.5. WAGE DATA 609
Coefficients:
Estimate Std. Error t value Pr(>|t|)
union1 0.859166 0.091630 9.376 < 2e-16 ***
union2 1.164295 0.090453 12.872 < 2e-16 ***
ed 0.058122 0.006952 8.361 4.44e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Call:
lm(formula = lnwage ~ union + ed - 1, data = cps85)
610 22. SPECIFIC DATASETS
Coefficients:
unionNonunion unionUnion ed
0.9926 1.2909 0.0778
Call:
lm(formula = lnwage ~ gender * marr + ed + ex + I(ex^2), data = cps78)
Residuals:
Min 1Q Median 3Q Max
-2.45524 -0.24566 0.01969 0.23102 1.42437
Coefficients:
612 22. SPECIFIC DATASETS
> #Being married raises the wage for men by 13% but lowers it for women
> ###Exercise 4a (5 points):
> summary(lm(lnwage ~ union + gender + race + ed + ex + I(ex^2), data=cp
Call:
22.5. WAGE DATA 613
Residuals:
Min 1Q Median 3Q Max
-2.41914 -0.23674 0.01682 0.21821 1.31584
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.1549723 0.1068589 1.450 0.14757
unionUnion 0.2071429 0.0368503 5.621 3.04e-08 ***
genderMale 0.3060477 0.0344415 8.886 < 2e-16 ***
raceNonwh -0.1301175 0.0830156 -1.567 0.11761
raceOther 0.0271477 0.0688277 0.394 0.69342
ed 0.0746097 0.0066521 11.216 < 2e-16 ***
ex 0.0261914 0.0047174 5.552 4.43e-08 ***
I(ex^2) -0.0003082 0.0001015 -3.035 0.00252 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
614 22. SPECIFIC DATASETS
> exp(-0.1301175)
[1] 0.8779923
> #Being Hispanic lowers wages by 2.7%, byut being black lowers them
> #by 12.2 %
> 0.0261914/(2*0.0003082)
[1] 42.49091
Call:
lm(formula = lnwage ~ gender + union + race + ed + ex + I(ex^2) +
I(ed * ex), data = cps78)
Residuals:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0396495 0.1789073 0.222 0.824693
genderMale 0.3042639 0.0345241 8.813 < 2e-16 ***
unionUnion 0.2074045 0.0368638 5.626 2.96e-08 ***
616 22. SPECIFIC DATASETS
Call:
lm(formula = lnwage ~ gender + union + race + ed + ex + I(ex^2),
data = cps78)
Residuals:
Min 1Q Median 3Q Max
-2.41914 -0.23674 0.01682 0.21821 1.31584
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.1549723 0.1068589 1.450 0.14757
genderMale 0.3060477 0.0344415 8.886 < 2e-16 ***
unionUnion 0.2071429 0.0368503 5.621 3.04e-08 ***
raceNonwh -0.1301175 0.0830156 -1.567 0.11761
raceOther 0.0271477 0.0688277 0.394 0.69342
ed 0.0746097 0.0066521 11.216 < 2e-16 ***
ex 0.0261914 0.0047174 5.552 4.43e-08 ***
I(ex^2) -0.0003082 0.0001015 -3.035 0.00252 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
618 22. SPECIFIC DATASETS
Call:
lm(formula = lnwage ~ gender + race + ed + ex + I(ex^2), data = cps78,
subset = union == "Union")
620 22. SPECIFIC DATASETS
Residuals:
Min 1Q Median 3Q Max
-2.3307 -0.1853 0.0160 0.2199 1.1992
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.9261456 0.2321964 3.989 0.000101 ***
genderMale 0.2239370 0.0684894 3.270 0.001317 **
raceNonwh -0.3066717 0.1742287 -1.760 0.080278 .
raceOther -0.0741660 0.1562131 -0.475 0.635591
ed 0.0399500 0.0138311 2.888 0.004405 **
ex 0.0313820 0.0098938 3.172 0.001814 **
I(ex^2) -0.0004526 0.0002022 -2.239 0.026535 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Call:
lm(formula = lnwage ~ gender + race + ed + ex + I(ex^2), data = cps78,
subset = union == "Nonun")
Residuals:
Min 1Q Median 3Q Max
-1.39107 -0.23775 0.01040 0.23337 1.29073
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.0095668 0.1193399 -0.080 0.9361
genderMale 0.3257661 0.0397961 8.186 4.22e-15 ***
raceNonwh -0.0652018 0.0960570 -0.679 0.4977
raceOther 0.0444133 0.0761628 0.583 0.5602
ed 0.0852212 0.0075554 11.279 < 2e-16 ***
ex 0.0253813 0.0053710 4.726 3.25e-06 ***
I(ex^2) -0.0002841 0.0001187 -2.392 0.0172 *
622 22. SPECIFIC DATASETS
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> 0.8519796-0.0348465
[1] 0.8171331
>
Call:
lm(formula = lnwage ~ gender + union + race + ed + ex + I(ex^2))
Residuals:
Min 1Q Median 3Q Max
-2.41914 -0.23674 0.01682 0.21821 1.31584
Coefficients:
624 22. SPECIFIC DATASETS
> #To test whether Nonwh and Hisp have same intercept
> #one might generate a contrast matrix which collapses those
> #two and then run it and make an F-test
> #or make a contrast matrix which has this difference as one of
22.5. WAGE DATA 625
> #the dummies and use the t-test for that dummy
> ##Exercise 6b (2 points)
> table(race)
race
Hisp Nonwh Other
36 57 457
> tapply(lnwage, race, mean)
Hisp Nonwh Other
1.529647 1.513404 1.713829
> tapply(lnwage, race, ed)
Error in get(x, envir, mode, inherits) : variable "ed" was not found
> tapply(ed, race, mean)
Hisp Nonwh Other
10.30556 11.71930 12.81400
> table(gender, race)
race
gender Hisp Nonwh Other
Female 12 28 167
Male 24 29 290
> #Blacks, almost as many women than men, hispanic twice as many men,
626 22. SPECIFIC DATASETS
>
> #Additional stuff:
> #There are two outliers in cps78 with wages of less than $1 per hour,
> #Both service workers, perhaps waitresses who did not report her tips?
> #What are the commands for extracting certain observations
> #by certain criteria and just print them? The split command.
>
> #Interesting to do
> loess(lnwage ~ ed + ex, data=cps78)
> #loess is appropriate here because there are strong interation terms
> #How can one do loess after taking out the effects of gender for insta
> #Try the following, but I did not try it out yet:
> gam(lnwage ~ lo(ed,ex) + gender, data=cps78)
> #I should put more plotting commands in!
22.5. WAGE DATA 627
@
@
@
@
@
@
@
q capital
@
@
@
@
@
@
@
@qoutput no error, crs
@ c Cobb Douglas’s original result
@ q
output
@
@
@
@
@ qlabor -
1.0
@
0.9 @
0.8 @
@
0.7 @
0.6 q
.......... capital all errors
..........
..........
..........
@
..........
.......... @
..........
0.5 ..........
.........
........... @
..........
..........
..........
0.4 @
...........
..........
..........
@ q .......... output no error, crs
..........
..........
0.3 ..........
.........
@q
@ ......output
.......... all errors
.........
...........
0.2 ..........
..........
..........
@ ..........
..........
..........
0.1 @ .........
...........
..........
0.0
@
@ q .......... labor
..........
........
all errors
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5
Figure 3. Coefficient of capital (vertical) and labor (horizontal)
in the elementary regressions, dependent variable output, 1899–1922
CHAPTER 23
The question how “close” two random variables are to each other is a central
concern in statistics. The goal of statistics is to find observed random variables which
are “close” to the unobserved parameters or random outcomes of interest. These ob-
served random variables are usually called “estimators” if the unobserved magnitude
is nonrandom, and “predictors” if it is random. For scalar random variables we will
use the mean squared error as a criterion for closeness. Its definition is MSE[φ̂; φ]
(read it: mean squared error of φ̂ as an estimator or predictor, whatever the case
may be, of φ):
(23.0.1) MSE[φ̂; φ] = E[(φ̂ − φ)2 ]
629
630 23. THE MEAN SQUARED ERROR AS AN INITIAL CRITERION OF PRECISION
For our purposes, therefore, the estimator (or predictor) φ̂ of the unknown parameter
(or unobserved random variable) φ is no worse than the alternative φ̃ if MSE[φ̂; φ] ≤
MSE[φ̃; φ]. This is a criterion which can be applied before any observations are
collected and actual estimations are made; it is an “initial” criterion regarding the
expected average performance in a series of future trials (even though, in economics,
usually only one trial is made).
projection on the axes), φ̂ is closer to the origin than φ̃. But in the projection on
the diagonal, φ̃ is closer to the origin than φ̂.
0
Answer. In the simplest counterexample, all variables involved are constants: φ = 0 ,
1 −2
φ̂ = 1 , and φ̃ = 2 .
One can only then say unambiguously that the vector φ̂ is a no worse estimator
than φ̃ if its MSE is smaller or equal for every linear combination. Theorem 23.1.1
will show that this is the case if and only if the MSE-matrix of φ̂ is smaller, by a
nonnegative definite matrix, than that of φ̃. If this is so, then theorem 23.1.1 says
that not only the MSE of all linear transformations, but also all other nonnegative
definite quadratic loss functions involving these vectors (such as the trace of the
MSE-matrix, which is an often-used criterion) are minimized. In order to formulate
and prove this, we first need a formal definition of the MSE-matrix. We write MSE
for the matrix and MSE for the scalar mean squared error. The MSE-matrix of φ̂
as an estimator of φ is defined as
Don’t assume the scalar result but make a proof that is good for vectors and scalars.
Theorem 23.1.1. Assume φ̂ and φ̃ are two estimators of the parameter φ (which
is allowed to be random itself ). Then conditions (23.1.3), (23.1.4), and (23.1.5) are
23.1. COMPARISON OF TWO VECTOR ESTIMATORS 633
equivalent:
(23.1.3) For every constant vector t, MSE[t> φ̂; t> φ] ≤ MSE[t> φ̃; t> φ]
(23.1.4) MSE[φ̃; φ] − MSE[φ̂; φ] is a nonnegative definite matrix
For every nnd Θ, E (φ̂ − φ)> Θ(φ̂ − φ) ≤ E (φ̃ − φ)> Θ(φ̃ − φ) .
(23.1.5)
Proof. Call MSE[φ̃; φ] = σ 2 Ξ and MSE[φ̂; φ] = σ 2Ω . To show that (23.1.3)
implies (23.1.4), simply note that MSE[t> φ̂; t> φ] = σ 2 t>Ωt and likewise MSE[t> φ̃; t> φ
σ 2 t> Ξt. Therefore (23.1.3) is equivalent to t> (Ξ − Ω )t ≥ 0 for all t, which is the
defining property making Ξ − Ω nonnegative definite.
Here is the proof that (23.1.4) implies (23.1.5):
E[(φ̂ − φ)> Θ(φ̂ − φ)] = E[tr (φ̂ − φ)> Θ(φ̂ − φ) ] =
To complete the proof, (23.1.5) has (23.1.3) as a special case if one sets Θ =
tt> .
Problem 296. Show that if Θ and Σ are symmetric and nonnegative definite,
Σ) ≥ 0. You are allowed to use that tr(AB) = tr(BA), that the trace of a
then tr(ΘΣ
nonnegative definite matrix is ≥ 0, and Problem 129 (which is trivial).
Note that both MSE-matrices are singular, i.e., both estimators allow an error-free look at certain
linear combinations of the parameter vector.
• b. 1 point Give two vectors g = [ gg12 ] and h = hh12 satisfying MSE[g > φ̂; g > φ] <
MSE[g > φ̃; g > φ] and MSE[h> φ̂; h> φ] > MSE[h> φ̃; h> φ] (g and h are not unique;
there are many possibilities).
1 1 > > >
Answer. With g = −1 and h = 1 for instance we get g φ̂ − g φ = 0, g φ̃ −
g > φ = 4, h> φ̂; h> φ = 2, h> φ̃; h> φ = 0, therefore MSE[g > φ̂; g > φ] = 0, MSE[g > φ̃; g > φ] = 16,
636 23. THE MEAN SQUARED ERROR AS AN INITIAL CRITERION OF PRECISION
MSE[h> φ̂; h> φ] = 4, MSE[h> φ̃; h> φ] = 0. An alternative way to compute this is e.g.
> >
4 −4 1
MSE[h φ̃; h φ] = 1 −1 = 16
−4 4 −1
The estimator β̂ was derived from a geometric argument, and everything which
we showed so far are what [DM93, p. 3] calls its numerical as opposed to its statistical
properties. But β̂ has also nice statistical or sampling properties. We are assuming
right now the specification given in (18.1.3), in which X is an arbitrary matrix of full
column rank, and we are not assuming that the errors must be Normally distributed.
The assumption that X is nonrandom means that repeated samples are taken with
the same X-matrix. This is often true for experimental data, but not in econometrics.
The sampling properties which we are really interested in are those where also the X-
matrix is random; we will derive those later. For this later derivation, the properties
637
638 24. SAMPLING PROPERTIES OF THE LEAST SQUARES ESTIMATOR
with fixed X-matrix, which we are going to discuss presently, will be needed as an
intermediate step. The assumption of fixed X is therefore a preliminary technical
assumption, to be dropped later.
In order to know how good the estimator β̂ is, one needs the statistical properties
of its “sampling error” β̂ − β. This sampling error has the following formula:
From (24.0.7) follows immediately that β̂ is unbiased, since E [(X > X)−1 X >ε ] = o.
Unbiasedness does not make an estimator better, but many good estimators are
unbiased, and it simplifies the math.
We will use the MSE-matrix as a criterion for how good an estimator of a vector
of unobserved parameters is. Chapter 23 gave some reasons why this is a sensible
criterion (compare [DM93, Chapter 5.5]).
24.1. THE GAUSS MARKOV THEOREM 639
ε = Xβ−ŷ = X(β−β̂). This allows to use MSE[ε̂; ε ] = X MSE[β̂; β]X > = σ 2 X(X > X)−1 X > .
ŷ−ε
Problem 300. 2 points Show that β̂ and ε̂ are uncorrelated, i.e., cov[β̂ i , ε̂j ] =
0 for all i, j. Defining the covariance matrix C [β̂, ε̂] as that matrix whose (i, j)
element is cov[β̂ i , ε̂j ], this can also be written as C [β̂, ε̂] = O. Hint: The covariance
matrix satisfies the rules C [Ay, Bz] = A C [y, z]B > and C [y, y] = V [y]. (Other rules
for the covariance matrix, which will not be needed here, are C [z, y] = (C [y, z])> ,
C [x + y, z] = C [x, z] + C [y, z], C [x, y + z] = C [x, y] + C [x, z], and C [y, c] = O if c is
a vector of constants.)
Answer. A = (X > X)−1 X > and B = I−X(X > X)−1 X > , therefore C [β̂, ε̂] = σ 2 (X > X)−1 X >
X(X > X)−1 X > ) = O.
(here consisting of one row only) that contains all the covariances
(24.1.2) C [ȳ, β̂] ≡ cov[ȳ, β̂ 1 ] cov[ȳ, β̂ 2 ] · · · cov[ȳ, β̂ k ]
2
has the following form: C [ȳ, β̂] = σn 1 0 · · · 0 where n is the number of ob-
servations. Hint: That the regression has an intercept term as first column of the
X-matrix means that Xe(1) = ι, where e(1) is the unit vector having 1 in the first
place and zeros elsewhere, and ι is the vector which has ones everywhere.
Answer. Write both ȳ and β̂ in terms of y, i.e., ȳ = 1 >
n
ι y and β̂ = (X > X)−1 X > y. Therefore
(24.1.3)
1 > σ2 > σ 2 (1) > > σ 2 (1) >
C [ȳ, β̂] = ι V [y]X(X > X)−1 = ι X(X > X)−1 = e X X(X > X)−1 = e .
n n n n
MSE[φ̃; φ] = E[(φ̃ − φ)2 ] = E[ t> (X > X)−1 X > + c> εε > X(X > X)−1 t + c ] =
= σ 2 t> (X > X)−1 X > + c> X(X > X)−1 t + c = σ 2 t> (X > X)−1 t + σ 2 c> c,
Here we needed again c> X = o> . Clearly, this is minimized if c = o, in which case
φ̃ = t> β̂.
MSE[β̃; β] = V [β̃] = σ 2 (X > X)−1 X > + C X(X > X)−1 + C > = σ 2 (X > X)−1 + σ 2 CC > , i.e.,
it exceeds the MSE-matrix of β̂ by a nonnegative definite matrix.
minimax estimator of the scalar φ = t> β with respect to the MSE. I.e., for every
other linear estimator φ̃ = a> y of φ one can find a value β = β 0 for which φ̃ has a
larger MSE than the largest possible MSE of t> β̂.
Proof: as in the proof of Theorem 24.1.1, write the alternative linear estimator
φ̃ in the form φ̃ = t> (X > X)−1 X > + c> y, so that the sampling error is given by
= σ 2 t> (X > X)−1 X > + c> X(X > X)−1 t + c + c> Xββ > X > c
(24.2.3)
Now there are two cases: if c> X = o> , then MSE[φ̃; φ] = σ 2 t> (X > X)−1 t + σ 2 c> c.
This does not depend on β and if c 6= o then this MSE is larger than that for c = o.
If c> X 6= o> , then MSE[φ̃; φ] is unbounded, i.e., for any finite number ω one one
can always find a β 0 for which MSE[φ̃; φ] > ω. Since MSE[φ̂; φ] is bounded, a β 0
can be found that satisfies (24.2.1).
24.3. MISCELLANEOUS PROPERTIES OF THE BLUE 645
Answer.
P1 2
X
= E[xi εi ]2 since the εi ’s are uncorrelated, i.e., cov[εi , εj ] = 0 for i 6= j
(
xi )2
1 X
= P 2 σ2 x2i since all εi have equal variance σ 2
( xi ) 2
2
= Pσ .
x2i
650 24. SAMPLING PROPERTIES OF THE LEAST SQUARES ESTIMATOR
Problem 305. We still assume (24.3.5) is the true model. Consider an alter-
native estimator:
P
(xi − x̄)(y i − ȳ)
(24.3.7) β̂ = P
(xi − x̄)2
i.e., the estimator which would be the best linear unbiased estimator if the true model
were (18.2.15).
Answer. One can argue it: β̂ is unbiased for model (18.2.15) whatever the value of α or β,
therefore also when α = 0, i.e., when the model is (24.3.5). But here is the pedestrian way:
P P
(xi − x̄)(y i − ȳ) (x − x̄)y i X
β̂ = P = P i since (xi − x̄)ȳ = 0
(xi − x̄)2 2
(xi − x̄)
P
(xi − x̄)(βxi + εi )
= P since y i = βxi + εi
(xi − x̄)2
P P
(xi − x̄)xi (xi − x̄)εi
=β P + P
(xi − x̄)2 (xi − x̄)2
P
(xi − x̄)εi X X
=β+ P since (xi − x̄)xi = (xi − x̄)2
(xi − x̄) 2
P
(xi − x̄)εi
E β̂ = E β + E P
(xi − x̄)2
P
(xi − x̄) E εi
=β+ P =β since E εi = 0 for all i, i.e., β̂ is unbiased.
(xi − x̄) 2
Answer. One can again argue it: since the formula for var β̂ does not depend on what the
true value of α is, it is the same formula.
P
(xi − x̄)εi
(24.3.8) var β̂ = var β + P
(xi − x̄)2
P
(xi − x̄)εi
(24.3.9) = var P
(xi − x̄)2
P
(x − x̄)2 var εi
(24.3.10) = Pi 2 2
since cov[εi εj ] = 0
( (xi − x̄) )
σ2
(24.3.11) = P .
(xi − x̄)2
• c. 1 point Still assuming (24.3.5) is the true model, would you prefer β̂ or the
β̃ from Problem 304 as an estimator of β?
Answer. Since β̃ and β̂ are both unbiased estimators, if (24.3.5) is the true model, the pre-
ferred estimator is the one with the smaller variance. As I will show, var β̃ ≤ var β̂ and, therefore,
β̃ is preferred to β̂. To show
σ2 2
(24.3.12) var β̂ = P ≥ Pσ = var β̃
(xi − x̄)2 x2i
24.3. MISCELLANEOUS PROPERTIES OF THE BLUE 653
X X
(24.3.13) (xi − x̄)2 ≤ x2i
which is a simple consequence of (12.1.1). Thus var β̂ ≥ var β̃; the variances are equal only if x̄ = 0,
i.e., if β̃ = β̂.
Problem 306. Suppose the true model is (18.2.15) and the basic assumptions
are satisfied.
P
xi y
• a. 2 points In this situation, β̃ = P x2 i is generally a biased estimator of β.
i
Show that its bias is
nx̄
(24.3.14) E[β̃ − β] = α P 2
xi
654 24. SAMPLING PROPERTIES OF THE LEAST SQUARES ESTIMATOR
Answer. In situations like this it is always worth while to get a nice simple expression for the
sampling error:
P
xy
(24.3.15) β̃ − β = P i2i − β
x
P i
xi (α + βxi + εi )
(24.3.16) = P 2 −β since y i = α + βxi + εi
xi
P P 2 P
xi x xi εi
(24.3.17) = α P 2 + β P i2 + P 2 − β
xi xi xi
P P
xi xi εi
(24.3.18) = αP 2 + P 2
x xi
Pi P
xi xi εi
(24.3.19) E[β̃ − β] = E α P 2 + E P 2
xi xi
P P
xi xi E εi
(24.3.20) = αP 2 + P 2
xi xi
P
xi nx̄
(24.3.21) = αP 2 + 0 = αP 2
xi xi
This is 6= 0 unless x̄ = 0 or α = 0.
24.3. MISCELLANEOUS PROPERTIES OF THE BLUE 655
σ2
(24.3.22) P
(xi − x̄)2
• c. 5 points Show that the MSE of β̃ is smaller than that of the OLS estimator
if and only if the unknown true parameters α and σ 2 satisfy the equation
α2
(24.3.28)
2
<1
1
σ2 n + P(xx̄ −x̄)2
i
Answer. This implies some tedious algebra. Here it is important to set it up right.
2 2
σ2
Pσ αnx̄
MSE[β̃; β] = + P ≤ P
x2i 2
xi (xi − x̄)2
P 2 P
αnx̄
2
σ2 2 σ2 xi − (xi − x̄)2
Pσ 2 =
P ≤ P − P P 2
x2i (xi − x̄)2 xi 2
(xi − x̄) xi
σ 2 nx̄2
= P P
(xi − x̄)2 x2i
α2 n α2 σ2
P = 1
P ≤ P
x2i n
(xi − x̄)2 + x̄2 (xi − x̄)2
α2
≤1
x̄2
1
σ2 n
+ P
(xi −x̄)2
Now look at this lefthand side; it is amazing and surprising that it is exactly the population
equivalent of the F -test for testing α = 0 in the regression with intercept. It can be estimated by
replacing α2 with α̂2 and σ 2 with s2 (in the regression with intercept). Let’s look at this statistic.
24.3. MISCELLANEOUS PROPERTIES OF THE BLUE 657
From the Gauss-Markov theorem follows that for every nonrandom matrix R,
the BLUE of φ = Rβ is φ̂ = Rβ̂. Furthermore, the best linear unbiased predictor
(BLUP) of ε = y − Xβ is the vector of residuals ε̂ = y − X β̂.
• c. How does this best predictor relate to the OLS estimator β̂?
Problem 308. This is a vector generalization of problem 198. Let β̂ the BLUE
of β and β̃ an arbitrary linear unbiased estimator of β.
Problem 310. 3 points The model is y = Xβ + ε but all rows of the X-matrix
are exactly equal. What can you do? Can you estimate β? If not, are there any linear
combinations of the components of β which you can estimate? Can you estimate σ 2 ?
Answer. If all rows are equal, then each column is a multiple of ι. Therefore, if there are more
than one column, none of the individual components of β can be estimated. But you can estimate
x> β (if x is one of the row vectors of X) and you can estimate σ 2 .
Problem 311. This is [JHG+ 88, 5.3.32]: Consider the log-linear statistical
model
(24.3.29) y t = αxβt exp εt = zt exp εt
24.3. MISCELLANEOUS PROPERTIES OF THE BLUE 661
• b. 1 point Show that the elasticity of the functional relationship between xt and
zt
∂zt /zt
(24.3.31) η=
∂xt /xt
does not depend on t, i.e., it is the same for all observations. Many authors talk
about the elasticity of y t with respect to xt , but one should really only talk about the
elasticity of zt with respect to xt , where zt is the systematic part of yt which can be
estimated by ŷt .
Answer. The systematic functional relationship is log zt = log α + β log xt ; therefore
∂ log zt 1
(24.3.32) =
∂zt zt
which can be rewritten as
∂zt
(24.3.33) = ∂ log zt ;
zt
The same can be done with xt ; therefore
∂zt /zt ∂ log zt
(24.3.34) = =β
∂xt /xt ∂ log xt
What we just did was a tricky way to take a derivative. A less tricky way is:
∂zt
(24.3.35) = αβxβ−1
t = βzt /xt
∂xt
662 24. SAMPLING PROPERTIES OF THE LEAST SQUARES ESTIMATOR
Therefore
∂zt xt
(24.3.36) =β
∂xt zt
Problem 312.
• a. 2 points What is the elasticity in the simple regression y t = α + βxt + εt ?
Answer.
∂z t /z t ∂z t xt βxt βxt
(24.3.37) ηt = = = =
∂xt /xt ∂xt z t zt α + βxt
This depends on the observation, and if one wants one number, a good way is to evaluate it at
x̄.
β̂ x̄
• b. Show that an estimate of this elasticity evaluated at x̄ is h = ȳ .
Answer. This comes from the fact that the fitted regression line goes through the point x̄, ȳ.
If one uses the other definition of elasticity, which Greene uses on p. 227 but no longer on p. 280,
and which I think does not make much sense, one gets the same formula:
∂y t /y t ∂y t xt βxt
(24.3.38) ηt = = =
∂xt /xt ∂xt y t yt
This is different than (24.3.37), but if one evaluates it at the sample mean, both formulas give the
β̂ x̄
same result ȳ
.
24.3. MISCELLANEOUS PROPERTIES OF THE BLUE 663
i 1 −1 " −h
#
2
h
−h x̄(1−h) x̄ ȳ
(24.3.40) s ȳ ȳ x̄ x¯2 x̄(1−h)
ȳ
Now say v = log(y) and ui = log(xi ), and the values of f and its derivatives at o are
the coefficients to be estimated:
X 1X
(24.3.45) log(y) = α + βi log xi + γij log xi log xj + ε
2 i,j
Here usually one of the columns of X is the time subscript t itself; [Gre97, p. 227]
writes it as
where δ is the autonomous growth rate. The logistic functional form is appropriate
for adoption rates 0 ≤ y t ≤ 1: the rate of adoption is slow at first, then rapid as the
innovation gains popularity, then slow again as the market becomes saturated:
exp(x>
t β + tδ + εt )
(24.3.48) yt =
1 + exp(x>
t β + tδ + εt )
random parameter.) If yes, explain how you would estimate it, and if not, what is
the best you can do?
Answer. Call εt = αt − µ, then the equation reads y t = µ + βxt + εt , with well behaved
disturbances. Therefore one can estimate all the unknown parameters, and predict αt by µ̂ + εt .
E[ε̂> ε̂] = E[tr ε > Mε εε > ] = σ 2 tr M = σ 2 tr(I − X(X > X)− X > ) =
ε] = E[tr Mε
2 > − > 2
σ (n − tr(X X) X X) = σ (n − q).
Problem 314.
• a. 2 points Show that
(24.4.2) SSE = ε > Mε
ε where M = I − X(X > X)− X >
Answer. SSE = ε̂> ε̂, where ε̂ = y − X β̂ = y − X(X > X)− X > y = M y where M =
I − X(X > X)− X > . From M X = O follows ε̂ = M (Xβ + ε ) = Mε
ε. Since M is idempotent and
symmetric, it follows ε̂> ε̂ = ε > Mε
ε.
(24.4.3) E[SSE] = σ 2 (n − k)
Answer. E[ε̂> ε̂] = E[tr ε > Mε εε > ] = σ 2 tr M = σ 2 tr(I − X(X > X)− X > ) =
ε] = E[tr Mε
σ 2 (n − tr(X > X)− X > X) = σ 2 (n − k).
• b. 1 point Formula (24.5.1) for the MSE matrix depends on the unknown σ 2
and η and is therefore useless for estimation. If one cannot get an estimate of the
whole MSE matrix, an often-used second best choice is its trace. Show that
Hint: use equation (9.2.1). If one does not have an unbiased estimator s2 of σ 2 , one
usually gets such an estimator by regressing y on an X matrix which is so large that
one can assume that it contains the true regressors.
670 24. SAMPLING PROPERTIES OF THE LEAST SQUARES ESTIMATOR
The statistic
SSE
(24.5.4) Cp = + 2q − n
s2
If one therefore has several regressions and tries to decide which is the right one,
it is recommended to plot Cp versus q for all regressions, and choose one for which
this value is small and lies close to the diagonal. An example of this is given in
problem 286.
(Here γ1 is the skewness and γ2 the kurtosis of εi .) This is for instance satisfied
whenever all εi are independent drawings from the same population with E[εi ] = 0
and equal variances var[εi ] = σ 2 .
If either γ2 = 0 (which is for instance the case when ε is normally distributed),
or if X is such that all diagonal elements of X(X > X)− X > are equal, then the
minimum MSE estimator in the class of unbiased estimators of σ 2 , whatever the
672 24. SAMPLING PROPERTIES OF THE LEAST SQUARES ESTIMATOR
mean or dispersion matrix of β or its covariance matrix with ε may be, and which
can be written in the form y > Ay with a nonnegative definite A, is
1
(24.6.5) s2 = y > (I − X(X > X)− X > )y,
n−q
and
2 γ2
(24.6.6) E[(s2 − σ 2 )2 ] = var[s2 ] = σ 4 + .
n−q n
Proof: For notational convenience we will look at unbiased estimators of the
form y > Ay of (n − q)σ 2 , and at the end divide by n − q. If β is nonrandom, then
E[y > Ay] = σ 2 tr A + β > X > AXβ, and the estimator is unbiased iff X > AX = O
and tr A = n−q. Since A is assumed nonnegative definite, X > AX = O is equivalent
to AX = O. From AX = O follows y > Ay = ε > Aε ε, therefore the distribution of
β no longer matters and (9.2.27) simplifies to
(24.6.7) var[ε ε] = σ 4 (γ2 a> a + 2 tr(A2 ))
ε> Aε
Now take an arbitrary nonnegative definite A with AX = O and tr A = n−q. Write
it in the form A = M + D, where M = I − X(X > X)− X > , and correspondingly
a = m + d. The condition AX = O is equivalent to AM = A or, expressed in D
instead of A, (M + D)M = M + D, which simplifies to DM = D. Furthermore,
24.6. OPTIMALITY OF VARIANCE ESTIMATORS 673
1 0 1
Answer. Counterexample: A = and x = .
0 −1 1
CHAPTER 25
(25.0.12) y = Xβ + ε ,
675
676 25. VARIANCE ESTIMATION: UNBIASEDNESS?
is
n
1 1 X 2
(25.0.13) s2 = y> M y = ε̂
n−r n − r i=1 i
where M = I − X(X > X)− X > and ε̂ = M y. If X has full rank, then ε̂ = y − X β̂,
where β̂ is the least squares estimator of β. Just as β̂ is the best (minimum mean
square error) linear unbiased estimator of β, it has been shown in [Ati62], see also
[Seb77, pp. 52/3], that under certain additional assumptions, s2 is the best unbiased
estimator of σ 2 which can be written in the form y > Ay with a nonnegative definite
A. A precise formulation of these additional assumptions will be given below; they
are, for instance, satisfied if X = ι, the vector of ones, and the εi are i.i.d. But they
are also satisfied for arbitrary X if ε is normally distributed. (In this last case, s2 is
best in the larger class of all unbiased estimators.)
This suggests an analogy between linear and quadratic estimation which is, how-
ever, by no means perfect. The results just cited pose the following puzzles:
• Why is s2 not best nonnegative quadratic unbiased for arbitrary X-matrix
whenever the εi are i.i.d. with zero mean? What is the common logic behind
25. VARIANCE ESTIMATION: UNBIASEDNESS? 677
best bounded MSE linear estimator β̂ of β is (a) unbiased and (b) does not depend
on the nuisance parameter σ 2 , the best quadratic bounded MSE estimator of σ 2
is (a) biased and (b) depends on a fourth-order nuisance parameter, the kurtosis
of the disturbances. This, again, helps to dispel the false suggestiveness of puzzle
(1). The main assumption is distributional. If the kurtosis is known, then the best
nonnegative quadratic unbiased estimator exists. However it is uninteresting, since
the (biased) best bounded MSE quadratic estimator is better. The class of unbiased
estimators only then becomes interesting when the kurtosis is not known: for certain
X-matrices, the best nonnegative quadratic unbiased estimator does not depend on
the kurtosis.
However even if the kurtosis is not known, this paper proposes to use as estimate
of σ 2 the maximum value which one gets when one applies the best bonded mean
squared error estimator for all possible values of the kurtosis.
powerful building of least squares theory seems to rest on such a flimsy assumption
as unbiasedness.
G. A. Barnard, in [Bar63], noted this and proposed to replace unbiasedness by
bounded MSE, a requirement which can be justified by the researcher following an
“insurance strategy”: no bad surprises regarding the MSE of the estimator, what-
ever the value of the true β. Barnard’s suggestion has not found entrance into the
textbooks—and indeed, since linear estimators in model (25.0.12) are unbiased if and
only if they have bounded MSE, it might be considered an academic question.
It is usually not recognized that even in the linear case, the assumption of
bounded MSE serves to unify the theory. Christensen’s monograph [Chr87] treats,
as we do here in chapter 27, best linear prediction on the basis of known first and
second moments in parallel with the regression model. Both models have much in
common, but there is one result which seems to set them apart: best linear predic-
tors exist in one, but only best linear unbiased predictors in the other [Chr87, p.
226]. If one considers bounded MSE to be one of the basic assumptions, this seeming
irregularity is easily explained: If the first and second moments are known, then ev-
ery linear predictor has bounded MSE, while in the regression model only unbiased
linear estimators do.
One might still argue that no real harm is done with the assumption of unbiased-
ness, because in the linear case, the best bounded MSE estimators or predictors turn
680 25. VARIANCE ESTIMATION: UNBIASEDNESS?
out to be unbiased. This last defense of unbiasedness falls if one goes from linear to
quadratic estimation. We will show that the best bounded MSE quadratic estimator
is biased.
As in the the linear case, it is possible to derive these results without fully specify-
ing the distributions involved. In order to compute the MSE of linear estimators, one
needs to know the first and second moments of the disturbances, which is reflected
in the usual assumption ε ∼ (o, σ 2 I). For the MSE of quadratic estimators, one also
needs information about the third and fourth moments. We will therefore derive
optimal quadratic estimators of σ 2 based on the following assumptions regarding the
first four moments, which are satisfied whenever the εi are independently identically
distributed:
25.1. SETTING THE FRAMEWORK STRAIGHT 681
Here γ1 is the skewness and γ2 the kurtosis of εi . They are allowed to range within
their natural limits
(25.1.5) 0 ≤ γ12 ≤ γ2 + 2.
Problem 318. Show that the condition (25.1.5), γ12 ≤ γ2 + 2, always holds.
682 25. VARIANCE ESTIMATION: UNBIASEDNESS?
Answer.
2
(25.1.6) (σ 3 γ1 )2 = (E[ε3 ])2 = cov[ε, ε2 ] ≤ var[ε] var[ε2 ] = σ 6 (γ2 + 2)
The concept of bounded MSE which is appropriate here requires the bound to be
independent of the true value of β, but it may depend on the “nuisance parameters”
σ 2 , γ1 , and γ2 :
Definition 25.1.1. The mean square error E[(θ̂ − θ)2 ] of the estimator θ̂ of a
scalar parameter θ in the linear model (25.0.12) will be said to be bounded (with
respect to β) if a finite number b exists with E[(θ̂−θ)2 ] ≤ b regardless of the true value
of β. This bound b may depend on the known nonstochastic X and the distribution
of ε , but not on β.
This formula can be found e.g. in [Seb77, pp. 14–16 and 52]. If AX 6= O, then
a vector δ exists with δ > X > A2 Xδ > 0; therefore, for the sequence β = jδ, the
variance is a quadratic polynomial in j, which is unbounded as j → ∞.
The following ingredients are needed for the best bounded MSE quadratic esti-
mator of σ 2 :
Theorem 25.2.2. We will use the letter τ to denote the vector whose ith com-
ponent is the square of the ith residual τ i = ε̂2i . Then
2
(25.2.2) E [τ ] = σ m
where m is the diagonal vector of M = I − X(X > X)− X > . Furthermore,
(25.2.3) 4
V [τ ] = σ Ω where Ω = γ2 Q2 + 2Q + mm> ,
684 25. VARIANCE ESTIMATION: UNBIASEDNESS?
Q is the matrix with qij = m2ij , i.e., its elements are the squares of the elements of
M , and γ2 is the kurtosis.
∆ M M
" ε̂ ε̂ # " ε ε #
(25.2.4) E =E =
ε̂ ε̂ ε ε
25.2. DERIVATION OF THE BEST BOUNDED MSE QUADRATIC ESTIMATOR OF THE VARIANC
68
∆ ∆ ∆ ∆
M M M M M M M M
= σ4 + σ4 + σ4 + γ2 σ 4 ∆
M M M M M M M M
∆ ∆ ∆ ∆
These m and Ω play an important and, to me, surprising role in the estimator of σ 2 :
(25.2.5) σ̂ 2 = m>Ω − τ
where m and Ω are defined as in Theorem 25.2.2. Other ways to write it are
X
(25.2.6) σ̂ 2 y > M ΛM y = λi ε̂2i
i
686 25. VARIANCE ESTIMATION: UNBIASEDNESS?
it follows, using (25.2.1) and E[y > Ay] = σ 2 tr A + β > X > AXβ, that
Define Σ = (γ2 +2)Q2 +2(Q−Q2 ). It is the sum of two nonnegative definite matrices:
γ2 + 2 ≥ 0 by (25.1.5), and Q − Q2 is nonnegative definite because λ> (Q − Q2 )λ
is the sum of the squares of the offdiagonal elements of M ΛM . Therefore Σ is
Σ + mm> )(Σ
nonnegative definite and it follows m = (Σ Σ + mm> )− m. (To see this,
> > > −
take any P with Σ = P P and apply the identity T =T T (T T ) T , proof e.g.
in [Rao73, p. 26], to the partitioned matrix T = P m .)
Writing Ω = Σ + mm> , one verifies therefore
1
(25.2.12) MSE = (λ − Ω − m)>Ω (λ − Ω − m) − m>Ω − m + 1 + 2 tr(D 2 ).
σ4
Clearly, this is minimized by D = O and any λ with Ω (λ − Ω − m) = o, which gives
(25.2.7).
But why should the data analyst be particulary interested in estimates whose
MSE is independent of β? The research following up on Hsu tried to get rid of this
assumption again. C. R. Rao, in [Rao52], replaced independence of the MSE by the
assumption that A be nonnegative definite. We argue that this was unfortunate, for
the following two reasons:
• Although one can often read that it is “natural” to require A to be non-
negative definite (see for instance [Ati62, p. 84]), we disagree. Of course,
one should expect the best estimator to be nonnegative, but is perplexing
that one should have to assume it. We already noted this in puzzle (3) at
the beginning.
• In the light of theorem 25.2.1, Hsu’s additional condition is equivalent to
the requirement of bounded MSE. It is therefore not as poorly motivated as
it was at first assumed to be. Barnard’s article [Bar63], arguing that this
assumption is even in the linear case more meaningful than unbiasedness,
appeared eleven years after Rao’s [Rao52]. If one wanted to improve on
Hsu’s result, one should therefore discard the condition of unbiasedness,
not that of bounded MSE.
Even the mathematical proof based on unbiasedness and nonnegative definiteness
suggests that the condition AX = O, i.e., bounded MSE, is the more fundamental
assumption. Nonnegative definitenes of A is used only once, in order to get from
690 25. VARIANCE ESTIMATION: UNBIASEDNESS?
is
X
(25.3.1) ˆ 2 = y > M ΘM y =
σ̂ θi ε̂2i
i
where Θ is a diagonal matrix whose diagonal vector θ satisfies the two conditions
that
(25.3.2) Ωθ is proportional to m,
and that
(25.3.3) m> θ = 1
(for instance one may use θ = λ m1> λ .) M , m, Ω , and λ are the same as in theorem
25.2.3. The MSE of this estimator is
(25.3.4) ˆ 2 − σ 2 )2 ] = σ 4 ( 1 − 1).
E[(σ̂
m> λ
We omit the proof, which is very similar to that of theorem 25.2.3. In the general
case, estimator (25.3.1) depends on the kurtosis, just as estimator (25.2.6) does. But
if X is such that all diagonal elements of M are equal, a condition which Atiqullah
in [Ati62] called “quadratically balanced,” then it does not! Since tr M = n − r,
equality of the diagonal elements implies m = n−r n ι. And since m = Qι, any vector
proportional to ι satisfies (25.3.2), i.e., one can find solutions of (25.3.2) without
692 25. VARIANCE ESTIMATION: UNBIASEDNESS?
1
knowing the kurtosis. (25.3.3) gives θ = ι n−r , i.e., the resulting estimator is none
2
other than the unbiased s defined in (25.0.13).
The property of unbiasedness which makes it so popular in the classroom—it is
easy to check—gains here objective relevance. For the best nonnegative quadratic
unbiased estimator one needs to know Ω only up to a scalar factor, and in some
special cases the unknown kurtosis merges into this arbitrary multiplicator.
25.4. Summary
If one replaces the requirement of unbiasedness by that of bounded MSE, one
can not only unify some known results in linear estimation and prediction, but one
also obtains a far-reaching analogy between linear estimation of β and quadratic
estimation of σ 2 . The most important dissimilarity is that, whereas one does not
have to know the nuisance parameter σ 2 in order to write down the best linear
bounded MSE estimator of β, the best quadratic bounded MSE estimator of σ 2
depends on an additional fourth order nuisance parameter, namely, the kurtosis.
In situations in which the kurtosis is known, one should consider the best quadratic
bounded MSE estimator (25.2.6) of σ 2 to be the quadratic analog of the least squares
estimator β̂. It is a linear combination of the squared residuals, and if the kurtosis is
zero, it specializes to the Theil-Schweitzer estimator (25.0.14). Regression computer
25.4. SUMMARY 693
packages, which require normality for large parts of their output, should therefore
provide the Theil-Schweitzer estimate as a matter of course.
If the kurtosis is not known, one can always resort to s2 . It is unbiased and
consistent, but does not have any optimality properties in the general case. If the
design matrix is “quadratically balanced,” s2 can be justified better: in this case s2
has minimum MSE in the class of nonnegative quadratic unbiased estimators (which
is a subclass of all bounded MSE quadratic estimators).
The requirement of unbiasedness for the variance estimator in model (25.0.12)
is therefore not as natural as is often assumed. Its main justification is that it may
help to navigate around the unknown nuisance parameter “kurtosis.”
CHAPTER 26
(26.0.1) P y = P Xβ + P ε P ε ∼ (o, σ 2 I)
695
696 26. NONSPHERICAL COVARIANCE MATRIX
(26.0.2) β̂ = (X > P > P X)−1 X > P > P y = (X > Ψ−1 X)−1 X > Ψ−1 y.
This β̂ is the BLUE of β in model (26.0.1), and since estimators which are linear
in P y are also linear in y and vice versa, β̂ is also the BLUE in the original GLS
model.
and derive from this that β̂ is unbiased and that MSE[β̂; β] = σ 2 (X > Ψ−1 X)−1 .
If X has full rank, then X > Ψ−1 X is nonsingular, and the unique β̂ minimizing
(26.0.4) is
(26.0.6) β̂ = (X > Ψ−1 X)−1 X > Ψ−1 y
Problem 320. [Seb77, p. 386, 5] Show that if Ψ is positive definite and X has
full rank, then also X > Ψ−1 X is positive definite. You are allowed to use, without
proof, that the inverse of a positive definite matrix is also positive definite.
Answer. From X > Ψ−1 Xa = o follows a> X > Ψ−1 Xa = 0, and since Ψ−1 is positive defi-
nite, it follows Xa = o, and since X has full column rank, this implies a = o.
Problem 321. Show that (26.0.5) has always at least one solution, and that the
general solution can be written as
(26.0.7) β̂ = (X > Ψ−1 X)− X > Ψ−1 y + U γ
where X ⊥ U and γ is an arbitrary vector. Show furthermore that, if β̂ is a solution
of (26.0.5), and β is an arbitrary vector, then
(26.0.8)
(y − Xβ)> Ψ−1 (y − Xβ) = (y − X β̂)> Ψ−1 (y − X β̂) + (β − β̂)> X > Ψ−1 X(β − β̂).
Conclude from this that (26.0.5) is a necessary and sufficient condition characterizing
the values β̂ minimizing (26.0.4).
698 26. NONSPHERICAL COVARIANCE MATRIX
Answer. One possible solution of (26.0.5) is β̂ = (X > Ψ−1 X)− X > Ψ−1 y. Since the normal
equations are consistent, (26.0.7) can be obtained from equation (A.4.1), using Problem 574. To
>
prove (26.0.8), write (26.0.4) as (y − X β̂) − X(β − β̂) Ψ−1 (y − X β̂) − X(β − β̂) ; since
β̂ satisfies (26.0.5), the cross product terms disappear. Necessity of the normal equations: for
any solution β of the minimization, X > Ψ−1 X(β − β̂) = o. This together with (26.0.5) gives
X > Ψ−1 Xβ = X > Ψ−1 y.
which has different properties now since we do not assume ε ∼ (o, σ 2 I) but ε ∼
(o, σ 2 Ψ).
• b. 2 points Show that, still under the assumption ε ∼ (o, σ 2 Ψ), V [β̂ OLS ] −
V [β̂] = V [β̂ OLS − β̂]. (Write down the formulas for the left hand side and the right
hand side and then show by matrix algebra that they are equal.) (This is what one
should expect after Problem 198.) Since due to unbiasedness the covariance matrices
are the MSE-matrices, this shows that MSE[β̂ OLS ; β] − MSE[β̂; β] is nonnegative
definite.
26. NONSPHERICAL COVARIANCE MATRIX 701
Answer. Verify equality of the following two expressions for the differences in MSE matrices:
V [β̂ OLS ] − V [β̂] = σ
2
(X > X)−1 X > ΨX(X > X)−1 − (X > Ψ−1 X)−1 =
= σ 2 (X > X)−1 X > − (X > Ψ−1 X)−1 X > Ψ−1 Ψ X(X > X)−1 − Ψ−1 X(X > Ψ−1 X)−1
Best Linear Prediction is the second basic building block for the linear model,
in addition to the OLS model. Instead of estimating a nonrandom parameter β
about which no prior information is available, in the present situation one predicts
a random variable z whose mean and covariance matrix are known. Most models to
be discussed below are somewhere between these two extremes.
Christensen’s [Chr87] is one of the few textbooks which treat best linear predic-
tion on the basis of known first and second moments in parallel with the regression
model. The two models have indeed so much in common that they should be treated
together.
703
704 27. BEST LINEAR PREDICTION
(27.1.5) B ∗Ω yy = Ω zy ,
−
(27.1.6) z ∗ = ν + Ω zy Ω yy (y − µ)
Theorem 27.1.1. In situation (27.1.1), the predictor (27.1.6) has, among all
predictors of z which are affine functions of y, the smallest MSE matrix. Its MSE
matrix is
−
(27.1.7) MSE[z ∗ ; z] = E [(z ∗ − z)(z ∗ − z)> ] = σ 2 (Ω
Ωzz − Ω zy Ω yy Ω yz ) = σ 2Ω zz.y .
706 27. BEST LINEAR PREDICTION
Proof. Look at any predictor of the form z̃ = B̃y + b̃. Its bias is d̃ = E [z̃ −z] =
B̃µ + b̃ − ν, and by (23.1.2) one can write
> >
(27.1.8) E [(z̃ − z)(z̃ − z) ] = V [(z̃ − z)] + d̃d̃
h y i >
(27.1.9) = V B̃ −I + d̃d̃
z
" #
Ω yy Ω yz B̃ >
2
>
(27.1.10) = σ B̃ −I + d̃d̃ .
Ω zy Ω zz −I
Therefore
" #
Ω yy Ω yz B ∗ > + D̃ >
>
MSE[z̃; z] = σ B ∗ + D̃
2
−I + d̃d̃
Ω zy Ω zz −I
>
" #
Ω yy D̃ >
= σ 2 B ∗ + D̃
(27.1.11) −I > + d̃d̃
−Ω
Ωzz.y + Ω zy D̃
> >
(27.1.12) = σ 2 (Ω
Ωzz.y + D̃Ω
Ωyy D̃ ) + d̃d̃ .
The MSE matrix is therefore minimized (with minimum value σ 2Ω zz.y ) if and only
Ωyy = O which means that B̃, along with B ∗ , satisfies (27.1.5).
if d̃ = o and D̃Ω
Problem 324. Show that the solution of this minimum MSE problem is unique
in the following sense: if B ∗1 and B ∗2 are two different solutions of (27.1.5) and y
is any feasible observed value y, plugged into equations (27.1.3) they will lead to the
same predicted value z ∗ .
Answer. Comes from the fact that every feasible observed value of y can be written in the
form y = µ + Ω yy q for some q, therefore B ∗i y = B ∗i Ω yy q = Ω zy q.
708 27. BEST LINEAR PREDICTION
The matrix B ∗ is also called the regression matrix of z on y, and the unscaled
covariance matrix has the form
Ωyy X >
Ωyy Ωyz Ωyy
(27.1.13) Ω= =
Ω zy Ω zz XΩ Ωyy XΩ Ωyy X > + Ω zz.y
Where we wrote here B ∗ = X in order to make the analogy with regression clearer.
A g-inverse is
−
Ω yy + X >Ω − −X >Ω −
− zz.y X zz.y
(27.1.14) Ω =
−X >Ω −zz.y
−
Ω zz.y
and every g-inverse of the covariance matrix has a g-inverse of Ω zz.y as its zz-
partition. (Proof
in Problem
592.)
Ω yy Ω yz
If Ω = is nonsingular, 27.1.5 is also solved by B ∗ = −(Ω Ωzz )−Ω zy
Ω zy Ω zz
where Ω zz and Ω zy are the corresponding partitions of the inverse Ω −1 . See Problem
592 for a proof. Therefore instead of 27.1.6 the predictor can also be written
−1 zy
(27.1.15) z ∗ = ν − Ω zz Ω (y − µ)
(note the minus sign) or
(27.1.16) z ∗ = ν − Ωzz.y Ωzy (y − µ).
27.1. MINIMUM MEAN SQUARED ERROR, UNBIASEDNESS NOT REQUIRED 709
Problem 325. This problem utilizes the concept of a bounded risk estimator,
which is not yet explained very well in these notes. Assume y, z, µ, and ν are
jointly distributed random vectors. First assume ν and µ are observed, but y and z
are not. Assume we know that in this case, the best linear bounded MSE predictor
of y and z is µ and ν, with prediction errors distributed as follows:
y−µ o Ω Ωyz
(27.1.17) ∼ , σ 2 yy .
z−ν o Ωzy Ωzz
• a. Give special cases of this specification in which µ and ν are constant and y
and z random, and one in which µ and ν and y are random and z is constant, and
one in which µ and ν are random and y and z are constant.
710 27. BEST LINEAR PREDICTION
Answer. Ifµ and ν are constant, they are written µ and ν. From this follows µ = E [y] and
2 Ω yy Ω yz y
ν = E [z] and σ = V[ ] and every linear predictor has bounded MSE. Then the
Ω zy Ω zz rx
proof is as given earlier in this chapter. But an example in which µ and ν are not known constants
but are observed random variables, and y is also a random variable but z is constant, is (28.0.26).
Another example, in which y and z both are constants and µ and ν random, is constrained least
squares (29.4.3).
Answer. From independence follows E [z ∗ − z|y] = E [z ∗ − z], and by the law of iterated
expectations E [z ∗ − z] = o. Rewrite this as E [z|y] = E [z ∗ |y]. But since z ∗ is a function of y,
E [z ∗ |y] = z ∗ . Now the proof that the conditional dispersion matrix is the MSE matrix:
> ∗ ∗ >
V [z|y] = E [(z − E [z|y])(z − E [z|y]) |y] = E [(z − z )(z − z ) |y]
(27.1.19)
= E [(z − z ∗ )(z − z ∗ )> ] = MSE[z ∗ ; z].
Problem 327. Assume the expected values of x, y and z are known, and their
joint covariance matrix is known up to an unknown scalar factor σ 2 > 0.
x λ Ω xx Ω xy Ω xz
>
y ∼ µ , σ 2 Ω xy
(27.1.20) Ω yy Ω yz .
z ν Ω>xz Ω>yz Ω zz
• b. 5 points Show that the best linear predictor of z on the basis of the obser-
vations of x and y has the form
(27.1.21) z ∗∗ = z ∗ + Ω > − ∗
yz.xΩ yy.x (y − y )
This is an important formula. All you need to compute z ∗∗ is the best estimate
z ∗ before the new information y became available, the best estimate y ∗ of that new
27.1. MINIMUM MEAN SQUARED ERROR, UNBIASEDNESS NOT REQUIRED 713
information itself, and the joint MSE matrix of the two. The original data x and
the covariance matrix (27.1.20) do not enter this formula.
Answer. Follows from
−
Ω xx Ω xy x−λ
z ∗∗ = ν + Ω xz
>
Ω> =
yz Ω>xy Ω yy y−µ
• b. 1 point The next three examples are from [CW99, pp. 264/5]: Assume
E[z|x, y] = 1 + 2x + 3y, x and y are independent, and E[y] = 2. Compute E[z|x].
Answer. According to the formula, E[z|x] = 1 + 2x + 3E[y|x], but since x and y are indepen-
dent, E[y|x] = E[y] = 2; therefore E[z|x] = 7 + 2x. I.e., the slope is the same, but the intercept
changes.
• c. 1 point Assume again E[z|x, y] = 1 + 2x + 3y, but this time x and y are not
independent but E[y|x] = 2 − x. Compute E[z|x].
Answer. E[z|x] = 1 + 2x + 3(2 − x) = 7 − x. In this situation, both slope and intercept change,
but it is still a linear relationship.
• d. 1 point Again E[z|x, y] = 1 + 2x + 3y, and this time the relationship between
x and y is nonlinear: E[y|x] = 2 − ex . Compute E[z|x].
Answer. E[z|x] = 1 + 2x + 3(2 − ex ) = 7 + 2x − 3ex . This time the marginal relationship
between x and y is no longer linear. This is so despite the fact that, if all the variables are included,
i.e., if both x and y are included, then the relationship is linear.
27.1. MINIMUM MEAN SQUARED ERROR, UNBIASEDNESS NOT REQUIRED 715
• d. Now let us extend the model a little: assume x1 , x2 , and ε are Normally
distributed and independent of each other, and E[ε] = 0. Define y = α + β1 x1 +
β2 x2 + ε. Again express β1 and β2 in terms of variances and covariances of x1 , x2 ,
and y.
Answer. Since x1 and x2 are independent, one gets the same formulas as in the univariate
cov[x ,y]
case: from cov[x1 , y] = β1 var[x1 ] and cov[x2 , y] = β2 var[x2 ] follows β1 = var[x1 ] and β2 =
1
cov[x2 ,y]
var[x2 ]
.
27.2. THE ASSOCIATED LEAST SQUARES PROBLEM 717
• e. Since x1 and y are jointly normal, they can also be written x1 = γ1 +δ1 y+ω 1 ,
where ω 1 is independent of y. Likewise, x2 = γ2 + δ2 y + ω 2 , where ω 2 is independent
of y. Express δ1 and δ2 in terms of the variances and covariances of x1 , x2 , and y,
and show that
δ1 var[x1 ] 0 β1
(27.1.23) var[y] =
δ2 0 var[x2 ] β2
This is (27.1.22) in the present situation.
cov[x1 ,y] cov[x2 ,y]
Answer. δ1 = var[y]
and δ2 = var[y]
.
If one plugs z = z ∗ into this objective function, one obtains a very simple expression:
(27.2.3)
Ω− − − −
Ω− −
yy + Ω yy Ω yz Ω zz.y Ω zy Ω yy −Ω yy Ω yz Ω zz.y I
>
Ω−
(y−µ) I yy Ω yz (y−
−Ω − −
Ωzz.y Ω zy Ω yy −
Ω zz.y Ω zy Ω −
yy
(27.2.4) = (y − µ)>Ω −
yy (y − µ).
27.2. THE ASSOCIATED LEAST SQUARES PROBLEM 719
Now take any z of the form z = ν + Ω zz q for some q and write it in the form
z = z ∗ + Ω zz d, i.e.,
y−µ y−µ o
= ∗ + .
z−ν z −ν Ω zz d
(27.2.5)
Ω− − − − −
Ω yz Ω −
> > yy + Ω yy Ω yz Ω zz.y Ω zy Ω yy −Ω
Ωyy zz.y I
o d Ω zz − − − − (y−µ) =
−Ω
Ωzz.y Ω zy Ω yy Ω zz.y Ω zy Ω yy
−
Ωyy
= o> d>Ω zz
(y − µ) = 0
O
From (27.2.1) follows that z ∗ is the mode of the normal density function, and
since the mode is the mean, this is an alternative proof, in the case of nonsingular
covariance matrix, when the density exists, that z ∗ is the normal conditional mean.
720 27. BEST LINEAR PREDICTION
one sees that E[y ∗0 − y 0 ] = o, i.e., it is an unbiased predictor. And since ε and ε 0
are uncorrelated, one obtains
Problem 331 shows that this is the Best Linear Unbiased Predictor (BLUP) of y 0 on
the basis of y.
27.3. PREDICTION OF FUTURE OBSERVATIONS IN THE REGRESSION MODEL 721
Problem 331. The prediction problem in the Ordinary Least Squares model can
be formulated as follows:
y X ε ε o ε I O
(27.3.4) = β+ E [ ]= V [ ] = σ2 .
y0 X0 ε0 ε0 o ε0 O I
X and X 0 are known, y is observed, y 0 is not observed.
• a. 4 points Show that y ∗0 = X 0 β̂ is the Best Linear Unbiased Predictor (BLUP)
of y 0 on the basis of y, where β̂ is the OLS estimate in the model y = Xβ + ε .
Answer. Take any other predictor ỹ 0 = B̃y and write B̃ = X 0 (X > X)−1 X > + D. Unbiased-
ness means E [ỹ 0 − y 0 ] = X 0 (X > X)−1 X > Xβ + DXβ − X 0 β = o, from which follows DX = O.
Because of unbiasedness we know MSE[ỹ 0 ; y 0 ] = V[ỹ 0 − y 0 ]. Since the prediction error can be
y
written ỹ 0 − y = X 0 (X > X)−1 X > + D −I , one obtains
y0
>
y X(X > X)−1 X >
0 +D
>
V [ỹ 0 − y 0 ] = X 0 (X X)−1 X > +D −I V [ ]
y0 −I
2
>
X(X > X)−1 X >
0 +D
>
=σ X 0 (X X)−1 X > +D −I
−I
>
= σ 2 X 0 (X > X)−1 X > + D X 0 (X > X)−1 X > + D + σ2 I
> −1
=σ 2
X 0 (X X) X>
0 + DD >
+I .
722 27. BEST LINEAR PREDICTION
It differs from the prediction MSE matrix by σ 2 I, which is the uncertainty about the value of the
new disturbance ε 0 about which the data have no information.
[Gre97, p. 369] has an enlightening formula showing how the prediction intervals
increase if one goes away from the center of the data.
Now let us look at the prediction problem in the Generalized Least Squares
model
y X ε ε o ε 2 Ψ C
(27.3.5) = β+ E ε = V ε = σ .
y0 X0 ε0 0 o 0 C > Ψ0
27.3. PREDICTION OF FUTURE OBSERVATIONS IN THE REGRESSION MODEL 723
where β̂ is the generalized least squares estimator of β, and that its MSE-matrix
MSE[y ∗0 ; y 0 ] is
(27.3.7) σ 2 Ψ0 −C > Ψ−1 C +(X 0 −C > Ψ−1 X)(X > Ψ−1 X)−1 (X > > −1
0 −X Ψ C) .
Problem 332. Derive the formula for the MSE matrix from the formula of
the predictor, and compute the joint MSE matrix for the predicted values and the
parameter vector.
724 27. BEST LINEAR PREDICTION
and the joint MSE matrix with the sampling error of the parameter vector β̂ − β is
C > Ψ−1 + (X 0 − C > Ψ−1 X)(X > Ψ−1 X)−1 X > Ψ−1 −I
(27.3.13) σ2
(X > Ψ−1 X)−1 X > Ψ−1 O
Ψ C Ψ−1 C + Ψ−1 X(X > Ψ−1 X)−1 (X > > −1
0 −X Ψ C) Ψ−1 X(X > Ψ−1 X)−1
=
C> Ψ0 −I O
27.3. PREDICTION OF FUTURE OBSERVATIONS IN THE REGRESSION MODEL 725
C > Ψ−1 + (X 0 − C > Ψ−1 X)(X > Ψ−1 X)−1 X > Ψ−1 −I
(27.3.14) = σ2
(X > Ψ−1 X)−1 X > Ψ−1 O
X(X > Ψ−1 X)−1 (X >0 −X Ψ
> −1
C) X(X > Ψ−1 X)−1
C > Ψ−1 C + C > Ψ−1 X(X > Ψ−1 X)−1 (X > > −1
0 −X Ψ C) − Ψ0 C > Ψ−1 X(X > Ψ−1 X)−1
The strategy of the proof given in ITPE is similar to the strategy used to obtain
the GLS results, namely, to transform the data in such a way that the disturbances
are well behaved. Both data vectors y and y 0 will be transformed, but this trans-
formation must have the following additional property: the transformed y must be
a function of y alone, not of y 0 . Once such a transformation is found, it is easy to
predict the transformed y 0 on the basis of the transformed y, and from this one also
obtains a prediction of y 0 on the basis of y.
726 27. BEST LINEAR PREDICTION
Answer. What is predicted is a random variable, therefore the MSE matrix is the covariance
matrix of the prediction error. The prediction error is (X 0 − C > Ψ−1 )(β̂ − β), its covariance matrix
is therefore σ 2 (X 0 − C > Ψ−1 X)(X > Ψ−1 X)−1 (X > 0 −X Ψ
> −1
C).
27.3. PREDICTION OF FUTURE OBSERVATIONS IN THE REGRESSION MODEL 727
Problem 334. In the following we work with partitioned matrices. Given the
model
y X ε ε o ε 2 Ψ C
(27.3.18) = β+ E[ ]= V[ ε ] = σ > .
y0 X0 ε0 ε0 o 0 C Ψ0
X has full rank. y is observed, y 0 is not observed. C is not the null matrix.
Answer.
Now one must use B = X 0 (X > Ψ−1 X)−1 X > Ψ−1 . One ends up with
(27.3.23)
MSE[X 0 β̂; y 0 ] = σ 2 X 0 (X > Ψ−1 X)−1 X > > −1
0 −C Ψ X(X > Ψ−1 X)−1 X > > −1
0 −X 0 (X Ψ X)−1 X >
i.e., it exceeds the minimum MSE matrix by C > (Ψ−1 − Ψ−1 X(X > Ψ−1 X)−1 X > Ψ−1 )C. This
is nnd because the matrix in parentheses is M = M ΨM , refer here to Problem 322.
CHAPTER 28
The theory of the linear model often deals with pairs of models which are nested
in each other, one model either having more data or more stringent parameter re-
strictions than the other. We will discuss such nested models in three forms: in
the remainder of the present chapter 28 we will see how estimates must be updated
when more observations become available, in chapter 29 how the imposition of a
linear constraint affects the parameter estimates, and in chapter 30 what happens if
one adds more regressors.
731
732 28. ADDITIONAL OBSERVATIONS
Assume you have already computed the BLUE β̂ on the basis of the observations
y = Xβ +ε ε, and afterwards additional data y 0 = X 0 β +ε ε0 become available. Then
β̂ can be updated using the following principles:
Before the new observations became available, the information given in the orig-
inal dataset not only allowed to estimate β by β̂, but also yielded a prediction
y ∗0 = X 0 β̂ of the additional data. The estimation error β̂ − β and the prediction
error y ∗0 − y 0 are unobserved, but we know their expected values (the zero vectors),
and we also know their joint covariance matrix up to the unknown factor σ 2 . After
the additional data have become available, we can compute the actual value of the
prediction error y ∗0 −y 0 . This allows us to also get a better idea of the actual value of
the estimation error, and therefore we can get a better estimator of β. The following
steps are involved:
(1) Make the best prediction y ∗0 of the new data y 0 based on y.
(2) Compute the joint covariance matrix of the prediction error y ∗0 − y 0 of the
new data by the old (which is observed) and the sampling error in the old regression
β̂ − β (which is unobserved).
(3) Use the formula for best linear prediction (??) to get a predictor z ∗ of β̂−β.
ˆ
(4) Then β̂ = β̂ − z ∗ is the BLUE of β based on the joint observations y and
y0 .
28. ADDITIONAL OBSERVATIONS 733
(5) The sum of squared errors of the updated model minus that of the basic
model is the standardized prediction error SSE ∗ − SSE = (y ∗0 − y 0 )>Ω −1 (y ∗0 − y 0 )
ˆ ˆ
where SSE ∗ = (y − X β̂)> (y − X β̂) V [y ∗0 − y 0 ] = σ 2Ω .
In the case of one additional observation and spherical covariance matrix, this
procedure yields the following formulas:
Problem 335. Assume β̂ is the BLUE on the basis of the observation y =
Xβ + ε , and a new observation y 0 = x>
0 β + ε0 becomes available. Show that the
updated estimator has the form
ˆ y 0 − x>0 β̂
(28.0.25) β̂ = β̂ + (X > X)−1 x0 >
.
1 + x>
0 (X X)−1 x0
• a. Show that the residual ε̂ˆ0 from the full regression is the following nonrandom
multiple of the “predictive” residual y 0 − x>0 β̂:
ˆ 1
(28.0.29) ε̂ˆ0 = y 0 − x>
0 β̂ = >
(y 0 − x>
0 β̂)
1 + x>
0 (X X)−1 x
0
Interestingly, this is the predictive residual divided by its relative variance (to stan-
dardize it one would have to divide it by its relative standard deviation). Compare
this with (31.2.9).
Answer. (28.0.29) can either be derived from (28.0.25), or from the following alternative
application of the updating principle: All the information which the old observations have for the
estimate of x> >
0 β is contained in ŷ 0 = x0 β̂. The information which the updated regression, which
includes the additional observation, has about x> 0 β can therefore be represented by the following
28. ADDITIONAL OBSERVATIONS 735
two “observations”:
>
ŷ 0 1 > δ δ1 0 x>
0 (X X)
−1 x 0
(28.0.30) = x β+ 1 ∼ , σ2 0
y0 1 0 δ2 δ2 0 0 1
This is a regression model with two observations and one unknown parameter, x>0 β, which has a
nonspherical error covariance matrix. The formula for the BLUE of x>
0 β in model (28.0.30) is
(28.0.31)
−1 −1
> −1 >
x> −1 x x> −1 x
ˆ =
0 (X X) 0 0 1
0 (X X) 0 0 ŷ 0
ŷ 0 1 1 1 1
0 1 1 0 1 y0
1 ŷ 0
(28.0.32) = 1 >
+ y0
1+ x>
0 (X X)
−1 x
0
x>
0
(X > X)−1 x0
1
(28.0.33) = >
(ŷ 0 + x> >
0 (X X)
−1
x0 y 0 ).
1 + x>
0 (X X)
−1 x
0
Later, in (32.4.1), one will see that it can also be written in the form
ˆ
(28.0.35) β̂ = β̂ + (Z > Z)−1 x0 (y 0 − x>
0 β̂)
X
where Z = .
x>
0
Problem 336. Show the following fact which is point (5) in the above updating
principle in this special case: If one takes the squares of the standardized predictive
residuals, one gets the difference of the SSE for the regression with and without the
additional observation y 0
(y 0 − x>
0 β̂)
2
(28.0.36) SSE ∗ − SSE = >
1 + x>0 (X X)
−1 x
0
Answer. The sum of squared errors in the old regression is SSE = (y − X β̂)> (y − X β̂);
ˆ > (y − X β̂)
the sum of squared errors in the updated regression is SSE ∗ = (y − X β̂) ˆ + ε̂ˆ 2 . From
0
(28.0.34) follows
ˆ = y − X β̂ − X(X > X)−1 x ε̂ˆ .
y − X β̂
(28.0.37) 0 0
ˆ > (y − X β̂)
ˆ = (y − X β̂)> (y − X β̂) +
If one squares this, the cross product terms fall away: (y − X β̂)
ε̂ˆ0 x> (X > X)−1 x0 ε̂ˆ0 . Adding ε̂ˆ0 2 to both sides gives SSE ∗ = SSE + ε̂ˆ0 2 (1 + x> (X > X)−1 x0 ).
0 0
Now use (28.0.29) to get (28.0.36).
CHAPTER 29
One of the assumptions for the linear model was that nothing is known about
the true value of β. Any k-vector γ is a possible candidate for the value of β. We
used this assumption e.g. when we concluded that an unbiased estimator B̃y of β
must satisfy B̃X = I. Now we will modify this assumption and assume we know
that the true value β satisfies the linear constraint Rβ = u. To fix notation, assume
y be a n × 1 vector, u a i × 1 vector, X a n × k matrix, and R a i × k matrix.
In addition to our usual assumption that all columns of X are linearly independent
(i.e., X has full column rank) we will also make the assumption that all rows of R
are linearly independent (which is called: R has full row rank). In other words, the
matrix of constraints R does not include “redundant” constraints which are linear
combinations of the other constraints.
737
738 29. CONSTRAINED LEAST SQUARES
• d. 2 points Now do the same thing with the modified regression from part b
which incorporates the constraint β + γ = 1: include the original z as an additional
regressor and determine the meaning of the coefficient of z.
What Problem 337 suggests is true in general: every constrained Least Squares
problem can be reduced to an equivalent unconstrained Least Squares problem with
fewer explanatory variables. Indeed, one can consider every least squares problem to
be “constrained” because the assumption E [y] = Xβ for some β is equivalent to a
linear constraint on E [y]. The decision not to include certain explanatory variables
in the regression can be considered the decision to set certain elements of β zero,
which is the imposition of a constraint. If one writes a certain regression model as
a constrained version of some other regression model, this simply means that one is
interested in the relationship between two nested regressions.
Problem 273 is another example here.
740 29. CONSTRAINED LEAST SQUARES
Plug (29.2.3) into (29.2.2) and rearrange to get a regression which is equivalent to
the constrained regression:
(29.2.4) y − X 1 R−1 −1
1 u = (X 2 − X 1 R1 R2 )β 2 + ε
29.2. CONVERSION OF AN ARBITRARY CONSTRAINT INTO A ZERO CONSTRAINT 741
or
(29.2.5) y∗ = Z 2 β2 + ε
One more thing is noteworthy here: if we add X 1 as additional regressors into
(29.2.5), we get a regression that is equivalent to (29.2.2). To see this, define the
difference between the left hand side and right hand side of (29.2.3) as γ 1 = β 1 −
R−1 −1
1 u + R1 R2 β 2 ; then the constraint (29.2.1) is equivalent to the “zero constraint”
γ 1 = o, and the regression
(29.2.6) y − X 1 R−1 −1 −1 −1
1 u = (X 2 − X 1 R1 R2 )β 2 + X 1 (β 1 − R1 u + R1 R2 β 2 ) + ε
This can be accomplished with β = δ and α = γ + R−1 1 R2 δ. The other side is even more trivial:
given α and β, multiplying out the right side of (29.2.10) gives X 1 α + X 2 β − X 1 R−1
1 R2 β, i.e.,
δ = β and γ = α − R−1
1 R2 β.
The constrained least squares estimator is that k × 1 vector β = β̂ˆ which mini-
>
mizes SSE = (y − Xβ) (y − Xβ) subject to the linear constraint Rβ = u.
Again, we assume that X has full column and R full row rank.
The Lagrange approach to constrained least squares, which we follow here, is
given in [Gre97, Section 7.3 on pp. 341/2], also [DM93, pp. 90/1]:
The Constrained Least Squares problem can be solved with the help of the
“Lagrange function,” which is a function of the k × 1 vector β and an additional i × 1
vector λ of “Lagrange multipliers”:
(29.3.1) L(β, λ) = (y − Xβ)> (y − Xβ) + (Rβ − u)> λ
λ can be considered a vector of “penalties” for violating the constraint. For every
possible value of λ one computes that β = β̃ which minimizes L for that λ (This is
an unconstrained minimization problem.) It will turn out that for one of the values
29.3. LAGRANGE APPROACH TO CONSTRAINED LEAST SQUARES 743
ˆ
for all β̃. Since by assumption, β̂ also satisfies the constraint, this simplifies to:
ˆ > (y − X β̂).
ˆ
(29.3.4) (y − X β̃)> (y − X β̃) + (Rβ̃ − u)> λ∗ ≥ (y − X β̂)
This is still true for all β̃. If we only look at those β̃ which satisfy the constraint, we get
Instead of imposing the constraint itself, one imposes a penalty function which
has such a form that the agents will “voluntarily” heed the constraint. This is
a familiar principle in neoclassical economics: instead of restricting pollution to a
certain level, tax the polluters so much that they will voluntarily stay within the
desired level.
The proof which follows now not only derives the formula for β̂ ˆ but also shows
∗ ˆ ˆ
that there is always a λ for which β̂ satisfies Rβ̂ = u.
Problem 340. 2 points Use the simple matrix differentiation rules ∂(w> β)/∂β > =
w> and ∂(β > M β)/∂β > = 2β > M to compute ∂L/∂β > where
(29.3.6) L(β) = (y − Xβ)> (y − Xβ) + (Rβ − u)> λ
Answer. Write the objective function as y > y − 2y > Xβ + β > X > Xβ + λ> Rβ − λ> u to get
(29.3.7).
Some textbook treatments have an extra factor 2 in front of λ∗ , which makes the
math slightly smoother, but which has the disadvantage that the Lagrange multiplier
can no longer be interpreted as the “shadow price” for violating the constraint.
ˆ to get that β̂
Solve (29.3.8) for β̂ ˆ which minimizes L for any given λ∗ :
(29.3.10) ˆ = (X > X)−1 X > y − 1 (X > X)−1 R> λ∗ = β̂ − 1 (X > X)−1 R> λ∗
β̂
2 2
Here β̂ on the right hand side is the unconstrained OLS estimate. Plug this formula
ˆ into (29.3.9) in order to determine that value of λ∗ for which the corresponding
for β̂
746 29. CONSTRAINED LEAST SQUARES
(29.3.13) ˆ = β̂ − (X > X)−1 R> R(X > X)−1 R> −1 (Rβ̂ − u).
β̂
Problem 341. If R has full row rank and X full column rank, show that
R(X > X)−1 R> has an inverse.
Answer. Since it is nonnegative definite we have to show that it is positive definite. b> R(X > X)−
0 implies b> R = o> because (X > X)−1 is positive definite, and this implies b = o because R has
full row rank.
(29.3.14) ˆ = β̂ − (X > Ψ−1 X)−1 R> R(X > Ψ−1 X)−1 R> −1 (Rβ̂ − u)
β̂
where β̂ = (X > Ψ−1 X)−1 X > Ψ−1 y. This formula is given in [JHG+ 88, (11.2.38)
on p. 457]. Remark, which you are not asked to prove: this is the best linear unbiased
estimator if ε ∼ (o, σ 2 Ψ) among all linear estimators which are unbiased whenever
the true β satisfies the constraint Rβ = u.)
Answer. Lagrange function is
L(β, λ) = (y − Xβ)> Ψ−1 (y − Xβ) + (Rβ − u)> λ
= y > y − 2y > Ψ−1 Xβ + β > X > Ψ−1 Xβ + λ> Rβ − λ> u
Jacobian is
Answer.
u Rβ̂ R(X > X)−1 R> R(X > X)−1
(29.4.1) ∼ , σ2 .
β β̂ (X > X)−1 R> (X > X)−1
• b. 1 point Look at the formula for the predictor you just derived. Have you
seen this formula before? Describe the situation in which this formula is valid as a
BLUE-formula, and compare the situation with the situation here.
Answer. Of course, constrained least squares. But in contrained least squares, β is nonrandom
and β̂ is random, while here it is the other way round.
In the unconstrained OLS model, i.e., before the “observation” of u = Rβ, the
best bounded MSE estimators of u and β are Rβ̂ and β̂, with the sampling errors
having the following means and variances:
After the observation of u we can therefore apply (27.1.18) to get exactly equation
ˆ This is probably the easiest way to derive this equation, but it derives
(29.3.13) for β̂.
constrained least squares by the minimization of the MSE-matrix, not by the least
squares problem.
ˆ − (β̂ − β̂)
(β − β̂)> X > X(β − β̂) = β − β̂ ˆ > X > X β − β̂
ˆ − (β̂ − β̂)
ˆ
ˆ ˆ
= (β − β̂)> X > X(β − β̂)
ˆ > X > X(X > X)−1 R> R(X > X)−1 R> −1 (Rβ̂ −
− 2(β − β̂)
ˆ > X > X(β̂ − β̂).
+ (β̂ − β̂) ˆ
−1
The cross product terms can be simplified to −2(Rβ−u)> R(X > X)−1 R> (Rβ̂−
> > −1 > −1
u), and the last term is (Rβ̂ − u) R(X X) R (Rβ̂ − u). Therefore the
objective function for an arbitrary β can be written as
The first and last terms do not depend on β at all; the third term is zero whenever
ˆ in which
β satisfies Rβ = u; and the second term is minimized if and only if β = β̂,
case it also takes the value zero.
The last term is zero if β satisfies the constraint. Now use (24.0.7) twice to get
ˆ −1
(29.6.2) β̂ − β = W X >ε −(X > X)−1 R> R(X > X)−1 R> (Rβ − u)
29.6. SAMPLING PROPERTIES OF CONSTRAINED LEAST SQUARES 753
where
−1
(29.6.3) W = (X > X)−1 − (X > X)−1 R> R(X > X)−1 R> R(X > X)−1 .
ˆ − β = W X >ε . In this case,
If β satisfies the constraint, (29.6.2) simplifies to β̂
ˆ is unbiased and MSE[β̂;
therefore, β̂ ˆ β] = σ 2 W (Problem 344). Since (X > X)−1 −
(29.6.5) ˆ β] = σ 2 W +
MSE[β̂;
−1
+ (X > X)−1 R> R(X > X)−1 R> (Rβ − u) ·
−1
· (Rβ − u) >
R(X X)−1 R>
>
R(X > X)−1 .
Even if the true parameter does not satisfy the constraint, it is still possible
that the constrained least squares estimator has a better MSE matrix than the
unconstrained one. This is the case if and only if the true parameter values β and
σ 2 satisfy
(29.6.6) (Rβ − u)> R(X > X)−1 R> )−1 (Rβ − u) ≤ σ 2 .
This equation, which is the same as [Gre97, (8-27) on p. 406], is an interesting
result, because the obvious estimate of the lefthand side in (29.6.6) is i times the
value of the F -test statistic for the hypothesis Rβ = u. To test for this, one has to
use the noncentral F -test with parameters i, n − k, and 1/2.
29.7. ESTIMATION OF THE VARIANCE IN CONSTRAINED OLS 755
is nonnegative definite. Since Ω = σ 2 R(X > X)−1 R> has an inverse, theorem A.5.9 immediately
leads to (29.6.6).
(29.7.1) ˆ = ε̂ + X(X > X)−1 R> R(X > X)−1 R> −1 (Rβ̂ − u).
ε̂ˆ = y − X β̂
756 29. CONSTRAINED LEAST SQUARES
Now note that E [Rβ̂ − u] = Rβ − uand V [Rβ̂ − u] = σ 2 R(X > X)−1 R> . Therefore
−1
use (9.2.1) in theorem 9.2.1 and tr R(X > X)−1 R> R(X > X)−1 R>
= i to
get
−1
(29.7.3) E[(Rβ̂ − u)> R(X > X)−1 R> (Rβ̂ − u)] =
−1
= σ 2 i+(Rβ − u)> R(X > X)−1 R> (Rβ − u)
ˆ
Problem 346. 3 points Assume β̂ is the constrained least squares estimate, and
β 0 is any vector satisfying Rβ 0 = u. Show that in the decomposition
(29.7.6) ˆ − β ) + ε̂ˆ
y − Xβ 0 = X(β̂ 0
ˆ ŷ − ŷ,
• b. 4 points Show that in (29.7.7) the three vectors ŷ, ˆ and ε̂ are orthog-
onal. You are allowed to use, without proof, formula (29.3.13):
Answer. One has to verify that the scalar products of the three vectors on the right hand side
ˆ > ε̂ = β̂
of (29.7.7) are zero. ŷ
ˆ> X > ˆ> >
ˆ > ε̂ = (β̂ − β̂)
ε̂ = 0 and (ŷ − ŷ) X ε̂ = 0 follow from X > ε̂ = o;
ˆ
geometrically on can simply say that ŷ and ŷ are in the space spanned by the columns of X, and
ˆ
ε̂ is orthogonal to that space. Finally, using (29.3.13) for β̂ − β̂,
Problem 348.
• b. 1 point Can you think of a practical situation in which this model might be
appropriate?
Answer. This can occur if one measures data which theoretically add to zero, and the mea-
surement errors are independent and have equal standard deviations.
• c. 2 points Check your results against a SAS printout (or do it in any other
statistical package) with the data vector y > = [ −1 0 1 2 ]. Here are the sas commands:
data zeromean;
input y x1 x2 x3 x4;
cards;
-1 1 0 0 0
0 0 1 0 0
1 0 0 1 0
2 0 0 0 1
29.7. ESTIMATION OF THE VARIANCE IN CONSTRAINED OLS 761
;
proc reg;
model y= x1 x2 x3 x4 /
noint;
restrict x1+x2+x3+x4=0;
output out=zerout
residual=ehat;
run;
proc print data=zerout;
run;
Additional Regressors
A good detailed explanation of the topics covered in this chapter is [DM93, pp.
19–24]. [DM93] use the addition of variables as their main paradigm for going from
a more restrictive to a less restrictive model.
In this chapter, the usual regression model is given in the form
β1
+ ε = Xβ + ε , ε ∼ (o, σ 2 I)
(30.0.1) y = X 1 β 1 + X 2 β 2 + ε = X 1 X 2
β2
β
where X = X 1 X 2 has full column rank, and the coefficient vector is β = 1 .
β2
We take a sequential approach to this regression. First we regress y on X 1
alone, which gives the regression coefficient β̂ ˆ . This by itself is an inconsistent
1
765
766 30. ADDITIONAL REGRESSORS
estimator of β 1 , but we will use it as a stepping stone towards the full regression.
We make use of the information gained by the regression on X 1 in our computation
of the full regression. Such a sequential approach may be appropriate in the following
situations:
• If regression on X 1 is much simpler than the combined regression, for in-
stance if X 1 contains dummy or trend variables, and the dataset is large.
Example: model (64.3.4).
• If we want to fit the regressors in X 2 by graphical methods and those in
X 1 by analytical methods (added variable plots).
• If we need an estimate of β 2 but are not interested in an estimate of β 1 .
• If we want to test the joint significance of the regressors in X 2 , while X 1
consists of regressors not being tested.
ˆ + ε̂.
If one regresses y on X 1 , one gets y = X 1 β̂ ˆ is an inconsistent
ˆ Of course, β̂
1 1
estimator of β 1 , since some explanatory variables are left out. And ε̂ˆ is orthogonal
to X 1 but not to X 2 .
The iterative “backfitting” method proceeds from here as follows: it regresses ε̂ˆ
on X 2 , which gives another residual, which is again orthogonal on X 2 but no longer
orthogonal on X 1 . Then this new residual is regressed on X 1 again, etc.
30. ADDITIONAL REGRESSORS 767
Problem 350. The purpose of this Problem is to get a graphical intuition of the
issues in sequential regression. Make sure the stand-alone program xgobi is installed
on your computer (in Debian GNU-Linux do apt-get install xgobi), and the R-
interface xgobi is installed (the R-command is simply install.packages("xgobi"),
or, on a Debian system the preferred argument is install.packages("xgobi", lib
= "/usr/lib/R/library")). You have to give the commands library(xgobi) and
then reggeom(). This produces a graph in the XGobi window which looks like [DM93,
Figure 3b on p. 22]. If you switch from the XYPlot view to the Rotation view, you
will see the same lines rotating 3-dimensionally, and you can interact with this graph.
You will see that this graph shows the dependent variable y, the regression of y on
x1 , and the regression of y on x1 and x2 .
• a. 1 point In order to show that you have correctly identified which line is y,
please answer the following two questions: Which color is y: red, yellow, light blue,
dark blue, green, purple, or white? If it is yellow, also answer the question: Is it that
yellow line which is in part covered by a red line, or is it the other one? If it is red,
green, or dark blue, also answer the question: Does it start at the origin or not?
are drawn in dark blue, and the quickly improving approximations to the fitted value
are connected by a red zig-zag line.
• h. 1 point The diagram contains the points for two more backfitting steps.
Identify the endpoints of both residuals.
• j. 1 point Of the lines cp, pq, qr, and rs, two are parallel to x1 , and two
parallel to x2 . Which two are parallel to x1 ?
• l. 3 points Which two variables are plotted against each other in an added-
variable plot for x2 ?
x1 x2 y ŷ ŷ ˆ
5 -1 3 3 3
0 4 3 3 0
0 0 4 0 0
In the dataset which R submits to XGobi, all coordinates are multiplied by 1156,
which has the effect that all the points included in the animation have integer coor-
dinates.
h3i h3i
Problem 351. 2 points How do you know that the decomposition 3 = 0 +
h0i h i h i 4 0
ˆ + ε̂ˆ in the regression of y = 33 on x1 = 50 ?
3 is y = ŷ
4 4 0
Answer. Besides the equation y = ŷ ˆ + ε̂ˆ we have to check two things: (1) ŷ ˆ is a linear
ˆ
combination of all the explanatory variables (here: is a multiple of x1 ), and (2) ε̂ is orthogonal to
all explanatory variables. Compare Problem ??.
h3i
Problem 352. 3 points In the same way, check that the decomposition 3 =
h3i h0i h3i h5i 4 i
h −1
3 + 0 is y = ŷ + ε in the regression of y = 3 on x1 = 0 and x2 = 4 .
0 4 4 0 0
Answer. Besides the equation y = ŷ ˆ + ε̂ˆ we have to check two things: (1) ŷ
ˆ is a linear
combination of all the explanatory variables. Since both x1 and x2 have zero as third coordinate,
and they are linearly independent, they span the whole plane, therefore ŷ, which also has the
30. ADDITIONAL REGRESSORS 771
third coordinate zero, is their linear combination. (2) ε̂ˆ is orthogonal to both explanatory variables
because its only nonzero coordinate is the third.
h i h i h i
The residuals ε̂ˆ in the regression on x1 are y − ŷˆ = 33 − 30 = 03 . This
4 h 0 0i 4 h
−1
h3i i
vector is clearly orthogonal to x1 = 0 . Now let us regress ε̂ˆ = 3 on x2 = 4 .
0 4 0
Say h is the vector of fitted values and k the residual vector in this regression. We
saw in problem 350 that this is the next step in backfitting, but k is not the same
as the residual vector ε̂ in the multiple regression, because k is not orthogonal to
x1 . In order to get the correct residual in the joint regression and also the correct
coefficient of x2 , one must regress ε̂ˆ only on that part of x2 which is orthogonal to
x1 . This regressor is the dark blue line starting at the origin.
In formulas: One gets the correct ε̂ and β̂ 2 by regressingx ε̂ˆ = M 1 y not on X 2
but on M 1 X 2 , where M 1 = I − X 1 (X > 1 X 1)
−1
X>
1 is the matrix which forms the
residuals under the regression on X 1 . In other words, one has to remove the influence
of X 1 not only from the dependent but also the independent variables. Instead of
regressing the residuals ε̂ˆ = M 1 y on X 2 , one has to regress them on what is new
about X 2 after we know X 1 , i.e., on what remains of X 2 after taking out the effect
of X 1 , which is M 1 X 2 . The regression which gets the correct β̂ 2 is therefore
(30.0.2) M 1 y = M 1 X 2 β̂ 2 + ε̂
772 30. ADDITIONAL REGRESSORS
This regression also yields the correct covariance matrix. (The only thing which
is not right is the number of degrees of freedom). The regression is therefore fully
representative of the additional effect of x2 , and the plot of ε̂ˆ against M 1 X 2 with
the fitted line drawn (which has the correct slope β̂ 2 ) is called the “added variable
plot” for X 2 . [CW99, pp. 244–246] has a good discussion of added variable plots.
Problem 353. 2 points Show that in the model (30.0.1), the estimator β̂ 2 =
(X >
2 M 1X 2)
−1
X>
2 M 1 y is unbiased. Compute MSE[β̂ 2 ; β 2 ].
In order to get an estimate of β̂ 1 , one can again do what seems intuitive, namely,
regress y − X 2 β̂ 2 on X 1 . This gives
(30.0.4) β̂ 1 = (X >
1 X 1)
−1
X>
1 (y − X 2 β̂ 2 ).
This regression also gives the right residuals, but not the right estimates of the
covariance matrix.
30. ADDITIONAL REGRESSORS 773
Problem 354. The three Figures in [DM93, p. 22] can be seen in XGobi if
you use the instructions in Problem 350. The purple line represents the dependent
variable y, and the two yellow lines the explanatory variables x1 and x2 . (x1 is the
one which is in part red.) The two green lines represent the unconstrained regression
ˆ + ε̂ˆ where y is
y = ŷ + ε̂, and the two red lines the constrained regression y = ŷ
only regressed on x1 . The two dark blue lines, barely visible against the dark blue
background, represent the regression of x2 on x1 .
• a. The first diagram which XGobi shows on startup is [DM93, diagram (b)
on p. 22]. Go into the Rotation view and rotate the diagram in such a way that the
view is [DM93, Figure (a)]. You may want to delete the two white lines, since they
are not shown in Figure (a).
ˆ=
• b. Make a geometric argument that the light blue line, which represents ŷ − ŷ
ˆ
X(β̂ − β̂), is orthogonal on the green line ε̂ (this is the green line which ends at the
point y, i.e., not the green line which starts at the origin).
Answer. The light blue line lies in the plane spanned by x1 and x2 , and ε̂ is orthogonal to
this plane.
• c. Make a geometric argument that the light blue line is also orthogonal to the
ˆ emanating from the origin.
red line ŷ
774 30. ADDITIONAL REGRESSORS
Answer. This is a little trickier. The red line ε̂ˆ is orthogonal to x1 , and the green line ε̂ is
also orthogonal to x1 . Together, ε̂ and ε̂ˆ span therefore the plane orthogonal to x1 . Since the light
ˆ it is orthogonal to x1 .
blue line lies in the plane spanned by ε̂ and ε̂,
Problem 355. 4 points This is a simplified version of question 593. Show the
following, by multiplying X > X with its alleged inverse: If X = X 1 X 2 has full
where M 1 = I − X 1 (X >
1 X 1)
−1
X> >
1 and K 1 = X 1 (X 1 X 1 )
−1
.
776 30. ADDITIONAL REGRESSORS
From (30.0.5) one sees that the covariance matrix in regression (30.0.3) is the
lower left partition of the covariance matrix in the full regression (30.0.1).
Problem 356. 6 points Use the usual formula β̂ = (X > X)−1 X > y together
with (30.0.5) to prove (30.0.3) and (30.0.4).
Since M 1 = I − K 1 X >
1 , one can simplify
(30.0.7)
β̂ 2 = −(X >
2 M 1X2)
−1 >
X2 K1X> >
1 y + (X 2 M 1 X 2 )
−1 >
X2 y
(30.0.8)
= (X >
2 M 1X2)
−1 >
X2 M y
(30.0.9)
β̂ 1 = (X >
1 X1)
−1 >
X1 y + K> >
1 X 2 (X 2 M 1 X 2 )
−1 >
X2 K1X> > >
1 y − K 1 X 2 (X 2 M 1 X 2 )
−1 >
X2 y
(30.0.10)
= K> > >
1 y − K 1 X 2 (X 2 M 1 X 2 )
−1 >
X 2 (I − K 1 X >
1 )y
(30.0.11)
= K> > >
1 y − K 1 X 2 (X 2 M 1 X 2 )
−1 >
X2 M 1y
(30.0.12)
= K>
1 (y − X 2 β̂ 2 )
[Gre97, pp. 245–7] follow a different proof strategy: he solves the partitioned
normal equations
>
X1 X1 X>
>
1 X2 β̂ 1 X 1 y
(30.0.13)
X> 2 X 1 X >
2 X 2 β̂ 2 X >2y
778 30. ADDITIONAL REGRESSORS
directly, without going through the inverse. A third proof strategy, used by [Seb77,
pp. 65–72], is followed in Problems 358 and 359.
Problem 357. 5 points [Gre97, problem 18 on p. 326]. The following matrix
gives the slope in the simple regression of the column variable on the row variable:
y x1 x2
1 0.03 0.36 y
(30.0.14)
0.4 1 0.3 x1
1.2 0.075 1 x2
For example, if y is regressed on x1 , the slope is 0.4, but if x1 is regressed on y, the
slope is 0.03. All variables have zero means, so the constant terms in all regressions
are zero. What are the two slope coefficients in the multiple regression of y on x1
and x2 ? Hint: Use the partitioned normal equation as given in [Gre97, p. 245] in
the special case when each of the partitions of X has only one colum.
Answer.
x>
1 x1 x>
1 x2 β̂1 x>
1 y
(30.0.15) =
x>
2 x1 x>
2 x2 β̂2 x>
2 y
which is the upper line of [Gre97, (6.24) on p, 245], and in our numbers this is β̂1 = 0.4 − 0.3β̂2 .
The second row reads
(30.0.17) (x>
2 x2 )
−1 >
x2 x1 β̂1 + β̂2 = (x>
2 x2 )
−1 >
x2 y
or in our numbers 0.075β̂2 + β̂2 = 1.2. Plugging in the formula for β̂1 gives 0.075 · 0.4 − 0.075 ·
0.3β̂2 + β̂2 = 1.2. This gives β̂2 = 1.17/0.9775 = 1.196931 = 1.2 roughly, and β̂1 = 0.4 − 0.36 =
0.0409207 = 0.041 roughly.
Problem 358. Derive (30.0.3) and (30.0.4) from the first order conditions for
minimizing
(30.0.18) (y − X 1 β 1 − X 2 β 2 )> (y − X 1 β 1 − X 2 β 2 ).
Answer. Start by writing down the OLS objective function for the full model. Perhaps we
can use the more sophisticated matrix differentiation rules?
(30.0.19)
(y−X 1 β 1 −X 2 β 2 )> (y−X 1 β 1 −X 2 β 2 ) = y > y+β > > > > > >
1 X 1 X 1 β 1 +β 2 X 2 X 2 β 2 −2y X 1 β 1 −2y X 2 β 2 +
(30.0.20)
2β > > > > >
1 X 1 X 1 − 2y X 1 + 2β 2 X 2 X 1 or, transposed 2X > > >
1 X 1 β 1 − 2X 1 y + 2X 1 X 2 β 2
(30.0.21)
2β > > > > > >
2 X 2 X 2 − 2y X 2 + 2β 1 X 1 X 2 or, transposed 2X > > >
2 X 2 β 2 − 2X 2 y + 2X 2 X 1 β 1
780 30. ADDITIONAL REGRESSORS
(30.0.24) X 1 β̂ 1 = X 1 (X >
1 X1)
−1 >
X 1 (y − X 2 β̂ 2 ).
(30.0.25) X> > >
2 X 2 β̂ 2 = X 2 y − X 1 (X 1 X 1 )
−1 >
X 1 y + X 1 (X >
1 X1)
−1 >
X 1 X 2 β̂ 2
Problem 359. Using (30.0.3) and (30.0.4) show that the residuals in regression
(30.0.1) are identical to those in the regression of M 1 y on M 1 X 2 .
30. ADDITIONAL REGRESSORS 781
Answer.
(30.0.27) ε̂ = y − X 1 β̂ 1 − X 2 β̂ 2
(30.0.28) = y − X 1 (X >
1 X1)
−1 >
X 1 (y − X 2 β̂ 2 ) − X 2 β̂ 2
(30.0.29) = M 1 y − M 1 X 2 β̂ 2 .
Problem 360. The following problem derives one of the main formulas for
adding regressors, following [DM93, pp. 19–24]. We are working in model (30.0.1).
• a. 1 point Show that, if X has full column rank, then X > X, X >1 X 1 , and
X>
2 X2 are nonsingular. Hint: A matrix X has full column rank if Xa = o implies
a = o.
Answer. From X > Xa = o follows a> X > Xa = 0 which can also be written kXak = 0.
Therefore Xa = o, and since the columns are linearly independent, it follows a = o. X >
1 X 1 and
X>
2 X 2 are nonsingular because, along with X, also X 1 and X 2 have full column rank.
• g. 2 points Prove that (30.0.31) is the fit which one gets if one regresses M 1 y
on M 1 X 2 . In other words, if one runs OLS with dependent variable M 1 y and
explanatory variables M 1 X 2 , one gets the same β̂ 2 and ε̂ as in (30.0.31), which are
the same β̂ 2 and ε̂ as in the complete regression (30.0.30).
Answer. According to Problem ?? we have to check X > > >
2 M 1 ε̂ = X 2 M 1 M y = X 2 M y =
Oy = o.
784 30. ADDITIONAL REGRESSORS
for the BLUE is not unique, since one can add any C with CM 1 C > = O or equivalently CM 1 = O
or C = AX for some A. However such a C applied to a dependent variable of the form M 1 y will
give the null vector, therefore the values of the BLUE for those values of y which are possible are
indeed unique.
• j. 1 point Once β̂ 2 is known, one can move it to the left hand side in (30.0.30)
to get
(30.0.33) y − X 2 β̂ 2 = X 1 β̂ 1 + ε̂
Prove that one gets the right values of β̂ 1 and of ε̂ if one regresses y − X 2 β̂ 2 on X 1 .
Answer. The simplest answer just observes that X > 1 ε̂ = o. Or: The normal equation for this
pseudo-regression is X >
1 y − X >
1 X 2 β̂ 2 = X >
1 X 1 β̂ 1 , which holds due to the normal equation for the
full model.
• k. 1 point Does (30.0.33) also give the right covariance matrix for β̂ 1 ?
Answer. No, since y − X 2 β̂ 2 has a different covariance matrix than σ 2 I.
This following Problems gives some applications of the results in Problem 360.
You are allowed to use the results of Problem 360 without proof.
Problem 361. Assume your regression involves an intercept, i.e., the matrix of
regressors is ι X , where X is the matrix of the “true” explanatory variables with
786 30. ADDITIONAL REGRESSORS
no vector of ones built in, and ι the vector of ones. The regression can therefore be
written
(30.0.34) y = ια + Xβ + ε .
• a. 1 point Show that the OLS estimate of the slope parameters β can be obtained
by regressing y on X without intercept, where y and X are the variables with their
means taken out, i.e., y = Dy and X = DX, with D = I − n1 ιι> .
Answer. This is called the “sweeping out of means.” It follows immediately from (30.0.3).
This is the usual procedure to do regression with a constant term: in simple regression y i =
α + βxi + εi , (30.0.3) is equation (18.2.22):
P
(xi − x̄)(y i − ȳ)
(30.0.35) β̂ = P .
(xi − x̄)2
> >
• b. Show that the OLS estimate of the intercept is α̂ = ȳ − x̄ β̂ where x̄ is
the row vector of column means of X, i.e., x̄> = n1 ι> X.
Answer. This is exactly (30.0.4). Here is a more specific argument: The intercept α̂ is ob-
tained by regressing y − X β̂ on ι. The normal equation for this second regression is ι> y − ι> X β̂ =
ι> ια̂. If ȳ is the mean of y, and x̄> the row vector consisting of means of the colums of X, then this
gives ȳ = x̄> β̂ + α̂. In the case of simple regression, this was derived earlier as formula (18.2.23).
30. ADDITIONAL REGRESSORS 787
• c. 2 points Show that MSE[β̂; β] = σ 2 (X > X)−1 . (Use the formula for β̂.)
Answer. Since
ι> n nx̄>
(30.0.36) ι X = ,
X> x̄n X>X
In other words, one simply does as if the actual regressors had been the data with their means
removed, and then takes the inverse of that design matrix. The only place where on has to be
careful is the number of degrees of freedom. See also Seber [Seb77, section 11.7] about centering
and scaling the data.
Answer.
• f. 3 points Now, split once more X = X 1 x2 where the second partition
x2 consists of one column only, and X is, as above, the X matrix with the column
β̂
means taken out. Conformably, β̂ = 1 . Show that
β̂ 2
σ2 1
(30.0.39) var[β̂ 2 ] = 2 )
x> x (1 − R2·
2
where R2· is the R2 in the regression of x2 on all other variables in X. This is in
[Gre97, (9.3) on p. 421]. Hint: you should first show that var[β̂ 2 ] = σ 2 /x> 2 M 1 x2
where M 1 = I−X 1 (X > 1 X 1)
−1
X>1 . Here is an interpretation of (30.0.39) which you
don’t have to prove: σ 2 /x> x is the variance in a simple regression with a constant
2
term and x2 as the only explanatory variable, and 1/(1 − R2· ) is called the variance
inflation factor.
30.1. SELECTION OF REGRESSORS 789
Answer. Note that we are not talking about the variance of the constant term but that of all
the other terms.
> −1 X > x
x>
2 X 1 (X 1 X 1 )
(30.0.40) x> > > >
2 M 1 x2 = x2 x2 + x2 X 1 (X 1 X 1 )
−1 >
X 1 x2 = x>
2 x2 1 +
1 2
x>
2 x2
2 , i.e., it is the R2 in the regression of x on all other variables in X, we
and since the fraction is R2· 2
get the result.
variables have a too large SSE compared with the good regressions, one does not
have to examine the subsets of these bad regressions. The precise implementation is
more a piece of engineering than mathematics. Let’s just go through their example.
Stage 0 must always be done. In it, the following regressions are performed:
Note that in the product traverse the procedure is lexicographical, i.e., the lowest
placed regressors are introduced first, since they promise to have the lowest SSE. In
the inverse traverse, the regressions on all four-variable sets which include the fifth
variable are generated. Our excerpt of table 2 does not show how these regressions
were computed; from the full table one can see that the two regressions shown in each
row are generated by a sweep on the same pivot index. In the “product traverse,” the
source matrix of each sweep operation is the result of the previous regression. For
the inverse traverse, the source is in each of these four regressions the same, namely,
the inverse of the SSCP matrix, but different regressors are eliminated by the sweep.
792 30. ADDITIONAL REGRESSORS
Now we are at the beginning of stage 1. Is it necessary to perform the sweep which
generates regression 124? No other regression will be derived from 124, therefore we
only have to look at regression 124 itself, not any subsets of these three variables. It
would not necessary to perform this regression if the regression with variables 1245
(596) had a higher SSE than the best three-variable regression run so far, which is
123, whose SSE is 612. Since 596 is not higher than 612, we must run the regression.
It gives
Product Traverse Inverse Traverse
Regressors SSE Regressors SSE
124 615 125 597
(x>
i is the ith row of X), or the “predictive” residuals, which are the residuals
computed using the OLS estimate of β gained from all the other data except the
data point where the residual is taken. If one writes β̂(i) for the OLS estimate
without the ith observation, the defining equation for the ith predictive residual,
795
796 31. RESIDUALS
The second decision is whether to standardize the residuals or not, i.e., whether
to divide them by their estimated standard deviations or not. Since ε̂ = M y, the
variance of the ith ordinary residual is
(31.1.3) var[ε̂i ] = σ 2 mii = σ 2 (1 − hii ),
and regarding the predictive residuals it will be shown below, see (31.2.9), that
σ2 σ2
(31.1.4) var[ε̂i (i)] = = .
mii 1 − hii
Here
>
(31.1.5) hii = x>
i (X X)
−1
xi .
(Note that xi is the ith row of X written as a column vector.) hii is the ith diagonal
element of the “hat matrix” H = X(X > X)−1 X > , the projector on the column
space of X. This projector is called “hat matrix” because ŷ = Hy, i.e., H puts the
“hat” on y.
31.1. THREE DECISIONS ABOUT PLOTTING RESIDUALS 797
Problem 362. 2 points Show that the ith diagonal element of the “hat matrix”
H = X(X > X)−1 X > is x> >
i (X X)
−1
xi where xi is the ith row of X written as a
column vector.
Answer. In terms of ei , the n-vector with 1 on the ith place and 0 everywhere else, xi =
X > ei , and the ith diagonal element of the hat matrix is e> > >
i Hei = ei X i (X X)
−1 X > e =
i
> > −1
xi (X X) xi .
Problem 363. 2 points The variance of the ith disturbance is σ 2 . Is the variance
of the ith residual bigger than σ 2 , smaller than σ 2 , or equal to σ 2 ? (Before doing the
math, first argue in words what you would expect it to be.) What about the variance
of the predictive residual? Prove your answers mathematically. You are allowed to
use (31.2.9) without proof.
Answer. Here is only the math part of the answer: ε̂ = M y. Since M = I − H is idempotent
and symmetric, we get V [M y] = σ 2 M , in particular this means var[ε̂i ] = σ 2 mii where mii is the
ith diagonal elements of M . Then mii = 1 − hii . Since all diagonal elements of projection matrices
are between 0 and 1, the answer is: the variances of the ordinary residuals cannot be bigger than
σ 2 . Regarding predictive residuals, if we plug mii = 1 − hii into (31.2.9) it becomes
1 1 2 σ2
(31.1.6) ε̂i (i) = ε̂i therefore var[ε̂i (i)] = 2
σ mii =
mii mii mii
which is bigger than σ 2 .
798 31. RESIDUALS
Problem 364. Decide in the following situations whether you want predictive
residuals or ordinary residuals, and whether you want them standardized or not.
• a. 1 point You are looking at the residuals in order to check whether the asso-
ciated data points are outliers and do perhaps not belong into the model.
Answer. Here one should use the predictive residuals. If the ith observation is an outlier
which should not be in the regression, then one should not use it when running the regression. Its
inclusion may have a strong influence on the regression result, and therefore the residual may not
be as conspicuous. One should standardize them.
• b. 1 point You are looking at the residuals in order to assess whether there is
heteroskedasticity.
Answer. Here you want them standardized, but there is no reason to use the predictive
residuals. Ordinary residuals are a little more precise than predictive residuals because they are
based on more observations.
• c. 1 point You are looking at the residuals in order to assess whether the
disturbances are autocorrelated.
Answer. Same answer as for b.
• d. 1 point You are looking at the residuals in order to assess whether the
disturbances are normally distributed.
31.1. THREE DECISIONS ABOUT PLOTTING RESIDUALS 799
Answer. In my view, one should make a normal QQ-plot of standardized residuals, but one
should not use the predictive residuals. To see why,√let us first look at the distribution of the
standardized residuals before division by s. Each ε̂i / 1 − hii is normally distributed with mean
zero and standard deviation σ. (But different such residuals are not independent.) If one takes a
QQ-plot of those residuals against the normal distribution, one will get in the limit a straight line
with slope σ. If one divides every residual by s, the slope will be close to 1, but one will again get
something approximating a straight line. The fact that s is random does not affect the relation
of the residuals to each other, and this relation is what determines whether or not the QQ-plot
approximates a straight line.
But Belsley, Kuh, and Welsch on [BKW80, p. 43] draw a normal probability plot of the
studentized, not the standardized, residuals. They give no justification for their choice. I think it
is the wrong choice.
• e. 1 point Is there any situation in which you do not want to standardize the
residuals?
The third decision is how to plot the residuals. Never do it against y. Either
do it against the predicted ŷ, or make several plots against all the columns of the
X-matrix.
In time series, also a plot of the residuals against time is called for.
Another option are the partial residual plots, see about this also (30.0.2). Say
β̂[h] is the estimated parameter vector, which is estimated with the full model, but
after estimation we drop the h-th parameter, and X[h] is the X-matrix without
the hth column, and xh is the hth column of the X-matrix. Then by (30.0.4), the
estimate of the hth slope parameter is the same as that in the simple regression of
y − X[h]β̂[h] on xh . The plot of y − X[h]β̂[h] against xh is called the hth partial
residual plot.
To understand this better, start out with a regression y i = α + βxi + γzi + εi ;
which gives you the fitted values y i = α̂+ β̂xi +γ̂zi + ε̂i . Now if you regress y i − α̂− β̂xi
on xi and zi then the intercept will be zero and the estimated coefficient of xi will
be zero, and the estimated coefficient of zi will be γ̂, and the residuals will be ε̂i .
The plot of y i − α̂ − β̂xi versus zi is the partial residuals plot for z.
out. We will show now that there is a very simple mathematical relationship between
the ith predictive residual and the ith ordinary residual, namely, equation (31.2.9).
(It is therefore not necessary to run n different regressions to get the n predictive
residuals.)
We will write y(i) for the y vector with the ith element deleted, and X(i) is the
matrix X with the ith row deleted.
Answer. Write (31.2.2) as X > y = X(i)> y(i) + xi yi , and observe that with our definition of
xi as column vectors representing the rows of X, X > = x1 · · · xn . Therefore
y1
.. = x1 y1 + · · · + xn yn .
>
(31.2.3) X y = x1 ... xn .
yn
802 31. RESIDUALS
1
(31.2.7) (X(i)> X(i))−1 xi = (X > X)−1 xi ,
1 − hii
and using (31.2.7) show that hii (i) is related to hii by the equation
1
(31.2.8) 1 + hii (i) =
1 − hii
1
(31.2.9) ε̂i (i) = ε̂i
1 − hii
Answer. For this we have to apply the above mathematical tools. With the help of (31.2.7)
(transpose it!) and (31.2.2), (31.1.2) becomes
y i , we get
ŷ i (i) 1 δ δi 0 h (i) 0
(31.2.11) = ηi + i ∼ , σ 2 ii
yi 1 εi εi 0 0 1
This is a regression model similar to model (18.1.1), but this time with a nonspherical
covariance matrix.
Answer. As shown in problem 206, the BLUE in this situation is the weighted average of the
observations with the weights proportional to the inverses of the variances. I.e., the first observation
has weight
1/hii (i) 1
(31.2.13) = = 1 − hii .
1/hii (i) + 1 1 + hii (i)
Since the sum of the weights must be 1, the weight of the second observation is hii .
806 31. RESIDUALS
Here is an alternative solution, using formula (26.0.2) for the BLUE, which reads here
hii
−1 hii
−1
−1
1−hii
0 1
1−hii
0 ŷ i (i)
ŷ i = 1 1 1 1 =
0 1 1 0 1 yi
1−h
ii
hii
0 ŷ i (i)
= hii 1 1 = (1 − hii )ŷ i (i) + hii y i .
0 1 yi
Now subtract this last formula from y i to get y i − ŷ i = (1 − hii )(y i − ŷ i (i)), which is (31.2.9).
31.3. Standardization
In this section we will show that the standardized predictive residual is what is
sometimes called the “studentized” residual. It is recommended not to use the term
“studentized residual” but say “standardized predictive residual” instead.
The standardization of the ordinary
√ residuals has two steps: every ε̂i is divided
by its “relative” standard deviation 1 − hii , and then by s, an estimate of σ, the
standard deviation of the true disturbances. In formulas,
ε̂i
(31.3.1) the ith standardized ordinary residual = √ .
s 1 − hii
Standardization of the ith predictive residual has the same two steps: first divide
the predictive residual (31.2.9) by the relative standard deviation, and then divide by
31.3. STANDARDIZATION 807
s(i). But a look at formula (31.2.9) shows that the ordinary and the predictive resid-
ual differ only by a nonrandom factor. Therefore the first step of the standardization
yields exactly the same result whether one starts with an ordinary or a predictive
residual. Standardized predictive residuals differ therefore from standardized ordi-
nary residuals only in the second step:
ε̂i
(31.3.2) the ith standardized predictive residual = √ .
s(i) 1 − hii
Note that equation (31.3.2) writes the standardized predictive residual as a function
of the ordinary residual, not the predictive residual. The standardized predictive
residual is sometimes called the “studentized” residual.
Problem 371. 3 points The ith predictive residual has the formula
1
(31.3.3) ε̂i (i) = ε̂i
1 − hii
You do not have to prove this formula, but you are asked to derive the standard
deviation of ε̂i (i), and to derive from it a formula for the standardized ith predictive
residual.
808 31. RESIDUALS
This similarity between these two formulas has lead to widespread confusion.
Even [BKW80] seem to have been unaware of the significance of “studentization”;
they do not work with the concept of predictive residuals at all.
The standardized predictive residuals have a t-distribution, because they are
a normally distributed variable divided by an independent χ2 over its degrees of
freedom. (But note that the joint distribution of all standardized predictive residuals
is not a multivariate t.) Therefore one can use the quantiles of the t-distribution to
judge, from the size of these residuals, whether one has an extreme observation or
not.
Problem 372. Following [DM93, p. 34], we will use (30.0.3) and the other
formulas regarding additional regressors to prove the following: If you add a dummy
variable which has the value 1 for the ith observation and the value 0 for all other
observations to your regression, then the coefficient estimate of this dummy is the ith
predictive residual, and the coefficient estimate of the other parameters after inclusion
of this dummy is equal to β̂(i). To fix notation (and without loss of generality),
assume the ith observation is the last observation, i.e., i = n, and put the dummy
variable first in the regression:
y(n) o X(n) α ε̂(i) α
(31.3.4) = + or y = en X +ε
yn 1 x>n β ε̂n β
31.3. STANDARDIZATION 809
o
• a. 2 points With the definition X 1 = en = , write M 1 = I−X 1 (X >
1 X 1)
−1
X
1
as a 2 × 2 partitioned matrix.
Answer.
I o o I o I o z(i) z(i)
(31.3.5) M1 = − o> 1 = ; =
o> 1 1 o> 0 o> 0 zi 0
in other words, the estimate of β is indeed β̂(i), and the first n − 1 elements of the residual are
indeed the residuals one gets in the regression without the ith observation. This is so ugly because
the singularity shows here in the zeros of the last row, usually it does not show so much. But this
way one also sees that it gives zero as the last residual, and this is what one needs to know!
810 31. RESIDUALS
To have a mathematical proof that the last row with zeros does not affect the estimate, evaluate
(30.0.3)
β̂ 2 = (X >
2 M 1X2)
−1 >
X2 M 1y
−1
I o X(n) I o y(n)
= X(n)> xn X(n)> xn
o> 0 x>
n o> 0 yn
• d. 2 points Use (30.0.3) with X 1 and X 2 interchanged to get a formula for α̂.
Answer. α̂ = (X >
1 M X1)
−1 X > M y =
1
1
ε̂
mnn n
= 1
ε̂ ,
1−hnn n
here M = I − X(X > X)−1 X > .
31.3. STANDARDIZATION 811
Regression Diagnostics
32.3.1. The “Leverage”. The ith diagonal element hii of the “hat matrix”
is called the “leverage” of the ith observation. The leverage satisfies the following
identity
(32.3.1) ŷ i = (1 − hii )ŷ i (i) + hii y i
hii is therefore is the weight which y i has in the least squares estimate ŷ i of ηi = x>
i β,
compared with all other observations, which contribute to ŷ i through ŷ i (i). The
larger this weight, the more strongly this one observation will influence the estimate
816 32. REGRESSION DIAGNOSTICS
Problem 374. 3 points Explain the meanings of all the terms in equation (32.3.1)
and use that equation to explain why hii is called the “leverage” of the ith observa-
tion. Is every observation with high leverage also “influential” (in the sense that its
removal would greatly change the regression estimates)?
Answer. ŷ i is the fitted value for the ith observation, i.e., it is the BLUE of ηi , of the expected
value of the ith observation. It is a weighted average of two quantities: the actual observation y i
(which has ηi as expected value), and ŷ i (i), which is the BLUE of ηi based on all the other
observations except the ith. The weight of the ith observation in this weighted average is called the
“leverage” of the ith observation. The sum of all leverages is always k, the number of parameters
in the regression. If the leverage of one individual point is much greater than k/n, then this point
has much more influence on its own fitted value than one should expect just based on the number
of observations,
Leverage is not the same as influence; if an observation has high leverage, but by accident
the observed value y i is very close to ŷ i (i), then removal of this observation will not change the
regression results much. Leverage is potential influence. Leverage does not depend on any of the
observations, one only needs the X matrix to compute it.
Those observations whose x-values are away from the other observations have
“leverage” and can therefore potentially influence the regression results more than the
32.3. INFLUENTIAL OBSERVATIONS AND OUTLIERS 817
others. hii serves as a measure of this distance. Note that hii only depends on the X-
matrix, not on y, i.e., points may have a high leverage but not be influential, because
the associated y i blends well into the fit established by the other data. However,
regardless of the observed value of y, observations with high leverage always affect
the covariance matrix of β̂.
det(X > X) − det(X(i)> X(i))
(32.3.2) hii = ,
det(X > X)
where X(i) is the X-matrix without the ith observation.
Problem 375. Prove equation (32.3.2).
Answer. Since X > (i)X(i) = X > X − xi x> >
i , use theorem A.7.3 with W = X X, α = −1,
and d = xi .
Problem 376. Prove the following facts about the diagonal elements of the so-
called “hat matrix” H = X(X > X)−1 X > , which has its name because Hy = ŷ,
i.e., it puts the hat on y.
• a. 1 point H is a projection matrix, i.e., it is symmetric and idempotent.
Answer. Symmetry follows from the laws for the transposes of products: H > = (ABC)> =
C B > A> = H where A = X, B = (X > X)−1 which is symmetric, and C = X > . Idempotency
>
X(X > X)−1 X > X(X > X)−1 X > = X(X > X)−1 X > .
818 32. REGRESSION DIAGNOSTICS
Answer. If ι, the vector of ones, is one of the columns of X (or a linear combination
of these columns), this means there is a vector a with ι = Xa. From this follows Hιι> =
X(X > X)−1 X > Xaι> = Xaι> = ιι> . One can use this to show that H − n 1 >
ιι is idempotent:
1 > 1 > 1 > 1 > 1 >1 > 1 > 1 > 1 >
(H − n ιι )(H − n ιι ) = HH − H n ιι − n ιι H + n ιι n ιι = H − n ιι − n ιι + n ιι =
1 >
H−n ιι .
• g. 1 point Show: If the regression has a constant term, then one can sharpen
inequality (32.3.3) to 1/n ≤ hii ≤ 1.
Answer. H − ιι> /n is a projection matrix, therefore nonnegative definite, therefore its diag-
onal elements hii − 1/n are nonnegative.
• h. 3 points Why is hii called the “leverage” of the ith observation? To get full
points, you must give a really good verbal explanation.
Answer. Use equation (31.2.12). Effect on any other linear combination of β̂ is less than the
effect on ŷ i . Distinguish from influence. Leverage depends only on X matrix, not on y.
hii is closely related to the test statistic testing whether the xi comes from the
same multivariate normal distribution as the other rows of the X-matrix. Belsley,
Kuh, and Welsch [BKW80, p. 17] say those observations i with hii > 2k/n, i.e.,
more than twice the average, should be considered as “leverage points” which might
deserve some attention.
820 32. REGRESSION DIAGNOSTICS
Problem 377. Show (32.4.1) by methods very similar to the proof of (31.2.9)
Answer. Here is this brute-force proof, I think from [BKW80]: Let y(i) be the y vector with
the ith observation deleted. As shown in Problem 365, X > (i)y(i) = X > y − xi y i . Therefore by
32.4. SENSITIVITY OF ESTIMATES TO OMISSION OF ONE OBSERVATION 821
(31.2.6)
To understand (32.4.1), note the following fact which is interesting in its own
right: β̂(i), which is defined as the OLS estimator if one drops the ith observation,
can also be obtained as the OLS estimator if one replaces the ith observation by the
prediction of the ith observation on the basis of all other observations, i.e., by ŷ i (i).
Writing y((i)) for the vector y whose ith observation has been replaced in this way,
one obtains
(32.4.2) β̂ = (X > X)−1 X > y; β̂(i) = (X > X)−1 X > y((i)).
(g > x)2
(32.4.4) max = x>Ω −1 x.
g g >Ω g
32.4. SENSITIVITY OF ESTIMATES TO OMISSION OF ONE OBSERVATION 823
If the denominator in the fraction on the lefthand side is zero, then g = o and
therefore the numerator is necessarily zero as well. In this case, the fraction itself
should be considered zero.
Proof: As in the derivation of the BLUE with nonsperical covariance matrix, pick
a nonsingular Q with Ω = QQ> , and define P = Q−1 . Then it follows P ΩP > = I.
Define y = P x and h = Q> g. Then h> y = g > x, h> h = g >Ω g, and y > y =
x>Ω −1 x. Therefore (32.4.4) follows from the Cauchy-Schwartz inequality (h> y)2 ≤
(h> h)(y > y).
Using Theorem 32.4.1 and equation (32.4.1) one obtains
hii
(32.4.6) ŷ i − ŷ i (i) = x> >
i β̂ − xi β̂(i) = ε̂i = hii ε̂i (i)
1 − hii
824 32. REGRESSION DIAGNOSTICS
If one divides (32.4.6) by the standard deviation of ŷ i , i.e., if one applies the con-
struction (32.4.3), one obtains
√ √
ŷ i − ŷ i (i) hii hii
(32.4.7) √ = ε̂i (i) = ε̂i
σ hii σ σ(1 − hii )
If ŷ i changes only little (compared with the standard deviation of ŷ i ) if the ith
observation is removed, then no other linear combination of the elements of β̂ will
be affected much by the omission of this observation either.
The righthand side of (32.4.7), with σ estimated by s(i), is called by [BKW80]
and many others DFFITS (which stands for DiFference in FIT, Standardized). If
one takes its square, divides it by k, and estimates σ 2 by s2 (which is more consistent
than using s2 (i), since one standardizes by the standard deviation of t> β̂ and not
by that of t> β̂(i)), one obtains Cook’s distance [Coo77]. (32.4.5) gives an equation
for Cook’s distance in terms of β̂ − β̂(i):
(32.4.8)
(β̂ − β̂(i))> X > X(β̂ − β̂(i)) hii hii
Cook’s distance = = 2 ε̂2i (i) = 2 ε̂2
ks2 ks ks (1 − hii )2 i
Problem 378. Can you think of a situation in which an observation has a small
residual but a large “influence” as measured by Cook’s distance?
32.4. SENSITIVITY OF ESTIMATES TO OMISSION OF ONE OBSERVATION 825
Answer. Assume “all observations are clustered near each other while the solitary odd ob-
servation lies a way out” as Kmenta wrote in [Kme86, p. 426]. If the observation happens to lie
on the regression line, then it can be discovered by its influence on the variance-covariance matrix
(32.3.2), i.e., in this case only the hii count.
Problem 379. The following is the example given in [Coo77]. In R, the com-
mand data(longley) makes the data frame longley available, which has the famous
Longley-data, a standard example for a highly multicollinear dataset. These data
are also available on the web at www.econ.utah.edu/ehrbar/data/longley.txt.
attach(longley) makes the individual variables available as R-objects.
• a. 3 points Look at the data in a scatterplot matrix and explain what you
see. Later we will see that one of the observations is in the regression much more
influential than the rest. Can you see from the scatterplot matrix which observation
that might be?
Answer. In linux, you first have to give the command x11() in order to make the graphics win-
dow available. In windows, this is not necessary. It is important to display the data in a reasonable
order, therefore instead of pairs(longley) you should do something like attach(longley) and then
pairs(cbind(Year, Population, Employed, Unemployed, Armed.Forces, GNP, GNP.deflator)). Pu
Year first, so that all variables are plotted against Year on the horizontal axis.
Population vs. year is a very smooth line.
Population vs GNP also quite smooth.
826 32. REGRESSION DIAGNOSTICS
You see the huge increase in the armed forced in 1951 due to the Korean War, which led to a
(temporary) drop in unemployment and a (not so temporary) jump in the GNP deflator.
Otherwise the unemployed show the stop-and-go scenario of the fifties.
unemployed is not correlated with anything.
One should expect a strong negative correlation between employed and unemployed, but this
is not the case.
• c. 3 points Make plots of the ordinary residuals and the standardized residuals
against time. How do they differ? In R, the commands are plot(Year, residuals(lon
type="h", ylab="Ordinary Residuals in Longley Regression"). In order to
get the next plot in a different graphics window, so that you can compare them,
do now either x11() in linux or windows() in windows, and then plot(Year,
rstandard(longley.fit), type="h", ylab="Standardized Residuals in Longle
Regression").
32.4. SENSITIVITY OF ESTIMATES TO OMISSION OF ONE OBSERVATION 827
Answer. You see that the standardized residuals at the edge of the dataset are bigger than
the ordinary residuals. The datapoints at the edge are better able to attract the regression plane
than those in the middle, therefore the ordinary residuals are “too small.” Standardization corrects
for this.
• e. 3 points Make a plot of the leverage, i.e., the hii -values, using plot(Year,
hatvalues(longley.fit), type="h", ylab="Leverage in Longley Regression")
and explain what leverage means.
• f. 3 points One observation is much more influential than the others; which
is it? First look at the plots for the residuals, then look also at the plot for leverage,
and try to guess which is the most influential observation. Then do it the right way.
Can you give reasons based on your prior knowledge about the time period involved
why an observation in that year might be influential?
Answer. The “right” way is to use Cook’s distance: plot(Year, cooks.distance(longley.fit),
type="h", ylab="Cook’s Distance in Longley Regression")
One sees that 1951 towers above all others. It does not have highest leverage, but it has
second-highest, and a bigger residual than the point with the highest leverage.
1951 has the largest distance of .61. The second largest is the last observation in the dataset,
1962, with a distance of .47, and the others have .24 or less. Cook says: removal of 1951 point will
move the least squares estimate to the edge of a 35% confidence region around β̂. This point is
probably so influential because 1951 was the first full year of the Korean war. One would not be
able to detect this point from the ordinary residuals, standardized or not! The predictive residuals
are a little better; their maximum is at 1951, but several other residuals are almost as large. 1951
is so influential because it has an extremely high hat-value, and one of the highest values for the
ordinary residuals!
ε̂2i
(32.4.9) SSE − SSE(i) =
1 − hii
Problem 380. Use (32.4.9) to derive the following formula for s2 (i):
1 ε̂2i
(32.4.10) s2 (i) = (n − k)s2 −
n−k−1 1 − hii
Answer. This merely involves re-writing SSE and SSE(i) in terms of s2 and s2 (i).
SSE(i) 1 ε̂2i
(32.4.11) s2 (i) = = SSE −
n−1−k n−k−1 1 − hii
830 32. REGRESSION DIAGNOSTICS
In the last line the Pfirst term is SSE. The second term is zero because H ε̂ = o.
Furthermore, hii = j h2ji because H is symmetric and idempotent, therefore the
sum of the last two items is −ε̂2i /(1 − hii ).
Note that every single relationship we have derived so far is a function of ε̂i and
hii .
Problem 381. 3 points What are the main concepts used in modern “Regression
Diagnostics”? Can it be characterized to be a careful look at the residuals, or does it
have elements which cannot be inferred from the residuals alone?
32.4. SENSITIVITY OF ESTIMATES TO OMISSION OF ONE OBSERVATION 831
• b. How are the concepts of leverage and influence affected by sample size?
• c. What steps would you take when alerted to the presence of an influential
observation?
Answer. Make sure you know whether the results you rely on are affected if that influential
observation is dropped. Try to find out why this observation is influential (e.g. in the Longley data
the observations in the year when the Korean War started are influential).
• e. Discuss situations in which one would want to deal with the “predictive”
residuals rather than the ordinary residuals, and situations in which one would want
residuals standardized versus situations in which it would be preferable to have the
unstandardized residuals.
Problem 383. 6 points Describe what you would do to ascertain that a regression
you ran is correctly specified?
Answer. Economic theory behind that regression, size and sign of coefficients, plot residuals
versus predicted values, time, and every independent variable, run all tests: F -test, t-tests, R2 ,
DW, portmanteau test, forecasting, multicollinearity, influence statistics, overfitting to see if other
variables are significant, try to defeat the result by using alternative variables, divide time period into
subperiods in order to see if parameters are constant over time, pre-test specification assumptions.
CHAPTER 33
Regression Graphics
The “regression” referred to in the title of this chapter is not necessarily linear
regression. The population regression can be defined as follows: The random scalar
y and the random vector x have a joint distribution, and we want to know how
the conditional distribution of y|x = x depends on the value x. The distributions
themselves are not known, but we have datasets and we use graphical means to
estimate the distributions from these datasets.
Problem 384. Someone said on an email list about statistics: if you cannot see
an effect in the data, then there is no use trying to estimate it. Right or wrong?
Answer. One argument one might give is the curse of dimensionality. Also higher moments
of the distribution, kurtosis etc., cannot be seen very cleary with the plain eye.
833
834 33. REGRESSION GRAPHICS
[Coo98, Figure 2.8 on p. 29] shows a scatterplot matrix of the “horse mussel”
data, originally from [Cam89]. This graph is also available at www.stat.umn.edu/
RegGraph/graphics/Figure 2.8.gif. Horse mussels, (Atrinia), were sampled from
the Marlborough Sounds. The five variables are L = Shell length in mm, W = Shell
width in mm, H = Shell height in mm, S = Shell mass in g, and M = Muscle mass
in g. M is the part of the mussel that is edible.
836 33. REGRESSION GRAPHICS
Problem 386. 3 points In the mussel data set, M is the “response” (according
to [Coo98]). Is it justified to call this variable the “response” and the other variables
the explanatory variables, and if so, how would you argue for it?
Answer. This is one of the issues which is not sufficiently discussed in the literature. It would
be justified if the dimensions and weight of the shell were exogenous to the weight of the edible part
of the mussel. I.e., if the mussel first grows the shell, and then it fills this shell wish muscle, and
the dimensions of the shell affect how big the muscle can grow, but the muscle itself does not have
an influence on the dimensions of the shell. If this is the case, then it makes sense to look at the
distribution of M conditionally on the other variables, i.e., ask the question: given certain weights
and dimensions of the shell, what is the nature of the mechanism by which the muscle grows inside
this shell. But if muscle and shell grow together, both affected by the same variables (temperature,
nutrition, daylight, etc.), then the conditional distribution is not informative. In this case, the joint
distribution is of interest.
In order to get this dataset into R, you simply say data(mussels), after having
said library(ecmet). Then you need the command pairs(mussels) to get
the scatterplot matrix. Also interesting is pairs(log(mussels)), especially since
the log transformation is appropriate if one explains volume and weight by length,
height, and width.
The scatter plot of M versus H shows a clear curvature; but one should not jump
to the conclusion that the regression is not linear. Cook brings another example with
constructed data, in which the regression function is clearly linear, without error
33.1. SCATTERPLOT MATRICES 837
term, and in which nevertheless the scatter plot of the response versus one of the
predictors shows a similar curvature as in the mussel data.
Answer. You need the commands data(reggra29) and then pairs(reggra29) to get the scat-
terplot matrix. Before you can access xgobi from R, you must give the command library(xgobi).
Then xgobi(reggra29). The dependency is y = 3 + x1 + elemx2 /2.
Problem 388. 2 points Why can the scatter plot of the dependent variable
against one of the independent variables be so misleading?
Answer. Because the included independent variable becomes a proxy for the excluded vari-
able. The effect of the excluded variable is mistaken to come from the included variable. Now if the
included and the excluded variable are independent of each other, then the omission of the excluded
variable increases the noise, but does not have a systematic effect. But if there is an empirical
relationship between the included and the excluded variable, then this translates into a spurious
relationship between included and dependent variables. The mathematics of this is discussed in
Problem 328.
838 33. REGRESSION GRAPHICS
33.3. Spinning
An obvious method to explore a more than two-dimensional structure graphically
is to look at plots of y against various linear combinations of x. Many statistical
software packages have the ability to do so, but one of the most powerful ones is
XGobi. Documentation about xgobi, which is more detailed than the help(xgobi)
in R/Splus can be obtained by typing man xgobi while in unix. A nice brief docu-
mentation is [Rip96]. The official manual is is [SCB91] and [BCS96].
XGobi can be used as a stand-alone program or it can be invoked from inside R
or Splus. In R, you must give the command library(xgobi) in order to make the
function xgobi accessible.
The search for “interesting” projections of the data into one-, two-, or 3-dimensional
spaces has been automated in projection pursuit regression programs. The basic ref-
erence is [FS81], but there is also the much older [FT74].
The most obvious graphical regression method consists in slicing or binning the
data, and taking the mean of the data in each bin. But if you have too many
explanatory variables, this local averaging becomes infeasible, because of the “curse
of dimensionality.” Consider a dataset with 1000 observations and 10 variables, all
between 0 and 1. In order to see whether the data are uniformly distributed or
840 33. REGRESSION GRAPHICS
whether they have some structure, you may consider splitting up the 10-dimensional
unit cube into smaller cubes and counting the number of datapoints in each of these
subcubes. The problem here is: if one makes those subcubes large enough that they
contain more than 0 or 1 observations, then their coordinate lengths are not much
smaller than the unit hypercube itself. Even with a side length of 1/2, which would
be the largest reasonable side length, one needs 1024 subcubes to fill the hypercube,
therefore the average number of data points is a little less than 1. By projecting
instead of taking subspaces, projection pursuit regression does not have this problem
of data scarcity.
Projection pursuit regression searches for an interesting and informative projec-
tion of the data by maximizing a criterion function. A logical candidate would for
instance be the variance ratio as defined in (8.6.7), but there are many others.
About grand tours, projection pursuit guided tours, and manual tours see [CBCH9
and [CB97].
Problem 390. If you run XGobi from the menu in Debian GNU/Linux, it uses
prim7, which is a 7-dimensional particle physics data set used as an example in
[FT74].
The following is from the help page for this dataset: There are 500 observations
taken from a high energy particle physics scattering experiment which yields four
particles. The reaction can be described completely by 7 independent measurements.
33.4. SUFFICIENT PLOTS 841
The important features of the data are short-lived intermediate reaction stages which
appear as protuberant “arms” in the point cloud.
The projection pursuit guided tour is the tool to use to understand this data set.
Using all 7 variables turn on projection pursuit and optimize with the Holes index
until a view is found that has a triangle and two arms crossing each other off one
edge (this is very clear once you see it but the Holes index has a tendency to get stuck
in another local maximum which doesn’t have much structure). Brush the arms with
separate colours and glyphs. Change to the Central Mass index and optimize. As new
arms are revealed brush them and continue. When you have either run out of colours
or time turn off projection pursuit and watch the data touring. Then it becomes
clear that the underlying structure is a triangle with 5 or 6 arms (some appear to be
1-dimensional, some 2-dimensional) extending from the vertices.
with β 1 and β 2 linearly independent, then the structural dimension is 2, since one
needs 2 different linear combinations of x to characterize the distribution of y. If
(33.4.3) y|x = kxk2 + ε
then this is a very simple relationship between x and y, but the structural dimension
is k, the number of dimensions of x, since the relationship is not intrinsically linear.
Problem 391. [FS81, p. 818] Show that the regression function consisting of
the interaction term between x1 and x2 only φ(x) = x1 x2 has structural dimension
P2
2, i.e., it can be written in the form φ(x) = m=1 sm (αm x) where sm are smooth
functions of one variable.
Answer.
" # " #
1 1
1 1 z2 z2
(33.4.4) α1 = √ 1 α2 = √ −1 s1 (z) = s2 (z) = −
2 2 2 2
o o
Problem 392. [Coo98, p. 62] In the rubber data, mnr is the dependent variable
y, and temp and dp form the two explanatory variables x1 and x2 . Look at the data
using XGgobi or some other spin program. What is the structural dimension of the
data set?
844 33. REGRESSION GRAPHICS
The rubber data are from [Woo72], and they are also discussed in [Ric, p. 506].
mnr is modulus of natural rubber, temp the temperature in degrees Celsius, and dp
Dicumyl Peroxide in percent.
Answer. The data are a plane that has been distorted by twisting and stretching. Since one
needs a different view to get the best fit of the points in the upper-right corner than for the points
in the lower-left corner, the structural dimension must be 2.
If one looks at the scatter plots of y against all linear combinations of components
of x, and none of them show a relationship (either linear or nonlinear), then the
structural dimension is zero.
Here are the instruction how to do graphical regression on the mussel data. Select
the load menu (take the cursor down a little until it is black, then go back up), then
press the Check Data Dir box, then double click ARCG so that it jumps into the
big box. Then Update/Open File will give you a long list of selections, where you
will find mussels.lsp. Double click on this so that it jumps into the big box, and
then press on the Update/Open File box. Now for the Box-Cox transformation I
first have to go to scatterplot matrices, then click on transformations, then to find
normalizing transformations. It you just select the 4 predictors and then press
the OK button, there will be an error message; apparently the starting values were
not good enough. Try again, using marginal Box-Cox Starting Values. This will
succeed, and the LR test for all transformations logs has a p-value of .14. Therefore
33.4. SUFFICIENT PLOTS 845
choose the log transform for all the predictors. (If we include all 5 variables, the
LR test for all transformations to be log transformations has a p-value of 0.000.)
Therefore transform the 4 predictor variables only to logs. There you see the very
linear relationship between the predictors, and you see that all the scatter plots
with the response are very similar. This is a sign that the structural dimension is 1
according to [CW99, pp. 435/6]. If that is the case, then a plot of the actual against
the fitted values is a sufficient summary plot. For this, run the Fit Linear LS menu
option, and then plot the dependent variable against the fitted value. Now the next
question might be: what transformation will linearize this, and a log curve seems to
fit well.
CHAPTER 34
A much more detailed treatment of the contents of this chapter can be found in
[DM93, Chapters 4 and 5].
Here we are concerned with the consistency of the OLS estimator for large sam-
ples. In other words, we assume that our regression model can be extended to
encompass an arbitrary number of observations. First we assume that the regressors
are nonstochastic, and we will make the following assumption:
1 >
(34.0.5) Q = lim X X exists and is nonsingular.
n→∞ n
847
848 34. ASYMPTOTIC PROPERTIES OF OLS
Two examples
where this is not the case. Look at the model y t = α+βt+εt . Here
1 1
1 2
1 + 1 + 1 + ··· + 1 1 + 2 + 3 + ··· + n
X = 1 3 . Therefore X > X = =
.. .. 1 + 2 + 3 + · · · + n 1 + 4 + 9 + · · · + n2
. .
1 n
n n(n + 1)/2 1 > 1 ∞
, and n X X → . Here the assumption
n(n + 1)/2 n(n + 1)(2n + 1)/6 ∞ ∞
(34.0.5) does not hold, but one can still prove consistency and asymtotic normality,
the estimators converge even faster than in the usual case.
The other example is the model y t = α + βλt + εt with a known λ with −1 <
λ < 1. Here
λ + λ2 + · · · + λn
> 1 + 1 + ··· + 1
X X= =
λ + λ2 + · · · + λn λ2 + λ4 + · · · + λ2n
(λ − λn+1 )/(1 − λ)
n
= .
(λ − λn+1 )/(1 − λ) (λ2 − λ2n+2 )/(1 − λ2 )
34. ASYMPTOTIC PROPERTIES OF OLS 849
1 > 1 0
Therefore n X X → , which is singular. In this case, a consistent estimate of
0 0
λ does not exist: future observations depend on λ so little that even with infinitely
many observations there is not enough information to get the precise value of λ.
We will show that under assumption (34.0.5), β̂ and s2 are consistent. However
this assumption is really too strong for consistency. A weaker set of assumptions
is the Grenander conditions, see [Gre97, p. 275]. To write down the Grenander
conditions, remember that presently X depends on n (in that we only look at the
first n elements of y and first n rows of X), therefore also the column vectors xj also
depend of n (although we are not indicating this here). Therefore x> j xj depends
on n as well, and we will make this dependency explicit by writing x> 2
j xj = dnj .
2
Then the first Grenander condition is limn→∞ dnj = +∞ for all j. Second: for all i
and k, limn→∞ maxi=1···n xij /d2nj = 0 (here is a typo in Greene, he leaves the max
out). Third: Sample correlation matrix of the columns of X minus the constant
term converges to a nonsingular matrix.
Consistency means that the probability limit of the estimates converges towards
the true value. For β̂ this can be written as plimn→∞ β̂ n = β. This means by
definition that for all ε > 0 follows limn→∞ Pr[|β̂ n − β| ≤ ε] = 1.
The probability limit is one of several concepts of limits used in probability
theory. We will need the following properties of the plim here:
850 34. ASYMPTOTIC PROPERTIES OF OLS
(1) For nonrandom magnitudes, the probability limit is equal to the ordinary
limit.
(2) It satisfies the Slutsky theorem, that for a continuous function g,
(34.0.6) plim g(z) = g(plim(z)).
(3) If the MSE-matrix of an estimator converges towards the null matrix, then
the estimator is consistent.
(4) Kinchine’s theorem: the sample mean of an i.i.d. distribution is a consistent
estimate of the population mean, even if the distribution does not have a population
variance.
√
vectors by dividing them by n, then they do not get longer but converge towards
a finite length. And the result (34.1.1) plim n1 x>ε = 0 means now that with this
√ √
normalization, ε / n becomes more and more orthogonal to x/ n. I.e., if n is large
enough, asymptotically, not only ε̂ but also the true ε is orthogonal to x, and this
means that asymptotically β̂ converges towards the true β.
>
For the proof of consistency of s2 we need, among others, that plim ε nε = σ 2 ,
which is a consequence of Kinchine’s theorem. Since ε̂> ε̂ = ε > Mεε it follows
Theorem, is asymptotically normal: √1n X >ε n → N (o, σ 2 Q). (Here the convergence
is convergence in distribution.)
√ > −1
We can write n(β̂ n −β) = X n X ( √1n X >ε n ). Therefore its limiting covari-
√
ance matrix is Q−1 σ 2 QQ−1 = σ 2 Q−1 , Therefore n(β̂ n −β) → N (o, σ 2 Q−1 ) in dis-
tribution. One can also say: the asymptotic distribution of β̂ is N (β, σ 2 (X > X)−1 ).
√
From this follows n(Rβ̂ n − Rβ) → N (o, σ 2 RQ−1 R> ), and therefore
−1
(34.2.1) n(Rβ̂ n − Rβ) RQ−1 R> (Rβ̂ n − Rβ) → σ 2 χ2i .
Divide by s2 and replace in the limiting case Q by X > X/n and s2 by σ 2 to get
−1
(Rβ̂ n − Rβ) R(X > X)−1 R> (Rβ̂ n − Rβ)
(34.2.2) 2
→ χ2i
s
in distribution. All this is not a proof; the point is that in the denominator, the
distribution is divided by the increasingly bigger number n − k, while in the numer-
ator, it is divided by the constant i; therefore asymptotically the denominator can
be considered 1.
The central limit theorems only say that for n → ∞ these converge towards the
χ2 , which is asymptotically equal to the F distribution. It is easily possible that
before one gets to the limit, the F -distribution is better.
854 34. ASYMPTOTIC PROPERTIES OF OLS
Now these results also go through if one has stochastic regressors. [Gre97, 6.7.7]
shows that the above condition (34.0.5) with the lim replaced by plim holds if xi
and ε i are an i.i.d. sequence of random variables.
Problem 394. 2 points In the regression model with random regressors y =
Xβ+ε ε, you only know that plim n1 X > X = Q is a nonsingular matrix, and plim n1 X >ε
o. Using these two conditions, show that the OLS estimate is consistent.
Answer. β̂ = (X > X)−1 X > y = β + (X > X)−1 X >ε due to (24.0.7), and
X > X −1 X > ε
plim(X > X)−1 X >ε = plim( ) = Qo = o.
n n
CHAPTER 35
Now assume ε is multivariate normal. We will show that in this case the OLS
estimator β̂ is at the same time the Maximum Likelihood Estimator. For this we
> function of y. First look at one y t which is y t ∼
need to write down the density
x1
> 2 ..
N (xt β, σ ), where X = . , i.e., xt is the tth row of X. It is written as a
x>n
column vector, since we follow the “column vector convention.” The (marginal)
855
856 35. LEAST SQUARES AS THE NORMAL MAXIMUM LIKELIHOOD ESTIMATE
If we replace β in the log likelihood function (35.0.5) by β̂, we get what is called
the log likelihood function with β “concentrated out.”
n n 1
(35.0.6) log fy (y; β = β̂, σ 2 ) = − log 2π − log σ 2 − 2 (y − X β̂)> (y − X β̂).
2 2 2σ
One gets the maximum likelihood estimate of σ 2 by maximizing this “concentrated”
log likelihoodfunction. Taking the derivative with respect to σ 2 (consider σ 2 the
name of a variable, not the square of another variable), one gets
∂ n 1 1
(35.0.7) log fy (y; β̂) = − + 4 (y − X β̂)> (y − X β̂)
∂σ 2 2 σ2 2σ
Setting this zero gives
(y − X β̂)> (y − X β̂) ε̂> ε̂
(35.0.8) σ̃ 2 = = .
n n
This is a scalar multiple of the unbiased estimate s2 = ε̂> ε̂/(n − k) which we
had earlier.
Let’s look at the distribution of s2 (from which that of its scalar multiples follows
easily). It is a quadratic form in a normal variable. Such quadratic forms very often
have χ2 distributions.
Now recall equation 10.4.9 characterizing all the quadratic forms of multivariate
normal variables that are χ2 ’s. Here it is again: Assume y is a multivariate normal
858 35. LEAST SQUARES AS THE NORMAL MAXIMUM LIKELIHOOD ESTIMATE
vector random variable with mean vector µ and covariance matrix σ 2 Ψ, and Ω is a
symmetric nonnegative definite matrix. Then (y − µ)>Ω (y − µ) ∼ σ 2 χ2k iff
(35.0.9) ΩΨΩ
ΨΩ ΩΨ = ΨΩ
ΩΨ,
Answer. We showed in question 300 that β̂ and ε̂ are uncorrelated, therefore in the normal
case independent, therefore β̂ is also independent of any function of ε̂, such as σ̂ 2 .
35. LEAST SQUARES AS THE NORMAL MAXIMUM LIKELIHOOD ESTIMATE 859
• a. In the same plot, plot the density function of the Theil-Schweitzer estimate.
Can one see from the comparison of these density functions why the Theil-Schweitzer
estimator has a better MSE?
Answer. Start with the Theil-Schweitzer plot, because it is higher. > x <- seq(from = 0, to
= 6, by = 0.01) > Density <- (19/2)*dchisq((19/2)*x, df=17) > plot(x, Density, type="l",
lty=2) > lines(x,(17/2)*dchisq((17/2)*x, df=17)) > title(main = "Unbiased versus Theil-S
Variance Estimate, 17 d.f.")
Now let us derive the maximum likelihood estimator in the case of nonspher-
ical but positive definite covariance matrix. I.e., the model is y = Xβ + ε , ε ∼
N (o, σ 2 Ψ). The density function is
1
−1/2
(35.0.11) fy (y) = (2πσ 2 )−n/2 |det Ψ| exp − 2 (y − Xβ)> Ψ−1 (y − Xβ) .
2σ
860 35. LEAST SQUARES AS THE NORMAL MAXIMUM LIKELIHOOD ESTIMATE
Problem 397. Derive (35.0.11) as follows: Take a matrix P with the property
that P ε has covariance matrix σ 2 I. Write down the joint density function of P ε .
Since y is a linear transformation of ε , one can apply the rule for the density function
of a transformed random variable.
Answer. Write Ψ = QQ> with Q nonsingular and define P = Q−1 and v = P ε . Then
V [v] = σ 2 P QQ> P > = σ 2 I, therefore
1 >
(35.0.12) fv (v) = (2πσ 2 )−n/2 exp − v v .
2σ 2
For the transformation rule, write v, whose density function you know, as a function of y, whose
density function you want to know. v = P (y − Xβ); therefore the Jacobian matrix is ∂v/∂y > =
∂(P y − P Xβ)/∂y > = P , or one can see it also element by element
∂v1 ∂v 1
∂y 1
··· ∂y n
(35.0.13) .. .. ..
= P,
. . .
∂v n ∂v n
∂y 1
··· ∂y n
therefore one has to do two things: first, substitute P (y − Xβ) for v in formula (35.0.12), and
secondly multiply by the absolute value of the determinant of the Jacobian. Here is how to ex-
press the determinant of the Jacobian in terms of Ψ: From Ψ−1 =√(QQ> )−1 = (Q> )−1 Q−1 =
(Q−1 )> Q−1 = P > P follows (det P )2 = (det Ψ)−1 , hence |det P | = det Ψ.
35. LEAST SQUARES AS THE NORMAL MAXIMUM LIKELIHOOD ESTIMATE 861
This objective function has to be maximized with respect to β and the parameters
entering Ψ. If Ψ is known, then this is clearly maximized by the β̂ minimizing
(26.0.9), therefore the GLS estimator is also the maximum likelihood estimator.
If Ψ depends on unknown parameters, it is interesting to compare the maxi-
mum likelihood estimator with the nonlinear least squares estimator. The objective
function minimized by nonlinear least squares is (y − Xβ)> Ψ−1 (y − Xβ), which
is the sum of squares of the innovation parts of the residuals. These two objective
1
functions therefore differ by the factor (det[Ψ]) n , which only matters if there are
unknown parameters in Ψ. Asymptotically, the objective functions are identical.
Using the factorization theorem for sufficient statistics, one also sees easily that
σ̂ 2 and β̂ together form sufficient statistics for σ 2 and β. For this use the identity
Therefore the observation y enters the likelihood function only through the two
statistics β̂ and s2 . The factorization of the likelihood function is therefore the trivial
factorization in which that part which does not depend on the unknown parameters
but only on the data is unity.
35. LEAST SQUARES AS THE NORMAL MAXIMUM LIKELIHOOD ESTIMATE 863
Problem 398. 12 points The log likelihood function in the linear model is given
by (35.0.5). Show that the inverse of the information matrix is
2 > −1
σ (X X) o
(35.0.19)
o> 2σ 4 /n
The information matrix can be obtained in two different ways. Its typical element
has the following two forms:
∂ ln ` ∂ ln ` ∂ 2 ln `
(35.0.20) E[ ] = − E[ ],
∂θi ∂θk ∂θi ∂θk
or written as matrix derivatives
∂ ln ` ∂ ln ` ∂ 2 ln `
(35.0.21) E[ >
] = − E[ ].
∂θ ∂θ ∂θ∂θ >
β
In our case θ = . The expectation is taken under the assumption that the
σ2
parameter values are the true values. Compute it both ways.
The first derivatives were already computed for the maximum likelihood estimators:
∂ 1 1 1
(35.0.23) ln ` = − 2 (2y > X + 2β > X > X) = 2 (y − Xβ)> X = 2 ε> X
∂β > 2σ σ σ
∂ n 1 n 1 >
(35.0.24) ln ` = − 2 + (y − Xβ)> (y − Xβ) = − 2 + ε ε
∂σ 2 2σ 2σ 4 2σ 2σ 4
By the way, one sees that each of these has expected value zero, which is a fact that is needed to
prove consistency of the maximum likelihood estimator.
The formula with only one partial derivative will be given first, although it is more tedious:
>
∂ ∂
By doing ∂β > ∂β >
we get a symmetric 2 × 2 partitioned matrix with the diagonal elements
1 > > 1
(35.0.25) E[ X εε X] = 2 X > X
σ4 σ
and
n 1 > 2 n 1 > 1 1 n
(35.0.26) E[ − + ε ε ] = var[− 2 + ε ε ] = var[ 4 ε >ε ] = 2nσ 4 =
2σ 2 2σ 4 2σ 2σ 4 2σ 4σ 8 2σ 4
n 1 > ε > X. Its expected value is zero:
One of the off-diagonal elements is ( 2σ 4 + 2σ 6 ε ε )ε ε] = o,
E [ε
>
P 2 P
εε ε ] = o since its ith component is E[εi
and also E [ε ε ] =
j j j
E[εi ε2j ]. If i 6= j , then εi is
independent of ε2j , therefore E[εi ε2j ] = 0 · σ 2 = 0. If i = j, we get E[ε3i ] = 0 since εi has a symmetric
distribution.
35. LEAST SQUARES AS THE NORMAL MAXIMUM LIKELIHOOD ESTIMATE 865
Now assume that β and σ 2 are the true values, take expected values, and reverse the sign. This
gives the information matrix
σ −2 X > X o
(35.0.31)
o> n/(2σ 4 )
For the lower righthand side corner we need that E[(y − Xβ)> (y − Xβ)] = E[ε ε>ε ] = nσ 2 .
Taking inverses gives (35.0.19), which is a lower bound for the covariance matrix; we see that
s2 with var[s2 ] = 2σ 4 /(n − k) does not attain the bound. However one can show with other means
that it is nevertheless efficient.
CHAPTER 36
The model is y = Xβ + ε with ε ∼ N (o, σ 2 I). Both y and β are random. The
distribution of β, called the “prior information,” is β ∼ N (ν, τ 2 A−1 ). (Bayesians
work with the precision matrix, which is the inverse of the covariance matrix). Fur-
thermore β and ε are assumed independent. Define κ2 = σ 2 /τ 2 . To simplify matters,
we assume that κ2 is known.
Whether or not the probability is subjective, this specification implies that y
and β are jointly Normal and
We can use theorem ?? to compute the best linear predictor β̂(y) ˆ of β on the
ˆ
basis of an observation of y. Due to Normality, β̂ is at the same time the conditional
mean or “posterior mean” β̂ ˆ = [β|y], and the MSE-matrix is at the same time
E
the variance of the posterior distribution of β given y MSE[β̂; ˆ β] = [β̂|y].
ˆ A
V
proof is given as answer to Question ??. Since one knows mean and variance of the
posterior distribution, and since the posterior distribution is normal, the posterior
distribution of β given y is known. This distribution is what the Bayesians are
after. The posterior distribution combines all the information, prior information
and sample information, about β.
According to (??), this posterior mean can be written as
(36.0.33) ˆ = ν + B ∗ (y − Xν)
β̂
where B ∗ is the solution of the “normal equation” (??) which reads here
The obvious solution of (36.0.34) is B ∗ = A−1 X > (XA−1 X > + κ2 I)−1 , and
according to (??), the MSE-matrix of the predictor is
These formulas are correct, but the Bayesians use mathematically equivalent formulas
which have a simpler and more intuitive form. The solution of (36.0.34) can also be
written as
where β̂ = (X > X)−1 X > y is the OLS estimate. Bayesians are interested in β̂ˆ be-
cause this is the posterior mean. The MSE-matrix, which is the posterior covariance
matrix, can also be written as
ˆ
(36.0.38) MSE[β̂; β] = σ 2 (X > X + κ2 A)−1
(36.0.39)
(X > X + κ2 A)−1 X > (XA−1 X > + κ2 I) = (X > X + κ2 A)−1 (X > XA−1 X > + κ2 X > ) =
= (X > X + κ2 A)−1 (X > X + κ2 A)A−1 X > = A−1 X > .
Now the solution formula:
For the formula of the MSE matrix one has to check that (36) times the inverse of (36.0.38) is the
identity matrix, or that
(36.0.43) A−1 − A−1 X > (XA−1 X + κ2 I)−1 XA−1 X > X + κ2 A = κ2 I
(36.0.44)
A−1 X > X +κ2 I −A−1 X > (XA−1 X +κ2 I)−1 XA−1 X > X −κ2 A−1 X > (XA−1 X +κ2 I)−1 X =
= A−1 X > X + κ2 I − A−1 X > (XA−1 X + κ2 I)−1 (XA−1 X + κ2 I)X = κ2 I
36. BAYESIAN ESTIMATION IN THE LINEAR MODEL 871
The formula (36.0.37) can be given two interpretations, neither of which is nec-
essarily Bayesian. First interpretation: It is a matrix weighted average of the OLS
estimate and ν, with the weights being the respective precision matrices. If ν = o,
ˆ
then the matrix weighted average reduces to β̂ = (X > X + κ2 A)−1 X > y, which has
been called a “shrinkage estimator” (Ridge regression), since the “denominator” is
bigger: instead of “dividing by” X > X (strictly speaking, multiplying by (X > X)−1 ),
one “divides” by X > X + κ2 A. If ν 6= o then the OLS estimate β̂ is “shrunk” not
in direction of the origin but in direction of ν.
Second interpretation: It is as if, in addition to the data y = Xβ + ε , also an
independent observation ν = β + δ with δ ∼ N (o, τ 2 A−1 ) was available, i.e., as if
the model was
2
y X ε ε o 2 σ I O
(36.0.45) = β+ with ∼ ,τ .
ν I δ δ o O A−1
ˆ in
The Least Squares objective function minimized by the GLS estimator β = β̂
(36.0.45) is:
(36.0.46) (y − Xβ)> (y − Xβ) + κ2 (β − ν)> A(β − ν).
ˆ is chosen such that at the same time X β̂
In other words, β̂ ˆ is close to y and β̂
ˆ
close to ν.
872 36. BAYESIAN ESTIMATION IN THE LINEAR MODEL
Problem 400. Show that the objective function (36.0.46) is, up to a constant
factor, the natural logarithm of the product of the prior density and the likelihood
function. (Assume σ 2 and τ 2 known). Note: if z ∼ N (θ, σ 2Σ ) with nonsingular
covariance matrix σ 2Σ , then its density function is
−1/2 1
fz (z) = (2πσ 2 )−n/2 |det Σ | exp − 2 (z − θ)>Σ −1 (z − θ) .
(36.0.47)
2σ
>
(β−ν) A(β−ν)
Answer. Prior density (2πτ 2 )−k/2 |det A|−1/2 exp − 2τ 2
; likelihood function (2πσ 2
the posterior density is then proportional to the product of the two:
(y − Xβ)> (y − Xβ) + κ2 (β − ν)> A(β − ν)
(36.0.48) posterior ∝ exp − .
2σ 2
Problem 401. As in Problem 274, we will work with the Cobb-Douglas pro-
duction function, which relates output Q to the inputs of labor L and capital K as
follows:
(36.0.49) Qt = µKtβ Lγt exp(εt ).
Setting y t = log Qt , xt = log Kt , zt = log Lt , and α = log µ. one obtains the
linear regression
(36.0.50) y t = α + βxt + γzt + εt
Assume that the prior information about β, γ, and the returns to scale β + γ is
such that
(36.0.51) E[β] = E[γ] = 0.5 E[β + γ] = 1.0
(36.0.53) Pr[0.2 < β < 0.8] = Pr[0.2 < γ < 0.8] = 0.9
About α assume that the prior information is such that
(36.0.54) E[α] = 5.0, Pr[−10 < α < 20] = 0.9
874 36. BAYESIAN ESTIMATION IN THE LINEAR MODEL
and that our prior knowledge about α is not affected by (is independent of ) our
prior knowledge concerning β and γ. Assume that σ 2 is known and that it has the
value σ 2 = 0.09. Furthermore, assume that our prior views about α, β, and γ can
be adequately represented by a normal distribution. Compute from this the prior
>
distribution of the vector β = α β γ .
Here is my personal opinion what to think of this. I always get uneasy when I
see graphs like [JHG+ 88, Figure 7.2 on p. 283]. The prior information was specified
on pp. 277/8: the marginal propensity to consume is with high probability between
0.75 and 0.95, and there is a 50-50 chance that it lies above or below 0.85. The least
squares estimate of the MPC is 0.9, with a reasonable confidence interval. There is
no multicollinearity involved, since there is only one explanatory variable. I see no
reason whatsoever to take issue with the least squares regression result, it matches
my prior information perfectly. However the textbook tells me that as a Bayesian I
have to modify what the data tell me and take the MPC to be 0.88. This is only
because of the assumption that the prior information is normal.
I think the Bayesian procedure is inappropriate here because the situation is so
simple. Bayesian procedures have the advantage that they are coherent, and therefore
can serve as a guide in complex estimation situations, when the researcher is tempted
36. BAYESIAN ESTIMATION IN THE LINEAR MODEL 875
A Bayesian considers the posterior density the full representation of the informa-
tion provided by sample and prior information. Frequentists have discoveered that
one can interpret the parameters of this density as estimators of the key unknown
parameters, and that these estimators have good sampling properties. Therefore
they have tried to re-derive the Bayesian formulas from frequentist principles.
If β satisfies the constraint Rβ = u only approximately or with uncertainty, it
has therefore become customary to specify
Both interpretations are possible here: either u is a constant, which means nec-
essarily that β is random, or β is as usual a constant and u is random, coming from
whoever happened to do the research (this is why it is called “mixed estimation”).
It is the correct procedure in this situation to do GLS on the model
y X ε ε o 2 I O
(37.0.56) = β+ with ∼ ,σ .
u R −η −η o O κ12 I
Therefore
Whatever the true values of β and σ 2 , there is always a κ2 > 0 for which (37.0.59)
or (37.0.58) holds. The corresponding statement for the trace of the MSE-matrix
has been one of the main justifications for ridge regression in [HK70b] and [HK70a],
and much of the literature about ridge regression has been inspired by the hope that
880 37. OLS WITH RANDOM CONSTRAINT
one can estimate κ2 in such a way that the MSE is better everywhere. This is indeed
done by the Stein-rule.
Ridge regression is reputed to be a good estimator when there is multicollinearity.
Problem 403. (Not eligible for in-class exams) Assume E[y] = µ, var(y) = σ 2 ,
and you make n independent observations y i . Then the best linear unbiased estimator
of µ on the basis of these observations is the sample mean ȳ. For which range of
values of α is MSE[αȳ; µ] < MSE[ȳ; µ]? Unfortunately, this value depends on µ and
can therefore not be used to improve the estimate.
Answer.
(37.0.60) MSE[αȳ; µ] = E (αȳ − µ)2 = E (αȳ − αµ + αµ − µ)2 < MSE[ȳ; µ] = var[ȳ]
(37.0.61) α2 σ 2 /n + (1 − α)2 µ2 < σ 2 /n
Now simplify it:
(37.0.62) (1 − α)2 µ2 < (1 − α2 )σ 2 /n = (1 − α)(1 + α)σ 2 /n
This cannot be true for α ≥ 1, because for α = 1 one has equality, and for α > 1, the righthand side
is negative. Therefore we are allowed to assume α < 1, and can divide by 1 − α without disturbing
the inequality:
883
884 38. STEIN RULE ESTIMATORS
• b. 0 points Assume one has Bayesian prior knowledge that β ∼ N (o, τ 2 I), and
β independent of ε . In the general case, if prior information is β ∼ N (ν, τ 2 A−1 ),
the Bayesian posterior mean is β̂ M = (X > X + κ2 A)−1 (X > y + κ2 Aν) where κ2 =
σ 2 /τ 2 . Show that in the present case β̂ M is proportional to the OLS estimate β̂ with
2
proportionality factor (1 − τ 2σ+σ2 ), i.e.,
σ2
(38.0.71) β̂ M = β̂(1 − ).
τ2 + σ2
Answer. The formula given is (36.0.36), and in the present case, A−1 = I. One can also view
it as a regression with a random constraint Rβ ∼ (o, τ 2 I) where R = I, which is mathematically
the same as considering the know mean vector, i.e., the null vector, as additional observations. In
either case one gets
(38.0.72)
σ2 σ2
β̂ M = (X > X + κ2 A)−1 X > y = (X > X + κ2 R> R)−1 X > y = (I + 2 I)−1 X > y = β̂(1 − 2 ),
τ τ + σ2
i.e., it shrinks the OLS β̂ = X > y.
• c. 0 points Formula (38.0.71) can only be used for estimation if the ratio
σ 2 /(τ 2 + σ 2 ) is known. This is usually not the case, but it is possible to estimate
both σ 2 and τ 2 + σ 2 from the data. The use of such estimates instead the actual
values of σ 2 and τ 2 in the Bayesian formulas is sometimes called “empirical Bayes.”
38. STEIN RULE ESTIMATORS 885
Show that E[β̂ > β̂] = k(τ 2 + σ 2 ), and that E[y > y − β̂ > β̂] = (n − k)σ 2 , where n is
the number of observations and k is the number of regressors.
Answer. Since y = Xβ + ε ∼ N (o, σ 2 XX > + τ 2 I), it follows β̂ = X > y ∼ N (o, (σ 2 + τ 2 )I)
(where we now have a k-dimensional identity matrix), therefore E[β̂ > β̂] = k(σ 2 + τ 2 ). Furthermore,
since M y = Mε ε regardless of whether β is random or not, σ 2 can be estimated in the usual
manner from the SSE: (n − k)σ 2 = E[ε̂> ε̂] = E[ε̂> ε̂] = E[y > M y] = E[y > y − β̂ > β̂] because
M = I − XX > .
• d. 0 points If one plugs the unbiased estimates of σ 2 and τ 2 + σ 2 from part (c)
into (38.0.71), one obtains a version of the so-called “James and Stein” estimator
y > y − β̂ > β̂
(38.0.73) β̂ JS = β̂(1 − c ).
β̂ > β̂
What is the value of the constant c if one follows the above instructions? (This
estimator has become famous because for k ≥ 3 and c any number between 0 and
2(n − k)/(n − k + 2) the estimator (38.0.73) has a uniformly lower MSE than the
OLS β̂, where the MSE is measured as the trace of the MSE-matrix.)
k
Answer. c = n−k
. I would need a proof that this is in the bounds.
• e. 0 points The existence of the James and Stein estimator proves that the
OLS estimator is “inadmissible.” What does this mean? Can you explain why the
886 38. STEIN RULE ESTIMATORS
OLS estimator turns out to be deficient exactly where it ostensibly tries to be strong?
What are the practical implications of this?
The properties of this estimator were first discussed in James and Stein [JS61],
extending the work of Stein [Ste56].
Stein himself did not introduce the estimator as an “empirical Bayes” estimator,
and it is not certain that this is indeed the right way to look at it. Especially this
approach does not explain why the OLS cannot be uniformly improved upon if k ≤ 2.
But it is a possible and interesting way to look at it. If one pretends one has prior
information, but does not really have it but “steals” it from the data, this “fraud”
can still be successful.
Another interpretation is that these estimators are shrunk versions of unbiased
estimators, and unbiased estimators always get better if one shrinks them a little.
The only problem is that one cannot shrink them too much, and in the case of the
normal distribution, the amount by which one has to shrink them depends on the
unknown parameters. If one estimates the shrinkage factor, one usually does not
know if the noise introduced by this estimated factor is greater or smaller than the
savings. But in the case of the Stein rule, the noise is smaller than the savings.
no prior information about β), show that the F -statistic for the hypothesis β = o is
>
F = (y> y−β̂β̂>β̂/k
β̂)/(n−k)
.
Answer. SSE r = y > y, SSE u = y > y − β̂ > β̂ as shown above, number of constraints is k.
Use equation . . . for the test statistic.
for c coming from the empirical Bayes approach)? Which estimator would you expect
to be better, and why?
Answer. This modified pre-test estimator has the form
>
y−β̂ > β̂
(
o if 1 − c y <0
β̂ > β̂
(38.0.75) β̂ JS+ = >
β̂ > β̂
β̂(1 − c y y− ) otherwise
β̂ > β̂
> >
It is equal to the Stein-rule estimator (38.0.73) when the estimated shrinkage factor (1−c y y− β̂ β̂
)
β̂ > β̂
is positive, but the shrinkage factor is set 0 instead of turning negative. This is why it is commonly
called the “positive part” Stein-rule estimator. Stein conjectured early on, and Baranchik [Bar64]
showed that it is uniformly better than the Stein rule estimator:
• b. 0 points Which lessons can one draw about pre-test estimators in general
from this exercise?
Stein rule estimators have not been used very much, they are not equivariant
and the shrinkage seems arbitrary. Discussing them here brings out two things: the
formulas for random constraints etc. are a pattern according to which one can build
good operational estimators. And some widely used but seemingly ad-hoc procedures
like pre-testing may have deeper foundations and better properties than the halfways
sophisticated researcher may think.
38. STEIN RULE ESTIMATORS 889
Problem 407. 6 points Why was it somewhat a sensation when Charles Stein
came up with an estimator which is uniformly better than the OLS? Discuss the Stein
rule estimator as empirical Bayes, shrinkage estimator, and discuss the “positive
part” Stein rule estimator as a modified pretest estimator.
CHAPTER 39
Random Regressors
Until now we always assumed that X was nonrandom, i.e., the hypothetical
repetitions of the experiment used the same X matrix. In the nonexperimental
sciences, such as economics, this assumption is clearly inappropriate. It is only
justified because most results valid for nonrandom regressors can be generalized to
the case of random regressors. To indicate that the regressors are random, we will
write them as X.
891
892 39. RANDOM REGRESSORS
Problem 409. 2 points Assume the regressors X are random, and the classical
ε|X] = σ 2 I. Show
ε|X] = o and V [ε
assumptions hold conditionally on X, i.e., E [ε
2 2
that s is an unbiased estimate of σ .
Answer. From the theory with 2nonrandom explanatory variables follows E[s2 |X] = σ 2 .
2 2 2
Therefore E[s ] = E E[s |X] = E[σ ] = σ . In words: if the expectation conditional on X
does not depend on X, then it is also the unconditional expectation.
The law of iterated expectations can also be used to compute the unconditional
MSE matrix of β̂:
Problem 410. 2 points Show that s2 (X > X)−1 is unbiased estimator of MSE[β̂; β
894 39. RANDOM REGRESSORS
Answer.
2 > −1
2 > −1
(39.1.8) E [s (X X) ] = E E [s (X X) |X]
(39.1.9) = E [σ 2 (X > X)−1 ]
(39.1.10) = σ 2 E [(X > X)−1 ]
(39.1.11) = MSE[β̂; β] by (39.1.7).
this is the linear regression model discussed above. But the expected value might
also depend on X in a nonlinear fashion (nonlinear least squares), and the variance
may not be constant—in which case the intuition that y is some function of X plus
some error term may no longer be appropriate; y may for instance be the outcome
of a binary choice, the probability of which depends on X (see chapter 69.2; the
generalized linear model).
whether the sample correlation coefficients between the residuals and the explanatory
variables is significantly different from zero or not. Is this an appropriate statistic?
Answer. No. The sample correlation coefficients are always zero!
Everything in this chapter is unpublished work, presently still in draft form. The
aim is to give a motivation for the least squares objective function in terms of an
initial measure of precision. The case of prediction is mathematically simpler than
that of estimation, therefore this chapter will only discuss prediction. We assume
that the joint distribution of y and z has the form
y X
Ω Ωyz
σ 2 > 0, otherwise un
∼ β, σ 2 yy , (40.0.1)
z W Ω zy Ω zz β unknown as well.
y is observed but z is not and has to be predicted. But assume we are not interested
in the MSE since we do the experiment only once. We want to predict z in such a
897
898 40. THE MAHALANOBIS DISTANCE
way that, whatever the true value of β, the predicted value z ∗ “blends in” best with
the given data y.
There is an important conceptual difference between this criterion and the one
based on the MSE. The present criterion cannot be applied until after the data are
known, therefore it is called a “final” criterion as opposed to the “initial” criterion
of the MSE. See Barnett [Bar82, pp. 157–159] for a good discussion of these issues.
How do we measure the degree to which a given data set “blend in,” i.e., are not
outliers for a given distribution? Hypothesis testing uses this criterion. The most
often-used testing principle is: reject the null hypothesis if the observed value of a
certain statistic is too much an outlier for the distribution which this statistic would
have under the null hypothesis. If the statistic is a scalar, and if under the null
hypothesis this statistic has expected value µ and standard deviation σ, then one
often uses an estimate of |x − µ| /σ, the number of standard deviations the observed
value is away from the mean, to measure the “distance” of the observed value x from
the distribution (µ, σ 2 ). The Mahalanobis distance generalizes this concept to the
case that the test statistic is a vector random variable.
motivate the Mahalanobis distance. How could one generalize the squared scalar
distance (y − µ)2 /σ 2 for the distance of a vector value y from the distribution of
the vector random variable y ∼ (µ, σ 2Ω )? If all y i have same variance σ 2 , i.e., if
Ω = I, one might measure the squared distance of y from the distribution (µ, σ 2Ω )
by σ12 maxi (yi − µi )2 , but since the maximum from two trials is bigger than the value
from one trial only, one should divide this perhaps by the expected value of such
a maximum. If the variances are different, say σi2 , one might want to look a the
number of standard deviations which the “worst” component of y is away from what
would be its mean if y were an observation of y, i.e., the squared distance of the
2
obsrved vector from the distribution would be maxi (yi −µ
σi2
i)
, again normalized by its
expected value.
The principle actually used by the Mahalanobis distance goes only a small step
further than the examples just cited. It is coordinate-free, i.e., any linear combi-
nations of the elements of y are considered on equal footing with these elements
themselves. In other words, it does not distinguish between variates and variables.
The distance of a given vector value from a certain multivariate distribution is defined
to be the distance of the “worst” linear combination of the elements of this vector
from the univariate distribution of this linear combination, normalized in such a way
that the expected value of this distance is 1.
900 40. THE MAHALANOBIS DISTANCE
Definition 40.1.1. Given a random n-vector y which has expected value and a
nonsingular covariance matrix. The squared “Mahalanobis distance” or “statistical
distance” of the observed value y from the distribution of y is defined to be
2
1 g > y − E[g > y]
(40.1.1) MHD[y; y] = max .
n g var[g > y]
If the denominator var[g > y] is zero, then g = o, therefore the numerator is zero as
well. In this case the fraction is defined to be zero.
Theorem 40.1.2. Let y be a vector random variable with E [y] = µ and V [y] =
σ 2Ω, σ 2 > 0 and Ω positive definite. The squared Mahalanobis distance of the value
y from the distribution of y is equal to
1
(40.1.2) MHD[y; y] = (y − µ)>Ω −1 (y − µ)
nσ 2
Proof. (40.1.2) is a simple consequence of (32.4.4). It is also somewhat intuitive
since the righthand side of (40.1.2) can be considered a division of the square of y −µ
by the covariance matrix of y.
The Mahalanobis distance is an asymmetric measure; a large value indicates a
bad fit of the hypothetical population to the observation, while a value of, say, 0.1
does not necessarily indicate a better fit than a value of 1.
40.1. DEFINITION OF THE MAHALANOBIS DISTANCE 901
Problem 413. Let y be a random n-vector with expected value µ and nonsingular
covariance matrix σ 2Ω . Show that the expected value of the Mahalobis distance of
the observations of y from the distribution of y is 1, i.e.,
(40.1.3) E MHD[y; y] = 1
Answer.
(40.1.4)
1 1 1 1
E[ 2 (y − µ)>Ω −1 (y − µ)] = E[tr Ω −1 (y − µ)(y − µ)> ] tr( 2 Ω −1 σ 2Ω ) = tr(I) = 1.
nσ nσ 2 nσ n
(40.1.2) is, up to a constant factor, the quadratic form in the exponent of the
normal density function of y. For a normally distributed y, therefore, all observations
located on the same density contour have equal distance from the distribution.
The Mahalanobis distance is also defined if the covariance matrix of y is singular.
In this case, certain nonzero linear combinations of the elements of y are known with
certainty. Certain vectors can therefore not possibly be realizations of y, i.e., the set
of realizations of y does not fill the whole Rn .
h y1 i h 1 i
Problem 414. 2 points The random vector y = yy2 has mean 2 and
h 2 −1 −1 i 3 −3
linear combination of the elements of y which is known with certainty. And give a
value which can never be a realization of y. Prove everything you state.
Answer. Yes, it is singular;
2 −1 −1
h ih i h i
1 0
(40.1.5) −1 2 −1 1 = 0
−1 −1 2 1 0
h 1
i
I.e., y 1 +y 2 +y 3 = 0 because its variance is 0 and its mean is zero as well since [ 1 1 1 ] 2 = 0.
−3
Definition 40.1.3. Given a vector random variable y which has a mean and
a covariance matrix. A value y has infinite statistical distance from this random
variable, i.e., it cannot possibly be a realization of this random variable, if a vector
of coefficients g exists such that var[g > y] = 0 but g > y 6= g > E [y]. If such a g does
not exist, then the squared Mahalanobis distance of y from y is defined as in (40.1.1),
with n replaced by rank[Ω Ω]. If the denominator in (40.1.1) is zero, then it no longer
necessarily follows that g = o but it nevertheless follows that the numerator is zero,
and the fraction should in this case again be considered zero.
If Ω is singular, then the inverse Ω −1 in formula (40.1.2) must be replaced by a
“g-inverse.” A g-inverse of a matrix A is any matrix A− which satisfies AA− A = A.
G-inverses always exist, but they are usually not unique.
Problem 415. a is a scalar. What is its g-inverse a− ?
40.2. THE CONDITIONAL MAHALANOBIS DISTANCE 903
their distribution. But if one works with the relative increase in the Mahalanobis
distance if z is added to y, then σ 2 cancels out. In order to measure how well the
conjectured value z fits together with the observed y we will therefore divide the
Mahalanobis distance of the vector composed of y and z from its distribution by the
Mahalanobis distance of y alone from its distribution:
> −
1 y−µ Ω yy Ω yz y−µ
rσ 2 z − ν Ω zy Ω zz z−ν
(40.3.2) 1 > − .
pσ 2 (y − µ) Ω yy (y − µ)
This relative measure is independent of σ 2 , and if y is observed but z is not, one can
predict z by that value z ∗ which minimizes this relative contribution.
An equivalent criterion which leads to mathematically simpler formulas is to
divide the conditional Mahalanobis distance of z given y by the Mahalanobis distance
of y from y:
> −
y−µ Ω yy Ω yz y−µ
−
1
(r−p)σ 2 − (y − µ)>Ωyy (y − µ)
z−ν Ω zy Ω zz z−ν
(40.3.3) 1 > − .
pσ 2 (y − µ) Ω yy (y − µ)
We already solved this minimization problem in chapter ??. By (??), the mini-
mum value of this relative contribution is zero, and the value of z which minimizes
906 40. THE MAHALANOBIS DISTANCE
this relative contribution is the same as the value of the best linear predictor of z,
i.e., the value assumed by the linear predictor which minimizes the MSE among all
linear predictors.
(µ − y n+1 )2
(40.4.1) q=
(ιµ − y)> (ιµ − y)/n
40.4. SECOND SCENARIO: ONE ADDITIONAL IID OBSERVATION 907
We will show that the prediction y n+1 = ȳ is the solution to this minimax problem,
and that the minimax value of q is q = 1.
We will show that (1) for y n+1 = ȳ, q ≤ 1 for all values of µ, but one can find µ
for which q is arbitrarily close to 1, and (2) if y n+1 6= ȳ, then q > 1 for certain values
of µ.
In the proof of step (1), the case y 1 = y 2 = · · · = y n must be treated separately. If
this condition holds (which is always the case if n = 1, but otherwise it is a special case
occurring with zero probability), and someone predicts y n+1 by a value different from
the value taken by all previous realizations of y, i.e., if y 1 = y 2 = · · · = y n 6= y n+1 ,
then q is unbounded and becomes infinite if µ takes the same value as the observed
y i . If, on the other hand, y 1 = y 2 = · · · = y n+1 , then q = 1 if µ does not take the
same value as the y i , and q is a 1+0/0 undefined value otherwise, but by continuity
we can say q = 1 for all µ. Therefore y n+1 = ȳ is the best predictor in this special
case.
Now turn to the regular case in which not all observed y i are equal. Re-write q
as
(µ − y n+1 )2
(40.4.2) q=
(ιȳ − y)> (ιȳ
− y)/n + (µ − ȳ)2
908 40. THE MAHALANOBIS DISTANCE
or equivalently
(40.5.2)
y y y X
β, σ 2 I n+1 =
MHD ; = MHD ;
yn+1 y n+1 yn+1 x>
n+1
1
(y − Xβ)> (y − Xβ) + (yn+1 − x> 2
(40.5.3) = 2 n+1 β)
(n + 1)σ
(40.5.4)
MHD[yn+1 ; y n+1 |y] = (yn+1 − x>
n+1 β)
2
Both Mahalanobis distances (40.5.1) and (40.5.4) are unknown since they depend on
the unknown parameters. However we will show here that whatever the true value
of the parameters, the ratio of the conditional divided by the original Mahalanobis
40.5. THIRD SCENARIO: ONE ADDITONAL OBSERVATION IN A REGRESSION MODEL 911
We will show that the OLS prediction yn+1 = x> n+1 β̂ is the solution to this minimax
>
problem, and that the minimax value of q is q = x> n+1 (X X)
−1
xn+1 .
>
This proof will proceed in two steps. (1) For yn+1 = xn+1 β̂, q ≤ x>
>
n+1 (X X)
−1
xn+
for all values of β, and whatever g-inverse was used in (40.5.6), but one can find β
>
for which q is arbitrarily close to x>
n+1 (X X)
−1
xn+1 . (2) If yn+1 6= x> n+1 β̂, then
912 40. THE MAHALANOBIS DISTANCE
>
q > x> n+1 (X X)
−1
xn+1 for certain values of β, and again independent of the choice
of g-inverse in (40.5.6).
In the proof of step (1), the case y = X β̃ for some β̃ must be treated separately.
If this condition holds (which is always the case if rank X = n, but otherwise it only
occurs with zero probability), and someone predicts y n+1 by a value different than
x>n+1 β̃, then q is unbounded as the true β approaches β → β̃.
If, on the other hand, y n+1 is predicted by y n+1 = x> n+1 β̃, then
2
x>
n+1 (β̃ − β) >
(40.5.7) q= ≤ x>
n+1 (X X)
−1
xn+1
(β̃ − β)> X > X(β̃ − β)
if β 6= β̃ (with equality holding if β̂ − β = λ(X > X)−1 xn+1 for some λ 6= 0), and
q = 0 if β = β̃.
Now turn to the regular case in which the vector y cannot be written in the form
y = X β̃ for any β̃. Re-write (40.5.6) as
(y n+1 − x>
n+1 β)
2
(40.5.8) q=
(y − X β̂)> (y − X β̂) + (β̂ − β)> X > X(β̂ − β)
40.5. THIRD SCENARIO: ONE ADDITONAL OBSERVATION IN A REGRESSION MODEL 913
If y n+1 = x>
n+1 β̂, then
(40.5.9)
2 2
x>n+1 (β̂ − β) x>n+1 (β̂ − β)
q= ≤ ≤ x>
n+
(y − X β̂)> (y − X β̂) + (β̂ − β)> X > X(β̂ − β) (β̂ − β)> X > X(β̂ − β)
>
One gets arbitrarily close to x>n+1 (X X)
−1
xn+1 for β̂ − β = λ(X > X)−1 xn+1 and
λ sufficiently large.
For step (2) of the proof we have to show: if y n+1 is not equal to x>
n+1 β̂, then q
> > −1
can assume values larger than xn+1 (X X) xn+1 . To show this, we will find for a
given y and y n+1 that parameter value β̃ for which this relative increase is highest,
and the value of the highest relative increase.
Here is the rough draft for the continuation of this proof. Here I am solving the
first order condition and I am not 100 percent sure whether it is a global maximum.
For the derivation which follows this won’t matter, but I am on the lookout for a
proof that it is. After I am done with this, this need not even be in the proof, all
that needs to be in the proof is that this highest value (wherever I get it from) gives
>
a value of q that is greater than nx>n+1 (X X)
−1
xn+1 .
914 40. THE MAHALANOBIS DISTANCE
(ζ − a)2 u
(40.5.10) q=n 2 2
=n
b +ξ v
> ∂ξ 2
where ζ = x> 2 >
n+1 β, a = y n+1 , and ξ = (β̂ − β) X X(β̂ − β). Hence ∂β >
=
> > >
−2(β̂ − β) X X =: −2α . To get the highest relative increase, we need the roots
of
u 0 u0 v − v 0 u
(40.5.11) =
v v2
Here, in a provisional notation which will be replaced by matrix differentiation
eventually, the prime represents the derivative with respect to βi . I will call −2αi =
∂ξ 2
∂βi . Since the numerator is always positive, we only have to look for the roots of
u0 v − v 0 u = 2xn+1,i (ζ − a)(b2 + ξ 2 ) + 2αi (ζ − a)2 . Here it is in matrix differentiation:
∂u ∂v ∂ζ ∂ξ 2
(40.5.12) v− u= 2(ζ − a)(b2 + ξ 2 ) − (ζ − a)2 = 0
∂β ∂β ∂β ∂β
40.5. THIRD SCENARIO: ONE ADDITONAL OBSERVATION IN A REGRESSION MODEL 915
or equivalently
> >
(40.5.19) γ(x> 2 >
n+1 β̂ − y n+1 ) + γ xn+1 (X X)
−1
xn+1 = b2 + γ 2 x>
n+1 (X X)
−1
xn+1
b2
(40.5.20) γ= .
x>
n+1 β̂ − y n+1
The worst β is
(y − X β̂)> (y − X β̂)
(40.5.21) β = β̂ − (X > X)−1 xn+1 .
y n+1 − x>
n+1 β̂
Now I should start from here and prove it new, and also do the proof that all other
values of β give better q.
(40.5.16) and (40.5.17) into (40.5.10) gives
>
2
x> >
n+1 β̂ − y n+1 + γxn+1 (X X)
−1
xn+1
(40.5.22) q= >
,
b2 + γ 2 x>
n+1 (X X)
−1 x
n+1
40.5. THIRD SCENARIO: ONE ADDITONAL OBSERVATION IN A REGRESSION MODEL 917
(40.5.23)
>
2
1 x> >
n+1 β̂ − y n+1 + γxn+1 (X X)
−1
xn+1 x>
n+1 β̂ − y n+1 >
q= > > >
= +x>
n+1 (X X)
−1
γ xn+1 β̂ − y n+1 + γxn+1 (X X) xn+1
−1 γ
(x>
n+1 β̂ − y n+1 )
2
>
= + x>
n+1 (X X)
−1
xn+1
b2
>
This is clearly bigger than x>n+1 (X X)
−1
xn+1 . This is what he had to show.
By the way, the excess of this q over the minimum value is proportional to
the F -statistic for the test whether the n + 1st observation comes from the same
value β. (The only difference is that numerator and denominator are not divided
by their degrees of freedom). This gives a new interpretation of the F -test, and also
of the F -confidence regions (which will be more striking if we predict more than
one observation). F -confidence regions are conjectured observations for which the
minimax value of the Mahalanobis ratio stays below a certain bound.
Now the proof that it is the worst β. Without loss of generality we can write
and β as follows in terms of a δ:
(y − X β̂)> (y − X β̂)
(40.5.24) β = β̂ − γ(X > X)−1 (xn+1 + δ) where γ=
y n+1 − x>
n+1 β̂
918 40. THE MAHALANOBIS DISTANCE
(x>
n+1 β̂−y n+1 )
2
γ2 b2 = b2 and γ(x> 2
n+1 β̂ − y n+1 ) = b we get
(40.5.30)
>
(x> 2 2 >
n+1 β̂ − y n+1 ) + b xn+1 (X X)
−1
xn+1 + b2 (xn+1 + δ)> (X > X)−1 (xn+1 + δ)+
+ γ 2 (xn+1 + δ)> (X > X)−1 (xn+1 + δ)x> >
n+1 (X X)
−1
xn+1 ≥
> >
2
≥ (y n+1 −x> 2 2 >
n+1 β̂) +2b xn+1 (X X)
−1
(xn+1 +δ)+γ 2 x>
n+1 (X X)
−1
(xn+1 +δ)
or
>
(40.5.31) b2 x>
n+1 (X X)
−1
xn+1 + b2 (xn+1 + δ)> (X > X)−1 (xn+1 + δ)+
+ γ 2 (xn+1 + δ)> (X > X)−1 (xn+1 + δ)x> >
n+1 (X X)
−1
xn+1 −
> >
2
− 2b2 x>
n+1 (X X)
−1
(xn+1 + δ) − γ 2 x>
n+1 (X X)
−1
(xn+1 + δ) ≥ 0
Collecting terms we get
(40.5.32)
b2 δ > (X > X)−1 δ+γ 2 δ > (X > X)−1 δx> > −1
xn+1 −(x> > −1 2
n+1 (X X) n+1 (X X) δ) ≥ 0
which certainly holds. These steps can be reversed, which concludes the proof.
CHAPTER 41
Interval Estimation
We will first show how the least squares principle can be used to construct
confidence regions, and then we will derive the properties of these confidence regions.
• For every other vector β̃ one can define the sum of squared errors associ-
ated with that vector as SSE β̃ = (y − X β̃)> (y − X β̃). Draw the level
hypersurfaces (if k = 2: level lines) of this function. These are ellipsoids
centered on β̂.
• Each of these ellipsoids is a confidence region for β. Different confidence
regions differ by their coverage probabilities.
• If one is only interested in certain coordinates of β and not in the others, or
in some other linear transformation β, then the corresponding confidence
regions are the corresponding transformations of this ellipse. Geometrically
this can best be seen if this transformation is an orthogonal projection; then
the confidence ellipse of the transformed vector Rβ is also a projection or
“shadow” of the confidence region for the whole vector. Projections of the
same confidence region have the same confidence level, independent of the
direction in which this projection goes.
The confidence regions for β with coverage probability π will be written here
as B β;π or, if we want to make its dependence on the observation vector y explicit,
Bβ;π (y). These confidence regions are level lines of the SSE, and mathematically,
it is advantageous to define these level lines by their level relative to the minimum
level, i.e., as as the set of all β̃ for which the quotient of the attained SSE β̃ =
41.1. CONSTRUCTION OF CONFIDENCE REGIONS 923
Answer. Because the projection is a many-to-one mapping, and vectors which are not in the
original ellipsoid may still end up in the projection.
Again let us illustrate this with the 2-dimensional case in which the confidence
region for β is an ellipse, as drawn in Figure 1, called Bβ;π (y). Starting with this
ellipse, the above criterion defines individual confidence intervals for linear combina-
tions u = r > β by the rule: ũ ∈ Br> β;π (y) iff a β̃ ∈ Bβ (y) exists with r > β̃ = ũ. For
r = [ 10 ], this interval is simply the projection of the ellipse on the horizontal axis,
and for r = [ 01 ] it is the projection on the vertical axis.
The same argument applies for all vectors r with r > r = 1. The inner product
of two vectors is the length of the first vector times the length of the projection
of the second vector on the first. If r > r = 1, therefore, r > β̃ is simply the length
of the orthogonal projection of β̃ on the line generated by the vector r. Therefore
41.1. CONSTRUCTION OF CONFIDENCE REGIONS 925
the confidence interval for r > β is simply the projection of the ellipse on the line
generated by r. (This projection is sometimes called the “shadow” of the ellipse.)
The confidence region for Rβ can also be defined as follows: ũ lies in this
ˆ ˆ
confidence region if and only if the “best” β̂ which satisfies Rβ̂ = ũ lies in the
ˆ being, of course, the constrained least squares
confidence region (41.1.1), this best β̂
estimate subject to the constraint Rβ = ũ, whose formula is given by (29.3.13).
The confidence region for Rβ consists therefore of all ũ for which the constrained
ˆ −1
least squares estimate β̂ = β̂ − (X > X)−1 R> R(X > X)−1 R> (Rβ̂ − ũ) satisfies
condition (41.1.1):
ˆ > (y − X β̂)
(y − X β̂) ˆ
(41.1.3) ũ ∈ BRβ (y) ⇐⇒ ≤ cπ;n−k,i
(y − X β̂)> (y − X β̂)
i.e., those ũ are in the confidence region which, if imposed as a constraint on the
regression, will not make the SSE too much bigger.
926 41. INTERVAL ESTIMATION
−2 −1 0 1 2
−3 ..................
.....
..............................................................
.......................
...............
−3
..... ...................................................................................................................
... ...... ................... ............
.... .... ................. ...........
............. ..........
.... ... ........... ..........
.....
...... ....... .......... ..........
.......... .........
...... .
....... ........... ......... .........
....... ........ ......... ........
........ .........
....... ....... ........ .......
........ ....... ..
........ ......... ....... ........
........ ........ ....... .......
−4 ......... .........
......... .........
......... ..........
.......... ...........
....... ......
...... .......
.....
.... ...........
. −4
.......... ........... ... ....
........... ................ ... ....
........... .................. . ...
............... ...................... ... . ...
.................
................... ...................................................................... ..
....................... .....
................................
....................................................
−5 −5
−2 −1 0 1 2
Figure 1. Confidence Ellipse with “Shadows”
Problem 418. You have run a regression with intercept, but you are not inter-
ested in the intercept per se but need a joint confidence region for all slope parameters.
Using the notation of Problem 361, show that this confidence region has the form
I.e., we are sweeping the means out of both regressors and dependent variables, and
then we act as if the regression never had an intercept and use the formula for the
full parameter vector (41.1.6) for these transformed data (except that the number of
degrees of freedom n−k still reflects the intercept as one of the explanatory variables).
α
Answer. Write the full parameter vector as and R = o I . Use (41.1.5) but instead
β
of ũ write β̃. The only tricky part is the following which uses (30.0.37):
(41.1.8)
1/n + x̄> (X > X)−1 x̄ −x̄> (X > X)−1 o>
R(X > X)−1 R> = o I = (X > X)−1
−(X > X)−1 x̄ (X > X)−1 I
The denominator is (y − ια̂ − X β̂)> (y − ια̂ − X β̂), but since α̂ = ȳ − x̄> β̂, see problem 242, this
denominator can be rewritten as (y − X β̂)> (y − X β̂).
Therefore, instead of , the condition deciding whether a given vector ũ lies in the
confidence region for Rβ with confidence level π = 1 − α is formulated as follows:
(41.3.1)
(SSE constrained − SSE unconstrained )/number of constraints
≤ F(i,n−k;α)
SSE unconstr. /(numb. of obs. − numb. of coeff. in unconstr. model)
Here the constrained SSE is the SSE in the model estimated with the constraint
Rβ = ũ imposed, and F(i,n−k;α) is the upper α quantile of the F distribution
with i and n − k degrees of freedom, i.e., it is that scalar c for which a random
variable F which has a F distribution with i and n − k degrees of freedom satisfies
Pr[F ≥ c] = α.
Therefore from (41.1.5) one gets the following alternative formula for the joint con-
fidence region B(y) for the vector parameter u = Rβ for confidence level π = 1 − α:
(41.4.2)
1 −1
ũ ∈ BRβ;1−α (y) ⇐⇒ (Rβ̂ − ũ)> R(X > X)−1 R> (Rβ̂ − ũ) ≤ iF(i,n−k;α)
s2
Here β̂ is the least squares estimator of β, and s2 = (y − X β̂)> (y − X β̂)/(n − k) the
unbiased estimator of σ 2 . Therefore Σ̂ = s2 (X > X)−1 is the estimated covariance
matrix as available in the regression printout. Therefore V̂ = s2 R(X > X)−1 R>
is the estimate of the covariance matrix of Rβ̂. Another way to write (41.4.2) is
therefore
−1
(41.4.3) B(y) = {ũ ∈ Ri : (Rβ̂ − ũ)> V̂ (Rβ̂ − ũ) ≤ iF(i,n−k;α) }.
This formula allows a suggestive interpretation. whether ũ lies in the confidence
region or not depends on the Mahalanobis distance of the actual value of Rβ̂ would
have from the distribution which Rβ̂ would have if the true parameter vector were
to satisfy the constraint Rβ = ũ. It is not the Mahalanobis distance itself but only
an estimate of it because σ 2 is replaced by its unbiased estimate s2 .
These formulas are also useful for drawing the confidence ellipses. The p r which
you need in equation (10.3.22) in order to draw the confidence ellipse is r = iF(i,n−k;α)
934 41. INTERVAL ESTIMATION
This is the same as the local variable mult in the following S-function to draw this
ellipse: its arguments are the center point (a 2-vector d), the estimated covariance
matrix (a 2 × 2 matrix C), the degrees of freedom in the denominator of the F -
distribution (the scalar df), and the confidence level (the scalar level between 0
and 1 which defaults to 0.95 if not specified).
confelli <-
function(b, C, df, level = 0.95, xlab = "", ylab = "", add=T, prec=51)
{
d <- sqrt(diag(C))
dfvec <- c(2, df)
phase <- acos(C[1, 2]/(d[1] * d[2]))
angles <- seq( - (PI), PI, len = prec)
mult <- sqrt(dfvec[1] * qf(level, dfvec[1], dfvec[2]))
xpts <- b[1] + d[1] * mult * cos(angles)
ypts <- b[2] + d[2] * mult * cos(angles + phase)
if(add) lines(xpts, ypts)
else plot(xpts, ypts, type = "l", xlab = xlab, ylab = ylab)
}
The mathematics why this works is in Problem 166.
Answer.
(41.4.5) Pr[B(y) 3 Rβ] = Pr[(Rβ̂ − Rβ)> (R(X > X)−1 R> )−1 (Rβ̂ − Rβ) ≤ iF(i,n−k;α) s2 ] =
This interpretation with the Mahalanobis distance is commonly used for the
construction of t-Intervals. A t-interval is a special case of the above confidence
region for the case i = 1. The confidence interval with confidence level 1 − α for the
scalar parameter u = r > β, where r 6= o is a vector of constant coefficients, can be
written as
(41.4.7) B(y) = {u ∈ R : |u − r > β̂| ≤ t(n−k;α/2) sr> β̂ }.
Problem 421. Which element(s) on the right hand side of (41.4.7) depend(s)
on y?
Let us verify that the coverage probability, i.e., the probability that the confi-
dence interval constructed using formula (41.4.7) contains the true value r > β, is, as
938 41. INTERVAL ESTIMATION
claimed, 1 − α:
(41.4.10)
Pr[B(y) 3 r > β] = Pr[r > β − r > β̂ ≤ t(n−k;α/2) sr> β̂ ]
h q i
= Pr r > (X > X)−1 X >ε ≤ t(n−k;α/2) s r > (X > X)−1 r
(41.4.11)
> > −1 >
r (X X) X ε
(41.4.12) = Pr[ q ≤ t(n−k;α/2) ]
s r > (X > X)−1 r
> > −1 > .
r (X X) X ε s
(41.4.13) = Pr[ q ≤ t(n−k;α/2) ] = 1 − α,
σ r > (X > X)−1 r σ
This last equality holds because the expression left of the big slash is a standard
normal, and the expression on the right of the big slash is the square root of an
independent χ2n−k divided by n − k. The random variable between the absolute signs
has therefore a t-distribution, and (41.4.13) follows from (41.4.8).
In R, one obtains t(n−k;α/2) by giving the command qt(1-alpha/2,n-p). Here
qt stands for t-quantile [BCW96, p. 48]. One needs 1-alpha/2 instead of alpha/2
41.4. INTERPRETATION IN TERMS OF STUDENTIZED MAHALANOBIS DISTANCE 939
because it is the usual convention for quantiles (or cumulative distribution functions)
to be defined as lower quantiles, i.e., as the probabilities of a random variable being
≤ a given number, while test statistics are usually designed in such a way that the
significant values are the high values, i.e., for testing one needs the upper quantiles.
There is a basic duality between confidence intervals and hypothesis tests. Chap-
ter 42 is therefore a discussion of the same subject under a slightly different angle:
CHAPTER 42
941
942 42. THREE PRINCIPLES FOR TESTING A LINEAR CONSTRAINT
the model with the constraint imposed has a much worse fit than the model without
the constraint.
(3) (“Lagrange Multiplier Criterion”) This third criterion is based on the con-
strained estimator only. It has two variants. In its “score test” variant, one rejects
the null hypothesis if the vector of derivatives of the unconstrained least squares
ˆ is too far away from o.
objective function, evaluated at the constrained estimate β̂,
In the variant which has given this Criterion its name, one rejects if the vector of
Lagrange multipliers needed for imposing the constraint is too far away from o.
Many textbooks inadvertently and implicitly distinguish between (1) and (2)
as follows: they introduce the t-test for one parameter by principle (1), and the
F -test for several parameters by principle (2). Later, the student is surprised to
find out that the t-test and the F -test in one dimension are equivalent, i.e., that
the difference between t-test and F -test has nothing to do with the dimension of
the parameter vector to be tested. Some textbooks make the distinction between
(1) and (2) explicit. For instance [Chr87, p. 29ff] distinguishes between “testing
linear parametric functions” and “testing models.” However the distinction between
all 3 principles has been introduced into the linear model only after the discovery
that these three principles give different but asymptotically equivalent tests in the
Maximum Likelihood estimation. Compare [DM93, Chapter 3.6] about this.
42.1. MATHEMATICAL DETAIL OF THE THREE APPROACHES 943
If the constraint holds, the SSE’s divided by their respective degrees of freedom
should give roughly equal numbers. According to this, a feasible test statistic would
be
SSE r /(n + i − k)
(42.1.2)
SSE u /(n − k)
and one would reject if this is too much > 1. The following variation of this is more
convenient, since its distribution does not depend on n, k and i separately, but only
through n − k and i.
(SSE r − SSE u )/i
(42.1.3)
SSE u /(n − k)
It still has the property that the numerator is an unbiased estimator of σ 2 if the con-
straint holds and biased upwards if the constraint does not hold, and the denominator
is always an unbiased estimator. Furthermore, in this variation, the numerator and
denominator are independent random variables. If this test statistic is much larger
than 1, then the constraints are incompatible with the data and the null hypothesis
must be rejected. The statistic (42.1.3) can also be written as
(42.1.4)
(SSE constrained − SSE unconstrained )/number of constraints
SSE unconstrained /(numb. of observations − numb. of coefficients in unconstr. model)
42.1. MATHEMATICAL DETAIL OF THE THREE APPROACHES 945
The Lagrange multiplier statistic is based on the restricted estimator alone. If one
wanted to take this principle seriously one would have to to replace σ 2 by the unbiased
estimate from the restricted model to get the “score form” of the Lagrange Multiplier
Test statistic. But in the linear model this leads to it that the denominator in the
test statistic is no longer independent of the numerator, and since the test statistic as
a function of the ratio of the constrained and unconstrained estimates of σ 2 anyway,
one will only get yet another monotonic transformation of the same test statistic.
If one were to use the unbiased estimate from the unrestricted model, one would
exactly get the Wald statistic back, as one can verify using (29.3.13).
This same statistic can also be motivated in terms of the Lagrange multipliers,
and this is where this testing principle has its name from, although the applications
usually use the score form. According to (29.3.12), the Lagrange multiplier is λ =
−1
2 R(X > X)−1 R> (Rβ̂ − u). If the constraint holds, then E [λ] = o, and V [λ] =
> −1 > −1
2
4σ R(X X) R . The Mahalanobis distance of the observed value from this
distribution is
1 >
(42.1.8) λ> (V [λ])−1 λ = λ R(X > X)−1 R> λ
4σ 2
Using (29.7.1) one can verify that this is the same as (42.1.7).
Problem 422. Show that (42.1.7) is equal to the righthand side of (42.1.8).
42.1. MATHEMATICAL DETAIL OF THE THREE APPROACHES 947
>
Problem 423. 10 points Prove that ε̂ˆ ε̂ˆ − ε̂> ε̂ can be written alternatively in
the following five ways:
> ˆ ˆ
(42.1.9) ε̂ˆ ε̂ˆ − ε̂> ε̂ = (β̂ − β̂)> X > X(β̂ − β̂)
(42.1.10) = (Rβ̂ − u)> (R(X > X)−1 R> )−1 (Rβ̂ − u)
1
(42.1.11) = λ> R(X > X)−1 R> λ
4
>
(42.1.12) = ε̂ X(X > X)−1 X > ε̂ˆ
ˆ
Answer.
(42.1.19) ε ˆ = X β̂ + ε̂
ˆ = y − X β̂ ε − X β̂ ˆ + ε̂
ˆ = X(β̂ − β̂) ε,
ε̂
and since X >ε̂ε = o, the righthand decomposition is an orthogonal decomposition. This gives
(42.1.9) above:
(42.1.20) ˆ>ε̂
ε
ε̂ ˆ = (β̂ − β̂)
ε ˆ + ε̂
ˆ > X > X(β̂ − β̂) ε>ε̂
ε,
ˆ = σ 2 (X > X)−1 R> R(X > X)−1 R> −1 R(X > X)−1 . This is
Using (29.3.13) one obtains V [β̂ − β̂]
a singular matrix, and one verifies immediately that σ12 X > X is a g-inverse of it.
To obtain (42.1.10), which is (29.7.2), one has to plug (29.3.13) into (42.1.20). Clearly, V [Rβ̂ −
u] = σ 2 R(X > X)−1 R> .
For (42.1.11) one needs the formula for the Lagrange multiplier (29.3.12).
• E(SSE u ) = E(ε̂> ε̂) = σ 2 (n −k), which holds whether or not the constraint
is true. Furthermore it was shown earlier that
(42.1.21) E(SSE r − SSE u ) = σ 2 i + (Rβ − u)> (R(X > X)−1 R> )−1 (Rβ − u),
i.e., this expected value is equal to σ 2 i if the constraint is true, and larger
otherwise. If one divides SSE u and SSE r − SSE u by their respective
degrees of freedom, as is done in (42.1.4), one obtains therefore: the de-
nominator is always an unbiased estimator of σ 2 , regardless of whether the
null hypothesis is true or not. The numerator is an unbiased estimator of
σ 2 when the null hypothesis is correct, and has a positive bias otherwise.
• If the distribution of ε is normal, then numerator and denominator are
independent. The numerator is a function of β̂ and the denominator one
of ε̂, and β̂ and ε̂ are independent.
• Again under assumption of normality, numerator and denominator are dis-
tributed as σ 2 χ2 with i and n − k degrees of freedom, divided by their
respective degrees of freedom. If one divides them, the common factor σ 2
cancels out, and the ratio has a F distribution. Since both numerator and
denominator have the same expected value σ 2 , the value of this F distri-
bution should be in the order of magnitude of 1. If it is much larger than
that, the null hypothesis is to be rejected. (Precise values in the F -tables).
950 42. THREE PRINCIPLES FOR TESTING A LINEAR CONSTRAINT
(β̂ j − u)2
(42.2.1) ∼ F1,n−k when H is true.
s2 djj
This is the square of a random variable which has a t-distribution:
β̂ j − u
(42.2.2) p ∼ tn−k when H is true.
s djj
This latter test statistic is simply β̂ j − u divided by the estimated standard deviation
of β̂ j .
If one wants to test that a certain linear combination of the parameter values is
equal to (or bigger than or smaller than) a given value, say r > β = u, one can use a
t-test as well. The test statistic is, again, simply r > β̂ − u divided by the estimated
42.2. EXAMPLES OF TESTS OF LINEAR HYPOTHESES 951
r > β̂ − u
(42.2.3) q ∼ tn−k when H is true.
s r > (X > X)−1 r
By this one can for instance also test whether the sum of certain regression coefficients
is equal to 1, or whether two regression coefficients are equal to each other (but not
the hypothesis that three coefficients are equal to each other).
Many textbooks use the Wald criterion to derive the t-test, and the Likelihood-
Ratio criterion to derive the F -test. Our approach showed that the Wald criterion
can be used for simultaneous testing of several hypotheses as well. The t-test is
equivalent to an F -test if only one hypothesis is tested, i.e., if R is a row vector.
The only difference is that with the t-test one can test one-sided hypotheses, with
the F -test one cannot.
Next let us discuss the test for the existence of a relationship, “the” F -test
which every statistics package performs automatically whenever the regression has
a constant term: it is the test whether all the slope parameters are zero, such that
only the intercept may take a nonzero value.
952 42. THREE PRINCIPLES FOR TESTING A LINEAR CONSTRAINT
Problem 424. 4 points In the model y = Xβ + ε with intercept, show that the
test statistic for testing whether all the slope parameters are zero is
(y > X β̂ − nȳ 2 )/(k − 1)
(42.2.4)
(y > y − y > X β̂)/(n − k)
This is [Seb77, equation (4.26) on p. 110]. What is the distribution of this test
statistic if the null hypothesis is true (i.e., if all the slope parameters are zero)?
Answer. The distribution is ∼ F k−1,n−k . (42.2.4) is most conveniently derived from (42.1.4).
In the constrained model, which has only a constant term and no other explanatory variables, i.e.,
y = ιµ + ε , the BLUE is µ̂ = ȳ. Therefore the constrained residual sum of squares SSE const. is
what is commonly called SST (“total” or, more precisely, “corrected total” sum of squares):
(42.2.5)
SSE const. = SST = (y − ιȳ)> (y − ιȳ) = y > (y − ιȳ) = y > y − nȳ 2
while the unconstrained residual sum of squares is what is usually called SSE:
(42.2.6)
SSE unconst. = SSE = (y − X β̂)> (y − X β̂) = y > (y − X β̂) = y > y − y > X β̂.
This last equation because X > (y − X β̂) = X > ε̂ = o. A more elegant way is perhaps
(42.2.7)
SSE unconst. = SSE = ε̂> ε̂ = y > M > M y = y > M y = y > y − y > X(X > X)−1 X > y = y > y − y > X β̂
42.2. EXAMPLES OF TESTS OF LINEAR HYPOTHESES 953
According to (18.3.12) we can write SSR = SST − SSE, therefore the F -statistic is
Problem 425. 2 points Can one compute the value of the F -statistic testing for
the existence of a relationship if one only knows the coefficient of determination R2 =
SSR/SST , the number of observations n, and the number of regressors (counting the
constant term as one of the regressors) k?
Answer.
SSR/(k − 1) n−k SSR n − k R2
(42.2.9) F = = = .
SSE/(n − k) k − 1 SST − SSR k − 1 1 − R2
Other, similar F -tests are: the F -test that all among a number of additional
variables have the coefficient zero, the F -test that three or more coefficients are
equal. One can use the t-test for testing whether two coefficients are equal, but not
for three. It may be possible that the t-test for β1 = β2 does not reject and the t-test
for β2 = β3 does not reject either, but the t-test for β1 = β3 does reject!
954 42. THREE PRINCIPLES FOR TESTING A LINEAR CONSTRAINT
Answer. In this problem, the “unconstrained” model for the purposes of testing is already
> β = 0. The “constrained” model has the additional
constrained, it is subject to the constraint ι
β1
.
constraint Rβ = 1 0 −1 0 · · · 0 .. = 0. In Problem 348 we computed the “uncon-
βk
strained” estimates β̂ = y − ιȳ and s2 = nȳ 2 = (y 1 + · · · + y n )2 /n. You are allowed to use this
without proving it again. Therefore Rβ̂ = y 1 − y 3 ; its variance is 2σ 2 , and the F test statistic
n(y −y )2
1 3
is 2(y +···+y 2 ∼ F1,1 . The “unconstrained” model had 4 parameters subject to one constraint,
1 n)
therefore it had 3 free parameters, i.e.,k = 3, n = 4, and j = 1.
Another important F -test is the “Chow test” named by its popularizer Chow
[Cho60]: it tests whether two regressions have equal coefficients (assuming that
the disturbance variances are equal). For this one has to run three regressions. If
the first regression has n1 observations and sum of squared error SSE 1 , and the
second regression n2 observations and SSE 2 , and the combined regression (i.e., the
42.2. EXAMPLES OF TESTS OF LINEAR HYPOTHESES 955
Answer.
u ι µ + ε1 ι1 o µ1 ε
(42.2.12) = 1 1 = + 1 .
v ι2 µ 2 + ε 2 o ι2 µ2 ε2
Answer.
> ι> o> ι1 o n1 0
(42.2.13) X X = 1> =
o ι>
2 o ι2 0 n2
1
> −1 n1
0
(42.2.14) (X X) = 1
0 n2
Answer.
Pn1
ι> o> u ui
(42.2.15) X>y = 1 = Pni=1
o> ι>
2 v 2
j=1
vj
1
Pn1
0 ui ū
(42.2.16) β̂ = (X > X)−1 X > y = n1
1
Pni=1
2 =
0 n2 j=1
vj v̄
One can also see this without matrix algebra. var[ū = σ 2 n1 , var[v̄ = σ 2 n1 , and since ū and v̄ are
1 2
independent, the variance of the difference is the sum of the variances.
Answer. The test statistic is ū − v̄ divided by its estimated standard deviation, i.e.,
ū − v̄
(42.2.22) q ∼ tn1 +n2 −2 when H is true.
1 1
s n1
+ n2
For the denominator in the t-statistic you need the s2 from the unconstrained regression, which
is
n
1 X
(42.2.23) s2 = (y j − ȳ)2
n−1
j=1
What happened to the (n + 1)st observation here? It always has a zero residual. And the factor
1/(n − 1) should really be written 1/(n + 1 − 2): there are n + 1 observations and 2 parameters.
Divide ȳ − y n+1 by its standard deviation and replace σ by s (the square root of s2 ) to get the
t statistic
ȳ − y n+1
(42.2.24) p 1
s 1+ n
• b. 2 points One can interpret this same formula also differently (and this is
why this test is sometimes called the “predictive” Chow test). Compute the Best
Linear Unbiased Predictor of y n+1 on the basis of the first n observations, call it
ˆ + 1)n+1 . Show that the predictive residual y ˆ
ŷ(n n+1 − ŷ(n + 1)n+1 , divided by the
ˆ
square root of MSE[ŷ(n + 1)n+1 ; y n+1 ], with σ replaced by s (based on the first n
observations only), is equal to the above t statistic.
Answer. BLUP of y n+1 based on first n observations is ȳ again. Since it is unbiased,
MSE[ȳ; y n+1 ] = var[ȳ − y n+1 ] = σ 2 (n + 1)/n. From now on everything is as in part a.
42.2. EXAMPLES OF TESTS OF LINEAR HYPOTHESES 961
• c. 6 points Next you should show that the above two formulas are identical
to the statistic based on comparing the SSEs of the constrained and unconstrained
models (the likelihood ratio principle). Give a formula for the constrained SSE r , the
unconstrained SSE u , and the F -statistic.
Answer. According to the Likelihood Ratio principle, one has to compare the residual sums of
squares in the regressions under the assumption that the mean did not change with that under the
assumption that the mean changed. If the mean did not change (constrained model), then ȳ¯ is the
OLS of µ. In order to make it easier to derive the difference between constrained and unconstrained
SSE, we will write the constrained SSE as follows:
If one allows the mean to change (unconstrained model), then ȳ is the BLUE of µ, and yn+1 is the
BLUE of ν.
n n
X X
SSE u = (yj − ȳ)2 + (yn+1 − yn+1 )2 = yj2 − nȳ 2 .
j=1 j=1
962 42. THREE PRINCIPLES FOR TESTING A LINEAR CONSTRAINT
Now subtract:
2 1
SSE r − SSE u = yn+1 + nȳ 2 − (nȳ + yn+1 )2
n+1
2 1
= yn+1 + nȳ 2 − (n2 ȳ 2 + 2nȳyn+1 + yn+1
2
)
n+1
1 n2 n
= (1 − )y 2 + (n − )ȳ 2 − 2ȳyn+1
n + 1 n+1 n+1 n+1
n
= (yn+1 − ȳ)2 .
n+1
Interestingly, this depends on the first n observations only through ȳ.
Since the unconstrained model has n + 1 observations and 2 parameters, the test statistic is
n
SSE r − SSE u (yn+1 − ȳ)2 (y − ȳ)2 n(n − 1)
(42.2.25) = Pnn+1 = Pn+1
n ∼ F1,n−1
SSE u /(n + 1 − 2) (y − ȳ)2 /(n − 1) (y − ȳ)2 (n + 1)
1 j 1 j
Problem 429. [Seb77, pp. 117–119] Given a regression model with k indepen-
dent variables. There are n observations of the vector of independent variables, and
for each of these n values there is not one but r > 1 different replicated observations
42.2. EXAMPLES OF TESTS OF LINEAR HYPOTHESES 963
This unconstrained model does not have enough information to estimate any of the
individual coefficients βmj . Explain how it is nevertheless still possible to compute
SSE u .
964 42. THREE PRINCIPLES FOR TESTING A LINEAR CONSTRAINT
Answer. Even though the individual coefficients βmj are not identified, their linear combina-
Pk
tion ηm = x> m βm = x β
j=1 mj mj
is identified; one unbiased estimator, although by far not the
best one, is any individual observation y mq . This linear combination is all one needs to compute
SSE u , the sum of squared errors in the unconstrained model.
• c. 1 point The sum of squared errors associated with this least squares estimate
is the unconstrained sum of squared errors SSE u . How would you set up a regression
with dummy variables which would give you this SSE u ?
42.2. EXAMPLES OF TESTS OF LINEAR HYPOTHESES 965
Answer. The unconstrained model should be regressed in the form y mq = ηm + εmq . I.e.,
string out the matrix Y as a vector and for each column of Y introduce a dummy variable which
is = 1 if the given observation was originally in this colum.
• d. 2 points Next turn to the constrained model (42.2.26). If X has full column
rank, then it is fully identified. Writing β̃ j for your estimates of βj , give a formula
for the sum of squared errors of this estimate. By taking the first order conditions,
show that the estimate β̂ is the same as the estimate in the model without replicated
observations
X k
(42.2.30) zm = xmj βj + εm ,
j=1
• f. 3 points Write down the formula of the F -test in terms of SSE u and SSE c
with a correct accounting of the degrees of freedom, and give this formula also in
terms of SSE u and SSE b .
966 42. THREE PRINCIPLES FOR TESTING A LINEAR CONSTRAINT
Answer. Unconstrained model has n parameters, and constrained model has k parameters;
the number of additional “constraints” is therefore n − k. This gives the F -statistic
(SSE c − SSE u )/(n − k) rSSE b /(n − k)
(42.2.31) =
SSE u /n(r − 1) SSE u /n(r − 1)
where i is the number of restrictions. In this exercise we are proving that the F -test
in the linear model is equivalent to the generalized likelihood ratio test. (You should
assume here that both β and σ 2 are unknown.) All this is in [Gre97, p. 304].
• a. 1 point Since we only have constraints on β and not on σ 2 , it makes sense
to first compute the concentrated likelihood function with σ 2 concentrated out. Derive
the formula for this concentrated likelihood function which is given in [Gre97, just
above (6.88)].
Answer.
n 1
(42.3.3) Concentrated log `(y; β) = − 1 + log 2π + log (y − Xβ)> (y − Xβ)
2 n
and test the null hypothesis that the coefficient of this predicted value is zero. If
Model 1 is right, then this additional regressor leaves all other estimators unbiased,
and the true coefficient of the additional regressor is 0. If Model 2 is right, then
asymptotically, this additional regressor should be the only regressor in the combined
model with a nonzero coefficient (its coefficient is = 1 asymptotically, and all the
other regressors should have coefficient zero.) Whenever nonnested hypotheses are
tested, is is possible that both hypotheses are rejected, or that neither hypothesis is
rejected by this criterion.
CHAPTER 43
Due to the isomorphism of tests and confidence intervals, we will keep this whole
discussion in terms of confidence intervals.
joint confidence region to have confidence level 95%, then the individual confidence
intervals must have a confidence level higher than 95%, i.e., they must be be wider.
There are two main approaches for compute the confidence levels of the indi-
vidual intervals, one very simple one which is widely applicable but which is only
approximate, and one more specialized one which is precise in some situations and
can be taken as an approximation in others.
43.1.1. Bonferroni Intervals. To derive the first method, the Bonferroni in-
tervals, assume you have individual confidence intervals Ri for parameter φi. Inorder
φ1
..
to make simultaneous inferences about the whole parameter vector φ = . you
φi
φ1
take the Cartesian product R1 ×R2 ×· · ·×Ri ; it is defined by ... ∈ R1 ×R2 ×· · ·×Ri
φi
if and only if φi ∈ Ri for all i.
Usually it is difficult to compute the precise confidence level of such a rectan-
gular set. If one cannot be precise, it is safer to understate the confidence level.
The following inequality from elementary probability theory, called the Bonferroni
43.1. RECTANGULAR CONFIDENCE REGIONS 973
inequality, gives a lower bound for the confidence level T of this Cartesian
P prod-
uct:T Given i events E i with Pr[E i ] = 1 − αi ; then Pr[ E i ] ≥ 1 − α i . Proof:
Pr[ Ei ] = 1 − Pr[ Ei0 ] ≥ 1 − Pr[Ei0 ]. The so-called Bonferroni bounds therefore
S P
have the individual
P levels 1 − α/i. Instead of γj = α/i one can also take any other
γi ≥ 0 with γi = α. For small α and small i this is an amazingly precise method.
Problem 432. Show that the correlation coefficient between ti and tj is ρij But
give a verbal argument that the ti are not independent, even if the ρij = 0, i.e. z i
are independent. (This means, one cannot get the quantiles of their maxima from
individual quantiles.)
Answer. First we have E[tj ] = E[z j ] E[ 1s ] = 0, since z j and s are independent. Therefore
(43.1.2)
1 1 1 σ2
cov[ti , tj ] = E[ti tj ] = E[E[ti tj ]|s] = E[E[ 2 z i z j |s]] = E[ 2 E[z i z j |s]] = E[ 2 E[z i z j ]] = E[ 2 ]ρij .
s s s s
2
In particular, var[ti ] = E[ σs2 ], and the statement follows.
If one needs only two joint confidence intervals, i.e., if i = 2, then there are only
two off-diagonal elements in the dispersion matrix, which must be equal by symmetry.
A 2 × 2 dispersion matrix is therefore always “equicorrelated.” The values of the
uα
2,n−k,ρ can therefore be used to compute simultaneous confidence intervals for any
two parameters in the regression model. For ρ one must use the actual correlation
coefficient between the OLS estimates of the respective parameters, which is known
precisely.
Problem 433. In the model y = Xβ + ε , with ε ∼ (o, σ 2 I), give a formula
for the correlation coefficient between g > β̂ and h> β̂, where g and h are arbitrary
constant vectors.
Answer. This is in Seber, [Seb77, equation (5.7) on p. 128].
p
(43.1.4) ρ = g > (X > X)−1 h/ (g > (X > X)−1 g)(h> (X > X)−1 h)
But in certain situations, those equicorrelated quantiles can also be applied for
testing more than two parameters. The most basic situation in which this is the
case is the following: you have n × m observations y ij = µi + εij , and the ε ij ∼
NID(0, σ 2 ). Then the equicorrelated t quantiles allow you to compute precise joint
confidence intervals for all µi . Define s2 = i,j (y ij − ȳ i· )2 /(n(m − 1)), and define z
P
976 43. MULTIPLE COMPARISONS IN THE LINEAR MODEL
√
by z i = (ȳ i· − µi ) m. These z i are normal with mean zero and dispersion matrix
σ 2 I, and they are independent of s2 . Therefore one gets confidence intervals
√
(43.1.5) µi ∈ ȳ i· ± uα
n,n(m−1),0 s/ m.
1
P
where ȳ ·· is the grand sample mean and µ its population counterpart. Since ȳ ·· = ȳ i· . one
n
1 σ2 1 2 1 n−1
obtains cov[ȳ i· , ȳ ·· ] = n
var[ȳ i· ] = mn
. Therefore var[ȳ i· − ȳ ·· ] = σ 2 m
− mn + mn = σ2 mn
.
And the correlation coefficient is 1/(n − 1).
σ2
Answer. Write z = 1
√ Aȳ, therefore V [z] = 2m
AA> where
2
(43.1.7)
1 −1 0 0 2 1 1 −1 −1 0
1 0 −1 0 1 2 1 1 0 −1
1 0 0 −1 1 1
1 1 2 0 1 1
A= V [z] = AA> =
0 1 −1 0 2m 2m −1 1 0 2 1 −1
−1
0 1 0
−1 0 1 1 2 1
0 0 1 −1 0 −1 1 −1 1 2
What are situations in which one would want to obtain a F -confidence region in
order to get information about many different linear combinations of the parameters
at the same time?
For instance, one examines a regression output and looks at all parameters and
computes linear combinations of parameters of interest, and believes they are sig-
nificant if their t-tests reject. This whole procedure is sometimes considered as a
misuse of statistics, “data-snooping,” but Scheffé argued it was justified if one raises
the significance level to that of the F test implied by the infinitely many t tests of
all linear combinations of β.
Or one looks at only certain kinds of linear combinations, for instance, at all
contrasts, i.e., linear combinations whose coefficients sum to zero. This is a very
thorough way to ascertain that all parameters are equal.
Or if one wants to draw a confidence band around the whole regression line.
Problem 436. Someone fits a regression with 18 observations, one explanatory
variable and a constant term, and then draws around each point of the regression line
a standard 95% t interval. What is the probability that the band created in this way
covers the true regression line over its entire length? Note: the Splus commands
qf(1-alpha,df1,df2) and qt(1-alpha/2,df) give quantiles, and the commands
pf(critval,df1,df2) and pt(critval,df) give the cumulative distribution func-
tion of F and t distributions.
980 43. MULTIPLE COMPARISONS IN THE LINEAR MODEL
Problem 437. 6 points Which options do you have if you want to test more
than one hypothesis at the same time? Describe situations in which one F -test is
better than two t-tests (i.e., in which an elliptical confidence region is better than
a rectangular one). Are there also situations in which you might want two t-tests
instead of one F -test?
In the one-dimensional case this confidence region is identical to the t-interval.
But if one draws for i = 2 the confidence ellipse generated by the F -test and the two
intervals generated by the t-tests into the same diagram, one obtains the picture as
in figure 5.1 of Seber [Seb77], p. 131. In terms of hypothesis testing this means:
there are values for which the F test does not reject but one or both t tests reject,
and there are values for which one or both t-tests fail to reject but the F -test rejects.
The reason for this confusing situation is that one should not compare t tests and F
43.2. RELATION BETWEEN F-TEST AND T-TESTS. 981
tests at the same confidence level. The relationship between those testing procedures
becomes clear if one compares the F test at a given confidence level to t tests at a
certain higher confidence level.
We need the following math for this. For a positive definite Ψ and arbitrary x
it follows from (A.5.6) that
(g > x)2
(43.2.4) x> Ψ−1 x = max .
g: g6=o g > Ψg
Now the maximum of a set is smaller or equal to iF(i,n−q;α) s2 if and only if each
element of this set is smaller or equal. Therefore the F -confidence region (41.4.3)
982 43. MULTIPLE COMPARISONS IN THE LINEAR MODEL
It is sufficient to take the intersection over all g with unit length. What does each
of these regions intersected look like? First note that the i × 1 vector u lies in that
region if and only if g > u lies p
in a t-interval for g > Rβ, whose confidence level is no
longer α but is γ = Pr[|t| ≤ iF(i,n−q;α) ], where t is distributed as a t with n − q
degrees of freedom. Geometrically, in Seber [Seb77]’s figure 5.1, these confidence
regions can be represented by all the bands tangent to the ellipse.
43.3. LARGE-SAMPLE SIMULTANEOUS CONFIDENCE REGIONS 983
Taking only the vertical and the horizontal band tangent to the ellipse, one has
now the following picture: if one of the t-tests rejects, then the F -test rejects too.
But it may be possible that the F -test rejects but neither of the two t-tests rejects.
In this case, there must be some other linear combination of the two variables for
which the t test rejects.
Another example for simultaneous t-tests, this time derived from Hotelling’s T 2 ,
is given in Johnson and Wichern [JW88, chapter 5]. It is very similar to the above;
we will do here only the large-sample development:
q p
(43.3.1) g> µ ∈ g> y ± χ2q (α) g >Σ g
(α)
where χ2q is the upper α-quantile of the χ2 distribution with q degrees of freedom,
(α)
i.e., Pr[χ2q ≥ χ2q ] = α.
Proof: For those g with var[g > y] = 0, i.e., g >Σ g = 0, the confidence interval
has 100 percent coverage probability (despite its zero length); therefore we only have
43.3. LARGE-SAMPLE SIMULTANEOUS CONFIDENCE REGIONS 985
there are 355 respondents in these five categories. It is assumed that the rows of Y
are independent, which presupposes sampling with replacement, i.e., the sampling is
done in such a way that theoretically the same people might be asked twice (or the
sample is small compared with the population). The probability distribution of each
of these rows, say here the ith row, is the multinomial distribution whose parameters
form the p-vector p (of nonnegative elements adding up to 1). Its means, variances,
and covariances can be computed according to the rules for discrete distributions:
(43.3.7) E[y ij ] = 1(pj ) + 0(1 − pj ) = pj
(43.3.8) var[y ij ] = E[y 2ij ] − (E[y ij ])2 = pj − p2j = pj (1 − pj ) because y 2ij = y ij
(43.3.9)
cov[y ij , y ik ] = E[y ij y ik ] − E[y ij ] E[y ik ] = −pj pk because y ij y ik = 0
The pi can be estimated by the ith sample means. From these sample means one
also obtains an estimate S of the dispersion matrix of the rows of Y . This estimate
is singular (as is the true dispersion matrix), it has rank r − 1, since every row of the
Y -matrix adds up to 1. Provided n − r is large, which means here that np̂k ≥ 20
for each category k, one can use the normal asymptotics, and gets as simultaneous
confidence interval for all linear combinations
r
g > Sg
q
> > 2 (α)
(43.3.10) g p ∈ g p̂ ± χr−1
n
43.3. LARGE-SAMPLE SIMULTANEOUS CONFIDENCE REGIONS 987
Table 1 is the output of a SAS run. The dependent variable is the y variable,
here it has the name wagerate. “Analysis” is the same as “decomposition,” and
“variance” is here the sample variance or, say better, the sum of squares.
Pn “Analysis of
variance” is a decomposition of the “corrected total” sum of squares j=1 (yj − ȳ)2 =
Pn
8321.91046 into its “explained” part j=1 (ŷj − ȳ)2 = 1553.90611, the sum of squares
whose “source”
Pn is the “model,” and its “unexplained” part, the sum of squared
“errors” j=1 (yj − ŷj )2 , which add up to 6768.00436 here. The “degrees of freedom”
are a dimensionality constant; the d.f. of the corrected total sum of squares (SST)
is the number of observations minus 1, while the d.f. of the SSE is the number of
observations minus the number of parameters (intercept and slope parameters) in the
989
990 44. SAMPLE SAS REGRESSION OUTPUT
regression. The d.f. of the sum of squares due to the model consists in the number
of slope parameters (not counting the intercept) in the model.
The “mean squares” are the corresponding sum of squares divided by their de-
grees of freedom. This “mean sum of squares due to error” should not be con-
fused with the “mean squared error” of an estimator θ̂ of θ, defined as MSE[θ̂; θ] =
E[(θ̂ − θ)2 ]. One can think of the mean sum of squares due to error as the sample
analog of the MSE[ŷ; y]; it is as the same time an unbiased estimate of the distur-
bance variance σ 2 . The mean sum of squares explained by the model is an unbiased
estimate of σ 2 if all slope coefficients are zero, and is larger otherwise. The F value
is the mean sum of squares explained by the model divided by the mean sume of
squares due to error, this is the value of the test statistic for the F -test that all
slope parameters are zero. The p-value prob > F gives the probability of getting an
even larger F -value when the null hypothesis is true, i.e., when all parameters are
indeed zero. To reject the null hypothesis at significance level α, this p-value must
be smaller than α.
The root mse is the square root of the mean sum of squares due to error, it is an
estimate of σ. The dependent mean is simply ȳ. The “coefficient of variation” (c.v.)
is 100 times root mse divided by dependent mean. The R-square is ss(model)
divided by ss(c. total), and the adjusted R-square is R̄2 = 1 − SSE/(n−k) SST /(n−1) .
44. SAMPLE SAS REGRESSION OUTPUT 991
For every parameter, including the intercept, the estimated value is printed, and
next to it the estimate of its standard deviation. The next column has the t-value,
which is the estimated value divided by its estimated standard deviation. This is
the test statistic for the null hypothesis that this parameter is zero. The prob>|t|
value indicates the significance for the two-sided test.
Problem 438. What does the c stand for in c total in Table 1?
Problem 439. 4 points Using the sample SAS regression output in Table 1,
test at the 5% significance level that the coefficient of gender is −1.0, against the
alternative that it is < −1.0.
Answer. -1.73266849-(-1)=-.73266849 must be divided by 0.41844140, which gives -1.7509465
and then we must look it up in the t-table or, since there are so many observations, the normal ta-
ble. It is a one-sided test, therefore the critical value is -1.645, therefore reject the null hypothesis.
Problem 440. Here is part of the analysis of variance table printed out by the
SAS regression procedure:
√ If you don’t have a calculator, simply give the answers in
expressions like 1529.68955/7 etc.
• e. 3 points Make an F test of the null hypothesis that the slope parameters
of all the explanatory variables are zero, at the 5% significance level. The observed
Answer. Pr[F4,∞ > 3.32] = 0.01; Pr[F4,∞ > 2.37] = 0.05. The observed F value is
382.42/12.372=30.910, which is significant up to the level 0.0001, therefore reject H0 .
• f. Here is the printout of the analysis of variance table after additional explana-
tory variables were included in the above regression (i.e., the dependent variable is
the same, and the set of explanatory variables contains all variables used in the above
regression, plus some additional ones).
44. SAMPLE SAS REGRESSION OUTPUT 993
had slope coefficients zero. The value of the test statistic is (show how you
sum of mean
source df squares square F value prob>F
parameter estimates
parameter standard t for H0:
variable df estimate error parameter=0 prob>|t|
SUM OF MEAN
SOURCE DF SQUARES SQUARE
So far we have assumed that the mean of the dependent variable is a linear func-
tion of the explanatory variables. In this chaper, this assumption will be relaxed. We
first discuss the case where the explanatory variables are categorical variables. For
categorical variables (gender, nationality, occupations, etc.), the concept of linearity
does not make sense, and indeed, it is customary to fit arbitrary numerical functions
of these categorical variables. One can do this also if one has numerical variables
which assume only a limited number of values (such as the number of people in a
household). As long as there are repeated observations for each level of these vari-
ables, it is possible to introduce a different dummy variables for every level, and in
this way also allow arbitrary functions. Linear restrictions between the coefficients
997
998 45. FLEXIBLE FUNCTIONAL FORM
• b. 1 point What would the estimated equation have been if, instead of wt ,
they had used a variable pt with the values pt = 0 during the war years, and pt = 1
otherwise? (Hint: the coefficient for pt will be negative, because the intercept in peace
times is below the intercept in war times).
Answer. Now the intercept of the whole equation is the intercept of the war regression line,
which is .36, and the coefficient of pt is the difference between peace and war intercepts, which is
1000 45. FLEXIBLE FUNCTIONAL FORM
-.23.
(45.1.2) bt = .36 + .068yt − .23pt + ε̂t .
• c. 1 point What would the estimated equation have been if they had thrown in
both wt and pt , but left out the intercept term?
Answer. Now the coefficient of wt is the intercept in the war years, which is .36, and the
coefficient of pt is the intercept in the peace years, which is .13.
(45.1.3) bt = .36wt + .13pt + .068yt + ε̂t ?
• d. 2 points What would the estimated equation have been, if bond sales and
income had been measured in millions of dollars instead of billions of dollars? (1
billion = 1000 million.)
Answer. From bt = 0.13 + .068yt + 0.23wt + ε̂t follows 1000bt = 130 + .068 · 1000yt + 230wt +
1000ε̂t , or
(m) (m) (m)
(45.1.4) bt = 130 + .068yt + 230wt + ε̂t ,
(m) (m) (m)
where bt is bond sales in millions (i.e., bt = 1000bt ), and yt is national income in millions
(m)
(i.e., yt = 1000yt ).
45.1. CATEGORICAL VARIABLES: REGRESSION WITH DUMMIES AND FACTORS 1001
There are various ways to set it up. Threshold effects might be represented by
the following dummies:
ι o o o
ι ι o o
(45.1.5)
ι ι ι o
ι ι ι ι
In the example in Problem 441, the slope of the numerical variables does not
change with the levels of the categorical variables, in other words, there is no in-
teraction between those variables, but each variable makes a separate contribution
to the response variable. The presence of interaction can be modeled by including
products of the dummy variables with the response variable with whom interaction
exists.
How do you know the interpretation of the coefficients of a given set of dummies?
Write the equation for every category separately. E.g. [Gre97, p. 383]: Winter
1002 45. FLEXIBLE FUNCTIONAL FORM
To fix notation, assume for now that only one explanatory variable x is given and
you want to estimate the model y = f (x) + ε with the usual assumption ε ∼ o, σ 2 I.
But whereas the regression model specified that f is an affine function, we allow f
to be an element of an appropriate larger function space. The size of this space is
characterized by a so-called smoothing parameter.
For higher degree polynomials don’t use the “power basis” 1, x, x2 , . . . , xm−1 , but
there are two reasonable choices. Either one can use Legendre polynomials [Eub88,
(3.10) and (3.11) on p. 54], which are obtained from the power basis by Gram-
Schmidt orthonormalization over the interval [a, b]. This does not make the design
matrix orthogonal, but at least one should expect it not to be too ill-conditioned,
and the roots and the general shape of Legendre polynomials is well-understood. As
the second main choice one may also select polynomials that make the design-matrix
itself exactly orthonormal. The Splus-function poly does that.
The jth Legendre polynomial has exactly j real roots in the interval [Dav75,
Chapter X], [Sze59, Chapter III]. The orthogonal polynomials probably have a sim-
ilar property. This gives another justification for using polynomial regession, which
is similar to the justification one sometimes reads for using Fourier-series: The data
have high-frequency and low-frequency components, and one wants to filter out the
low-frequency components.
In practice, polynomials do not always give a good fit. There are better alterna-
tives available, which will be discussed in turn.
[DM93, p. 484] have a plot with the curves for λ = 1.5, 1, 0.5, 0, −0.5, and −1.
They point out some serious disadvantage of this transformation: if λ 6= 0, B(x, λ)
is bounded eihter from below or above. For λ < 0, B(x, λ) cannot be greater than
−1/λ, and for λ > 0, it cannot be less than −1/λ.
About the Box-Cox transformation read [Gre97, 10.4]
of bandwidth, see the plots on p. 21: using our data we can do plots of the sort
plot(locfit(r~year,data=uslt,alpha=0.1,deg=3),get.data=T) and then vary
alpha and deg.
Problem 443. What kind of smoothing would be best for the time series of the
variable r (profit rate) in dataset uslt?
Problem 444. Locally constant smooths are not good at the edges, and also not
at the maxima and minima of the data. Why not?
The kernel estimator can be considered a local fit of a constant. Straight lines
are better, and cubic parabolas even better. Quadratic ones not as good.
The birth rate data which require smoothing with a varying bandwidth are
interesting, see Simonoff p. 157, description in the text on p. 158.
45.2.4. Regression Splines. About the word “spline,” [Wah90, p. vii] writes:
“The mechanical spline is a thin reedlike strip that was used to draw curves needed
in the fabrication of cross sections of ships’ hulls. Ducks or weights were placed on
the strip to force it to go through given points, and the free portion of the strip
would assume a position in space that minimized the bending energy.”
One of the drawbacks of polynomial regression is that its fit is global. One
method to provide for local fits is to fit a piecewise polynomial. A spline is a piecewise
45.2. FLEXIBLE FUNCTIONAL FORM FOR NUMERICAL VARIABLES 1007
For m = 4, cubic splines, arrange the knots so that they are close to inflexion
points in the data and not more than one extreme point (maximum or minimum)
and one inflection point occurs between any two knots.
It is also possible to determine the number of knots and select their location so
as to optimize the fit. But this is a hairy minimization problem; [Eub88, p. 362]
gives some shortcuts.
Extensions: Sometime one wants knots which are not so smooth, this can be
obtained by letting several knots coincide. Or one wants polynomials of different
degrees in the different segments.
[Gre97, pp. 389/90] has a nice example for a linear spline. Each of 3 different
age groups has a different slope and a different intercept: t < t∗ , t∗ ≤ t < t∗∗ , and
t∗∗ ≤ t. These age groups are coded by the matrix D h consistingi of two dummy
(1)
variables, one for t ≥ t∗ and one for t ≥ t∗∗ . I.e, D = d(1) d(2) where dj = 1
(2)
if age tj ≥ t∗ and dj = 1 if tj ≥ t∗∗ . Throwing D into the regression allows for
different intercepts in these different age groups.
In order to allow for the slopes with respect to t to vary too, we need a matrix
E, again consisting of 2 columns, so that ej1 = tj if tj ≥ t∗ and 0 otherwise; and
ej2 = tj if if tj ≥ t∗∗ , and 0 otherwise. Each column h of E is the corresponding
i
column of D element-wise multiplied with t, i.e., E = d(1) ∗ t d(2) ∗ t .
45.2. FLEXIBLE FUNCTIONAL FORM FOR NUMERICAL VARIABLES 1009
An observation at the year t∗ has, according to the formula for ≥ t∗ , the form
(45.2.5) y ∗ = β1 + β2 t∗ + β3 x∗ + γ1 + δ1 t∗ + ε∗
but had the formula for < t∗ still applied, the equation would have been
(45.2.6) y ∗ = β1 + β2 t∗ + β3 x∗ + ε∗
For these two equations to be equal, which means that the two regression lines
intersect at x∗ , we have to impose the constraint γ1 + δ1 t∗ = 0
Similarly, an observation at the year t∗∗ has, according to the formula for ≥ t∗∗ ,
the form
but had the formula for < t∗∗ still applied, it would have been
k regressors have the form fi (Z) where the functions fi are linearly independent.
>
For instance f1 (Z) = z11 z21 . . . may pick out the first column of Z, and
2 2
>
f2 (Z) = z11 z21 . . . the square of the first column. The functions g and fi
define the relationship between the given economic variables and the variables in the
regression. [Gre97, Definition 81 on p. 396] says something about the relationship
between the parameters of interest and the regression coefficients: if the k regression
coefficients β1 , . . . , βk can be written as k one-to-one possibly nonlinear functions of
the k underlying parameters θ1 , . . . , θk , then the model is intrinsically linear in θ.
[Gre97, p. 391/2] brings the example of a regression with an interaction term:
maximum likelihood is far better. Greene asks why and answers: least squares does
not use one of the sufficient statistics.
[Gre97, example 8.5 on p. 397/8] starts with a CES production function, then
makes a Taylor development, and this Taylor development is an intrinsically linear
regression of the 4 parameters involved. Greene computes the Jacobian matrix nec-
essary to get the variances. He compares that with doing nonlinear least squares on
the production function directly, and gets widely divergent parameter estimates.
45.2.5. Smoothing Splines. This seems the most promising approach. If one
estimates a function by a polynomial of order m or degree m − 1, then this means
that one sets the mth derivative zero. An approximation to a polynomial would
be a function whose mth derivative is small. We will no longer assume that the
fitting functions are themselves polynomials, but we will assume that f ∈ W m [a, b]
which means f itself and its derivatives up to and including the m − 1st derivative
are absolutely continuous over a closed and bounded interval [a, b], and the mth
derivative is square integrable over [a, b].
If we allow such a general f , then the estimation criterion can no longer be
the minimization of the sum of squared errors, because in this case one could simply
choose an interpolant of the data, i.e., a f which satisfies f (xi ) = y i for all i. Instead,
the estimation criterion must be a constrained or penalized least squares criterion
(analogous to OLS with an exact or random linear constraint) which has a penalty for
1014 45. FLEXIBLE FUNCTIONAL FORM
the mth order derivative. The idea of smoothing splines is to minimize the objective
function
Z b
> 2
f (m) (x) dx
(45.2.12) y − f (x) y − f (x) + λ
a
Of course, only the values which f takes on the observed xi are relevant; but for
each sequence of observations there is one polynomial which minimizes this objective
function, and this is a natural spline with the observed values as breakpoints.
45.2.6. Local regression, Kernel Operators. A different approach is to run
locally weighted regressions. Here the response surface at a given value of the in-
dependent variable is estimated by a linear regressions which only encompasses the
points in the neighborhood of the independent variable. Splus command loess.
If this local regression only has an intercept, it is also known as a “kernel
smoother.” But locally linear smoothers perform better at the borders of the sample
than locally constant ones.
But in many real-life situations one has to do with several additive effects without
interaction. This is much easier to estimate and to interpret.
One procedure here is “projection pursuit regression” [FS81]. Denoting the ith
row of X with xi , the model is
k
X
(45.3.1) yi = fj (α>
j xi ) + ε i ,
j=1
Here one estimates k arbitrary functions of certain linear combinations of the ex-
planatory variables α>j xi along with the linear combinations themselves. This is
implemented in Splus in the function ppreg.
The matter will be easier if one already knows that the columns of the X-matrix
are the relevant variables and only their transformation has to be estimated. This
gives the additive model
k
X
(45.3.3) y= fj (xj ) + ε ,
j=1
1016 45. FLEXIBLE FUNCTIONAL FORM
where xj is the jth column of X. The beauty is that one can specify here different
univariate smoothing techniques for the individual variables and then combine it all
into a joint fit by the method of back-substitution. Back-substitution is an iterative
procedure by which one obtains the joint fit by an iteration only involving fits on one
(0)
explanatory variable each. One starts with some initial set of functions P fi and then,
cycling through j = 1, . . . , k, 1, . . . , k, . . . one fits the residual y − k6=j fk (xk ) as a
function of xj . This looks like a crude heuristic device, but it has a deep theoretical
justification.
If the fitting procedure, with respect to the jth explanatory variable, can be
written as ŷ j = S j xj (but the more common notation is to write it f j = S j xj ),
then this backfitting works because the joint fit y = f 1 + · · · + f k + ε̂ is a solution
of the equation
I S1 S1 · · · S1 f1 S1y
S 2 I S 2 · · · S 2 f 2 S 2 y
S 3 S 3 I · · · S 3 f 3 S 3 y
(45.3.4) = ,
.. .. .. .. .. .. ..
. . . . . . .
Sk Sk Sk · · · I fk Sky
and the iteration is known as a numerical iteration procedure to solve this system of
equations, called the Gauss-Seidel algorithm.
45.3. MORE THAN ONE EXPLANATORY VARIABLE: BACKFITTING 1017
1019
1020 46. TRANSFORMATION OF THE RESPONSE VARIABLE
46.1.1. ACE with one response and just one predictor variable. As-
sume x and y are two random variables (but they may also be categorical variables
or random vectors). Their maximal correlation corr∗ [x, y] is the maximal value of
corr[φ(x), θ(y)], where φ and θ are two real-valued mappings of the space in which
x and y are defined, with 0 < var[φ(x)] < ∞ and 0 < var[θ(y)] < ∞. The maximal
correlation has the following three properties:
• 0 ≤ corr∗ [x, y] ≤ 1. (Note that the usual correlation coefficient is between
−1 and 1.)
46.1. ALTERNATING LEAST SQUARES AND ALTERNATING CONDITIONAL EXPECTATIONS
1021
of each of the two categories so that the resulting real-valued random variables have
maximal correlation? This can be solved by an eigenvalue problem. This is discussed
in [KS79, sections 33.47–49].
How can one find the optimal θ and ψ in the continuous case? Since correlation
coefficients are invariant under affine transformations, such optimal transformations
are unique only up to a constant coefficient and an intercept. Here without proof the
following procedure, called “alternative conditional expectations:” let φ1 and θ1 be
the identical functions φ1 (x) = x and θ1 (y) = y. Then do recursively for i = 2, . . .
the following: φi (x) = E[θi−1 (y)|x] and θi (y) = E[φi (x)|y]. Remember that E[y|x]
is a function of x, and this function will be φ2 (x). In order to prevent this recursion
to become an increasingly steep or flat line, one does not exactly use this recursion
but rescales one of the variables, say θ, after each step so that it has zero mean and
unit variance.
46.1.2. ACE with more than 2 Variables. How can that be generalized to
a multivariate situation? Let us look at the case where y remains a scalar but x
is a k-vector. One can immediately speak of their maximal correlation again if one
maximizes over functions φ of one variable and θ of k variables. In the case of joint
normality, the above result generalizes to the following: the optimal φ can be chosen
to be the identity, and the optimal θ is linear; it can be cvhosen to be the best linear
predictor.
46.1. ALTERNATING LEAST SQUARES AND ALTERNATING CONDITIONAL EXPECTATIONS
1023
In the case of several variables, one can also ask for second-best and third-best
etc. solutions, which are required to be uncorrelated with the better solutions and
maximize the correlation subject to this constraint. They can in principle already be
defined if both variables are univariate, but in this case they are usually just simple
polynomials in the best solutions. In the multivariate case, these next-best solutions
may be of interest of their own. Not only the optimal but also these next-best
transformations give rise to linear regressions (Buja and Kass, Comment to [BF85],
p. 602).
46.1.3. Restrictions of the Functions over which to Maximize. If one
looks at several variables, this procedure of maximizing the correlation is also inter-
esting if one restricts the classes of functions to maximize.
The linear (or, to be more precise, affine) counterpart of maximal correlation
is already known to us. The best linear predictor can be characterized as that
linear combination of the components of x which has maximal correlation with y.
The maximum value of this correlation coefficient is called the multiple correlation
coefficient.
An in-between step between the set of all functions and the set of all linear
functions is the one realized in the ace procedure in Splus. It uses all those functions
of k variables with can be written as linear combinations or, without loss of generality,
as sums of functions of one variable each. Therefore one wants functions φ1 , . . . , φk
1024 46. TRANSFORMATION OF THE RESPONSE VARIABLE
and θ which maximize the correlation of φ1 (x1 ) + · · · + φk (xk ) with θ(y). This can
again be done by “backfitting,” which is a simple recursive algorithm using only
bivariate conditional expectations at every step. Each step does the following: for
φ1 , . . . , φi−1 , φi+1 , · · · , φk and θ one gets the best estimate
the given best estimates of P
of φi as φi (xi ) = E[θ(y) − j : j6=i φj (xj )|xi ].
If one does not know the joint distributions but has samples then one can replace
the conditonal expectations by a “Smoother” using the datapoints. One such pro-
cedure is the function supsmu in Splus, described in [HT90, p. 70]. Functions will
not be given in closed form but one gets their graph by plotting the untransformed
against the transformed variables.
46.1.4. Cautions About the ace Procedure. There are certain features
which one should be aware of before using this procedure.
First, this is a procedure which treats both variables symmetrically. The regres-
sion model between the variables is not a fixed point. If the variables satisfy the
regression specification y = β > x + ε with ε independent of the vector x, then the
optimal transformations will not be the simple multiples of the components of β,
although they will usually be close to them. This symmetry makes ace more appro-
priate for general multivariate models, like correlation analysis, than for regression.
The avas procedure, which will be discussed next, is a modification of ace which
seems to work better for regression.
46.1. ALTERNATING LEAST SQUARES AND ALTERNATING CONDITIONAL EXPECTATIONS
1025
Secondly, there are situations in which the functions of x and y which have
highest correlation are not very interesting functions.
Here is an example in which the function with the highest correlation may be
uninteresting. If y and one of the xi change sign together, one gets correlation of 1
by predicting the sign of y by the sign of xi and ignoring all other components of x.
Here is another example in which the function with the highest correlation may
>
be uninteresting. Let x y be a mixture consisting of
h i>
> x0 y 0 with probability 1 − α
(46.1.1) x y = h i>
x00 y 00
with probability α
where x0 and y 0 are independent random variables which have density functions,
while x00 and y 00 are discrete, and let Dx00 and Dy00 be the supports of them, i.e., the
finite sets of values which these variables are able to assume. One would expect the
maximal correlation to converge toward zero if α → 0, but in reality, as long as α > 0
the maximum correlation is always equal to one, even if x00 and y 00 are independent
of each other. The functions which achieve this are the indicator functions φ = I[x ∈
Dx00 ] and θ = I[y ∈ Dy00 ]. In other words, the functions which have the highest
correlations may be uninteresting. But in this case it is clear that one should also
look for the second and third choices. This is one of the remedies proposed in [BF85].
1026 46. TRANSFORMATION OF THE RESPONSE VARIABLE
Another potential source of trouble is that the optimal functions are not always
uniquely determined. Or sometimes, the eigenvalues of optimal and next-best so-
lutions cross each other, i.e., in a continuous modification of the data one will get
abrupt changes in the optimal transformations. All this is alleviated if one not only
looks at the optimal functions, but also the second-best solutions.
An automated procedure such as ace may lead to strange results due to errors
in setting up the problem, errors which one would easily catch if one had to do it
by hand. This is not a particularity of ace itself but a danger of any automated
procedure. (E.g., people run regressions without looking at the data.) The example
Prebibon and Vardi in their comment to [BF85] on p. 600 is interesting: If the
plot consists of two parallel regression lines, one would, if one did it by hand, never
dream of applying a transformation, but one would look for the additional variable
distinguishing the two regimes. An automatic application of ace gives a zig-zag line,
see figure 2 on p. 600.
Of course, ace makes all significance levels in the ensuing regression invalid.
Tradeoff between parametric and nonparametric methods.
46.2. ADDITIVITY AND VARIANCE STABILIZING TRANSFORMATIONS (AVAS) 1027
We want to show that this transformation indeed stabilizes the variance. Let us
first see how one can (asymptotically) obtain the variance of h(z): let u = E[z],
make Taylor development of h around u: h(z) = h(u) + h0 (u)(z − u), therefore
asymptotically var[h(z)] = (h0 (u))2 var[z].
To apply this for our purposes, pick a certain value of u. Make a Taylor devel-
opment of θi (y) = hi (θi−1 (y)) around E[θi−1 (y)|φi (x) = u] = u, which reads θi (y) =
hi (u)+h0i (u)(θi−1 (y)−u). Therefore var[θi (y)|φi (x) = u] = (h0i (u))2 var[θi−1 (y)|φi (x) =
u] = vi 1(u) vi (u) = 1. This asymptotic expression for the variance is independent of
the u chosen.
In this procedure, therefore, only the transformations of the independent vari-
ables are designed to achieve linearity. Those of the dependent variables are designed
46.3. COMPARING ACE AND AVAS 1029
to equalize the variance. This is a rule of thumb one should always consider in se-
lecting transformations: It makes sense to use transformations of the y axis to get
homoskedasticity, and then transformations of the x axis for straightening out the
regression line.
In the case of several predictor variables, the same “backfitting” procedure is
used which ace uses.
Again this is not exactly the iterative procedure chosen; in order to avoid am-
biguity in the result, the avas procedure normalizes at each step the function θ so
that it has zero mean and unit variance.
Density Estimation
2
and the expected value of its square is the MSE at u E fˆ(u) − f (u) . This is a
h = 3.491σn−1/3
This is often used also for non-Normal distributions, but if these distributions are
bimodal, then one needs narrower bins. The R/S-function dpih (which stands for Di-
rect Plug In Histogram) in the library KernSmooth uses more sophisticated methods
to select an optimal bin width.
Also the anchor positions can have a big impact on the appearance of a histogram.
To demonstrate this, cd /usr/share/ecmet/xlispstat/anchor-position then do
xlispstat, then (load "fde"), then (fde-demo), and pick animate anchor-moving.
Regarding the labeling on the vertical axis of a histogram there is a naive and a
more sophisticated approach. The naive approach gives the number of data points in
each bin. The preferred, more sophisticated approach is to divide the total number
of points in each bin by the overall size of the dataset and by the bin width. In
this way one gets the relative frequency density. With this normalization, the total
area under the histogram is 1 and the histogram is directly comparable with other
estimates of the probability density function.
1034 47. DENSITY ESTIMATION
>
Problem 446. If u 7→ k(u) is the kernel, and x = x1 · · · xn the data
Pn
vector, then fˆ(u) = n1 i=1 k(u − xi ) is the kernel estimate of the density at u.
• a. 3 points Compute the mean of the kernel estimator at u.
Pn
Answer. E[fˆ(u)] = n
1
i=1
E[k(u − xi )] but since all xi are assumed to come from the same
R +∞
distribution, it follows E[fˆ(u)] = E[k(u − x)] = k(u − x)f (x) dx.
x=−∞
Orthogonal Series Methods: project the data on an orthogonal base and only
use the first few terms. Advantage: here one actually knows the functional form of
the estimated density. See [BA97, pp. 19–21].
Problem 447. Write a function that translates the latitude and longitude data
of the magrem dataset into a 3-dimensional dataset which can be loaded into xgobi.
theoretical QQ-plot is a step function; and if both distribution functions are step
functions, then the theoretical QQ-plot consists of isolated points.
Here is a practical instruction how to construct a QQ plot from the given cumu-
lative distribution functions: Draw the cumulative distribution functions of the two
distributions which you want to compare into the same diagram. Then, for every
value p between 0 and 1 plot the abscisse of the intersection of the horizontal line
with height p with the first cumulative distribution function against the abscisse of
its intersection with the second. If there are horizontal line segments in these dis-
tribution functions, then the suprema of these line segments should be used. If the
cumulative distribution functions is a step function stepping over p, then the value
at which the step occurs should be used.
If the QQ-plot is a straight line, then the two distributions are either identical, or
the underlying random variables differ only by a scale factor. The plots have special
sensitivity regarding differences in the tail areas of the two distributions.
Problem 448. Let F1 be the cumulative distribution function of random variable
x1 , and F2 that of the variable x2 whose distribution is the same as that of αx1 , where
α is a positive constant. Show that the theoretical QQ plot of these two distributions
is contained in the straight line q2 = αq1 .
Answer. (x1 , x2 ) ∈ QQ-plot ⇐⇒ a p exists with x1 = F1−1 (p) = inf{u : Pr[x1 ≤ u] ≥ p}
and x2 = F2−1 (p) = inf{u : Pr[x2 ≤ u] ≥ p} = inf{u : Pr[αx1 ≤ u] ≥ p}. Write v = u/α, i.e.,
1040 47. DENSITY ESTIMATION
In other words, if one makes a QQ plot of a normal with mean zero and variance 2
on the vertical axis against a normal with mean zero and variance 1 on the horizontal
axis, one gets a straight line with slope 2. This makes such plots so valuable, since
visual inspection can easily discriminate whether a curve is a straight line or not. To
repeat, QQ plots have the great advantage that one only needs to know the correct
distribution up to a scale factor!
QQ-plots can not only be used to compare two probability measures, but an
important application is to decide whether a given sample comes from a given distri-
bution by plotting the quantile function of the empirical distribution of the sample,
compare (3.4.17). against the quantile function of the given cumulative distribution
function. Since empirical cumulative distribution functions and quantile functions
are step functions, the resulting theoretical QQ plot is also a step function.
In order to make it easier to compare this QQ plot with a straight line, one
usually does not draw the full step function but one chooses one point on the face of
each step, so that the plot contains one point per observation. This is like plotting
the given sample against a very regular sample from the given distribution. Where
on the face of each step should one choose these points? One wants to choose that
47.10. QUANTILE-QUANTILE PLOTS 1041
ordinate where the first step in an empirical cumulative distribution function should
usually be.
It is a mathematically complicated problem to compute for instance the “usual
location” (say, the “expected value”) of the smallest of 50 normally distributed vari-
ables. But there is one simple method which gives roughly the right locations inde-
pendently of the distribution used. Draw the cumulative distribution function (cdf)
which you want to test against, and then draw between the zero line and the line
p = 1 n parallel lines which divide the unit strip into n + 1 equidistant strips. The
intersection points of these n lines with the cdf will roughly give the locations where
the smallest, second smallest, etc., of a sample of n normally distributed observations
should be found.
For a mathematical justification of this, make the following thought experiment.
Assume you have n observations from a uniform distribution on the unit interval.
Where should you expect the smallest observation to be? The answer is given by the
simple result that the expected value of the smallest observation is 1/(n + 1), the
expected value of the second-smallest observation is 2/(n + 1), etc. In other words,
in the average, the n observations, cut the unit interval into n + 1 intervals of equal
distance.
Therefore we do know where the first step of an empirical cumulative distribution
function of a uniform random variable should be, and it is a very simple formula.
1042 47. DENSITY ESTIMATION
But this can be transferred to the general case by the following fact: if one plugs any
random variable into its cumulative distribution, one obtains a uniform distribution!
These locations will therefore give, strictly speaking, the usual values of the smallest,
second smallest etc. observation of Fx (x), but the usual values for x itself cannot be
far from this.
If one plots the data on the vertical axis versus the standard normal on the
horizontal axis (the default for the R-function qqnorm), then an S-shaped plot in-
dicates a light-tailed distribution, an inverse S says that the distribution is heavy-
tailed (try qqnorm(rt(25,df=1)) as an example), a C is left-skewed, and an inverse
C, a J, is right-skewed. A right-skewed, or positively skewed, distribution is one
which has a long right tail, like the lognormal qqnorm(rlnorm(25)) or chisquare
qqnorm(rchisq(25,df=3)).
The classic reference which everyone has read and which explains it all is [Gum58,
pp. 28–34 and 46/47]. Also [WG68] is useful, many examples.
1043
1044 48. MEASURING ECONOMIC INEQUALITY
more clearly what is happening in the middle income ranges. But perhaps
it is not so readily apparent what is happening in the upper tail. This
can be remedied by taking the distribution of the logarithm of income, i.e.,
arrange the income markers on the fence in such a way that equal physical
distances mean equal income ratios.
• Lorenz curve: Line up everybody in ascending order and let them parade
by. You have a big “cake” representing the overall sum of incomes. As each
person passes, hand him or her his or her share of the cake, i.e., a piece of
cake representing the proportion of income that person receives. Make a
diagram indicating how much of the cake has been handed out, versus the
number of people that have passed by. This gives the Lorenz curve. The
derivative of the Lorenz curve is Pen’s parade. The mean income is that
point at which the slope is parallel to the diagonal. A straight line means
total equality.
Parade curve and this horizontal line, divided by the total area under the
Parade curve.
• Gini coefficient: the area between the Lorenz curve and the diagonal line,
times 2 (so that a Gini coefficient of 100% would mean: one person owns
everything, and a Gini of 0 means total equality.
• Theil’s entropy measure: Say xi is person i’s income, x̄ is the average
income, and n the population count. Then the person’s income share is
xi
si = nx̄ . The entropy of this income distribution, according to (3.11.2),
but with natural logarithms instead of base 2, is
n
X
(48.3.1) si ln s1i
i=1
n
X
1
(48.3.2) n ln n = ln n
i=1
48.3. QUANTITATIVE MEASURES OF INCOME INEQUALITY 1047
Subtract the actual entropy of the income distribution from this maximal
entropy to get Theil’s measure
n
1 X xi xi
(48.3.3) ln
n i=1 x̄ x̄
i.e., the difference of the overall measure from the smallest possible measure
is at the same time the weighted average of the differences of h(si ) from
h( n1 ).
Problem 449. Show that (48.3.3) is the difference between (48.3.1) and (48.3.2)
Answer.
n n
X xi nx̄ 1 X xi x̄
ln n − ln = ln n − ln n + ln
nx̄ xi n x̄ xi
i=1 i=1
n n
1 X xi 1 X xi x̄
(48.3.7) = ln n − ln n − ln
n x̄ n x̄ xi
i=1 i=1
n
1 X xi xi
= ln
n x̄ x̄
i=1
Problem 450. 7 points Show that if one takes a small amount of income share
ds from person 2 and adds it to person 1, then the inequality measure defined in
48.3. QUANTITATIVE MEASURES OF INCOME INEQUALITY 1049
(48.3.5) changes by h(s2 ) − h(s1 ) ds. Hint: if β 6= 0,
∂I ∂ 1
(48.3.8) = si h(si ) = sβi = −h(si ).
∂si ∂si β
If one therefore takes ds away from 2 and gives it to 1, I changes by
∂I ∂I
(48.3.9) dI = − + ds = h(s2 ) − h(s1 ) ds
∂s2 ∂s1
If β = 0, only small modifications apply.
Answer. If β = 0, then
n
X
1
(48.3.10) I = ln( n )− si ln(si )
i=1
Interpretation: if h(s2 ) − h(s1 ) = h(s4 ) − h(s3 ) then for the purposes of this
inequality measure, the distance between 2 and 1 is the same as the distance between
1050 48. MEASURING ECONOMIC INEQUALITY
4 and 3. These inequality measures are therefore based on very specific notions of
what constitutes inequality.
Distributed Lags
Two problems: lag length often not known, and X matrix often highly multi-
collinear.
How to determine lag length? Sometimes it is done by the adjusted R̄2 . [Mad88,
p. 357] says this will lead to too long lags and proposes remedies.
Assume we know for sure that lag length is not greater than M . [JHG+ 88,
pp. 723–727] recommends the following “general-to-specific” specification procedure
for finding the lag length: First run the regression with M lags; if the t-test for
the parameter of the M th lag is significant, we say the lag length is M . If it is
insignificant, run the regression with M −1 lags and test again for the last coefficient:
If the t-test for the parameter of the M − 1st coefficient is significant, we say the lag
length is M − 1, etc.
The significance level of this test depends on M and on the true lag length. Since
we never know the true lag length for sure, we will never know the true significance
level for sure. The calculation which follows now allows us to compute this signif-
icance level under the assumption that the N given by the test is the correct N .
Furthermore this calculation only gives us the one-sided significance level: the null
hypothesis is not that the true lag length is = N , but that the true lag length is
≤ N.
Assume the null hypothesis is true, i.e., that the true lag length is ≤ N . Since
we assume we know for sure that the true lag length is ≤ M , the null hypothesis
49. DISTRIBUTED LAGS 1053
Problem 451. Here are excerpts from SAS outputs, estimating a consumption
function. The dependent variable is always the same, GCN72, the quarterly personal
consumption expenditure for nondurable goods, in 1972 constant dollars, 1948–1985.
The explanatory variable is GYD72, personal income in 1972 constant dollars (deflated
by the price deflator for nondurable goods), lagged 0–8 quarters.
• a. 3 points Make a sequential test how long you would like to have the lag
length.
49. DISTRIBUTED LAGS 1055
Answer. If all tests are made at 5% significance level, reject that there are 8 or 7 lags, and
go with 6 lags.
• b. 5 points What is the probability of type I error of the test you just described?
Answer. For this use the fact that the t-statistics are independent. There is a 5% probability
of incorrectly rejecting the first t-test and also a 5% probability of incorrectly rejecting the second
1056 49. DISTRIBUTED LAGS
t-test. The probability of incorrectly rejecting at least one of the two tests is therefore 0.05 + 0.05 −
0.05 · 0.05 = 0.1 − 0.0025 = 0.0975. For 1% it is (for two tests) 0.01 + 0.01 − 0.01 · 0.01 = 0.0199,
but three tests will be necessary!
Usually this is done by the imposition of linear constraints. One might explicitly
write it as linear constraints of the form Rβ = o, since polynomials of dth order are
characterized by the fact that the dth differences of the coefficients are constant, or
their d + 1st differences zero. (This gives one linear constraint for every position in
β for which the dth difference can be computed.)
But here it is more convenient to incorporate these restrictions into the regression
equation and in this way end up with a regression with fewer explanatory variables.
1060 49. DISTRIBUTED LAGS
Any β with a polynomial lag structure has the form β = Hα for the (d + 1) × 1
vector α, where the columns of H simply are polynomials:
β0 1 0 0 0
β1 1 α0
1 1 1
α1
(49.0.3) β2 = 1 2 4 8
α2
β3 1 3 9 27
α3
β4 1 4 16 64
More examples for such H-matrices are in [JHG+ 88, p. 730]. Then the specification
y = Xβ + ε becomes y = XHα + ε . I.e., one estimates the coefficients of α by an
ordinary regression again, and even in the presence of polynomial distributed lags
one can use the ordinary F -test, impose other linear constraints, do “GLS” in the
usual way, etc. (SAS allows for an autoregressive error structure in addition to the
lags). The pdlreg procedure in SAS also uses a H whose first column contains a
zero order polynomial, the second a first order polynomial, etc. But it does not use
these exact polynomials shown above but chooses the polynomials in such a way that
they are orthogonal to each other. The elements of α are called X**0 (coefficient of
the zero order polynomial), X**1, etc.
49. DISTRIBUTED LAGS 1061
In order to determine the degree of the polynomial one might use the same
procedure on this reparametrized regression which one used before to determine the
lag length.
About endpoint restrictions: The polynomial determines the coefficients β0
through βM , with the other βj being zero. Endpoint restrictions (the SAS op-
tions last, first, or both) determine that either the polynomial is such that its
formula also gives βM +1 = 0 or β−1 = 0 or both. This may prevent, for instance,
the last lagged coefficient from becomeing negative if all the others are positive. But
experience shows that in many cases such endpoint restrictions are not a good idea.
Alternative specifications of the lag coefficients: Shiller lag: In 1973, long before
smoothing splines became popular, Shiller in [Shi73] proposed a joint minimization
of SSE and k times the squared sum of d + 1st differences on lag coefficients. He used
a Bayesian approach; Maddala classical method. This is the BLUE if one replaces
the exact linear constraint by a random linear constraint.
Problem 452. Which problems does one face if one estimates a regression with
lags in the explanatory variables? How can these problems be overcome?
1062 49. DISTRIBUTED LAGS
Here the second line is written in a somewhat funny way in order to make the
wt = (1 − λ)λt , the weights with which β is distributed over the lags, sum to one.
Here it is tempting to do the following Koyck-transformation: lag this equation by
one and premultipy by λ to get
(49.1.3)
λy t−1 = λα + β(1 − λ)λxt−1 + β(1 − λ)λ2 xt−2 + β(1 − λ)λ3 xt−3 + · · · + λεt−1 .
Now subtract:
(49.1.4)
y t = α(1 − λ) + λy t−1 + β(1 − λ)xt + εt − λεt−1 .
This has a lagged dependent variable. This is not an accident, as the follwing dis-
cussion suggests.
49.2. AUTOREGRESSIVE DISTRIBUTED LAG MODELS 1063
We will discuss two models which give rise to such a lag structure: either with
the desired level achieved incompletely as the dependent variable (Partial Adjustment
models), or with an adaptively formed expected level as the explanatory variable. In
the first case, OLS on the Koyck transformation is consistent, in the other case it is
not, but alternative methods are available.
Partial Adjustment. Here the model is
(49.2.3) y ∗t = α + βxt + εt ,
where y ∗t is not the actual but the desired level of y t . These y ∗t are not observed,
but the assumption is made that the actual values of y t adjust to the desired levels
as follows:
Solving (49.2.4) for y t gives y t = λy t−1 + (1 − λ)y ∗t . If one substitutes (49.2.3) into
this, one gets
If one were to repeatedly lag this equation, premultipy by λ, and reinsert, one would
get
Here x∗t is the economic agents’ perceptions of the permanent level of xt . Usually the
x∗t are not directly observed. In order to link x∗t to the observed actual (as opposed
to permanent) values xt , assume that in every time period t the agents modify their
perception of the permanent level based on their current experience xt as follows:
(49.2.8) x∗t − x∗t−1 = (1 − λ)(xt − x∗t−1 ).
I.e., the adjustment which they apply to their perception of the permanent level in
period t, x∗t − x∗t−1 , depends on by how much last period’s permanent level differs
from the present period’s actual level; more precisely, it is 1 − λ times this difference.
Here 1 − λ represents some number between zero and one, which does not change
over time. We are using 1 − λ instead of λ in order to make the formulas below a
little simpler and to have the notation consistent with the partial adjustment model.
• a. 1 point Show that (49.2.8) is equivalent to
(49.2.9) x∗t = λx∗t−1 + (1 − λ)xt
• b. 2 points Derive the following regression equation from (49.2.7) and (49.2.9):
Now use (49.2.9) in the form x∗t − λx∗t−1 = (1 − λ)xt to get (49.2.10). The new disturbances are
η t = εt − λεt−1 .
(49.2.13) y t = α0 + β0 xt + λy t−1 + η t
α̂0 β̂0
(49.2.14) α̂ = and β̂ = .
1 − λ̂ 1 − λ̂
Answer. OLS is inconsistent because y t−1 and εt−1 , therefore also y t−1 and η t are correlated.
(It is also true that η t−1 and η t are correlated, but this is not the reason of the inconsistency).
1068 49. DISTRIBUTED LAGS
How many regressors are in this equation? Which are the unknown parameters? De-
scribe exactly how you get these parameters from the coefficients of these regressors.
2 t−1
t
Answer. Three regressors: intercept, (1 − λ) xt + λxt−1 + λ xt−2 + · · · + λ x1 , and λ . In
the last term, λt is the explanatory variable. A regression gives estimates of α, β, and a “prediction”
49.2. AUTOREGRESSIVE DISTRIBUTED LAG MODELS 1069
of x∗0 . Note that the sum whose coefficient is β has many elements for high t, and few elements for
low t. Also note that the λt -term becomes very small, i.e., only the first few observations of this
“variable” count. This is why the estimate of x∗0 is not consistent, i.e., increasing the sample size
will not get an arbitrarily precise estimate of this value. Will the estimate of σ 2 be consistent?
Here is R-code to compute the regressors in (49.2.19), and to search for the best
λ.
"geizel.regressors" <- function(x, lambda)
{ lngth <- length(x)
lampow <- z <- vector(mode="numeric",length=lngth)
lampow[[1]] <- lambda
z[[1]] <- x[[1]]
1070 49. DISTRIBUTED LAGS
the “Koyck transformation” leads to consistent estimates, in the other it does not.
Explain.
Problem 455. 7 points Zvi Griliches in [Gri67] considers the problem of dis-
tinguishing between the following two models: A partial adjustment model
where v t obeys all classical assumptions, and a simple regression model with an au-
toregressive disturbance
with well-behaved disturbances. Clearly, (49.2.21) follows from (49.2.23) by setting γ = α and
δ = 0. To see that (49.2.22) follows as well, write it as
(49.2.24) y t = α + βxt + ρεt−1 + v t
and insert εt−1 = y t−1 − α − βxt−1 . You get (49.2.23) with γ = α(1 − ρ) and δ = −βρ. The two
models make therefore two different statements about δ, the coefficient of xt−1 . (49.2.21) has the
constraint δ = 0, while (49.2.22) has δ = −βρ.
To test whether the data reject the first hypothesis, run a simple t-test for δ = 0. To test
whether the data reject the second hypothesis, test the nonlinear constraint βρ + δ = 0. The
likelihood-ratio test is the neatest way. If the data reject one test and accept the other, then one
is lucky. If the data accept both, then one can argue that there is not enough information to
discriminate between the two models. If the data reject both, then one has exceptionally bad data
(assuming the umbrella hypothesis is correct).
Other alternatives:
Schmidt’s polynomial geometric lag: not necessary to decide over maximum
length of the lag.
What is desired is usually a hump, and this can be
modeled according to density
functions: Pascal lag: the weights are wi = 1+r−1 1 (1 − λ)r i
λ ; estimate by MLE.
Gamma-lag wi = is−1 e−i , s > 0, integer; does not add up to one! Not recommonded
because: w0 = 0 for s > 1, and w1 is always the same. Modified Gamma-lag:
wi = (i + 1)α/(1−α) λi ; 0 ≤ α < 1.
CHAPTER 50
Investment Models
(50.1.1) ∆K = a∆Q.
But it does not fit, the estimated a is much too small for a reasonable capital-output
ratio.
Problem 456. Plot the capital stock of your industry against value added, and
also plot the first differences against each other. Interpret your results.
1073
1074 50. INVESTMENT MODELS
Now the flexible accelerator has the following two basic equations:
This can either be used to generate a relation between capital stock and output, or a
relation between investment and output. For the relation between capital stock and
output write (50.1.3) as
This is a convenient form for estimation, i.e., one has to regress Kt on Qt and Kt−1 .
But one may also eliminate Kt−1 on the righthand side of (50.1.5) by using the
lagged version of (50.1.5):
The other alternative is to get a relation between output and investment. This
is convenient when no capital stock data are available. The figure for investment
usually refers to both replacement investment and net investment, i.e.,
(50.1.8) It = Kt − Kt−1 + Dt ,
where Dt is depreciation.
The usual treatment of depreciation is to set Dt = δKt−1 , with δ either estimated
or obtained from additional information.
Therefore
(50.1.9) It = Kt − (1 − δ)Kt−1
In order to eliminate the term with Kt−1 on the righthand side, use the following
trick:
The last term on the righthand side is equal to (1 − γ − δ)It−1 , and by setting
Kt∗ = aQt one obtains
te assumption that firms seek to maximize the present discounted value of their cash
flow, he makes such strong supplementary assumptions that this is equivalent to the
firms maximizing their net revenue at every instant. In other words, their desired
capital input and labor input are such that FL = w(t)/p(t) and FK = c(t)/p(t),
where c(t) = (δ + r)q(t) − q̇(t) is the user cost of capital (q(t) is the price index
for capital goods). However firms are not at the desired path, and in order to
reach this path they pursue the following strategy: they hire enough labor to satisfy
the marginal product condition for labor given their actual capital stock, i.e., they
produce the profit maximizing amount given this capital stock. The capital stock
which they consider their desired capital stock at time t is that amount of capital
which would be optimal for producing the output they are actually producing at
time t, i.e., Kt∗ = α·Qctt ·pt .
As they approach this capital stock, they also hire more labor to fulfill the
marginal product condition for labor, therefore their output rises, and therefore their
desired capital stock will rise also. What the firms therefore consider their desired
capital stock is not yet the optimal path, but as long as they have not reached this
optimal path, they see a discrepancy between their actual and desired capital stock.
In other words, although Jorgenson claims to be modelling a very neoclassical
forward-looking optimizing behavior, he ands up estimating an equation in which
firms start with the situation they are in and go from there.
1078 50. INVESTMENT MODELS
∞
X Qp
(50.2.3) It = α µi ∆ + δKt−1 .
i=0
c t−i
If one has annual data, one might use this for estimation,
P∞ assuming there are at
most 3 or 4 lags. The coefficient α is identified because i=0 µi = 1.
Problem 458. Run the Jorgonson equation (50.2.3) for some finite number of
lags, and then run the same equation leaving out the cost-of-capital adjustment terms,
i.e., just looking at it as a simple accelerator model. Do you get better results?
Jorgenson, who works with quarterly data, makes the following assumption about
the lag structure µ0 , µ1 , . . .. Using the lag operator, the regression to be estimated
is
∞
X Qp
(50.2.4) It − δKt−1 = α µi Li ∆ .
i=0
c
P i
P i γi L
Jorgenson assumes a rational lag, i.e., the delivery lags are i µi L =
P
ω Li
, where
i
both sums are rather short, i.e., the only nonzero coefficients γi and ωi may be P
ω0 = 1,
ω1 , ω2 , and γ3 , γ4 , and γ5 . Multiplying the regression equation through by ωi Li
gives
5
X Qp
(50.2.5) It − δKt−1 + ω1 (It−1 − δKt−2 ) + ω2 (It−2 − δKt−3 ) = α γi Li ∆ .
i=3
c
1080 50. INVESTMENT MODELS
Therefore It − δKt−1 is the dependent variable, and the other variables the indepen-
dent variables. α is identified here because ω0 + ω1 + ω2 = γ3 + γ4 + γ5 .
Jorgenson’s results are that α = 0.01, which is very small, it would indicate that
only 1% of the sales revenues goes to the owners of capital, the rest goes to the
laborers.
Problem 459. Use the given data for an estimation of the investment function
along the lines suggested by Jorgenson.
50.3. INVESTMENT FUNCTION PROJECT 1081
The main data are collected in two datafiles: ec781.invcur has data about fixed non-
residential private investment, capital stock (net of capital consumption allowance),
and gross national product by industry in current dollars. The file ec781.invcon has
the corresponding data in constant 1982 dollars (has missing values for some data
for industry 27, that is why industry has a double star). Furthermore, the dataset
1082 50. INVESTMENT MODELS
ec781.invmisc has additional data which might be interesting. Among those are
capacity utilization data for the industries 20, 22, 26, 28, 29, 30, 32, 33, 34, 35,
36, 37, and 38 (all industries for which there are no capacity utilization data have
at least one star) and profit rates for all industries. The profit rate data are con-
structed as follows from current dollar data: numerator= corporate profits before tax
+ corporate inventory valuation adjustment + noncorporate income + noncorporate
inventory valuation adjustment + government subsidies + net interest. Denomina-
tor: capital stock + inventories (the inventories come in part from the NIPAs, in
part from the census). It also has the prime rate (short term lending interest rate),
and the 10 year treasury note interest rate, and the consumer price index. Note that
the interest rates are in percent, for most applications you will have to divide them
by 100. The profit rate is not in percent, it is a decimal fraction.
All three datasets have the year as one of the variable, and they go from 1947–85,
with often some of the data for the beginning and the end of that period missing.
These datasets will be available on your d-disk, SAS should find them if you just call
them up by the name given here.
CHAPTER 51
1083
1084 51. DETERMINISTIC CHAOS AND RANDOMNESS
and generally, for ε = (1/3)m one gets N (ε) = 2m . Therefore the dimension is
2m
limm→∞ log log 2
log 3m = log 3 = 0.63.
A concept related to the Hausdorff dimension is the correlation dimension. To
compute this one needs C(ε), the fraction of the total number of points that are
within the Euclidian distance ε of a given point. (This C(ε) is a quotient of two
infinite numbers, but in finite samples it is a quotient of two large but finite numbers,
this is why it is more tractable than the Hausdorff dimension.) Example again with
straight line and area, using sup norm: line: C(ε) = 2ε/L, area: C(ε) = 4ε2 /S.
Then the correlation dimension is limε→0 loglogC(ε) ε , again indicating how this count
varies with the distance.
To compute it, use log CM (ε), which is the sample analog of log C(ε) for a sample
of size M , and plot it against log ε. To get this sample analog, look at all pairs of
different points, and count those which are less than ε apart, and divide by total
number of pairs of different points N (N − 1)/2.
Clearly, if ε is too small, it falls through between the points, and if it is too large,
it extends beyond the boundaries of the set. Therefore one cannot look at the slope
in the origin but must look at the slope of a straight line segment near the origin.
Another reason for not looking at too small ε is that there may be a measurement
error.)
1086 51. DETERMINISTIC CHAOS AND RANDOMNESS
It seems the correlation dimension is close to and cannot exceed the Hausdorff
dimension. What one really wants is apparently the Hausdorff dimension, but the
correlation dimension is a numerically convenient surrogate.
Importance of fractal dimensions: If an attractor has a fractal dimension, then
it is likely to be a strange attractor (although strictly speaking it is neither necessary
nor sufficient). E.g. it seems to me the precise Hausdorff dimension of the Lorentz
attractor is not known, but the correlation dimension is around 2.05.
m > 2n − 1 then the embedding is topologically equivalent to the original time series.
In particular this means that it has the same correlation dimension.
This has important implications: if a time series is part of a deterministic system
also including other time series, then one can draw certain conclusions about the
attractor without knowing the other time series.
Next point: the correlation dimension of this embedding is limε→0 log log C(ε,m)
ε ,
where the embedding dimension m is added as second argument into the function C.
If the system is deterministic, the correlation dimension settles to a stationary value
as the embedding dimension m increases; for a random system it keeps increasing, in
the i.i.d. case it is m. (In the special case that this i.i.d. distribution is the uniform
one, the m-histories are uniformly distributed on the m-dimensional unit cube, and it
follows immediately, like our examples above.) Therefore the Grassberger-Procaccia
plots show for each m one curve, plotting log C(ε, m) against log ε.
For ε small, i.e., log ε going towards −∞, the plots of the true C’s become
asymptotically a straight line emanating from the origin with a given slope which
indicates the dimension. Now one cannot make ε very small for two reasons: (1)
there are only finitely many data points, and (2) there is also a measurement error
whose effect disappears if ε becomes bigger than a few standard deviations of this
measurement error. Therefore one looks at the slope for values of ε that are not too
small.
1088 51. DETERMINISTIC CHAOS AND RANDOMNESS
Instrumental Variables
Compare here [DM93, chapter 7] and [Gre97, Section 6.7.8]. Greene first intro-
duces the simple instrumental variables estimator and then shows that the general-
ized one picks out the best linear combinations for forming simple instruments. I will
follow [DM93] and first introduce the generalized instrumental variables estimator,
and then go down to the simple one.
In this chapter, we will discuss a sequence of models y n = X n β + ε n , where
ε n ∼ (on , σ 2 I n ), and X n are n × k-matrices of random regressors, and the number
of observations n → ∞. We do not make the assumption plim n1 X > n ε n = o which
would ensure consistency of the OLS estimator (compare Problem 394). Instead, a
sequence of n × m matrices of (random or nonrandom) “instrumental variables” W n
1089
1090 52. INSTRUMENTAL VARIABLES
1 >
is consistent. Hint: Write β̃ n − β = B n · nW ε and show that the sequence of
matrices B n has a plim.
Answer. Write it as
−1
β̃ n = X > W (W > W )−1 W > X X > W (W > W )−1 W > (Xβ + ε )
−1
= β + X > W (W > W )−1 W > X X > W (W > W )−1 W >ε
−1
1 1 1 1 1 1
= β + ( X > W )( W > W )−1 ( W > X) ( X > W )( W > W )−1 W > ε,
n n n n n n
1092 52. INSTRUMENTAL VARIABLES
Problem 461. Assume plim n1 X > X exists, and plim n1 X >ε exists. (We only
need the existence, not that the first is nonsingular and the second zero). Show that
σ 2 can be estimated consistently by s2 = n1 (y − X β̃)> (y − X β̃).
Answer. y − X β̃ = Xβ + ε − X β̃ = ε − X(β̃ − β). Therefore
1 1 2 1 >
(y − X β̃)> (y − X β̃) = ε >ε − ε > X(β̃ − β) + (β̃ − β)> X X (β̃ − β).
n n n n
All summands have plims, the plim of the first is σ 2 and those of the other two are zero.
Problem 462. In the situation of Problem 460, add the stronger assumption
√
√1 W >ε
n
→ N (o, σ 2 Q), and show that n(β̃ n − β) → N (o, σ 2 (D > Q−1 D)−1 )
1 √
Answer. β̃ n − β = B n n W>
n ε n , therefore n(β̃ n − β) = B n n−1/2 W > 2
n ε n → BN (o, σ Q) =
2 > > −1 −1 > −1
N (o, σ BQB ). Since B = (D Q D) D Q , the result follows.
52. INSTRUMENTAL VARIABLES 1093
case the ε -vector is not orthogonal to x. (Draw ε vertically, and make x long enough
that β < 1.) We assume n is large enough so that the asymptotic results hold for
the sample already (or, perhaps better, that the difference between the sample and
its plim is only infinitesimal). Therefore the OLS regression, with estimates β by
x> y/x> x, is inconsistent. Let O be the origin, A the point on the x-vector where
ε branches off (i.e., the end of xβ), furthermore let B be the point on the x-vector
where the orthogonal projection of y comes down, and C the end of the x-vector.
Then x> y = OC ¯ OB¯ and x> x = OC ¯ 2 , therefore x> y/x> x = OB/
¯ OC,¯ which would
be the β if the errors were orthogonal. Now introduce a new variable w which is
orthogonal to the errors. (Since ε is vertical, w is on the horizontal axis.) Call D the
projection of y on w, which is the prolongation of the vector ε , and call E the end of
the w-vector, and call F the projection of x on w. Then w> y = OE ¯ OD,
¯ and w> x =
¯ ¯ > > ¯ ¯ ¯ ¯
OE OF . Therefore w y/w x = (OE OD)(OE OF ) = OD/OF = OA/ ¯ ¯ ¯ OC ¯ = β.
Or geometrically it is obvious that the regression of y on the projection of x on w
will give the right β̂. One also sees here why the s2 based on this second regression
is inconsistent.
If I allow two instruments, the two instruments must be in the horizontal plane
perpendicular to the vector ε which is assumed still vertical. Here we project x on
this horizontal plane and then regress the y, which stays where it is, on this x. In
this way the residuals have the right direction!
52. INSTRUMENTAL VARIABLES 1095
What if there is one instrument, but it does not not lie in the same plane as
x and y? This is the most general case as long as there is only one regressor and
one instrument. This instrument w must lie somewhere in the horizontal plane. We
have to project x on it, and then regress y on this projection. Look at it this way:
take the plane orthogonal to w which goes through point C. The projection of x
on w is the intersection of the ray generated by w with this plane. Now move this
plane parallel until it intersects point A. Then the intersection with the w-ray is the
projection of y on w. But this latter plane contains ε , since ε is orthogonal to w.
This makes sure that the regression gives the right results.
Problem 463. 4 points The asymptotic MSE matrix of the instrumental vari-
−
ables estimator with W as matrix of instruments is σ 2 plim X > W (W > W )−1 W > X
Show that if one adds more instruments, then this asymptotic MSE-matrix can only
decrease. It is sufficient
to
show that the inequality holds before going over to the
plim, i.e., if W = U V , then
−1 −1
(52.0.7) X > U (U > U )−1 U > X − X > W (W > W )−1 W > X
is nonnegative definite. Hints: (1) Use theorem A.5.5 in the Appendix (proof is
not required). (2) Note that U = W G for some G. Can you write this G in
1096 52. INSTRUMENTAL VARIABLES
partitioned matrix form? (3) Show that, whatever W and G, W (W > W )−1 W > −
W G(G> W > W G)−1 G> W > is idempotent.
Answer.
I I
(52.0.8) U = U V = WG where G= .
O O
Problem 464. 2 points Show: if a matrix D has full column rank and is square,
then it has an inverse.
Answer. Here you need that column rank is row rank: if D has full column rank it also
has full row rank. And to make the proof complete you need: if A has a left inverse L and a
right inverse R, then L is the only left inverse and R the only right inverse and L = R. Proof:
L = L(AR) = (LA)R = R.
Problem 465. 2 points If W > X is square and has full column rank, then it is
nonsingular. Show that in this case (52.0.4) simplifies to the “simple” instrumental
variables estimator:
(52.0.9) β̃ = (W > X)−1 W > y
52. INSTRUMENTAL VARIABLES 1097
Answer. In this case the big inverse can be split into three:
−1
(52.0.10) β̃ = X > W (W > W )−1 W > X X > W (W > W )−1 W > y =
(52.0.11) = (W > X)−1 W > W (X > W )−1 X > W (W > W )−1 W > y
Problem 466. We only have one regressor with intercept, i.e., X = ι x , and
we have one instrument
w for x (while the constant term is its own instrument),
i.e., W = ι w . Show that the instrumental variables estimators for slope and
intercept are
P
(wt − w̄)(y t − ȳ)
(52.0.12) β̃ = P
(wt − w̄)(xt − x̄)
(52.0.13) α̃ = ȳ − β̃x̄
Hint: the math is identical to that in question 238.
Problem 467. 2 points Show that, if there are as many instruments as there are
observations, then the instrumental variables estimator (52.0.4) becomes identical to
OLS.
1098 52. INSTRUMENTAL VARIABLES
Answer. In this case W has an inverse, therefore the projection on R[W ] is the identity.
Staying in the algebraic paradigm, (W > W )−1 = W −1 (W > )−1 .
An implication of Problem 467 is that one must be careful not to include too
many instruments if one has a small sample. Asymptotically it is better to have more
instruments, but for n = m, the instrumental variables estimator is equal to OLS, i.e.,
the sequence of instrumental variables estimators starts at the (inconsistent) OLS.
If one uses fewer instruments, then the asymptotic MSE matrix is not so good, but
one may get a sequence of estimators which moves away from the inconsistent OLS
more quickly.
CHAPTER 53
Errors in Variables
(53.1.1) y = α + x∗ β + v.
1099
1100 53. ERRORS IN VARIABLES
If n observations of the variables y and x∗ are available, one can obtain estimates
of α and β and predicted values of the disturbances by running a regression of the
vector of observations y on x∗ :
(53.1.2) y = ια + x∗ β + v.
But now let us assume that x∗ can only be observed with a random error. I.e., we
observe x = x∗ +u. The error u is assumed to have zero mean, and to be independent
of x∗ and v. Therefore we have the model with the “latent” variable x∗ :
(53.1.3) y = α + x∗ β + v
(53.1.4) x = x∗ + u
This model is sometimes called “regression with both variables subject to error.” It
is symmetric between the dependent and the explanatory variable, because one can
also write it as
(53.1.5) y ∗ = α + x∗ β
(53.1.6) x = x∗ + u
(53.1.7) y = y∗ + v
and, as long as β 6= 0, y ∗ = α + x∗ β is equivalent to x∗ = −α/β + y ∗ /β.
53.1. THE SIMPLEST ERRORS-IN-VARIABLES MODEL 1101
What happens if this is the true model and one regresses y on x? Plug x∗ = x−u
into (53.1.2):
(53.1.8) y = ια + xβ + (v − uβ)
| {z }
ε
The problem is that the disturbance term ε is correlated with the explanatory vari-
able:
(53.1.9) cov[x, ε] = cov[x∗ + u, v − uβ] = −β var[u].
Therefore OLS will give inconsistent estimates of α and β:
P
(yi − ȳ)(xi − x̄)
(53.1.10) β̂ OLS = P
(xi − x̄)2
cov[y, x] var[u]
(53.1.11) plim β̂ OLS = =β 1− .
var[x] var[x]
Since var[u] ≤ var[x], β̂ OLS will have the right sign in the plim, but its absolute value
will understimate the true β.
Problem 468. 1 point [SM86, A3.2/3] Assume the variance of the measurement
error σu2 is 10% of the variance of the unobserved exogenous variable σx2∗ . By how
1102 53. ERRORS IN VARIABLES
many percent will then the OLS estimator β̂ OLS asymptotically underestimate the
absolute value of the true parameter β?
Answer. 1 − var[u]/ var[x] = 1 − 0.1/1.1 = 0.90909, which is 9.09% below 1.
We know five moments of the observed variables: µy , µx , σy2 , σxy , and σx2 ; but there
are six independent parameters of the model: α, β, µx∗ , σx2∗ , σv2 , σu2 . It is therefore
no wonder that the parameters cannot be determined uniquely from the knowledge of
means and variances of the observed variables, as shown by counterexample (53.1.13).
However α and β cannot be chosen arbitrarily either. The above equations imply
three constraints on these parameters.
The first restriction on the parameters comes from equation (53.1.16) for the
means: From µy∗ = α + βµx∗ follows, since µy = µy∗ and µx = µx∗ , that
(53.1.18) µy = α + βµx ,
the observations has coefficient zero, they are noisy observations of two linearly
unrelated variables.
In the regular case σxy 6= 0, condition (53.1.17) for the dispersion matrices gives
two more restrictions on the parameter vectors. From σxy = βσx2∗ follows the second
restriction on the parameters:
(53.1.19) β must have the same sign as σxy .
And here is a derivation of the third restriction (53.1.23): from
(53.1.20) 0 ≤ σu2 = σx2 − σx2∗ and 0 ≤ σv2 = σy2 − β 2 σx2∗
follows
(53.1.21) σx2∗ ≤ σx2 and β 2 σx2∗ ≤ σy2 .
Multiply the first inequality by |β| and substitute in both inequalities σxy for βσx2∗ :
The lower bound is the absolute value of the plim of the regression coefficient if
one regresses the observations of y on those of x, and the reciprocal of the upper
bound is the absolute value of the plim of the regression coefficient if one regresses
the observed values of x on those of y.
Problem 472. We have seen that the data generated by the two processes (53.1.13)
do not determine the underlying relationship completely. What restrictions do these
data impose on the parameters α and β of the underlying relation y ∗ = α + βx∗ ?
• a. 3 points What does the information about y and x given in equation (53.1.24)
imply about α and β?
Answer. (53.1.18) gives α − β = 1, (53.1.19) gives β ≤ 0, and (53.1.23) 2/3 ≤ |β| ≤ 3.
1108 53. ERRORS IN VARIABLES
• b. 3 points Give the plims of the OLS estimates of α and β in the regression
of y on x.
Answer. plim β̂ = cov[x, y]/ var[x] = − 23 , plim α̂ = E[y] − E[x]plim β̂ = 1
3
.
• c. 3 points Now assume it is known that α = 0. What can you say now about
β, σu2 , and σv2 ? If β is identified, how would you estimate it?
Answer. From y = (x − u)β + v follows, by taking expectations, E[y] = E[x]β (i.e., the true
relationship still goes through the means), therefore β = −1, and a consistent estimate would be ȳ/x̄.
Now if one knows β one gets var[x∗ ] from cov[x, y] = cov[x∗ +u, βx∗ +v] = β var[x∗ ], i.e., var[x∗ ] = 2.
Then one can get var[u] = var[x] − var[x∗ ] = 3 − 2 = 1, and var[v] = var[y] − var[y ∗ ] = 6 − 2 = 4.
Luckily, those variances came out to be positive; otherwise the restriction α = 0 would not be
compatible with (53.1.24).
are independent observations from the same joint distribution, this means that first
and second order moments exist. If the systematic variables are nonrandom, the
plim becomes the ordinary limit. These two special cases are called the “structural
variant” and the “functional variant.”
U is the n × k matrix of the values of the unobserved “errors” or “statistical
disturbances.” These errors are assumed to be random; they have zero expectations,
the rows of U are independent and identically distributed with covariance matrix Q. e
If the systematic variables are random as well, then we assume that the errors are
independent of them.
The letter B in (53.2.2) is an upper-case Greek β, the columns of B will therefore
be written β i . Every such column constitutes a linear relation between the systematic
variables. B is assumed to be exhaustive in the sense that for any vector γ which
satisfies X ∗ γ = o there is a vector q so that γ = Bq. The rank of B is denoted by
q. If q = 1, then only one linear relation holds, and the model is called a univariate
EV model, otherwise it is a multivariate EV model.
In this specification, there are therefore one or several exact linear relations
between the true variables X ∗ , but X ∗ can only be observed with error. The task of
getting an estimate of these linear relations has been appropriately called by Kalman
“identification of linear relations from noisy data” [Kal83, p. 119], compare also the
title of [Kal82]. One can say, among the columns of X there is both a stochastic
1110 53. ERRORS IN VARIABLES
relationship and a linear relationship, and one wants to extract the linear relationship
from this mixture.
The above are the minimum assumptions which we will make of each of the
models below. From these assumptions follows that
1 ∗>
(53.2.3) plim X U =O
n→∞ n
Proof: First we prove it for the case that the systematic variables are nonran-
dom, in which case we write them X ∗ . Since the expected value E[U ] = O, also
E[ n1 U > X ∗ ] = O. For (53.2.3) it is sufficient to show that the variances of these
arithmetic means converge to zero: if their expected value is zero and their vari-
>
ance
Pn converges to zero, their plim is zero as well. The i, k element of X ∗ U is
∗ ∗ >
j=1 x ji ujk , or it can also be written as x i uk , it is the scalar product of the ith
∗
column of X with the kth column of U . Since all the elements in the kth column of
U have same variance, namely, var[ujk ] = q̃kk , and since ujk is independent of umk
for j 6= m, it follows
1 X ∗ q̃ii 1 X ∗ 2
(53.2.4) var x ji ujk = x ji .
n j n n n
Given this special case, the general case follows by an argument of the form:
since this plim exists and is the same conditionally on any realization of X ∗ , it also
exists unconditionally.
Other assumptions, made frequently, are: the covariance matrix of the errors Q e
is p.d. and/or diagonal.
There may also be linear restrictions on B, and restrictions on the elements of
Q.
e
An important extension is the following: If the columns of X ∗ and U are re-
alizations of (weakly) stationary stochastic processes, and/or X ∗ contains lagged
variables, then one speaks of a dynamic EV model. Here the rows of U are no longer
independent.
Y ∗ = X ∗B
∗ −I
Y X∗ =O
(53.3.1) Y =Y∗+V i.e., B
Y X = Y∗ X∗ + V
∗
X =X +U U
The OLS model is a special case of the errors in variables model. Using the
definition y ∗ = X ∗ β, i.e., y ∗ is the vector which ŷ estimates, one can write the
regression model in the form
y∗ = X ∗ β
∗ ∗
−1
y X =o
y = y∗ + v or in the symmetric form β
∗
X∗ + v
∗
X=X y X = y O .
If there is a single “bad” variable, say it is x, and we will call the matrix of the
“good” variables Z, then the univariate EV model has the form
y ∗ = x∗ β + Zγ
−1
x = x∗ + u
∗
x∗ Z ∗ β = o
y
or γ
y = y∗ + v
∗ ∗ ∗
∗
53.3. PARTICULAR FORMS OF EV MODELS 1113
Some well-known models, which are not usually considered EV models, are in
fact special cases of the above specification.
A Simultaneous Equations System, as used often in econometrics, has the form
(53.3.4) Y Γ = X ∗B + E
where Y (the endogenous variables) and X ∗ (the exogenous variables) are observed.
E is independent of X ∗ (it characterizes the exogenous variables that they are inde-
pendent of the errors). B and Γ are matrices of nonrandom but unknown parameter
vectors, Γ is assumed to be nonsingular. Defining Y ∗ = X ∗ BΓ−1 and V = EΓ−1 ,
1114 53. ERRORS IN VARIABLES
This is an antisymmetric relation between the two matrices in the sense that
from C ⊥ B follows B > ⊥ C > . C ⊥ B means therefore also that for all R with
CR = O there is a Y with R = BY . We can therefore also say that B is a right
deficiency matrix of C.
C ⊥ B simply means that the row vectors of C span the vector space orthogonal
to the vector space spanned by the column vectors of B. If therefore B is k × q and
has rank q, then C can be chosen (k − q) × k with rank k − q.
Start with an EV model
(53.3.8) X ∗B = O
(53.3.9) X = X∗ + U
A model which violates one of the assumptions of the EV model is the Berkson
model. Let us discuss the simplest case
(53.3.10) y ∗ = α + x∗ β
(53.3.11) y = y∗ + v
(53.3.12) x = x∗ + u.
While usually u is indepenent of x∗ , now we assume u is indepenent of x. If this is
the case, then the regression of y on x is unbiased and efficient:
(53.3.13) y = y ∗ + v = x∗ β + v = xβ + v − uβ
Here the error terms are independent of the explanatory variable.
How can this happen? Use the example with y ∗ is voltage in volts, x∗ is current in
amperes, and β is the resistance in Ohm. Here is a circuit diagram: the experimenter
adjusts the current until the ampere meter shows, for instance, one ampere, and then
he reads the voltage in volts, which are his estimate of the resistance.
“three restrictions on the parameters” (53.1.18), (53.1.19), and (53.1.23) from the
simple regression with errors in dependent and independent variables to the general
EV model.
(53.4.7) Q = Q∗ + Q
e
This is a surprisingly difficult problem which has not yet been resolved in general.
Here is one partial result:
Theorem (“Elementary Regression Theorem”): Assume the limit moment matrix
of the observations, Q, has an inverse Q−1 all elements of which are positive. Then
the EV problem is necessarily univariate, and β is a solution if and only if it can be
written in the form β = Q−1 γ where γ > o.
Interpretation of the result: The ith column of Q−1 is proportional to the re-
gression coefficients of the ith “elementary regression,” in which the observations of
the ith variable xi are regressed on all the other variables. Therefore this theorem
is a direct generalization of the result obtained in two dimensions, but it is only
valid if all elementary regressions give positive parameters or can be made to give
positive parameters by sign changes. If this is the case, the feasible parameter vec-
tors are located in the convex set spanned by the plims of all elementary regression
coefficients.
Proof of Theorem: Assume Q is positive definite, Q e is diagonal and positive
definite, and Q − Q nonnegative definite and singular. Singularity means that there
e
exists a vector β, which is not the null vector, with (Q − Q)β
e = o. This can also be
−1 e
expressed as: β is eigenvector of Q Q with 1 as eigenvalue.
First we will take any such eigenvalue and show that it can be written in the
form as required. For this we will show first that every eigenvector α of Q−1 Q, e
1120 53. ERRORS IN VARIABLES
(53.4.9) Qα
e = Qαλ
(53.4.10) Qα − Qα
e = Qα(1 − λ)
This completes the proof of the theorem. But we still need to prove its interpre-
tation. If one regresses the first variable on all others, i.e., estimates the equation
(53.4.12) x1 = x2 ··· xK β + ε =: Zβ + ε ,
the solution is β̂ = (Z > Z)−1 Z > x1 . Note that the elements from which β̂ is formed
are partitions of Q. (Note that Q is not the dispersion matrix but the matrix of
uncentered moments of X.)
1 x> x>
1 x1 1Z
(53.4.13) Q= .
n Z > x1 Z > Z
1
Postmultiplication of Q by gives therefore
−β̂
(53.4.14)
1 x> x> 1 x> > > −1 >
1 x1 1Z 1 1 x1 − x1 Z(Z Z) Z x1
= .
n Z > x1 Z > Z −(Z > Z)−1 Z > x1 n o
1
In other words, is proportional to the first column of Q−1 .
−β̂
Once the mathematical tools are better developed, it will be feasible to take the
following approach to estimation: first solve the Frisch problem in order to get an
1122 53. ERRORS IN VARIABLES
estimate of the feasible parameter region compatible with the data, and then use
additional information, not coming from the data, to narrow down this region to a
single point. The emphasis on the Frisch problem is due to Kalman, see [Kal82].
Also look at [HM89].
∗ B
∗
(53.4.15) X Z =O
Γ
∗
Z∗ + U
(53.4.16) X Z = X O .
53.4. THE IDENTIFICATION PROBLEM 1123
The moment matrices associated with this satisfy the following Frisch decompo-
sition:
QXX QXZ QXX QXZ O O
(53.4.17) = ∗ +
QZX QZZ QZX QZZ O Q e ZZ
Reformulate this: Now how does one get variables whose moment matrix is the
one in (53.4.20)? By regressing every variable in Z on all variables in X and taking
the residuals in this regression. Write the estimated regression equation and the
residuals as
−Q−1
β XX QXZ γ
(53.4.21) 7→ γ 7→
γ γ
QXX QXZ
is a bijection between vectors that annul , which is the moment
QZX Q∗ZZ
matrix of the systematic variables in (53.4.17), and vectors that annul Q∗ZZ −
QZX Q−1 XX QXZ , which is the moment matrix of the systematic variables in (53.4.20).
53.4. THE IDENTIFICATION PROBLEM 1125
Since the nonsingularity of Q implies the nonsingularity of QXX , the first equa-
tion implies β = −Q−1XX QXZ γ, and plugging this into the second equation gives
∗ −1
QZZ − QZX QXX QXZ γ = o. On the other hand, starting with a β annulling the
systematic moment matrix of the compacted problem (Q∗ZZ −QZX Q−1 XX QXZ )β = o,
this implies
−Q−1
QXX QXZ XX QXZ β = o
(53.4.23)
QZX Q∗ZZ β o
If X = ι, then going over to the residuals simply means that one has to take
deviations from the means. In this case, (53.4.20) is the decomposition of the co-
variance matrix of the variables. In other words, if there is a constant term in the
regressions, then Q and Q e should not be considered to be the moments of the ob-
served and systematic variables about the origin, but their covariance matrices. We
will use this rule extensively in the following examples.
1126 53. ERRORS IN VARIABLES
1 1
(53.5.4) bOLS − β = ( X > X)−1 X > E
n n
1 > −1 1 > 1 1
(53.5.5) = −( X X) X U β + ( X > X)−1 X > v.
n n n n
Since X > U = X ∗ > U + U > U and plim n1 X ∗ > U = O, this becomes in the plim
This, under the additional assumption that v and U are in the plim uncorrelated,
i.e., that σ U v = o, is [Gre97, (9.28) on p. 439]. Greene says “this is a mixture of
all the parameters in the model,” implying that it is hopeless to get information
about these parameters out of this. However, if one looks for inequalities instead of
equalities, some information is available.
1128 53. ERRORS IN VARIABLES
For instance one can show that the sample variance of the residuals remains
between the variance of ε and the variance of v, i.e.,
1 >
(53.5.7) σv2 ≤ plim e e ≤ σε2 .
n
For this start with
therefore
1 > 1 1 1
(53.5.9) plim e e = σε2 − plim ε> X( X > X)−1 X >ε.
n n n n
Since X > X −1 is nonnegative definite, this shows the second half of (53.5.7). For
the first half use
(53.5.10) ε = −U β + v, hence
(53.5.11) ε >ε = β > U > U β − 2β > U > v + v > v and
>e
(53.5.12) σε2 = β Qβ + σv2
53.5. ORDINARY LEAST SQUARES IN THE EV MODEL 1129
and
1 > 1
(53.5.13) plim X ε = plim (X ∗ + U )> (−U β + v) = −Qβ. e
n n
Plugging (53.5.12) and (53.5.13) into (53.5.9) gives
1
(53.5.14) plim e> e = σv2 + β > (Q e −1 Q)β.
e − QQ e
n
Since Q e −1 Q
e − QQ e is nonnegative definite, this proves the first half of the inequality.
Let xi > , x∗i > , and ui > be the ith rows of X, X ∗ and U . Assume they are dis-
tributed x∗i ∼ NID(o, Q∗ ), ui ∼ NID(o, Q), e and v i ∼ NID(0, σ 2 ), and all three are
independent of each other. Define Q = Q∗ + Q. e Therefore
Q∗ β
xi
o Q
(53.5.23) ∼N , > ∗
yi 0 β Q σ 2 + β > Q∗ β
Compute E[y i |xi ] and var[y i |xi ]. (Since y i is a linear function of xi , you can use
the formulas for best linear predictors here.)
Answer.
(53.5.24) E[y i |xi ] = β > Q∗ Q−1 xi = β > (Q − Q
e )Q−1 xi = β> xi − β> Q
e Q−1 xi
(53.5.25) var[y i |xi ] = σ 2 + β > Q∗ β − β > Q
e Q−1 Q
eβ
53.5. ORDINARY LEAST SQUARES IN THE EV MODEL 1131
with X and v independent (and again for simplicity all variables having zero mean),
which gives the same joint distribution of X and y as the above specification. Com-
pute γ, V [xi ], and var[y i ] in terms of the structural data of the above specification.
Answer.
(53.5.27) y ∗ = X ∗ β − Q−1 Q
eβ
(53.5.28) X = X∗
(53.5.29) y = y∗ + v
where x∗ i = xi ∼ N (o, Q) and var[v i ] = σ 2 +β > Q∗ β −β > Q e Q−1 Q e β. As one sees, γ = β −Q−1 Qe β.
In this latter model, OLS is appropriate, and these are therefore the plims of the OLS estimates.
Note that C [xi , y i ] = C [xi , xi > γ + v i ] = C [xi , γ > xi ] = V [xi ]γ = Q(β − Q−1 Q e β) = Q∗ β, and
∗ > ∗ > > > e −1 −1 e > > e Q−1 Q
e b + b> Q
var[y ] = var[γ x ] = γ Qγ = (β − β QQ )Q(β − Q Qβ) = b Qb − 2b Q e b.
Adding var[v i ] = σ + β Q β − β QQ Qβ to this gives, if everything is right, σ + β > Q∗ β.
2 > ∗ > e −1 e 2
About the form of the alternative coefficient vector compare their theorem 4.1, and about the
residual variance compare their theorem 4.3.
1132 53. ERRORS IN VARIABLES
This dataset is also discussed in [Mad88, pp. 239, 382] and [CP77, pp. 152, 164],
but the following exercise follows [Kal84]. If you have the R-library ecmet installed,
then the data can be made available by the command data(malvaud). You can also
download them from www.econ.utah.edu/ehrbar/data/malvaud.txt.
• a. Run the three elementary regressions for the whole period, then choose at
least two subperiods and run them for those. Plot all regression coefficients as points
in a plane, using different colors for the different subperiods (you have to normalize
them in a special way that they all fit on the same plot).
Answer. Assume you have downloaded the data and put them into the SAS dataset malvaud.
The command for one of the regressions over the whole period is
For regression over subperiods you must first form a dataset which only contains the subperiod:
data fifties;
set ec781.malvaud;
if 1950<=year<=1959;
run;
proc reg data=fifties;
model imports=hhconsum;
53.6. KALMAN’S CRITIQUE OF MALINVAUD 1135
run;
You can run several regressions at once by including several model statements with different models.
• b. The elementary regressions give you three fitted equations of the form
(53.6.1) imports = α̂1 + β̂12 gdp + β̂13 hhconsum + residual1
(53.6.2) gdp = α̂2 + β̂21 imports + β̂23 hhconsum + residual2
(53.6.3) hhconsum = α̂3 + β̂31 imports + β̂32 gdp + residual3 .
In order to compare the slope parameters of the second regression to the ones obtained
in the first, solve (53.6.2) for imports,
α̂2 1 β̂23 residual2
(53.6.4) imports = − + gdp − hhconsum −
β̂21 β̂21 β̂21 β̂21
and compare β̂12 with 1/β̂21 and β̂13 with −β̂23 /β̂21 . In the same way compare the
results of the third regression with the ones of the first. This comparison is conve-
niently done in table 1. Fill in the values for the whole period and also for several
sample subperiods. Make a scatter plot of the contents of this table, i.e., represent
each regression result as a point in a plane, using different colors for different sample
periods.
1136 53. ERRORS IN VARIABLES
• c. You will probably find that these points form a very narrow but often quite
long triangle. The triangles for different subperiods lie on the same stable line. This
indicates that the data should be modeled as observations with errors of systematic
data which satisfy two linear relationships at once. Using the plots of the differ-
ent regression coefficients, compute approximately the coefficients of these two linear
relationships.
53.6. KALMAN’S CRITIQUE OF MALINVAUD 1137
• -
•
• -
• • -
•
•
• -
•
•• -
•
••
-
•
•• -
•
• -
• -
•
(53.7.1) y ∗ = x∗ β
(53.7.2) y = y∗ + v
(53.7.3) x = x∗ + u
53.7. ESTIMATION IF THE EV MODEL IS IDENTIFIED 1147
This has one parameter less than the previous simple regression model, since α = 0.
Therefore the parameters are determined by (53.1.16) and (53.1.17).
µy
(53.7.4) β=
µx
σxy σxy µx
(53.7.5) σx2∗ = =
β µy
σxy µx
(53.7.6) σu2 = σx2 − σx2∗ = σx2 −
µy
σxy µy
(53.7.7) σv2 = σy2 − β 2 σx2∗ 2
= σy − .
µx
Replacing these by sample moments gives consistent estimates.
Here is an example of a bivariate EV model that is identified. Assume one has
three different measurement instruments all of which measure the same quantity x∗ .
Then the readings of these instruments, which we will denote x, y, and z, are usually
modeled to be noisy linear transformations of the true value x∗ :
(53.7.8) x = x∗ + u
(53.7.9) y = α + βx∗ + v
(53.7.10) z = γ + δx∗ + w.
1148 53. ERRORS IN VARIABLES
The measurement errors u, v, and w are assumed independent of each other. The
first instrument is called the “standard instrument” since the origin and scale of the
true variable x∗ are assumed to be identical to the origin and scale of this instrument;
the other two instruments have different origins and scales.
Here are the formulas for the method of moments estimates:
syz
(53.7.11) β̂ =
sxz
syz
(53.7.12) δ̂ =
sxy
(53.7.13) α̂ = ȳ − β̂x̄
(53.7.14) γ̂ = z̄ − γ̂x
sxy sxz
(53.7.15) σ̂u2 = s2x −
syz
2 2 sxy syz
(53.7.16) σ̂v = sy −
sxz
2 2 sxz syz
(53.7.17) σ̂w = sx − .
sxy
53.7. ESTIMATION IF THE EV MODEL IS IDENTIFIED 1149
Here is another example related with the permanent income hypothesis. If one
has several categories of consumption Cj , such as food, housing, education and en-
tertainment, etc., then the permanent income hypothesis says
(53.7.18) Y =Yp+Yt
(53.7.19) Cj = αj + βj Y p + Cjt
If all the Cjt are independent of each other, then this system is identified.
Problem 477. Given a bivariate problem with three variables all of which have
zero mean. (This is the model apparently appropriate to the malvaud data after taking
out the means.) Call the observed variables x, y, and z, with underlying systematic
variables x∗ , y ∗ , and z ∗ , and error variables u, v, and w. Write this model in the
form (53.3.1).
Answer.
" # x∗ = βz ∗
−1 0 y ∗ = γz ∗
x∗ y∗ z∗ 0 −1 =O
(53.7.20) β γ or x = x∗ + u
y = y∗ + v
x y z = x∗ y∗ z∗ + u v w
z = z ∗ + w.
1150 53. ERRORS IN VARIABLES
• a. The moment matrix of the systematic variables can be written fully in terms
of σz2∗ and the unknown parameters. Write out the moment matrix and therefore the
Frisch decomposition.
Answer.
σx2
" # " # " #
σxy σxz β2 βγ β σu2 0 0
(53.7.21) σxy σy2 σyz = σz2∗ βγ γ2 γ + 0 σv2 0 .
σxz σyz σz2 β γ 1 0 0 2
σw
• b. Show that the unknown parameters are identified, and derive estimates of
all parameters of the model.
Answer. Solving the Frisch equations one gets
σxy 2 σxz σxy
β= σu = σx2 −
σyz σyz
σxy 2 2 σxy σyz
(53.7.22) γ= σv = σy −
σxz σxz
σxz σyz σyz σxz
σz2∗ = 2
σw = σz2 −
σxy σxy
If you replace the true moments by the sample moments, you see that β and γ are estimated by
instrumental variables.
53.7. ESTIMATION IF THE EV MODEL IS IDENTIFIED 1151
• c. Compare these estimates with OLS estimates. Derive equations for the bias
of OLS.
Answer.
σzz σxz
σxz σxy σyz σxy
(53.7.23) plim β̂OLS − β = − = .
σzz σyz σzz σyz
• e. 3 points Now run regressions with only one explanatory variable. Are the
results close to the relations which you would expect from the result of the previous
step?
53.8. P-Estimation
A type of prior information that can be handled well mathematically is that Q e
is known except for a constant multiple, i.e., one knows a Λ so that there is a κ 6= 0
with Qe = κΛ.
In the simple EV model in which the errors of the x and y variables are inde-
pendent this means that one knows a κ with
(53.8.1) σv2 = κσu2 .
This equation, together with the Frisch equations (after elimination of the constant
term α)
allows identification of all parameters as follows: In (53.8.2) and (53.8.3), replace σx2∗
by σxy /β and σv2 by κσu2 , and put σu2 on the lefthand side:
σxy
(53.8.5) σu2 = σx2 −
β
1
σu2 = σy2 − βσxy .
(53.8.6)
κ
Setting those equal and multiplying by βκ gives the quadratic equation
(53.8.7) β 2 σxy + β(κσx2 − σy2 ) − κσxy = 0,
which has the solutions
s
σy2 − κσx2 σ 2 − κσ 2 2
y x
(53.8.8) β1|2 = ± κ+ .
2σxy 2σxy
Since β must have the same sign as σxy , only one of these solutions is valid, which
can be written as
1 h 2 q i
(53.8.9) β= σy − κσx2 + 4κσxy 2 + (σ 2 − κσ 2 )2 .
y x
2σxy
By replacing the true moments of the observed variables by the sample moments one
obtains an estimate which we will denote with β̂ P . The Frisch equations will then
also yield estimates of the other parameters.
1154 53. ERRORS IN VARIABLES
Problem 478. [SM86, A 3.3/11] Show that (53.8.9) can also be written as
2σxy
(53.8.10) β= q where b = σx2 − σy2 /κ.
b+ 2 /κ
4σxy + b2
1 −4κσ 2
h i
(53.8.13) = p xy .
2σxy a − 4κσxy
2 + a2
where β̂ ROLS is the parameter obtained by reversed OLS (i.e., by regressing x on y).
Answer. For this one needs (53.8.9) and (53.8.10).
53.8. P-ESTIMATION 1155
Now we will show that the same estimate can be obtained by minimizing the
weighted sum
1 X 1 X
(53.8.15) 2
(xi − x∗ i )2 + 2 (y i − α − βx∗ i )2
σu σv
with respect to α, β, and x∗ i .
This minimization is done in three steps. In the first step, we ask: given α and β,
what are the best x∗ i ? (Here we see that this alternative approach to P -estimation
also gives us predictions of the systematic variables.) Since each x∗ i occurs only in
one summand of each of the two sums, we can minimize these individual summands
separately. For the ith summands we minimize
1 1
(53.8.16) (xi − x∗ i )2 + 2 (y i − α − βx∗ i )2
σu2 σv
with respect to x∗ i . The partial derivative is
2 2
(53.8.17) (xi − x∗ i )(−1) + 2 (y i − α − βx∗ i )(−β)
σu2 σv
Setting this zero gives
(53.8.18) σv2 (xi − x∗ i ) + βσu2 (y i − α − βx∗ i ) = 0
1156 53. ERRORS IN VARIABLES
or
σv2 xi + βσu2 (y i − α)
(53.8.19) x∗ i = .
β 2 σu2 + σv2
If one plugs these x∗ i into the objective function one ends up with a surprisingly
simple form:
y i − α − βxi
(53.8.20) xi − x∗ i = −βσu2
β 2 σu2 + σv2
y − α − βxi
(53.8.21) y i − α − βx∗ i = σv2 i 2 2
β σu + σv2
(xi − x∗ i )2 (y − α − βx∗ i )2 (y − α − βxi )2
(53.8.22) 2
+ i 2
= i 2 2 .
σu σv β σu + σv2
The objective function is the sum of this over i:
1 X
(53.8.23) (y − α − βxi )2
β 2 σu2 + σv2 i i
Note: with the true values of α and β, the numerator in (53.8.23) is i c2i where
P
ci = y i − α − βxi , and the denominator σc2 . This form of the objective function can
53.8. P-ESTIMATION 1157
x∗ i
- σu
P
Setting this zero gives (y i − α − βxi ) = 0 or α = ȳ − βx̄. Plugging this α into the
objective function gives
P 2
P ∗ ∗ 2
i (y i − ȳ − βxi + βx̄) i (y i − βxi )
(53.8.28) =: .
β 2 σu2 + σv2 β 2 σu2 + σv2
53.8. P-ESTIMATION 1159
The final step minimizes this with respect to β. Use (u/v)0 = (u0 v − uv 0 )/v 2 to
get the partial with respect to β:
2 i (y ∗i − βx∗i )(−x∗i )(β 2 σu2 + σv2 ) − i (y ∗i − βx∗i )2 2βσu2
P P
(53.8.29) .
(β 2 σu2 + σv2 )2
Setting this zero gives
X X
(53.8.30) (y ∗i − βx∗i )x∗i (β 2 σu2 + σv2 ) + (y ∗i − βx∗i )2 βσu2 = 0.
i i
1 ∗ ∗
P
Using the sample moments S xy = n i xi y i etc., one obtains
(53.8.31) (S xy − βS 2x )(β 2 σu2 + σv2 ) + (S 2y − 2βS xy + β 2 S 2x )βσu2 = 0
or
(53.8.32) β 2 σu2 S xy + β(S 2y σu2 − S 2x σv2 ) + σv2 S xy = 0.
Dividing by σx2 and using κ = σv2 /σu2 one obtains exactly (53.8.7) with the true
moments of the observed variables replaced by their sample moments.
singular). This κ always exists and is uniquely determined: it is the smallest κ for
which Q − κΛ is singular, i.e., the smallest root of the equation det(Q − κΛ) = 0.
Proof: This is true when Q is the identity matrix and Λ is diagonal, then κ is
the inverse of the largest diagonal element of Λ. The general case can always be
transformed into this by a nonsingular transformation (see Rao, Linear Statistical
Inference and Its Applications, p. 41): given a positive definite symmetric Q and a
symmetric Λ, there is always a nonsingular R and a diagonal Γ so that Q = R> R
and Λ = R> ΓR. Therefore Q − κΛ = R> (I − κΓ)R, which is positive semidefinite
if and only if I − κΓ is. Once one has κ, it is no problem to get all those vectors
that annul Q∗ .
Here is an equivalent procedure which gets β ∗ and κ simultaneously. We will
use the following mathematical fact: Given a symmetric positive definite Q and a
symmetric nonnebagive definite matrix Λ. Then the vector γ ∗ annulls Q − κΛ iff it
is a scalar multiple of a β ∗ which has the minimum property that
(53.8.33)
β = β∗ minimizes β > Qβ s. t. β > Λβ = 1,
53.8. P-ESTIMATION 1161
and κ is the minimum value in this minimization problem. Alternatively one can say
that γ ∗ itself has the following maximum property:
(53.8.34)
γ > Λγ
γ = γ∗ maximizes γ > Qγ
minimum problem. This proves the “only if” part, and at the same time shows that
the minimum value in (53.8.33) is κ.
>
For the “if” part assume that β ∗ solves (53.8.33), and β ∗ Qβ ∗ = κ. Then
>
β ∗ (Q − κΛ)β ∗ = 0. To show that Q − κΛ is nonnegative definite, we will assume
there is a γ with γ > (Q − κΛ)γ < 0 and derive a contradiction from this. By the
same argument as above one can construct a scalar multiple γ ∗ with γ ∗ > Λγ ∗ = 1
which still satisfies γ ∗ > (Q − κΛ)γ ∗ < 0. Hense γ ∗ > Qγ ∗ < κ, which contradicts β ∗
being a minimum value.
1162 53. ERRORS IN VARIABLES
If Λ has rank one, i.e., there exists a u with Λ = uu> , then the constraint
in (53.8.33) reads (u> β)2 = 1, or u> β = ±1. Since we are looking for all scalar
multiples of the solution, we can restrict ourselves to u> β = 1, i.e., we are back to
a linearly constrained problem. For Q positive definite I get the solution formula
β ∗ = Q−1 u(u> Q−1 u)−1 . and the minimum value is 1/(u> Q−1 u). This can be
written in a much neater and simpler form for problem (53.8.34); its solution is any
γ ∗ that is a scalar multiple of Q−1 u, and the maximum value is u> Q−1 u.
If one applies this construction to the sample moments instead of the true limiting
moments, one obtains estimates of κ and β, and also of Q∗ and Q. e The estimate
∗
of Q is positive semidefinite by construction, and since κ > 0 (otherwise Q − κΛ
would not be singular) also the estimate of Q e = κΛ is nonnegative definite. In P -
estimation, therefore, the estimates cannot lead to negative variance estimation as
in V -estimation.
53.8.2. The P-Estimator as MLE. One can show that the P -estimator is
MLE in the structural as well as in the functional variant. We will give the proofs
only for the simple EV model.
In the structural variant, in which x∗ i and y ∗ i are independent observations
from a jointly normal distribution, the same argument applies that we used for the
V -estimator: (53.8.9) expresses β as a function of the true moments of the observed
53.8. P-ESTIMATION 1163
variables which are jointly normal; the MLE of these true moments are therefore
the sample moments, and the MLE of a function of these true moments is the same
function of the sample moments. In this case, not only β̂P but also the estimates for
2
σu etc. derived from the Frisch equations are MLE.
In the functional variant, the x∗ i and y ∗ i are nonstochastic and must be max-
imized over (“incidental parameters”) together with the structural parameters of
2
interest α, β, σu , and σv2 . We will discuss this functional variant in the slightly more
general case of heteroskedastic errors, which is a violation of the assumptions made
in the beginning, but which occurs frequently, especially with replicated observations
which we will discuss next. In the functional variant, we have
1 ∗ 2 2
(53.8.35) xi ∼ N (x∗ i , σu
2
i
) fxi (xi ) = p 2
e−(xi −x i ) /2σui ;
2πσui
1 ∗ 2 2
(53.8.36) y i ∼ N (α + βx∗ i , σv2 i ) fyi (Y i ) = p 2
e−(Y i −α−βx i ) /2σvi .
2πσvi
Since the parameters x∗ i , α, and β only appear in the exponents, their maximum
likelihood estimates can be obtained by minimizing the exponent only (and for this,
2
the σu i
and σv2 i must be known only to a joint multiplicative factor). If these
variances do not depend on i, i.e., in the case of homoskedasticity, one is back to the
weighted least squares discussed above.
Also in the case of heteroskedasticity, it is convenient to use the three steps
outlined above. Step 1 always goes through, and one ends up with an objective
function of the form
X (Y i − α − βxi )2 X 1
(53.8.38) = gi (Y i − α − βxi )2 with gi := .
i
β 2 σu
2
i
+ σv2 i i
β 2 σu
2 + σ2
i vi
Step 2 establishes an identity involving α̂, β, and the weighted means of the obser-
vations:
P P
gi xi gi y
(53.8.39) ¯ ¯ ¯
α̂ = ȳ − β x̄ where x̄ := P ȳ := P i .
¯
gi gi
In the general case, step 3 leads to very complicated formulas for β̂ because x̄¯ and ȳ
¯
depend on β through the gi . But there is one situation in which this is not the case:
1
this is when one knows a λ so that σv2 i = λσu 2
i
for all i. Then gi = (β 2 +λ)σ 2 and
ui
53.9. ESTIMATION WHEN THE ERROR COVARIANCE MATRIX IS EXACTLY KNOWN1165
53.9.1. Examples. Age determination by Beta decay of Rb87 into Sr87 [CCCM8
This decay follows the equation
dRb87
(53.9.1) = −λRb87 λ = 1.42 · 10−11 yr−1
dt
1166 53. ERRORS IN VARIABLES
−λt
which has the solution Rb87 = Rb87
0 e . The amount of Sr87 is the amount present
at time 0 plus the amount created by the decay of Sr87 since then:
=Rb87 eλt
z }| {
(53.9.2) Sr87 = Sr087 + Rb87
0 (1 − e−λt )
(53.9.3) = Sr087 + Rb87 (eλt − 1).
If one divides by the stable reference isotope Sr86 which does not change over time:
(53.9.5) y ∗ = α + x∗ β.
The observations of the Sr87 /Sr86 and Rb87 /Sr86 ratios do therefore lie on a straight
line which is called an isochrone.
53.9. ESTIMATION WHEN THE ERROR COVARIANCE MATRIX IS EXACTLY KNOWN1167
This chapter draws on the monograph [WH97]. The authors are Bayesians, but
they attribute the randomness of the parameters not only to the ignorance of the
observer, but also to shifts of the true underlying parameters.
and it is an important estimation issue to distinguish the observation noise from the
random shifts of the parameters.
We will write the model equations observation by observation. As usual, y is the
n-vector of dependent variables, and X the n × k-matrix of independent variables.
But the coefficient vector is different in every period. For all t with 1 ≤ t ≤ n,
the unobserved underlying β t and the observation y t obey the familiar regression
relationship (“observation equation”):
(54.1.1) y t = x>
t β t + εt εt ∼ (0, σ 2 ut )
Here x>t is the tth row of X. The “system equation” models the evolution over time
of the underlying β t :
Finally, the model can also handle the following initial information
(54.1.3) β 0 ∼ (b0 , τ 2 Ψ0 )
but it can also be estimated if no prior information is given (“reference model”). The
scalar disturbance terms εt and the disturbance vectors ω t are mutually independent.
We know the values of all ut and Ξt and κ2 = σ 2 /τ 2 (which can be considered the
54.1. RECURSIVE SOLUTION 1171
y X k B ε
(54.1.4) = ∆ +
n n n
k B k G k B k ω
L
(54.1.5) = +
∆
∞ ∞ ∞
Notation: If y i are observed for i = 1, . . . , t, then we will use the symbols bt for
the best linear predictor of β t based on this information, and τ 2 Ψt for its MSE-
matrix.
The model with prior information is mathematically easier than that without,
because the formulas for bt+1 and Ψt+1 can be derived from those for bt and Ψt
using the following four steps:
(1) The best linear predictor of β t+1 still with the old information, i.e., the y i
are observed only until i = 1, . . . , t, is simply Gt+1 bt . This predictor is unbiased and
its MSE-matrix is
where we use the abbreviation Rt = Gt Ψt−1 G> t + Ξt . This formula encapsules the
prior information about β t+1 .
(2) The best linear predictor of y t+1 , i.e., the one-step-ahead forecast, is ŷ t+1 =
xt+1 Gt+1 bt . This predictor is again unbiased and its MSE is
(54.1.7)
MSE[ŷ t+1 ; y t+1 ] = τ 2 xt+1 (Gt+1 Ψt G> > 2 2 >
t+1 +Ξt+1 )xt+1 +σ ut = τ xt+1 Rt+1 xt+1 +σ ut
2
(3) The joint MSE-matrix of Gt+1 bt and x> t+1 Gt+1 bt as best linear predictors
of β t+1 and y t+1 based on all observations up to and including time t is
x>
>
+ κ2 u t x>
t+1 Gt+1 bt y x R x t+1 Rt+1
; t+1 = τ 2 t+1 t+1 t+1
(54.1.8) MSE
Gt+1 bt β t+1 Rt+1 xt+1 Rt+1 .
(4) Now we are in the situation of Problem 327. After observation of y t+1 we
can use the best linear prediction formula (27.1.15) to get the “posterior” predictor
of β t+1 as
−1
2 −1 2 −1
(54.1.10) bt+1 = xt+1 u−1
t x >
t+1 + κ R t+1 x u
t+1 t
−1
y t+1 + κ R G b
t+1 t+1 t
1 1 −1 −1
(54.1.12) bt+1 = 2
xt+1 x>t+1 + 2 Rt+1
σ ut τ
1 1 −1
> > −
xt+1 x t+1 (x t+1 xt+1 ) xt+1 y t+1 + R t+1 Gt+1 bt
σ 2 ut τ2
All εs and all ω t and β 0 are mutually independent. κ2 = σ 2 /τ 2 is known but σ 2 and
τ 2 separately are not.
General k βt xt b0 bt Ψ0 Ψt Gt ωt Ξt ut
Locally const. 1 βt 1 b0 bt ψ0 ψt 1 ωt 1 1
We can use the general solution formulas derived earlier inserting the specific
values listed in Table 1, but it is more instructive to derive these formulas from
scratch.
1176 54. DYNAMIC LINEAR MODELS
Problem 480. The BLUE of β t+1 based on the observations y 1 , . . . , y t+1 is the
optimal combination of the following two unbiased estimators of β t+1 .
• a. 1 point The estimator is the BLUE of β t before y t+1 was available; call this
estimator bt . For the purposes of this recursion bt is known, it was computed in the
previous iteration, and MSE[bt ; β t ] = τ 2 ψt is known for the same reason. bt is not
only the BLUE of β t based on the observations y 1 , . . . , y t , but it is also the BLUE of
β t+1 based on the observations y 1 , . . . , y t . Compute MSE[bt ; β t+1 ] as a function of
τ 2 ψt .
Answer. Since β t+1 = β t + ω t+1 where ω t+1 ∼ (0, τ 2 ) is independent of bt , bt can also serve
as a predictor of β t+1 , with MSE[bt ; β t+1 ] = τ 2 (ψt + 1).
• b. 1 point The second unbiased estimator is the new observation y t+1 . What
is its MSE as a estimator of β t+1 ?
Answer. Since y t+1 = β t+1 + εt+1 where εt+1 ∼ (0, σ 2 ), clearly MSE[y t+1 ; β t+1 ] = σ 2 .
• c. 3 points The estimation errors of the two unbiased estimators, y t+1 and bt
are independent of each other. Therefore use problem 206 to compute the best linear
combination of these estimators and the MSE of this combination?
54.2. LOCALLY CONSTANT MODEL 1177
Answer. We have to take their weighted average, with weights proportional to the inverses of
the MSE’s.
1
τ 2 (ψt +1) t
b + σ12 y t+1 κ2 bt + (ψt + 1)y t+1
(54.2.4) bt+1 = 1
=
τ 2 (ψt +1)
+ σ12 κ2 + ψ t + 1
(the second formula is obtained from the first by multiplying numerator and denominator by σ 2 (ψt +
1)). The MSE of this pooled estimator is MSE[bt+1 ; β t+1 ] = τ 2 ψt+1 where
κ2 (ψt + 1)
(54.2.5) ψt+1 = .
κ2 + ψt + 1
(54.2.4) and (54.2.5) are recursive formulas, which allow to compute ψt+1 from ψt , and bt+1 from
bt and ψt .
-
In every recursive step we first compute ψt and use this to get the weights of
y t+1 and bt+1 in their weighted average. These two steps can be combined since the
weight of y t+1 is exactly at+1 = ψt /κ2 , which is called the “adaptive coefficient”:
(??) can be written as
for (i in 2:lngth)
{avec[[i]] <- (avec[[i-1]]+kappinv)/(avec[[i-1]]+kappinv+1);
bvec[[i]] <- avec[[i]]*y[[i]]+(1-avec[[i]])*bvec[[i-1]];
}
##For the computation of the one-step-ahead prediction mse
##note that y[-1] is vector y with first observation dropped
##and bvec[-lngth] is bvec with last component dropped
##value returned:
list(coefficients=bvec,
adaptive=avec,
residuals=y-bvec,
mse=sum((y[-1]-bvec[-lngth])^2)/(lngth-1),
discount=1-sqrt(kappinv*(1+0.25*kappinv))+0.5*kappinv)
}
a + κ12
(54.2.9) a=
a + κ12 + 1
1180 54. DYNAMIC LINEAR MODELS
i.e., it depends on κ2 alone. This quadratic equation has one nonnegative solution
r
1 1 1
(54.2.10) a= 2
+ 4− 2
κ 4κ 2κ
Problem 482. Solve the quadratic equation (54.2.9).
gives a = 0.
The pre-limit values also depend on the initial value a1 and can be written (here
d = 1 − a is the “discount factor”)
where K and k are defined in (54.3.1). The MSE-matrix of b is the inverse of the
lower right partition of the inverse covariance matrix (54.3.5), which is τ 2 K −1 .
For the proof of equations (54.3.2) and (54.3.3) note first that
(54.3.11)
MSE[Gbt ; β t+1 ] = E (Gbt − β t+1 )(Gbt − β t+1 )>
Answer. Multiply the matrix with its alleged inverse and see whether you get I:
GK −1 Ξ−1 −1 −1
Gt+1 Ξ−1
>
> −1 >
t G + Ξt+1 t+1 − Ξt+1 Gt+1 (Gt+1 Ξt+1 Gt+1 + K t ) t+1 =
= I + GK −1 −1 > −1 > −1
t − GK t G Ξt+1 Gt+1 (Gt+1 Ξt+1 Gt+1 + K t )
−1
−
−1 −1
− Gt+1 (G>
t+1 Ξt+1 Gt+1 + K t )
−1
G>
t+1 Ξt+1 =
= I + GK −1
t I − G> Ξ−1 > −1
t+1 Gt+1 (Gt+1 Ξt+1 Gt+1 + K t )
−1
−
−1 −1
− K t (G>
t+1 Ξt+1 Gt+1 + K t )
−1
G>
t+1 Ξt+1 = I
Now let us see what this looks like in the simplest case where X = x has only
one column, and Ξt = 1 and ut = 1. For clarity I am using here capital letters for
54.3. THE REFERENCE MODEL 1185
certain scalars
xt y t
(54.3.16) k t = ht +
κ2
x2
(54.3.17) Kt = Ht + t2
κ
kt
(54.3.18) ht+1 = (1 + Kt )−1 k t =
1 + Kt
Kt
(54.3.19) Ht+1 = 1 − (1 + Kt )−1 =
1 + Kt
Starting values are h1 = 0 and H1 = 0; then
x1 y 1 x21
k1 = K1 =
κ2 κ2
k1 K1
h2 = H2 =
1 + K1 1 + K1
etc.
Problem 485. Write a program for the dynamic regression line through the
origin, reference model, in the programming language of your choice, or write a
macro in the spreadsheet of your choice.
1186 54. DYNAMIC LINEAR MODELS
Answer. The R-code is in Table 2. If argument x is missing, the locally constant model will be
estimated. Note that y[-1] is vector y with first observation dropped, and since lngth is the length
of the vectors, bvec[-lngth] is bvec with last component dropped. The last line is the expression
returned by the function.
.........
.... . ......
...... ........ .............
.
.
........... . ...... .... .............. ........... .....
.............
... ....... .......... ... . . ..... ......
..... . ...
... ..... ............. ......
.
.. ..... .. ...
. .. ............ .... .....
....
...... . ...... ... . ..
..... .............. .
.
..
..
.. .. .. . .... . ..... ..... ..... ..
.
... ......... ....... . ............... .......
... .
. . ...............
. .. ..... . ....... .... .. .. ..
....
....... ...... ..
.. ..
. ......... ............. ............. .......
.
.
. .. .... ............... ..
.
.......... .. . .. . . ..
. ............... ..... .. . ........ ...... . ... ............. .... .... . . ...
...... .... ...... ..... ......... ............. ........... .. ...... ... ..................... ..
.. ........ ... ........ ............. . ... . .. ... ............................. ..........................
..
................. ....... ..... .... .......
. ........ .... .. ............................................ .......... .. ..... ..
.
. . . .. . .
. .
.. . .. . .
.. . .
..
..
.. .......... . ........... . .... ....................
........... ............ ..... ....
.... .............. ....
...... .........
.... ......
....... .....
.....
.....
.
71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02
Answer. Figure 1 plots the daily levels of the exchange rate. It has a lot of detail which one
can only see if one magnifies the plot on the pdf-reader. Similar graphs are in [WH97, p. 67] and
[Gut94, Figure 14.1 on p. 370].
The following description of what you see in an exchange rate graph borrows heavily from
[Gut94, p. 369].
Secular trend: In long run exchange rates reflect a country’s competitiveness in the interna-
tional hierarchy of nations. When a country manages to strengthen its competitive position in the
world market, its external accounts improve and its currency appreciates (e.g. Germany, Japan).
The reverse happens if a country faces gradual erosion (Great Britain, United States). Therefore
you see steady runs of gradual linear advances or declines over many years.
Business cycle: Woven around this secular trend are cycles of 4–7 years. “This pattern suggests
that exchange-rate movements trigger counteracting adjustments in goods and assets markets. But
these effects take time to unfold, and in the meantime foreign-exchange markets overshoot. The
overshooting sets the stage for the next phase of the cycle, when it has finally begun to turn around
such economic fundamentals as inflation and the direction of macroeconomic policy.”
Is there an even shorter cycle due to inventories and the time it takes to find new suppliers?
Shorter-term exchange rate fluctuations lasting a few weeks or months are due to expectations
and speculation: “At times expectational biases are widely differentiated, and the markets move
sideways. But most of the time we can see pronounced price movements in one direction reflecting
widely shared market sentiments. These speculative “runs” usually last a few weeks or months
before being temporarily interrupted by even shorter countermovements. Because runs outweigh
corrections, they reinforce whatever phase of the currency cycle we are in.”
Daily variability: Despite these regularities, exchange rates are very volatile in the short run:
they often fluctuate 1–2% per day.
1190 54. DYNAMIC LINEAR MODELS
Then there are several complete changes in regime which are due to institutional changes in
the monetary system. In August 1971, when Nixon abolished the convertibility of the dollar, at the
beginning of 1993, with increasing European monetary integration, and at the beginning of 1999,
with the introduction of the euro.
that the modeller does not wish to anticipate the form of this longer term variation,
merely describing it as a purely stochastic process.”
Problem 487. Simulate and plot time series whose first differences follow a
locally constant model with various values of κ2 so that you become familiar with the
forms of behavior such series can display.
Problem 488. The dataset dolperpd has two variables; the first variable, dates,
is the date coded as an integer, namely the number of days since midnight January 1,
1970 (i.e., January 1, 1970 is day number 1). The second variable dolperpd is the
exchange rate of the British Pound in terms of the dollar at noon of that day in New
York City. The data go from 1971 until 1989, they are a subset of the data plotted
in figure 1. These data are included in R-library ecmet, and a text file with the data
is available on the web at www.econ.utah.edu/ehrbar/data/dolperpd.txt.
• a. Compute the dynamic linear reference model for various values of κ2 and
look at the average squared forecasting error. For which value of κ2 is it lowest?
Interpret your result.
Answer. If one takes the daily data, the SSE becomes lowest as κ2 → 0. This means, one gets
best forecasting performance if one treats the data as a random walk. The most recent observation
is the best predictor of the future, there is no such thing as an “underlying level.”
1192 54. DYNAMIC LINEAR MODELS
• b. Now take the weekly averages of these data, and the monthly averages, and
see which κ2 minimizes the SSE.
• c. Now take the first differences of the weekly and monthly averages, and see
which κ2 minimizes the forecasting error.
Answer. Now a nonzero κ2 minimizes the SSE. What does this mean? Instead of fluctuating
around slowly varying levels, the data fluctuate around slowly changing straight lines.
• d. Make several simulations of datasets which have the same length as the
datasets which you started out with, using the optimal κ2 . What differences do you
see between your original and the simulated datasets?
Answer. On the one hand, the original data are bounded; they move in a fixed range, while
the simulated data wander off randomly also in the long run. Therefore the plots can be misleading;
if the range is large, it is compressed and the data look much smoother than they are if the range
happens to be comparable to that of the economic data. On the other, the economic data have
different behavior if the series goes up than if it goes down.
54.5. COMPANY MARKET SHARE 1193
With β t being the company’s market share, this movement of the line can be
modeled as a dynamic zero-intercept regression model:
(54.5.1) y t = xt β t + εt εt ∼ NID(0, σ 2 )
(54.5.2) β t = β t−1 + ω t ω t ∼ NID(0, τ 2 )
(54.5.3) β 0 ∼ (b0 , τ 2 ψ0 )
and the MSE of this pooled estimator is MSE[bt+1 ; β t+1 ] = τ 2 ψt+1 where
κ2 (ψt + 1)
(54.5.5) ψt+1 = .
κ2 + ψ t + 1
Here it is convenient to define at = xt ψt /κ2 so that
(54.5.6) bt+1 = at+1 y t+1 + (1 − xt+1 at+1 )bt = bt + at+1 (y t+1 − xt+1 bt )
with the recursive relation
at 1
xt+1 xt + κ2
(54.5.7) at+1 = at 1
.
1 + x2t+1 xt + κ2
the estimate with each data point. In analogy with the dynamic model, we write it
1198 54. DYNAMIC LINEAR MODELS
as follows
(54.5.9) y t = xt β + εt εt ∼ NID(0, σ 2 )
(54.5.10) β 0 ∼ (b0 , σ 2 ψ0 ).
Here β 0 is the prior estimate of β before any data are available. All εs are assumed
independent of β 0 .
Let’s compute again recursively the best estimate of β given the observation of
y 1 , . . . , y t+1 . We have two pieces of information about β. On the one hand, bt is the
best estimator of β; it is unbiased with MSE[bt ; β t+1 ] = σ 2 ψt . On the other hand,
y 2
y t+1 = xt+1 β + εt+1 where εt+1 ∼ (0, σ 2 ), therefore MSE[ xt+1 t+1
; β] = xσ2 . These two
t+1
pieces of information are independent of each other. To combine them optimally,
take their weighted average, with weights proportional to the inverses of the MSE’s.
1 x2t+1 y t+1
σ 2 ψt bt +σ 2 xt+1 bt + xt+1 y t+1 ψt
(54.5.11) bt+1 = =
1 x2t+1 1 + x2t+1 ψt
σ 2 ψt + σ 2
54.5. COMPANY MARKET SHARE 1199
much closer to the observed one, although it still underpredicts. A κ2 = 10, 000
gives a better fit, but now there is very little smoothing going on, and there is still
underprediction. The problem is that the dynamic line through the origin allows the
line to move, but assumes that there will be zero movement even if the line has been
moving in the same direction for a long time. Apparently there is some momentum in
the movement of the market share: the product’s market share is trending upwards
and the predictions should take this into consideration. The linear growth model
(54.6.1) – (54.6.3) does exactly this, and if one looks at the one-step-ahead forecasts
now, finally there is no longer underprediction, but one sees a very clear seasonal
pattern which should be tackled next.
West and Harrison [WH97, Section 3.4.2 on pp. 84–91] use a prior not only for
the means but also for the variance and estimate the variance from the data too. We
are using a simpler model, therefore our results and theirs are a little different.
data(milkprod) makes these data available. These data can also be downloaded as
a text file from www.econ.utah.edu/ehrbar/data/milkprod.txt.
• a. Plot both milk and cows against time, plot them against each other, and
plot their ratio milk/market against time. What do you see?
• b. West and Harrison compare the forecasting performance of their dynamic
straight line through the origin with that of a static straight line, and say that the
dynamic model, although not perfect, is much to be preferred. One of the criteria of
a good model is its forecasting ability. Plot the one-step ahead forecasting errors in
both of these models into the same figure.
a
a a
a a a a a
aq aq a a a
q q q q q q q q q q
q
The plots just made show that the dynamic linear model is still not satisfactory.
A better fitting dynamic linear model with the same data is estimated in [NT92].
(54.6.1) y t = xt β t + εt εt ∼ (0, κ2 τ 2 )
(54.6.2) β t = β t−1 + δ t + ut ut ∼ (0, τ 2 )
(54.6.3) δ t = δ t−1 + v t v t ∼ (0, λ2 τ 2 )
This can be considered a linear growth curve model. β t is the annual milk output
per cow in year t, i.e., it measures productivity. This productivity increases from
year to year. The productivity increase between year t − 1 and year t is δ t + ut . The
first component is persistent; it does not change much between consecutive years
but follows a random walk. The second component is transitory with zero expected
value. I.e., the three error terms in this model have three different meanings: v t are
persistent random shocks in the yearly productivity increases, ut are transient annual
fluctuations in productivity, and εt represents the discrepancy between actual output
and productivity-determined normal output. All three are assumed independent. We
will use the notation var[v t ] = τ 2 var[ε] = σ 2 = κ2 τ 2 , and var[ut ] = λ2 τ 2 . There are
not enough data to estimate the relative variances of the different error terms as in
the exchange rate example; here prior information enters the model.
1204 54. DYNAMIC LINEAR MODELS
Problem 491. [NT92] This is an exercise about the growth model (54.6.1) –
(54.6.3).
• a. 3 points Describe the intuitive meaning of β t , δ t , the three disturbances, and
the two parameters κ and λ.
• b. 2 points Show how this model can be fitted in the framework of the dynamic
linear model as defined here. Note that in this framework the unobserved random pa-
rameters linearly depend on their lagged values, while in equation (54.6.2) β t depends
on δ t instead of δ t−1 . But there is a trick to get around this.
Answer. The trick is to replace δ t in equation (54.6.2) with δ t−1 + v t :
(54.6.4) β t = β t−1 + δ t−1 + v t + ut = β t−1 + δ t−1 + ω t
(54.6.5) y t = xt β t + εt
(54.6.6) β t = β t−1 + δ t−1 + ω t
(54.6.7) δ t = δ t−1 + v t
where ω t = v t + ut . The new disturbances ω t and v t are no longer independent. From the original
2 ut 0 1 0
(54.6.8) εt ∼ IID(0, σ ) ∼ IID( , τ2 )
vt 0 0 λ2
54.6. PRODUCTIVITY IN MILK PRODUCTION 1205
follows
ωt 0 1 + λ2 λ2
(54.6.9) εt ∼ IID(0, σ 2 ) ∼ IID( , τ2 )
vt 0 λ2 λ2
• c. Plug the matrices in Table 6 into the formulas for the reference estimator
(54.3.1), (54.3.2), and (54.3.3), and develop simple formulas without matrix notation,
which can then be programmed in a spreadsheet or other application.
CHAPTER 55
Numerical Minimization
1207
1208 55. NUMERICAL MINIMIZATION
(55.0.10) θ i+1 = θ i + αi di .
Here di , a vector, is the step direction, and αi , a scalar, is the step size. The choice of
the step direction is the main characteristic of the program. Most programs (notable
exception: simulated annealing) always choose directions at every step along which
the objective function slopes downward, so that one will get lower values of the
objective function for small increments in that direction. The step size is then chosen
such that the objective function actually decreases. In elaborate cases, the step size
is chosen to be that traveling distance in the step direction which gives the best
improvement in the objective function, but it is not always efficient to spend this
much time on the step size.
Let us take a closer look how to determine the step direction. If g > i = (g(θ i ))
>
is the Jacobian of f at θ i , i.e., the row vector consisting of the partial derivatives of
f , then the objective function will slope down along direction di if the scalar product
g>i di is negative. In determining the step direction, the following fact is useful: All
vectors di for which g >i di < 0 can be obtained by premultiplying the transpose of the
55. NUMERICAL MINIMIZATION 1209
Problem 492. 4 points Here is a proof for those who are interested in this issue:
Prove that g > d < 0 if and only if d = −Rg for some positive definite symmetric
matrix R. Hint: to prove the “only if ” part use R = I − gg > /(g > g) − dd> /(d> g).
This formula is from [Bar74, p. 86]. To prove that R is positive definite, note that
R = Q + S with both Q = I − gg > /(g > g) and S = −dd> /(d> g) nonnegative
definite. It is therefore sufficient to show that any x 6= o for which x> Qx = 0
satisfies x> Sx > 0.
Answer. If R is positive definite, then d = −Rg clearly satisfies d> g < 0. Conversely, for
any d satisfying d> g < 0, define R = I − gg > /(g > g) − dd> /(d> g). One immediately checks that
d = −Rg. To prove that R is positive definite, note that R is the sum of two nonnegative definite
matrices Q = I − gg > /(g > g) and S = −dd> /(d> g). It is therefore sufficient to show that any
x 6= o for which x> Qx = 0 satisfies x> Sx > 0. Indeed, if x> Qx = 0, then already Qx = o, which
gg > x
means x = g> g
. Therefore
1210 55. NUMERICAL MINIMIZATION
(55.0.12) θ i+1 = θ i − αi Ri g i
The most important ingredient here is the choice of Ri . We will discuss two “natural”
choices.
The choice which immediately comes to mind is to set Ri = I, i.e., di = −αi g i .
Since the gradient vector shows into the direction where the slope is steepest, this is
called the method of steepest descent. However this choice is not as natural as one
might first think. There is no benefit to finding the steepest direction, since one can
easily increase the step length. It is much more important to find a direction which
allows one to go down for a long time—and for this one should also consider how the
gradient is changing. The fact that the direction of steepest descent changes if one
changes the scaling of the variables, is another indication that selecting the steepest
descent is not a natural criterion.
The most “natural” choice for Ri is the inverse of the “Hessian matrix” G(θ i ),
which is the matrix of second partial derivatives of f , evaluated at θ i . This is called
the Newton-Raphson method. If the inverse Hessian is positive definite, the Newton
Raphson method amounts to making a Taylor development of f around the so far
55. NUMERICAL MINIMIZATION 1211
best point θ i , breaking this Taylor development off after the quadratic term (so
that one gets a quadratic function which at point θ i has the same first and second
derivatives as the given objective function), and choosing θ i+1 to be the minimum
point of this quadratic approximation to the objective function.
Here is a proof that one accomplishes all this if Ri is the inverse Hessian. The
quadratic approximation (second order Taylor development) of f around θ i is
> 1
(55.0.13) f (θ) ≈ f (θ i ) + g(θ i ) (θ − θ i ) + (θ − θ i )> G(θ i )(θ − θ i ).
2
By theorem 55.0.1, the minimum argument of this quadratic approximation is
−1
(55.0.14) θ i+1 = θ i − G(θ i ) g(θ i ),
which is the above procedure with step size 1 and Ri = (G(θ i ))−1 .
1
(55.0.15) q : z 7→ g > z + z > Gz is x = −G−1 g.
2
1212 55. NUMERICAL MINIMIZATION
There are many modifications of the Newton-Raphson method which get around
computing the Hessian and inverting it at every step and at the same time ensure
that the matrix Ri is always positive definite by using an updating formula for Ri ,
which turns Ri , after sufficiently many steps into the inverse Hessian. These are
probably the most often used methods. A popular one used by the gauss software
is the Davidson-Fletcher-Powell algorithm.
One drawback of all these methods using matrices is the fact that the size of
the matrix Ri increases with the square of the number of variables. For problems
with large numbers of variables, memory limitations in the computer make it nec-
essary to use methods which do without such a matrix. A method to do this is the
“conjugate gradient method.” If it is too difficult to compute the gradient vector,
the “conjugate direction method” may also compare favorably with computing the
gradient numerically.
CHAPTER 56
1215
1216 56. NONLINEAR LEAST SQUARES
If the errors are normally distributed, then nonlinear least squares is equal to the
maximum likelihood estimator. (But this is only true as long as the covariance matrix
is spherical as assumed here.)
56. NONLINEAR LEAST SQUARES 1217
(56.0.26) y 1 = η1 (β1 , β2 , · · · , βk ) + ε1
(56.0.27) y 2 = η2 (β1 , β2 , · · · , βk ) + ε2
.. ..
(56.0.28) . .
(56.0.29) y n = ηn (β1 , β2 , · · · , βk ) + εn
Usually there are other independent variables involved in η which are not shown here
explicitly because they are not needed for the results proved here.
1218 56. NONLINEAR LEAST SQUARES
Next we will derive the first-order conditions, and then describe how to run
the linearized Gauss-Newton regression. For this we need some notation. For an
arbitrary but fixed vector β i (below it will be the ith approximation to the nonlinear
least squares parameter estimate) we will denote the Jacobian matrix of the function
η evaluated at β i with the symbol X(β i ), i.e., X(β i ) = ∂η(β)/∂β > (β i ). X(β i ) is
called the matrix of pseudoregressors at β i . The mh-th element of X(β i ) is
∂ηm
(56.0.32) xmh (β i ) = (β ),
∂βh i
i.e., X(β i ) is the matrix of partial derivatives evaluated at β i
∂η1 ∂η1
∂β1 (β i ) · · · ∂βk (β i )
(56.0.33)
X(β i ) = ..
,
.
∂ηn ∂ηn
∂β1 (β i ) · · · ∂βk (β i )
but X(β i ) should first and foremost be thought of as the coefficient matrix of the
best linear approximation of the function η at the point β i . In other words, it is the
matrix which appears in the Taylor expansion of η(β) around β i :
This Jacobian is a row vector because the objective function is a scalar function. We
need the chain rule (C.1.23) to compute it. In the present situation it is useful to
break our function into three pieces and apply the chain rule for three steps:
(56.0.36)
∂SSE/∂β > = ∂SSE/∂ ε̂> ·∂ ε̂/∂η > ·∂η/∂β > = 2ε̂> ·(−I)·X(β) = −2(y−η(β))> X(β)
Problem 494. 3 points Compute the Jacobian of the nonlinear least squares
objective function
where η(β) is a vector function of a vector argument. Do not use matrix differentia-
tion but compute it element by element and then verify that it is the same as equation
(56.0.36).
56. NONLINEAR LEAST SQUARES 1221
Answer.
n
X
(56.0.38) SSE = (y t − ηt (β))2
t=1
n
∂SSE X ∂ηt
(56.0.39) = 2(y t − ηt (β)) · (− )
∂βh ∂βh
t=1
X ∂ηt
(56.0.40) = −2 (y t − ηt (β))
∂βh
t
∂η1 ∂ηn
(56.0.41) = −2 (y 1 − η1 (β)) + · · · + (y n − ηn (β))
∂βh ∂βh
∂η1
∂β. h
(56.0.42) = −2 y 1 − η1 (β) ··· y n − ηn (β) ..
∂ηn
∂βh
Therefore
∂η1 ∂η1
∂β1
··· ∂βk
∂SSE
··· ∂SSE
..
(56.0.43) = −2 y 1 − η1 (β) ··· y n − ηn (β)
.
∂β1 ∂βk .
∂ηn ∂ηn
∂β1
··· ∂βk
1222 56. NONLINEAR LEAST SQUARES
Problem 495. 6 points [DM93, p. 178], which is very similar to [Gre97, (10-2)
on p. 450]: You are estimating by nonlinear least squares the model
(56.0.48) y t = α + βxt + γztδ + εt or y = αι + βx + γz δ + ε
You are using the iterative Newton-Raphson algorithm.
• a. In the ith step you have obtained the vector of estimates
α̂
β̂
(56.0.49) β̂ i =
γ̂ .
δ̂
Write down the matrix X of pseudoregressors, the first order conditions, the Gauss-
Newton regression at the given parameter values, and the updated estimate β̂ i+1 .
Answer. The matrix of pseudoregressors is, column by column,
(56.0.50) X = ∂η/∂α ∂η/∂β ∂η/∂γ ∂η/∂δ
where η(α, β, γ, δ) == αι+βx+γz δ . From ∂ηt /∂α = 1 follows ∂η/∂α = ι; from ∂ηt /∂β = xt follows
∂η/∂β = x; from ∂ηt /∂γ = ztδ follows ∂η/∂γ = z δ (which is the vector taken to the δth power
∂ ∂
element by element). And from ∂ηt /∂δ = ∂δ γztδ = ∂δ γ exp(δ log(zt )) = γ log(zt ) exp(δ(log zt )) =
γ log(zt )ztδ follows ∂η/∂δ = γ log(z) ∗ z δ where ∗ denotes the Hadamard
product of two matrices
(their element-wise multiplication). Putting it together gives X = ι x z δ γ log(z) ∗ z δ .
1224 56. NONLINEAR LEAST SQUARES
Write the first order conditions (56.0.45) in the form X > (β)(y − η(β)) = o which gives here
ι>
x >
(56.0.51) (y − ια − xβ − z δ γ) = o
> δ
z
δ
γ log(z > ) ∗ z >
or, element by element,
X
(56.0.52) (y t − α − βxt − γztδ ) = 0
t
X
(56.0.53) xt (y t − α − βxt − γztδ ) = 0
t
X
(56.0.54) ztδ (y t − α − βxt − γztδ ) = 0
t
X
(56.0.55) γ log(zt )ztδ (y t − α − xt β − ztδ γ) = 0
t
(56.0.56) y t − α̂ − β̂xt − γ̂ztδ̂ = a + bxt + cztδ̂ + dγ̂ log(zt )ztδ̂ + error term
56. NONLINEAR LEAST SQUARES 1225
• b. How would you obtain the starting value for the Newton-Raphson algorithm?
Answer. One possible set of starting values would be to set δ̂ = 1 and to get α̂, β̂, and γ̂ from
the linear regression.
The Gauss-Newton algorithm runs this regression and uses the OLS estimate δ̂ i
of δ i to define β i+1 = β i + δ̂ i . The recursion formula is therefore
(56.0.59) β i+1 = β i + δ̂ i = β i + ((X(β i ))> X(β i ))−1 (X(β i ))> (y − η(β i )).
The notation (η(β))> = η > (β) and (X(β))> = X > (β) makes this perhaps a little
easier to read:
(56.0.60) β i+1 = β i + (X > (β i )X(β i ))−1 X > (β i )(y − η(β i )).
1226 56. NONLINEAR LEAST SQUARES
This is called the J test. A mathematical simplification, called the P-test, would be
ˆ
to get an estimate β̂ of β from the first model, and use the linearized version of η 0
ˆ
around β̂, i.e., replace η 0 in the above regression by
ˆ ˆ ˆ
(56.1.6) η 0 (β̂) + X 0 (β̂)(β − β̂).
If one does this, one gets the linear regression
(56.1.7) ˆ = Xδ + α(ŷ
y − ŷ ˆ − ŷ
ˆ )
0 1 0
ˆ
ˆ = η (β̂), ˆ
where ŷ 0 0 and δ = (1 − α)(β − β̂), and one simply has to test for α = 0.
Problem 496. Computer Assignment: The data in table 10.1 are in the file
/home/econ/ehrbar/ec781/consum.txt and they will also be sent to you by e-mail.
Here are the commands to enter them into SAS:
56.1. THE J TEST 1229
der.b1=exp(g1 * log(y));
der.g1=b1*(exp(g1*log(y)))*log(y);
run;
Some statisticians also believe that even under ideal circumstances the MLE attains
its asymptotic properties more slowly than NLS.
Problem 497. This is [DM93, pp. 243 and 284]. The model is
(56.2.2) y γt = α + βxt + εt
with ε ∼ N (o, σ 2 I). y i > 0.
• a. 1 point Why can this model not be estimated with nonlinear least squares?
Answer. If all the y’s are greater than unity, then the SSE can be made arbitrarily small by
letting γ tend to −∞ and setting α and β zero. γ = 1, α = 1, and β = 0 leads to a zero SSE as
well. The idea behind LS is to fit the curve to the data. If γ changes, the data points themselves
move. We already saw when we discussed the R2 that there is no good way to compare SSE’s for
different y’s. (How about the information matrix: is it block-diagonal? Are the Kmenta-Oberhofer
conditions applicable?)
Answer. This requires the transformation theorem for densities. εt = y γt − α − βxt ; therefore
∂εt /∂y t = γy γ−1
t and ∂εt /∂y s = 0 for s 6= t. TheQ
Jacobian has this in the Qdiagonal and 0 in the
off-diagonal, therefore the determinant is J = γ n ( y t )γ−1 and |J| = |γ|n ( y t )γ−1 . This gives
1232 56. NONLINEAR LEAST SQUARES
the above formula: which I assume is right, it is from [DM93], but somewhere [DM93] has a
typo.
1233
1234 57. APPLICATIONS WITH NONSPHERICAL COVARIANCE
Answer. A is square, since XA = ΨX, i.e., XA has has as many columns as X. Now assume
Ac = o. Then XAc = o or ΨXc = o, and since Ψ is nonsingular this gives Xc = o, and since X
has full column rank, this gives c = o.
• d. 2 points Show that in this case (X > Ψ−1 X)−1 X > Ψ−1 = (X > X)−1 X > ,
i.e., the OLS is BLUE (“Kruskal’s theorem”).
> −1 −1 > −1 −1 > >
−1 −1 > > > −1 > −1 >
Answer. (X Ψ X) X Ψ = (A ) X X (A ) X = (X X) A (A ) X
(X > X)−1 X >
of a random variable x with zero mean and finite fourth moments, it follows
plim n1
P 4
n x4 E[x4 ]
P
var[β̂ OLS ] x
(57.2.6) plim = plim P 2 2 = =
(plim n1 (E[x2 ])2
P
var[β̂] ( x ) x2 )2
This is the kurtosis (without subtracting the 3). Theoretically it can be anything
≥ 1, the Normal distribution has kurtosis 3, and the economics time series usually
have a kurtosis between 2 and 4.
• c. 3 points Using matrix identity (A.8.20) (for ordinary inverses, not for g-
inverses) show that the generalized least squares formula for the BLUE in this model
is equivalent to the ordinary least squares formula. In other words, show that the
sample mean ȳ is the BLUE of µ.
Answer. Setting γ = τ 2 /σ 2 , we want to show that
ιι> −1 −1 > ιι> −1 −1 > −1
(57.3.2) ι> (I + ) ι ι (I + ) y = ι> I −1 ι ι I y.
γ γ
This is even true for arbitrary h and A:
hh> −1 γ
(57.3.3) h> (A + ) = h> A−1 ;
γ γ + h> A−1 h
hh> −1 −1 γ + h> A−1 h 1 1
(57.3.4) h> (A + ) h = = > −1 + ;
γ γh> A−1 h h A h γ
1240 57. APPLICATIONS WITH NONSPHERICAL COVARIANCE
Now multiply the left sides and the righthand sides (use middle term in (57.3.4))
• d. 3 points [Gre97, Example 11.1 on pp. 499/500]: Show that var[ȳ] does not
converge to zero as n → ∞ while ρ remains constant.
Answer. By (57.3.4),
1 1 1−ρ τ2
(57.3.6) var[ȳ] = τ 2 ( + ) = σ2 ( + ρ) = + ω2
n γ n n
As n → ∞ this converges towards ω 2 , not to 0.
Problem 502. [Chr87, pp. 361–363] Assume there are 1000 families in a
1
P1000
certain town, and denote the income of family k by zk . Let µ = 1000 k=1 zk
be the population
P1000 average of all 1000 incomes in this finite population, and let
1
σ 2 = 1000 (z
k=1 k − µ)2
be the population variance of the incomes. For the pur-
poses of this question, the zk are nonrandom, therefore µ and σ 2 are nonrandom as
well.
You pick at random 20 families without replacement, ask them what their income
is, and you want to compute the BLUE of µ on the basis of this random sample. Call
57.3. EQUICORRELATED COVARIANCE MATRIX 1241
the incomes in the sample y 1 , . . . , y 20 . We are using the letters y i instead of zi for this
sample, because y 1 is not necessarily z1 , i.e., the income of family 1, but it may be,
e.g., z258 . The y i are random. The process of taking the sample of y i is represented
by a 20 × 1000 matrix of random variables q ik (i = 1, . . . , 20, k = 1, . . . , 1000) with:
q ik = 1 if family k has been picked as ith family in the sample, and 0 otherwise. In
P1000
other words, y i = k=1 q ik zk or y = Qz.
For these formulas you need the rules how to take expected values of discrete random
variables.
Answer. Since q ik is a zero-one variable, E[q ik ] = Pr[q ik = 1] = 1/1000. This is obvious if
i = 1, and one can use a symmetry argument that it should not depend on i. And since for a zero-
one variable, q 2ik = q ik , it follows E[q 2ik ] = 1/1000 too. Now for i 6= j, k 6= l, E[q ik q jl ] = Pr[q ik =
1 ∩ q jl = 1] = (1/1000)(1/999). Again this is obvious for i = 1 and j = 2, and can be extended by
symmetry to arbitrary pairs i 6= j. For i 6= j, E[q ik q jk ] = 0 since zk cannot be chosen twice, and
for k 6= l, E[q ik q il ] = 0 since only one zk can be chosen as the ith element in the sample.
P1000
• c. Since k=1 q ik = 1 for all i, one can write
1000
X
(57.3.8) yi = µ + q ik (zk − µ) = µ + εi
k=1
P1000
where εi = k=1 q ik (zk − µ). Show that
(57.3.9) E[εi ] = 0 var[εi ] = σ 2
cov[εi , εj ] = −σ 2 /999 for i 6= j
P1000
Hint: For the covariance note that from 0 = k=1 (zk − µ) follows
(57.3.10)
1000
X 1000
X X 1000
X X
0= (zk −µ) (zl −µ) = (zk −µ)(zl −µ)+ (zk −µ)2 = (zk −µ)(zl −µ)+1000
k=1 l=1 k6=l k=1 k6=l
57.3. EQUICORRELATED COVARIANCE MATRIX 1243
Answer.
1000 1000
X X zk − µ
(57.3.11) E[εi ] = (zk − µ) E[q ik ] = =0
1000
k=1 k=1
1000 1000
X X (zk − µ)2
(57.3.12) var[εi ] = E[ε2i ] = (zk − µ)(zl − µ) E[q ik q il ] = = σ2
1000
k,l=1 k=1
and for i 6= j follows, using the hint for the last equal-sign
(57.3.13)
1000
X X (zk − µ)(zl − µ)
cov[εi , εj ] = E[εi εj ] = (zk − µ)(zl − µ) E[q ik q jl ] = = −σ 2 /999.
1000 · 999
k,l=1 k6=l
With ι20 being the 20 × 1 column vector consisting of ones, one can therefore
write in matrix notation
y = ι20 µ + ε ε] = o
E[ε ε] = σ 2 Ψ
V [ε
1244 57. APPLICATIONS WITH NONSPHERICAL COVARIANCE
where
1 −1/999 · · · −1/999
−1/999 1 ··· −1/999
(57.3.14) Ψ= .
. .. ..
.. .. . .
−1/999 −1/999 · · · 1
From what we know about GLS with equicorrelated errors (question 501) follows
therefore that the sample mean ȳ is the BLUE of µ. (This last part was an explanation
of the relevance of the question, you are not required to prove it.)
CHAPTER 58
If Ψ depends on certain unknown parameters which are not, at the same time,
components of β or functions thereof, and if a consistent estimate of these parameters
is available, then GLS with this estimated covariance matrix, called “feasible GLS,”
is usually asymptotically efficient. This is an important result: one does not not
need an efficient estimate of the covariance matrix to get efficient estimates of β! In
this case, all the results are asymptotically valid, with Ψ̂ in the formulas instead of
Ψ. These estimates are sometimes even unbiased!
1245
1246 58. UNKNOWN PARAMETERS IN THE COVARIANCE MATRIX
58.1. Heteroskedasticity
Heteroskedasticity means: error terms are independent, but do not have equal
variances. There are not enough data to get consistent estimates of all error variances,
therefore we need additional information.
The simplest kind of additional information is that the sample can be partitioned
into two different subsets, each subset corresponding to a different error variance,
with the relative variances known. Write the model as
2
y1 X1 ε ε κ I O
(58.1.1) = β+ 1 ; V [ 1 ] = σ2 1 = Φ.
y2 X2 ε2 ε2 O κ22 I
To make this formula operational, we have to replace the κ2i by estimates. The
simplest way (if each subset has at least k + 1 observations) is to use the unbiased
estimates s2i (i = 1, 2) from the OLS regressions on the two subsets separately.
Associated with this estimation is also an easy test, the Goldfeld Quandt test [Gre97,
551/2]. simply use an F -test on the ratio s22 /s21 ; but reject if it is too big or too
58.1. HETEROSKEDASTICITY 1247
small. If we don’t have the lower significance points, check s21 /s22 if it is > 1 and
s22 /s21 otherwise.
in which X 1 is a 10×5 and X 2 a 20×5 matrix, you run the two regressions separately
and you get s21 = 500 and s22 = 100. Can you reject at the 5% significance level that
these variances are equal? Can you reject it at the 1% level? The enclosed tables are
from [Sch59, pp. 424–33].
Answer. The distribution of the ratio of estimated variances is s22 /s21 ∼ F15,5 , but since its
observed value is smaller than 1, use instead s21 /s22 ∼ F5,15 . The upper significance points for 0.005%
F(5,15;0.005) = 5.37 (which gives a two-sided 1% significance level), for 1% it is F(5.15;0.01) = 4.56
(which gives a two-sided 2% significance level), for 2.5% F(5,15;0.025) = 3.58 (which gives a two-sided
5% significance level), and for 5% it is F(5,15;0.05) = 2.90 (which gives a two-sided 10% significance
level). A table can be found for instance in [Sch59, pp. 428/9]. To get the upper 2.5% point one
can also use the Splus-command qf(1-5/200,5,15). One can also get the lower significance points
simply by the command qf(5/200,5,15). The test is therefore significant at the 5% level but not
significant at the 1% level.
1248 58. UNKNOWN PARAMETERS IN THE COVARIANCE MATRIX
Since the so-called Kmenta-Oberhofer conditions are satisfied, i.e., since Ψ does
not depend on β, the following iterative procedure converges to the maximum like-
lihood estimator:
(1) start with some initial estimate of κ21 and κ22 . [Gre97, p. 516] proposes to
start with the assumption of homoskedasticity, i.e., κ21 = κ22 = 1, but if each group
has enough observations to make separate estimates then I think a better starting
point would be the s2i of the separate regressions.
(2) Use those κ2i to get the feasible GLSE.
(3) use this feasible GLSE to get a new set κ2i = s2i (but divide by ni , not ni − k).
(4) Go back to (2).
Once the maximum likelihood estimates of β, σ 2 , and κ2i are computed (actually
σ and κ2i cannot be identified
2
separately, therefore one conventionally imposes a
condition like σ 2 = 1 or i κ2i = n to identify them), then it is easy to test for
P
homoskedasticity by the LR test. In order to get the maximum value of the likelihood
function it saves us some work to start with the concentrated likelihood functions,
therefore we start with (35.0.17):
(58.1.4)
n n 1
log fy (y; β, Ψ) = − (1 + ln 2π − ln n) − ln(y − Xβ)> Ψ−1 (y − Xβ) − ln det[Ψ]
2 2 2
58.1. HETEROSKEDASTICITY 1249
Since σ̂ 2 = 1
n (y − Xβ)> Ψ−1 (y − Xβ) and det[kΨ] = k n det[Ψ] one can rewrite
(35.0.17) as
n 1
(58.1.5) log fy (y; β, Ψ) = − (1 + ln 2π) − ln det[σ̂ 2 Ψ]
2 2
Now in the constrained case, with homoskedasticity assumed, Ψ = I and we will
write the OLS estimator as β̂ ˆ 2 = (ε̂ˆ> ε̂)/n.
ˆ and σ̂ ˆ ˆ 2 I] = n ln[σ̂
Then ln det[σ̂ ˆ 2 ]. Let
β̂ be the unconstrained MLE, and
2
σ̂1 I O
(58.1.6) Ψ̂ =
O σ̂22 I
there σ̂i2 = ε̂>
i ε̂i /ni . The LR statistic is therefore (compare [Gre97, p. 516])
X
(58.1.7) ˆ2 −
λ = 2(log fconstrained − log funconstrained ) = n ln σ̂ ni ln σ̂i2
In this particular case, the feasible GLSE is so simple that its finite sample
properties are known. Therefore [JHG+ 88] use it as a showcase example to study
the question: Should one use the feasible GLSE always, or should one use a pre-test
estimator, i.e., test whether the variances are equal, and use the feasible GLS only if
this test can be rejected, otherwise use OLS? [JHG+ 88, figure 9.2 on p. 364] gives
the trace of the MSE-matrix for several possibilities.
1250 58. UNKNOWN PARAMETERS IN THE COVARIANCE MATRIX
z>n
of observations of m nonrandom explanatory variables which include the constant
“variable” ι. The variables in Z are often functions of certain variables in X, but
this is not necessary for the derivation that follows.
A special case of this specification is σt2 = σ 2 xpt or, after
taking logarithms,
ln σt = ln σ + p ln xt . Here Z = ι ln x and α> = ln σ 2 p .
2 2
This can be considered a regression equation with ln(ε2t /σt2 ) as the disturbance term.
The assumption is that var[ln(ε2t /σt2 )] does not depend on t, which is the case if the
58.1. HETEROSKEDASTICITY 1251
εt /σt are i.i.d. The lefthand side of (58.1.9) is not observed, but one can take the
OLS residuals ε̂t ; usually ln ε̂2t → ln ε2t in the probability limit.
There is only one hitch: the disturbances in regression (58.1.9) do not have
zero expected value. Their expected value is an unknown constant. If one ignores
that and runs a regression on (58.1.9), one gets an inconsistent estimate of the
element of α which is the coefficient of the constant term in Z. This estimate really
estimates the sum of the constant term plus the expected value of the disturbance.
As a consequence of this inconsistency, the vector exp(Zα) estimates the vector of
variances only up to a joint multiplicative constant. I.e., this inconsistency is such
that the plim of the variance estimates is not equal but nevertheless proportional
to the true variances. But proportionality is all one needs for GLS; the missing
multiplicative constant is then the s2 provided by the least squares formalism.
Therefore all one has to do is: run the regression (58.1.9) (if the F test does not
reject, then homoskedasticity cannot be rejected), get the (inconsistent but propor-
tional) estimates σ̂t2 = exp(z >
t α), divide the tth observation of the original regression
by σ̂t , and re-run the original regression on the transformed data. Consistent esti-
mates of σt2 are then the s2 from this transformed regression times the inconsistent
estimates σ̂t2 .
58.1.2. Testing for heteroskedasticity: One test is the F -test in the proce-
dure just described. Then there is the Goldfeld-Quandt test: if it is possible to order
1252 58. UNKNOWN PARAMETERS IN THE COVARIANCE MATRIX
the observations in order of increasing error variance, run separate regressions on the
portion of the date with low variance and that with high variance, perhaps leaving
out some in the middle to increase power of the test, and then just making an F-test
SSE /d.f.
with SSEhigh
low /d.f.
.
Problem 504. Why does the Goldfeld-Quandt not use SSE high − SSE low in
the numerator?
58.1.3. Heteroskedasticity with Unknown Pattern. For consistency of
OLS one needs
1
(58.1.10) plim X >ε = o
n
1
(58.1.11) Q = plim X > X exists and is nonsingular
n
1
(58.1.12) Q∗ = plim X > ΨX exists and is nonsingular
n
Proof:
σ 2 1 > −1 1 > 1 −1
(58.1.13) V [β̂ OLS ] = X X X ΨX X > X
n n n n
σ 2 −1 ∗ −1
therefore plim V [β̂ OLS ] = n Q Q Q .
58.1. HETEROSKEDASTICITY 1253
dent observations of the random variables x and z with E[z 2 ] = 1 and cov[x2 , z 2 ] = 0.
In this case the naive regression output for the variance of β̂, which is s2N = s2 / x2 ,
P
is indeed a consistent estimate of the variance.
(58.1.15)
σ 2 x2 z 2 σ 2 n1 i x2i z 2i
P
E[x2 z 2 ] cov[x2 , z 2 ] + E[x
P
var[β̂ OLS ]
plim = plim = plim = =
s2N s2 n1 i x2i
P
s2 x2 E[x2 ] E[x2 ]
P
I.e., if one simply runs OLS in this model, then the regression printout is not mis-
leading. On the other hand, it is clear that always var[β̂ OLS ] > var[β̂]; therefore if z
is observed, then one can do better than this.
1254 58. UNKNOWN PARAMETERS IN THE COVARIANCE MATRIX
Answer: This is a fallacy. In the above formula one does not need Ψ but X > ΨX,
which is a k ×k symmetric matrix, i.e., it has k(k +1)/2 different elements. And even
>
an inconsistent estimate of Ψ can lead to a consistent estimate
2 of X ΨX. Which
ε̂1 · · · 0
inconsistent estimate of Ψ shall we use? of course Ψ̂ = ... . . . .. . Now since
.
0 · · · ε̂2n
2 >
σ ··· 0 x1
>
.1 .. .. .. X 2
σi xi x>
(58.1.17) X ΨX = x1 ··· xn .. . . . = i
0 ··· σn2 x>
n
i
This estimator has become very fashionable, since one does not have to bother with
estimating the covariance structure, and since OLS is not too inefficient in these
situations.
It has been observed, however, that this estimator gives too small confidence
intervals in small samples. Therefore it is recommended in small samples to multiply
ε̂2
the estimated variance by the factor n/(n − k) or to use miii as the estimates of σi2 .
See [DM93, p. 554].
58.2. Autocorrelation
While heteroskedasticity is most often found with cross-sectional data, autocor-
relation is more common with time-series.
Properties of OLS in the presence of autocorrelation. If the correlation between
the observations dies off sufficiently rapidly as the observations become further apart
in time, OLS is consistent and asymptotically normal, but inefficient. There is one
important exception to this rule: if the regression includes lagged dependent variables
and there is autocorrelation, then OLS and also GLS is inconsistent.
1256 58. UNKNOWN PARAMETERS IN THE COVARIANCE MATRIX
σ2
• b. 3 points Show that var[εt ] = ρ2t var[ε0 ] + (1 − ρ2t ) 1−ρ
v
2 . (Hint: use induc-
σv2
tion.) I.e., since |ρ| < 1, var[εt ] converges towards σε2 = 1−ρ2 .
σ2
Answer. Here is the induction step. Assume that var[εt−1 ] = ρ2(t−1) var[ε0 ]+(1−ρ2(t−1) ) 1−ρ v
2.
Since εt = ρεt−1 + v t and v t is independent of εt−1 , it follows
(58.2.3)
σv2 σv2
var[εt ] = ρ2 var[εt−1 ]+var[v t ] = ρ2t var[ε0 ]+ρ2 (1−ρ2(t−1) ) 2
+σv2 = ρ2t var[ε0 ]+(1−ρ2t ) .
1−ρ 1 − ρ2
58.2. AUTOCORRELATION 1257
• d. (d) 1 point Show that, if the process has had enough time to become sta-
tionary, it follows
ρ
(58.2.6) cov[εt , y t−1 ] = σ2
1 − ρβ ε
Answer. Do not yet compute var[εt−1 ] at this point, just call it σε2 . Assuming stationarity,
i.e., cov[εt , y t−1 ] = cov[εt−1 , y t−2 ], it follows
1 + βρ σε2
(58.2.9) var[y t ] = .
1 − βρ 1 − β 2
Answer.
(1 − β 2 )ρ
(58.2.13) plim β̂OLS = β +
1 + βρ
In analogy to White’s heteroskedastisicty-consistent estimator one can, in the
case of autocorrelation, use Newey and West’s robust, consistent estimator of the
58.2. AUTOCORRELATION 1259
MSE-matrix of OLS. This is discussed in [Gre97, p. 505–5 and 590–1]. The straight-
forward generalization of the White estimator would be
1 X
(58.2.14) Est.V ar[β̂ OLS ] = (X > X)−1 ( ε̂i ε̂j xi x> >
j )(X X)
−1
n i,j
but this estimator does not always give a positive definite matrix. The formula which
one should use is: first determine a maximum lag L beyond which the autocorrela-
tions are small enough to ignore, and then do
(58.2.15)
L X n
1 X j
Est.V ar[β̂ OLS ] = (X > X)−1 ( (1− )ε̂t ε̂t−j (xt x> > >
t−j +xt−j xt )(X X)
−1
n j=1 t=j+1
L + 1
• b. 2 points If |ρ| < 1, then the process generating the residuals converges toward
a stationary process. Assuming that this stationary state has been reached, show that
58.2. AUTOCORRELATION 1261
1
(58.2.20) var[εt ] = σ2
1 − ρ2 v
and also give a formula for cov[εt , εt−j ] in terms of σv2 , ρ, and j.
Answer. From the assumptions follows var[εt+1 ] = ρ2 var[εt ] + σv2 . Stationarity means
var[εt+1 ] = var[εt ] = σε2 , say. Therefore σε2 = ρ2 σε2 + σv2 , which gives σε2 = σv2 /(1 − ρ2 ). For the co-
variances one gets cov[εt , εt−1 ] = cov[ρεt−1 +v t , εt−1 ] = ρσε2 ; cov[εt , εt−2 ] = cov[ρεt−1 +v t , εt−2 ] =
cov[ρ2 εt−2 + ρv t−1 + v t , εt−2 ] = ρ2 σε2 , etc.
is σv2 /(1 − ρ2 ). In other words, we know that the disturbance in the first observation is independent
of all the later innovations, and its variance is by the pfactor 1/(1 − ρ2 ) higher than that of these
innovations. Therefore multiply first observation by 1 − ρ2 take this together with the other
differenced observations in order to get a well-behaved regression.
ρ2 ρn−1
1 ρ ···
ρ 1 ρ ··· ρn−2
ρ2 ρn−3
(58.2.22) Ψ=
ρ 1 ··· .
.. .. .. .. ..
. . . . .
58.2. AUTOCORRELATION 1263
Answer. V [P ε ] = σv2 I; σε2 P ΨP > = σv2 I; σε2 Ψ = σv2 P −1 (P > )−1 = σv2 (P > P > )−1 ; 1/(1 −
ρ2 )Ψ = (P > P )−1 ; (1 − ρ2 )Ψ−1 = P > P .
This is exactly the transformation which the procedure from Problem 507 leads to.
Problem 509. This question is formulated in such a way that you can do each
part of it independently of the others. Therefore if you get stuck, just go on to the
next part. We are working in the linear regression model y t = x>t β + εt , t = 1, . . . , n,
in which the following is known about the disturbances εt : For t = 2, . . . , n one can
write εt = ρεt−1 + v t with an unknown nonrandom ρ, and the v t are well behaved,
i.e., they are homoskedastic v t ∼ (0, σv2 ) and v s independent of v t for s 6= t. The
first disturbance ε1 has a finite variance and is independent of v 2 , . . . , v n .
• c. 1 point Assuming |ρ| < 1 and var[ε1 ] = σv2 /(1 − ρ2 ) =: σε2 , compute the
correlation matrix of the disturbance vector ε . Since ε is homoskedastic, this is
at the same time that matrix Ψ for which V [ε ε] = σε2 Ψ. What is a (covariance)
stationary process, and do the εt form one?
1266 58. UNKNOWN PARAMETERS IN THE COVARIANCE MATRIX
Answer. cov[εt , εt−1 ] = cov[ρεt−1 + v t , εt−1 ] = ρσε2 ; cov[εt , εt−2 ] = cov[ρεt−1 + v t , εt−2 ] =
cov[ρ2 εt−2 + ρv t−1 + v t , εt−2 ] = ρ2 σε2 , etc. If we therefore write the covariance matrix of ε in the
ε] = σε2 Ψ, so that all elements in the diagonal of Ψ are = 1, which makes Ψ at the same
form V [ε
time the correlation matrix, we get
1 ρ ρ2 ··· ρn−1
ρ 1 ρ ··· ρn−2
(58.2.27) Ψ=
ρ2 ρ 1 ··· ρn−3 .
. . . .
. . . .. .
. . . . .
ρn−1 ρn−2 ρn−3 ··· 1
A process is covariance stationary if the expected value and the variance do not change over time,
and cov[εs , εt ] depends only on s − t, not on s or t separately. Yes it is a covariance stationary
process.
• d. 2 points Show that the matrix in equation (58.2.23) is the inverse of this
correlation matrix.
• e. 2 points Prove that the square matrix P satisfies V [P ε ] = σv2 I if and only
if P > P = (1 − ρ2 )Ψ−1 .
Answer. V [P ε ] = σv2 I; σε2 P ΨP > = σv2 I; σε2 Ψ = σv2 P −1 (P > )−1 = σv2 (P > P > )−1 ; 1/(1 −
ρ2 )Ψ = (P > P )−1 ; (1 − ρ2 )Ψ−1 = P > P .
p Since P is lower diagonal, its determinant is the product of the diagonal elements,
Answer.
which is 1 − ρ2 . Since Ψ−1 = 1−ρ
1
2P
>
P , it follows det[Ψ−1 ] = 1/(1 − ρ2 )n (det[P ])2 = 1/(1 −
ρ2 )n−1 , therefore det Ψ = (1 − ρ2 )n−1 .
• h. 3 points Show that the general formula for the log likelihood function (35.0.11)
reduces in our specific situation to
(58.2.28)
n
n 1 1 X
ln `(y; β, ρ, σv2 ) = constant− ln σv2 + ln(1−ρ2 )− 2 (1−ρ2 )ε21 + (εt −ρεt−1 )2
2 2 2σv t=2
where εt = y t − x> t β. You will need such an expression if you have to program
the likelihood function in a programming language which does not understand matrix
operations. As a check on your arithmetic I want you to keep track of the value of
the constant in this formula and report it. Hint: use P to evaluate the quadratic
form (y − Xβ)> Ψ−1 (y − Xβ).
Answer. The constant is − n 2
ln 2π. The next two terms are − n
2
ln σε2 − 12 ln |det Ψ| = − n
2
ln σv2 +
n 2 n−1 2 n 2 1 2 −1 1 >
2
ln(1−ρ )− 2 ln(1−ρ ) = − 2 ln σv + 2 ln(1−ρ ). And since Ψ = 1−ρ2 P P and σε (1−ρ2 ) =
2
σv , the last terms coincide too, because (y − Xβ) Ψ (y − Xβ) = (y − Xβ)> P > P (y − Xβ).
2 > −1
1268 58. UNKNOWN PARAMETERS IN THE COVARIANCE MATRIX
• i. 4 points Show that, if one concentrates out σv2 , i.e., maximizes this likelihood
function with respect to σv2 , taking all other parameters as given, one obtains
n
1 n X
(58.2.29) ln `conc. = constant + ln(1 − ρ2 ) − ln (1 − ρ2 )ε21 + (εt − ρεt−1 )2
2 2 t=2
Again as a check on your arithmetic, I want you to give me a formula for the constant
in (58.2.29). If you did not figure out the constant in (58.2.28), you may give me
the constant in (58.2.29) as a function of the constant in (58.2.28).
Answer. It is better to do it from scratch than to use the general formula (35.0.17): First
order condition is
Pn
∂ 2 n 1 (1 − ρ2 )ε21 + (ε − ρεt−1 )2
t=2 t
(58.2.30) ln `(y; β, ρ, σ v ) = − + =0
∂σv2 2 σv2 2σv4
which gives
Pn
(1 − ρ2 )ε21 + t=2
(εt − ρεt−1 )2
(58.2.31) σv2 =
n
Plugging this into the likelihood function gives (58.2.29), but this time the constant is written out:
n
n n n 1 n
X
(58.2.32) ln `conc. = − − ln 2π + ln n + ln(1 − ρ2 ) − ln (1 − ρ2 )ε21 + (εt − ρεt−1 )2
2 2 2 2 2
t=2
58.2. AUTOCORRELATION 1269
As this Question shows, after concentrating out σv2 one can either concentrate
out ρ or β but not both, and [BM78] propose to alternate these concentrations until
it converges.
58.2.2. Prediction. To compute the BLUP for one step ahead simply predict
v n+1 by 0, i.e. ε∗n+1 = ρε̂n , hence
(58.2.33) y ∗n+1 = x>
n+1 β̂ + ρε̂n ;
where v > = [ρn , ρn−1 , . . . , ρ2 , ρ] and Ψ is as in (58.2.22). Equation (27.3.6) gives y ∗n+1 = x>
n+1 β̂ +
v > Ψ−1 (y − X β̂). Using Ψ−1 from (58.2.23) one can show that
1 −ρ 0 ··· 0 0
−ρ 1 + ρ2 −ρ ··· 0 0
1 n 0
−ρ 1 + ρ2 ··· 0 0
v > Ψ−1 = ρ ρn−1 ρn−2 ··· ρ2 ρ .
.. .. .. .. ..
1−ρ 2
.. . . . . .
··· 1 + ρ2 −ρ
0 0 0
0 0 0 ··· −ρ 1
= 0 0 0 ··· 0 ρ .
1 − α2
(58.2.34) σε2 = σ2 .
(1 + α2 ) (1 − α2 )2 − α12 v
1272 58. UNKNOWN PARAMETERS IN THE COVARIANCE MATRIX
• b. 4 points Discuss how these matrices are mathematically related to each other
(perhaps one is the inverse of the other, etc.), and how they are related to the model
(could it be that one is the matrix of covariances between the error terms and the
explanatory variables?) A precise proof of your answer would imply tedious matrix
multiplications, but you should be able to give an answer simply by carefully looking
at the matrices.
Answer. A is the correlation matrix of the errors, i.e., σε2 A = V [ε ε])−1 , and
ε], B = σv2 (V [ε
C > C = B.
• d. 4 points In terms of these matrices, give the objective function (some matrix
weighted sum of squares) which the BLUE minimizes due to the Gauss-Markov theo-
rem, give the formula for the BLUE, and give the formula for the unbiased estimator
s2v .
58.2.4. The Autoreg Procedure in SAS. This is about the “autoreg” pro-
cedure in the SAS ETS manual.
1274 58. UNKNOWN PARAMETERS IN THE COVARIANCE MATRIX
(58.2.38)
cov[εt , εt−1 ] = −α1 var[εt−1 ] − α2 cov[εt−2 , εt−1 ] − · · · − αp cov[εt−p , εt−1 ]
(58.2.39)
cov[εt , εt−2 ] = −α1 cov[εt−1 , εt−2 ] − α2 var[εt−2 ] − · · · − αp cov[εt−p , εt−2 ]
...
(58.2.40)
cov[εt , εt−p ] = −α1 cov[εt−1 , εt−p ] − α2 cov[εt−2 , εt−p ] · · · − αp var[εt−p ] − · · · − αp cov[
58.2. AUTOCORRELATION 1275
ρ1 1 ρ1 ρ2 ··· ρp−1
α1
ρ2
ρ1
1 ρ1 ··· ρp−2
α2
(58.2.41)
ρ3
= − ρ2
ρ1 1 ··· ρp−3
. .
.. .. .. .. .. .. ..
. . . . . .
αp
ρp ρp−1 ρp−2 ρp−3 ··· 1
Testing for ρ = 0; In the regression of ε̂t on ε̂t−1 , the formula for the
Pnvariance
2
(39.1.7) holds asymptotically, i.e., var[ρ̂] = σv2 / E[x> x] where x> x = t=2 ε̂t−1 .
> 2 2 2
Asymptotically, x x has expected value nσε = nσv /(1 − ρ ). Asymptotially, there-
2 √
fore, var[ρ̂] = 1−ρ
n . If ρ = 0, it is var[ρ̂] = 1/n; in this case, therefore, nρ̂ has
asymptotic N (0, 1) distribution.
But the most often used statistic for autoregression is the Durbin-Watson.
58.2.6. The Durbin-Watson Test Statistic. The Durbin Watson test [DW50,
DW51, DW71] tests εt and εt−1 are correlated in the linear model y = Xβ + ε in
which the conditions for hypothesis testing are satisfied (either normal disturbances
or so many observations that the central limit theorem leads to normality), and in
which the errors are homoskedastic with variance σ 2 , and cov[εt−1 , εt ] = ρσ 2 with
the same ρ for all t = 2, . . . , n.
The DW test does not test the higher autocorrelations. It was found to be
powerful if the overall process is an AR(1) process, but it cannot be powerful if the
autocorrelation is such that εt is not correlated with εt−1 but with higher lags. For
instance, for quarterly data, Wallis [Wal72] argued that one should expect εt to
be correlated with εt−4 and not with εt−1 , and he modified the DW test for this
situation.
1278 58. UNKNOWN PARAMETERS IN THE COVARIANCE MATRIX
(where the residuals are taken from OLS without correction for autocorrelation).
This test statistic is a consistent estimator of 2 − 2ρ (but it has its particular form
so that the distribution can be calculated). The plim can be seen as follows:
Pn 2 2 Pn 2 Pn Pn 2
t=2 (ε̂t − 2ε̂t ε̂t−1 + ε̂t−1 ) t=2 ε̂t t=2 ε̂t ε̂t−1 t=2 ε̂t−1
(58.2.44) d = Pn 2 = Pn 2 − 2 P n 2 + P n 2 .
t=1 ε̂t t=1 ε̂t t=1 ε̂t t=1 ε̂t
For large n one can ignore that the sum in the numerator has one less element
than the one in the denominator. Therefore the first term converges towards 1,
the secondp towards 2 cov[εt , εt−1 ]/ var[εt ] = 2ρ (note that, due to homoskedasticity,
var[εt ] = var[εt−1 ] var[εt ] ), and the third term again towards 1. d is always
between 0 and 4, and is close to 2 if there is no autocorrelation, close to 0 for
positive autocorrelation, and close to 4 for negative autocorrelation.
d differs from many test statistics considered so far because its distribution de-
pends on the values taken by regressors X. It is a very special situation that the
distribution of the t statistic and F statistic do not depend on the X. Usually one
must expect that the values of X have an influence. Despite this dependence on X,
58.2. AUTOCORRELATION 1279
it is possible to give bounds for the critical values, which are tabulated as DL (lower
D) and DU (upper D). If the alternative hypothesis is positive autocorrelation, one
can reject the null hypothesis if d < DL for all possible configurations of the regres-
sors, cannot reject if d > DU , and otherwise the test is inconclusive, i.e., in this case
it depends on X whether to reject or not, and the computer is not taking the trouble
of checking which is the case.
The bounds that are usually published are calculated under the assumption that
the regression has a constant term, i.e., that there is a vector a so that Xa = ι.
Tables valid if there is no constant term are given in [Far80]. If these tables are
unavailable, [Kme86, p. 329/30] recommends to include the constant term into
the regression before running the test, so that the usual bounds can be used. But
[JHG+ 88, p. 399] says that the power of the DW test is low if there is no intercept.
On the other hand, [Kin81] has given sharper bounds which one can use if it is
known that the regression has a trend of seasonal dummies. The computer program
SHAZAM [Whi93] computes the exact confidence points using the available data.
An approximation to the exact critical values by Durbin and Watson themselves
[DW71] uses that affine combination of the upper bound which has the same mean
and variance as the exact test statistic. This is discussed in [Gre97, p. 593] and
Green says it is “quite accurate.”
1280 58. UNKNOWN PARAMETERS IN THE COVARIANCE MATRIX
Robustness: D-W will detect more than just first order autoregression [Bla73],
but not all kinds of serial correlation, e.g. not very powerful for 2nd order autore-
gression.
If lagged dependent variables and autoregression, then OLS is no longer con-
sistent, therefore also d no longer a consistent estimate of 2 − 2ρ but is closer to
2 than it should be! Then the D-W has low power, accepts more often than it
should. If lagged dependent variable, use Durbin’s h. This is an intuitive formula,
see [JHG+ 88, p. 401]. It cannot always be calculated, because of the square root
which may become negative, therefore an asymptotically equivalent test is Durbin’s
m-test, which implies: get the OLS residuals, and regress them on all explanatory
variables and the lagged residuals, and see if the coefficient on the lagged residuals is
significant. This can be extended to higher order autoregression by including higher
lags of the residuals [Kme86, p. 333]
p Answer. This is exactly the p same proof as in Part b. From (58.3.1) follows E[εt |εt−1 , εs ] =
α0 + α1 ε2t−1 E[ut |εt−1 , εs ] = α0 + α1 ε2t−1 E[ut ] = 0.
• e. Show that in general E[εt |εs ] = 0 for all s < t. Hint: You are allowed to
use, without proof, the following extension of the law of iterated expectations:
(58.3.2) E E[x|y, z]y = E[x|y].
Answer. By (58.3.2), E[εt |εs ] = E E[εt |εt−1 , εs ]εs = E[0|εs ] = 0.
• f. Show that cov[εt , εs ] = 0 for all s < t. Hint: Use Question 145.
Answer. cov[εs , εt ] = cov εs , E[εt |εs ] = 0
α0
Answer. By induction: assume it is true for t − 1, i.e., var[εt−1 ] = (1 − αt−1
1 ) 1−α +
1
αt−1
1 var[ε0 ]. Then, by g,
α0
(58.3.4) var[εt ] = α0 + α1 (1 − αt−1
1 ) + αt1 var[ε0 ]
1 − α1
α0 α0
(58.3.5) = (1 − α1 ) + (α1 − αt1 ) + αt1 var[ε0 ]
1 − α1 1 − α1
α0
(58.3.6) = (1 − αt1 ) + αt1 var[ε0 ].
1 − α1
(58.3.7)
n n
n X 1X (y t − x>t β
log `(y; α1 , α2 , β) = − log 2π− log(α0 +α1 (y t−1 −x>
t−1 β)2
)−
2 t=1
2 t=1
α0 + α1 (y t−1 −
58.3. AUTOREGRESSIVE CONDITIONAL HETEROSKEDASTICITY (ARCH) 1285
The first-order conditions are complicated, but this can be maximized by numerical
methods.
There is also a simpler feasible four-step estimation procedure available, see
[Gre97, p. 571]. A good discussion of the ARCH processes is in [End95, pp. 139–
165].
CHAPTER 59
This follows mainly [DM93, Chapter 17]. A good and accessible treatment
is [M9́9]. The textbook [Hay00] uses GMM as the organizing principle for all
estimation methods except maximum likelihood.
A moment µ of a random variable y is the expected value of some function of y.
Such a moment is therefore defined by the equation
(59.0.8) E[g(y) − µ] = 0.
The same parameter-defining function g(y) − µ defines the method of moments esti-
mator µ̂ of µ if one replaces the expected value in (59.0.8) with the sample mean of
the elements of an observation vector y consisting ofPindependent observations of y.
n
In other words, µ̂(y) is that value which satisfies n1 i=1 (g(y i ) − µ̂) = 0.
1287
1288 59. GENERALIZED METHOD OF MOMENTS ESTIMATORS
The generalized method of moments estimator extends this rule in several re-
spects: the y i no longer have to be i.i.d., the parameter-defining equations may be a
system of equations defining more than one paramter at a time, there may be more
parameter-defining functions than parameters (overidentification), and not only un-
conditional but also conditional moments are considered.
Under this definition, the OLS estimator is a GMM estimator. To show this,
we will write the linear model y = Xβ + ε row by row as y i = x> i β + εi , where
xi is, as in various earlier cases, the ith row of X written as a column vector. The
basic property which makes least squares consistent is that the following conditional
expectation is zero:
(59.0.9) E[y i − x>
i β|xi ] = 0.
This is more information than just knowing that the unconditional expectation is
zero. How can this additional information be used to define an estimator? From
(59.0.9) follows that the unconditional expectation of the product
>
(59.0.10) E [xi (y i − xi β)] = o.
Replacing the expected value by the sample mean gives
n
1X
(59.0.11) xi (y i − x>
i β̂) = o
n i=1
59. GENERALIZED METHOD OF MOMENTS ESTIMATORS 1289
These are exactly the OLS Normal Equations. This shows that OLS in the linear
model is a GMM estimator.
Note that the rows of the X-matrix play two different roles in this derivation:
they appear in the equation y i = x> i β + εi , and they are also the information set
based on which the conditional expectation in (59.0.9) is formed. If this latter role
is assumed by the rows of a different matrix of observations W then the GMM
estimator becomes the Instrumental Variables Estimator.
Most maximum likelihood estimators are also GMM estimators. As long as the
maxima are at the interior of the parameter region, the ML estimators solve the first
order conditions, i.e., the Jacobian of the log likelihood function evaluated at these
estimators is zero. But it follows from the theory of maximum likelihood estimation
that the expected value of the Jacobian of the log likelihood function is zero.
Here are the general definitions and theorems, and as example their applications
to the textbook example of the Gamma distribution in [Gre97, p. 518] and the
Instrumental Variables estimator.
1290 59. GENERALIZED METHOD OF MOMENTS ESTIMATORS
f>
1 (y 1 , θ)
(59.0.13) F (y, θ) =
..
.
f>
n (y n , θ)
Sometimes the f i have identical functional form and only differ by the values of some
exogenous variables, i.e., f i (y i , θ) = g(y i , xi , θ), but sometimes they have genuinely
different functional forms.
In the Gamma-function example M is the set of all Gamma distributions, θ =
>
r λ consists of the two parameters of the Gamma distribution, ` = k = 2, and
the parameter-defining function has the rows
y 1 − λr 1
y1
λ
− r−1
y i − λr
. ..
(59.0.14) f i (y i , θ) = ..
so that F (y i , θ) = .
1 λ
− r−1 .
yi
y −r 1
− λ
59. GENERALIZED METHOD OF MOMENTS ESTIMATORS 1291
then f i (y i , β) = wi (y i − x>
i β). This gives
(y 1 − x> >
1 β)w 1
..
(59.0.16) F (y, β) = = diag(y − Xβ)W .
.
(y n − x> >
n β)w n
(b) The vector functions f i (y i , θ) must be such that the true value of the pa-
rameter vector θ µ satisfies
(59.0.17) E [f i (y i , θ µ )] = o
for all i, while any other parameter vector θ 6= θ µ gives E [f i (y i , θ)] 6= o.
In the Gamma example (59.0.17) follows from the fact that the moments of the
Gamma distribution are E[y] = λr and E[ y1 ] = r−1λ
. It is also easy to see that r and
i
λ are characterized by these two relations; given E[y] = µ and E[ y1 ] = ν one can
i
µν ν
solve for r = µν−1 and λ = µν−1 .
1292 59. GENERALIZED METHOD OF MOMENTS ESTIMATORS
Assumption (c) for a parameter-defining function is that there is only one θ̂ satisfying
(59.0.18).
For IV,
(59.0.20) F > (y, β̃)ι = W > diag(y − X β̃)ι = W > (y − X β̃)
If there are as many instruments as explanatory variables, setting this zero gives the
normal equation for the simple IV estimator W > (y − X β̃) = o.
59. GENERALIZED METHOD OF MOMENTS ESTIMATORS 1293
In the case ` > k, (59.0.17) still holds, but the system of equations (59.0.18) no
longer has a solution: there are ` > k relationships for the k parameters. In order
to handle this situation, we need to specify what qualifies as a weighting matrix.
The symmetric positive definite ` × ` matrix A(y) is a weighting matrix if it has
a nonrandom positive definite plim, called A0 (y) = plimn→∞ A(y). Instead of
(59.0.18), now the following equation serves to define θ̂:
In this case, condition (c) for a parameter-defining equation reads that there is only
one θ̂ which minimizes this criterion function.
For IV, A(y) does not depend on y but is n1 (W > W )−1 . Therefore A0 =
plim( n1 W > W )−1 , and (59.0.21) becomes β̃ = argmin(y−X > β)> W (W > W )−1 W > (y
X > β), which is indeed the quadratic form minimized by the generalkized instrumen-
tal variables estimator.
In order to convert the Gamma-function example into an overidentified system,
we add a third relation:
y 21 − r(r+1)
y 1 − λr y1 − r−1λ
λ2
1
. .. ..
(59.0.22) ..
F (y i , θ) = .
. .
r(r+1)
y −r 1
− λ y2 −
1294 59. GENERALIZED METHOD OF MOMENTS ESTIMATORS
In this case here is possible to compute the asymptotic covariance; but in real-life
situations this covariance matrix is estimated using a preliminary consistent estima-
tor of the parameters, as [Gre97] does it. Most GMM estimators depend on such a
consistent pre-estimator.
The GMM estimator θ̂ defined in this way is a particular kind of a M -estimator,
and many of its properties follow from the general theory of M -estimators. We need
some more definitions. Define the plim of the Jacobian of the parameter-defining
mapping D = plim n1 ∂F > ι/∂θ > and the plim of the covariance matrix of √1n F > ι is
Ψ = plim n1 F > F .
>
For IV, D = plim n1 ∂W (y−Xβ)
∂β >
= − plimn→∞ n1 W > X, and
1 1
Ψ = plim W > diag(y − Xβ) diag(y − Xβ)W = plim W >Ω W
n n
where Ω is the diagonal matrix with typical element E[(y i − x> 2 ε].
i β) ], i.e., Ω = V [ε
With this notation the theory of M -estimators gives us the following result: The
asymptotic MSE-matrix of the GMM is
√
This gives the following expression for the plim of n times the sampling error
of the IV estimator:
(59.0.24)
1 1 1 1 1 1 1
plim( X > W ( W > W )−1 W > X)−1 X > W ( W > W )−1 W >Ω W ( W > W )−1
n n n n n n n
(59.0.25)
= plim n(X > W (W > W )−1 W > X)−1 X > W (W > W )−1 W >Ω W (W > W )−1 W > X(X
The asymptotic MSE matrix can be obtained fom this by dividing by n. An estimate
of the asymptotic covariance matrix is therefore
(59.0.26)
(X > W (W > W )−1 W > X)−1 X > W (W > W )−1 W >ΩW (W > W )−1 W > X(X > W (W
(59.0.29) β̃ = (X > W (W >Ω W )−1 W > X)−1 X > W (W >Ω W )−1 W > y
(59.0.30) (D > A0 D)−1 D > A0 ΨA0 D(D > A0 D)−1 − (D > A0 D)−1 =
(59.0.31) = (D > A0 D)−1 D > A0 Ψ − D(D > A0 D)−1 D > A0 D(D > A0 D)−1
59. GENERALIZED METHOD OF MOMENTS ESTIMATORS 1297
Now the middle matrix can be written as P I − QD(D > Q> QD)−1 D > Q> P >
which is nonnegative definite because the matrix in the middle is idempotent.
The advantage of the GMM is that it is valid for many different DGP’s. In this
respect it is the opposite of the maximum likelihood estimator, which needs a very
specific DGP. The more broadly the DGP can be defined, the better the chances
are that the GMM etimator is efficient, i.e., in large samples as good as maximum
likelihood.
CHAPTER 60
Bootstrap Estimators
distribution function has been called the nonparametric maximum likelihood esti-
mate of F . And your estimate of the distribution of θ(x) is that distribution which
derives from this empirical distribution function. Just like the maximum likelihood
principle, this principle is deceptively simple but has some deep probability theoretic
foundations.
In simple cases, this is a widely used principle; the sample mean, for instance, is
the expected value of the empirical distribution, the same is true about the sample
variance (divisor is n) or sample median etc. But as soon as θ becomes a little more
complicated, and one wants more complex measures of its distribution, such as the
standard deviation of a complicated function of x, or some confidence intervals, an
analytical expression for this bootstrap estimate is prohibitively complex.
But with the availability of modern computing power, an alternative to the
analytical evaluation is feasible: draw a large random sample from the empirical
distribution, evaluate θ(x) for each x in this artificially generated random sample,
and use these datapoints to construct the distribution function of θ(x). A random
sample from the empirical distribution is merely a random drawing from the given
values with replacement. This requires computing power, usually one has to re-
sample between 1,000 and 10,000 times to get accurate results, but one does not
need to do complicated math, and these so-called nonparametric bootstrap results
are very close to the theoretical results wherever those are available.
60. BOOTSTRAP ESTIMATORS 1301
So far we have been discussing the situation that all observations come from
the same population. In the regression context this is not the case. In the OLS
model with i.i.d. disturbances, the observations of the independent variable y t have
different expected values, i.e., they do not come from the same population. On the
other hand, the disturbances come from the same population. Unfortunately, they
are not observed, but it turns out that one can successfully apply bootstrap methods
here by first computing the OLS residuals and then drawing from these residuals to
get pseudo-datapoints and to run the regression on those. This is a surprising and
strong result; but one has to be careful here that the OLS model is correctly specified.
For instance, if there is heteroskedasticity which is not corrected for, then the re-
sampling would no longer be uniform, and the bootstrap least squares estimates are
inconsistent.
The jackknife is a much more complicated concept; it was originally invented
and is often still introduced as a device to reduce bias, but [Efr82, p. 10] claims that
this motivation is mistaken. It is an alternative to the bootstrap, in which random
sampling is replaced by a symmetric systematic “sampling” of datasets which are by 1
observation smaller than the original one: namely, n drawings with one observation
left out in each. In certain situations this is as good as bootstrapping, but much
cheaper. I third concept is cross-validation.
1302 60. BOOTSTRAP ESTIMATORS
There is a new book out, [ET93], for which the authors also have written boot-
strap and jackknife functions for Splus, to be found if one does attach("/home/econ/eh
CHAPTER 61
Random Coefficients
1303
1304 61. RANDOM COEFFICIENTS
= ∆ ; = ι +
t t t t t
Estimation under the assumption Σ is known: To estimate β̄ one can use the het-
ˆ . The
eroskedastic model with error variances τ 2 xt >Σ xt , call the resulting estimate β̄
formula for the best linear unbiased predictor of β t itself can be derived (heuristi-
cally) as follows: Assume for a moment that β̄ is known: then the model can be
written as y t − xt > β̄ = xt > v t . Then we can use the formula for the Best Linear
Predictor, equation (??), applied to the situation
> > >
xt v t 0 2 xt Σ xt xt Σ
(61.0.33) ∼ ,τ
vt o Σ xt Σ
where xt > v t is observed, its value is y t − xt > β̄, but v t is not. Note that here we
predict a whole vector on the basis of one linear combination of its elements only.
61. RANDOM COEFFICIENTS 1305
This predictor is
(61.0.34) v ∗t = Σ xt (xt >Σ xt )−1 (y t − xt > β̄)
If one adds β̄ to both sides, one obtains
(61.0.35) β ∗t = β̄ + Σ xt (xt >Σ xt )−1 (y t − xt > β̄)
If one now replaces β̄ by β̄ˆ , one obtains the formula for the predictor given in
+
[JHG 88, p. 438]:
(61.0.36) β ∗ = β̄ˆ + Σ x (x >Σ x )−1 (y − x > β̄ ˆ ).
t t t t t t
Using this notation and defining, as usual, M = I − X(X > X)−1 X > , and
writing mt for the tth column vector of M , and furthermore writing Q for the
matrix whose elements are the squares of the elements of M , and writing δ t for the
vector that has 1 in the tth place and 0 elsewhere, one can derive:
(61.0.42) = m>
t diag(γ)mt = mt1 γ1 m1t + · · · + mtn γn mnt
= ∆ + ; =
n n n t
Problem 513. Let y i be the ith column of Y . The random coefficients model
as discussed in [Gre97, p. 669–674] specifies y i = X i β i + ε i with ε i ∼ (o, σi2 I)
and ε i uncorrelated with ε j for i 6= j. Furthermore also β i is random, write it as
β i = β + v i , with v i ∼ (o, τ 2 Γ) with a positive definite Γ, and again v i uncorrelated
with v j for i 6= j. Furthermore, all v i are uncorrelated with all ε j .
1 −1 >
(61.0.46) X>
i (V [w i ])
−1
= Γ (X i X i + κ2i Γ−1 )−1 X >
i
τ2
1308 61. RANDOM COEFFICIENTS
where κ2i = σi2 /τ 2 . You are allowed to use, without proof, formula (A.8.13), which
reads for inverses, not generalized inverses:
−1
(61.0.47) A + BD −1 C = A−1 − A−1 B(D + CA−1 B)−1 CA−1
(61.0.50)
1 1 1 1
= 2 X> − 2 X> + 2 κ2i Γ−1 (X > 2 −1 −1 >
i X i + κi Γ ) X i = 2 Γ−1 (X > 2 −1 −1 >
i X i + κi Γ ) Xi .
σi i σi i σi τ
• c. 2 points Show that from (61.0.46) also follows that The GLS of each column
of Y separately is the OLS β̂ i = (X >
i X i)
−1
X>
i yi .
Problem 514. 5 points Describe in words how the “Random Coefficient Model”
differs from an ordinary regression model, how it can be estimated, and describe
situations in which it may be appropriate. Use your own words instead of excerpting
the notes, don’t give unnecessary detail but give an overview which will allow one to
decide whether this is a good model for a given situation.
(61.0.54) y t = α + β t xt + γx2t
(no separate disturbance term), where α and γ are constants, and β t is the tth element
of a random vector β ∼ (ιµ, τ 2 I). Explain how you would estimate α, γ, µ, and τ 2 .
This is regression with a heteroskedastic disturbance term. Therefore one has to specify weights=1/x2t ,
if one does that, one gets
yt α
(61.0.56) = + µ + γxt + v t
xt xt
the coefficient estimates are the obvious ones, and the variance estimate in this regression is an
unbiased estimate of τ 2 .
CHAPTER 62
Multivariate Regression
t Y t X k B t E
(62.1.1) = +
p p p
The most common application of these kinds of models are Vector Autoregressive
Time Series models. If one adds the requirements that all coefficient vectors satisfy
1314 62. MULTIVARIATE REGRESSION
the same kind of linear constraint, one gets a model which is sometimes called a
growth curve models. These models will be discussed in the remainder of this chapter.
In a second basic model, the explanatory variables are different, but the coeffi-
cient vector is the same. In tiles:
t Y t X k β t E
(62.1.2) = +
p p p
These models are used for pooling cross-sectional and timeseries data. They will be
discussed in chapter 64.
In the third basic model, both explanatory variables and coefficient vectors are
different.
t Y t X k B t E
(62.1.3) = ∆ +
p p p
These models are known under the name “seemingly unrelated” or “disturbance
related” regression models. They will be discussed in chapter 65.
62.2. MULTIVARIATE REGRESSION WITH EQUAL REGRESSORS 1315
62.2.1. Least Squares Property. The least squares principle can be applied
here in the following form: given a matrix of observations Y , estimate B by that
value B̂ for which
in the matrix sense, i.e., (Y − X B̂)> (Y − X B̂) is by a nnd matrix smaller than
any other (Y − XB)> (Y − XB). And an unbiased estimator of Σ is Σ̂ = n−k 1
(Y −
>
X B̂) (Y − X B̂).
Any B̂ which satisfies the normal equation
is a solution. There is always at least one such solution, and if X has full rank, then
the solution is uniquely determined.
62.2. MULTIVARIATE REGRESSION WITH EQUAL REGRESSORS 1317
Proof: This is Problem 232. Due to the normal equations, the cross product
disappears:
Note that the normal equation (62.2.3) simply reduces to the OLS normal equation
for each column β i of B, with the corresponding column y i of Y as dependent
variable. In other words, for the estimation of β i , only the ith column y i is used.
−1
(62.2.6) vec(B̂) = (I ⊗ X)> (ΣΣ ⊗ I)−1 (I ⊗ X) (I ⊗ X)> (ΣΣ ⊗ I)−1 vec(Y )
−1
(62.2.7) Σ−1 ⊗ I)(I ⊗ X)
= (I ⊗ X > )(Σ Σ−1 ⊗ I) vec(Y )
(I ⊗ X > )(Σ
−1
(62.2.8) = Σ −1 ⊗ X > X Σ−1 ⊗ X > ) vec(Y )
(Σ
(62.2.9) = I ⊗ (X > X)−1 X > vec(Y )
From this vectorization one can also derive the dispersion matrix V [vec(B̂)] = Σ ⊗
(X > X)−1 . In other words, C [β̂ i , β̂ j ] = σij (X > X)−1 , which can be estimated by
σ̂ ij (X > X)−1 .
62.2. MULTIVARIATE REGRESSION WITH EQUAL REGRESSORS 1319
>
Assuming normality, the ith row vector is y > >
i ∼ N (xi B, Σ ), or y i ∼ N (B xi , Σ ).
Since all rows are independent, the likelihood function is
(62.2.12)
n
Y 1
fY (Y ) = (2π)−r/2 (det Σ )−1/2 exp − (y > >
Σ−1 (y i − B > xi )
i − xi B)Σ
i=1
2
1X >
= (2π)−nr/2 (det Σ )−n/2 exp − (y i − x> Σ−1 (y i − B > xi ) .
(62.2.13) i B)Σ
2 i
1320 62. MULTIVARIATE REGRESSION
n
X n
X
(y > >
Σ−1 (y i − B > xi ) =
i − xi B)Σ tr(y > >
Σ−1 (y i − B > xi )
i − xi B)Σ
i=1 i=1
n
X
= tr Σ−1 (y i − B > xi )(y > >
i − xi B)
i=1
n
X
= tr Σ −1 (y > > > > >
i − xi B) (y i − xi B)
i=1
> > >
y 1 − x> y 1 − x>
1B 1B
= tr Σ −1
.. ..
. .
y> >
n − xn B y> >
n − xn B
The first step is obvious: using (62.2.4), the quadratic form in the exponent
becomes:
(Y − XB)> (Y − XB) = tr Σ −1 (Y − X B̂)> (Y − X B̂)
Σ−1 (X B̂ − XB)> .
+ tr(X B̂ − XB)Σ
The argument which minimizes this is B = B̂, regardless of the value of Σ . Therefore
the concentrated likelihood function becomes, using the notation Ê = (Y − X B̂):
1 >
(62.2.14) (2π)−nr/2 (det Σ )−n/2 exp − tr Σ −1 Ê Ê .
2
In order to find the value of Σ which maximizes this we will use (A.8.21) in Theorem
A.8.3 in the Mathematical Appendix. From (A.8.21) follows
n
(62.2.15) (det A)n/2 e− 2 tr A
≤ e−rn/2 ,
> 1/2 −1 >
We want to apply (62.2.15). Set A = n1 (Ê Ê Σ (Ê Ê)1/2 ; then exp − n2 tr A =
> >
exp − 12 tr Σ −1 Ê Ê , and det A = det( n1 Ê Ê / det Σ ; therefore, using (62.2.15),
> > −n/2
(2π)−nr/2 (det Σ )−n/2 exp − 21 tr Σ −1 Ê Ê ≤ 2πe−nr/2 det( n1 Ê Ê ,
1 >
with equality holding when A = I, i.e., for the value Σ̂ = n Ê Ê.
1322 62. MULTIVARIATE REGRESSION
(62.2.14) is the concentrated likelihood function even if one has prior knowledge
about Σ ; in this case, the maximization is more difficult.
The OLS estimate of µ is ȳ, which one gets by taking the column means of Y .
The dispersion matrix of this estimate is Σ /n. The Mahalanobis distance of this
estimate from µ0 is therefore n(ȳ − µ)>Σ −1 (ȳ − µ), and replacing Σ by its unbiased
S = W /(n − 1), one gets the following test statistic: T 2n−1 = n(ȳ − µ)> S −1 (ȳ − µ).
Here use the following definition: if z ∼ N (o, Σ ) is a r-vector, and W ∼ W (r, Σ )
independent of z with the same Σ , so that S = W /r is an unbiased estimate of Σ ,
then
(62.2.19) T 2r = z > S −1 z
is called a Hotelling T 2r,r with r and r degrees of freedom.
One sees easily that the distribution of T 2r,r is independent of Σ . It can be
written in the form
(62.2.20) T 2r = z >Σ −1/2 (Σ
Σ−1/2 SΣ
Σ−1/2 )−1Σ −1/2 z
Here Σ −1/2 z ∼ N (o, I) and from W = Y > Y where each row of Y is a N (o, Σ ),
then Σ −1/2 SΣ
Σ−1/2 = U > U where each row of U is a N (o, I).
From the interpretation of the Mahalanobis distance as the number of standard
deviations the “worst” linear combination is away from its mean, Hotelling’s T 2 -test
can again be interpreted as: make t-tests for all possible linear combinations of the
components of µ0 at an appropriately less stringent significance level, and reject the
62.2. MULTIVARIATE REGRESSION WITH EQUAL REGRESSORS 1325
hypothesis if at least one of these t-tests rejects. This principle of constructing tests
for multivariate hypotheses from those of simple hypotheses is called the “union-
intersection principle” in multivariate statistics.
Since the usual F -statistic in univariate regression can also be considered the
estimate of a Mahalanobis distance, it might be worth while to point out the dif-
ference. The difference is that in the case of the F -statistic, the dispersion matrix
was known up to a factor σ 2 , and only this factor had to be estimated. In the case
of the Hotelling T 2 , the whole dispersion matrix is unknown and all of it must be
estimated (but one has also multivariate rather than univariate observations). Just
as the distribution of the F statistic does not depend on the true value of σ 2 , the
distribution of Hotelling’s T 2 does not depend on Σ. Indeed, its distribution can be
expressed in terms of the F -distribution. This is a deep result which we will not
prove here:
If Σ is a r × r nonsingular matrix, then the distribution of Hotelling’s T 2r,r with r
and r degrees of freedom can be expressed in terms of the F -distribution as follows:
r − r + 1 T 2r,r
(62.2.21) ∼ F r,r−r+1
r r
This apparatus with Hotelling’s T 2 has been developed only for a very specific
kind of hypothesis, namely, a hypothesis of the form r > B = u> . Now let us turn
1326 62. MULTIVARIATE REGRESSION
to the more general hypothesis RB = U , where R has rank i, and apply the F -
test principle. For this one runs the constrained and the unconstrained multivariate
>
regression, calling the attained error sum of squares and products matrices Ê 1 Ê 1
>
(for the constrained) and Ê Ê (for the unconstrained model). Then one fills in the
following table: Just as in the univariate case one shows that the S.P. matrices in
Source D. F. S. P. Matrix
> >
Deviation from Hypothesis k−i Ê 1 Ê 1 − Ê Ê
>
Error n−k Ê Ê
>
(Restricted) Total n−i Ê 1 Ê 1
the first two rows are independent Wishart matrices, the first being central if the
hypothesis is correct, and noncentral otherwise.
In the univariate case one has scalars instead of the S.P. matrices; then one
divides each of these sum of squares by its degrees of freedom, and then takes the
relation of the “Deviation from hypothesis” mean square error by the error mean
square error. In this way one gets, for the error sum of squares an unbiased estimate
of σ 2 . If the hypothesis is true, the mean squared sum of errors explained by the
62.2. MULTIVARIATE REGRESSION WITH EQUAL REGRESSORS 1327
|W 1 |
(62.2.22) ∼ Λ(r, k1 , k2 )
|W 1 + W 2 |
Answer. If Ψ is known, then the BLUE can be obtained as follows: Use (B.5.19) to write the
equation Y = XΘH + E in vectorized form as
(62.3.4) vec(Y ) = (H > ⊗ X) vec(Θ) + vec(E)
and now apply (B.5.19) again to transform this back into matrix notation
Here is one scenario how such a model may arise: assume you have n plants, you
group those plants into two different groups, the first group going from plant 1 until
plant m, and the second from plant m+1 until plant n. These groups obtain different
treatments. At r different time points you measure the same character on each of
these plants. These measurements give the rows of your Y -matrix. You assume the
following: the dispersion matrix between these measurements are identical for all
62.3. GROWTH CURVE MODELS 1331
plants, call it Ψ, and the expected values of these measurements evolves over time
following two different quadratic polynomials, one for each treatment.
This can be expressed mathematically as follows (omitting the matrix of error
terms):
(62.3.9)
y 11 y 12 ··· y 1r 1 0
.. .. .. .. .. ..
. . . .
. .
1 1 ··· 1
y m1 y m2 · · · y mr
1 0 θ
10
θ 11 θ 12
y m+1,1 y m+1,2 · · · y m+1,r = 0 1 θ20 θ21 θ22 t12 t22 · · · tp2
. ..
.. .. ..
t1 t 2 · · · t p
.. ..
. . . . .
y n1 y n2 ··· y nr 0 1
This gives the desired result y 11 = θ10 + θ11 t1 + θ12 t21 plus an error term, etc.
If one does not know Ψ, then one has to estimate it.
CHAPTER 63
This Chapter discusses a model that is a special case of the model in Chapter
62.2, but it goes into more depth towards the end.
We will choose an alternative notation, which is also found in the literature, and
write the matrix as a n × r matrix Y . As before, each column represents a variable,
and each row a usually independent observation.
Decompose Y into its row vectors as follows:
>
y1
..
(63.1.1) Y = . .
y>
n
Each row (written as a column vector) y i has mean µ and dispersion matrix Σ , and
different rows are independent of each other. In other words, E [Y ] = ιµ> . V [Y ]
is an array of rank 4, not a matrix. In terms of Kronecker products one can write
V [vec Y ] = Σ ⊗ I.
One can form the following descriptive statistics: ȳ = n1 y i is the vector of sample
means, W = i (y i − ȳ)(y i − ȳ)> is matrix of (corrected) squares and cross products,
P
the sample covariance matrix is S (n) = n1 W with divisor n, and R is the matrix of
sample correlation coefficients.
Notation: the ith sample variance is called sii (not s2i , as one might perhaps
expect).
The sample means indicate location, the sample standard deviations dispersion,
and the sample correlation coefficients linear relationship.
63.1. NOTATION AND BASIC STATISTICS 1335
How do we get these descriptive statistics from the data Y through a matrix
>
manipulation? ȳ > = n1 ι> Y ; now Y −ιȳ > = (I − ιιn )Y is the matrix of observations
with the appropriate sample mean taken out of each element, therefore
(y 1 − ȳ)>
(63.1.2) W = y 1 − ȳ · · · y n − ȳ
..
=
.
(y n − ȳ)>
ιι> > ιι> ιι>
= Y > (I − ) (I − )Y = Y > (I − )Y .
n n n
Then S (n) = n1 W , and in order to get the sample correlation matrix R, use
s11 0 · · · 0
0 s22 · · · 0
(63.1.3) D (n) = diag(S (n) ) = .
. . ..
.. .. .. .
0 0 · · · snn
and then R = (D (n) )−1/2 S (n) (D (n) )−1/2 .
In analogy to the formulas for variances and covariances of linear transformations
of a vector, one has the following formula for sample variances and covariances of
linear combinations Y a and Y b: est.cov[Y a, Y b] = a> S (n) b.
1336 63. INDEPENDENT OBSERVATIONS FROM SAME POPULATION
Problem 517. Show that E [ȳ] = µ and V [ȳ] = n1 Σ . (The latter identity can
be shown in two ways: once using the Kronecker product of matrices, and once by
partitioning Y into its rows.)
1 1 1
Answer. E [ȳ] = E [ n Y > ι] = ( [Y
n E
])> ι = n
µι> ι = µ. Using Kronecker products, one
> 1 >
obtains from ȳ = n ι Y that
1
(63.1.4) ȳ = vec(ȳ > ) = (I ⊗ ι> ) vec Y ;
n
therefore
1 1 1
(63.1.5) V [ȳ] = (I ⊗ ι> )(Σ Σ ⊗ ι> ι) = Σ
Σ ⊗ I)(I ⊗ ι) = 2 (Σ
n2 n n
63.2. TWO GEOMETRIES 1337
space, the “scatterplot geometry.” If r = 2, this is the scatter plot of the two variables
against each other.
In this geometry, the sample mean is the center of balance or center of gravity.
The dispersion of the observations around their mean defines a distance measure in
this geometry.
The book introduces this distance by suggesting with its illustrations that the
data are clustered in hyperellipsoids. The right way to introduce this distance would
be to say: we are not only interested in the r coordinates separately but also in any
linear combinations, then use our treatment of the Mahalanobis distance for a given
population, and then transfer it to the empirical distribution given by the sample.
In the other geometry, all observations of a given random variable form one point,
here called “vector.” I.e., the basic entities are the columns of Y . In this so-called
“vector geometry,” x̄ is the projection on the diagonal vector ι, and the correlation
coefficient is the cosine of the angle between the deviation vectors.
Generalized sample variance is defined as determinant of S. Its geometric intu-
ition: in the scatter plot geometry it is proportional to the square of the volume of
the hyperellipsoids, (see J&W, p. 103), and in the geometry in which the observations
of each variabe form a vector it is
One sees, therefore, that the density function depends on the observation only
through ȳ and S (n) , which means that ȳ and S (n) are sufficient statistics.
Now we compute the maximum likelihood estimators: taking the maximum for
µ is simply µ̂ = ȳ. This leaves the concentrated likelihood function
n
(63.3.5) Σ−1 S (n) ) .
max fY (Y ) = (2π)−nr/2 (det Σ )−n/2 exp − tr(Σ
µ 2
63.4. EM-ALGORITHM FOR MISSING OBSERVATIONS 1341
n
(2π)−nr/2 (det Σ )−n/2 exp − Σ−1 S (n) ) ≤ (2πe)−rn/2 (det S (n) )−n/2
(63.3.6) tr(Σ
2
Let’s follow Johnson and Wichern’s example on their p. 199. The matrix is
− 0 3
7 2 6
(63.4.1) Y = 5 1 2
− − 5
>
It
is not so important how one gets the initial estimates of µ and Σ : say µ̃ =
6 1 4 , and to get Σ̃ Σ take deviations from the mean, putting zeros in for the
missing values (which will of course underestimate the variances), and divide by
the number of observations. (Since we are talking maximum likelihood, there is no
adjustment for degrees of freedom.)
(63.4.2)
0 −1 −1
1/2 1/4 1
1 1 1 2
Σ = Y > Y where Y =
Σ̃ −1 0 −2 , i.e., Σ̃ Σ = 1/4 1/2 3/4 .
4
1 3/4 5/2
0 0 1
Given these estimates, the prediction step is next. The likelihood function de-
pends on sample mean and sample dispersion matrix only. These, in turn, are simple
functions of the vector of column sums Y > ι and the matrix of (uncentered) sums of
squares and crossproducts Y > Y , which are complete sufficient statistics. To predict
63.4. EM-ALGORITHM FOR MISSING OBSERVATIONS 1343
Furthermore,
(63.4.6)
∗ ∗ > −1
Σ] = E [(y 1 −y ∗1 )(y 1 −y ∗1 )> ] = MSE[y ∗1 ; y 1 ] = Σ̃
E [(y 1 −y 1 )(y 1 −y 1 ) |y 2 ; µ̃, Σ̃ Σ11 −Σ̃
Σ12Σ̃
Σ22
For the cross products with the observed values one can apply the linearity of
the (conditional) expectations operator:
> ∗ >
(63.4.9) E [y 1 y 2 |y 2 ; µ̃, Σ̃
Σ] = (y 1 )y 2
Now switch back to the more usual notation, in whichPy i is the ith row vector
of Y and ȳ the vector of column means. Since S (n) = n1 yi y> >
i − ȳ ȳ , one can
obtain from the above the value of
(n)
(63.4.16) E [S | all observed values in Y ; µ̃, Σ̃
Σ].
1346 63. INDEPENDENT OBSERVATIONS FROM SAME POPULATION
(63.4.17) Σ].
E [ȳ| all observed values in Y ; µ̃, Σ̃
(63.4.18)
1 1
y 11 7 5 y 41 5.73 7 5 6.4 24.13
> 1 1
E [Y ι| · · · ] = E 0
[ 2 1 y 42
1 | · · · ] = 0 2 1 1.3
1 = 4.30
3 6 2 5 3 6 2 5 16.00
1 1
(63.4.19)
y 0 3
7 5 y 41 11
y 11 148.05 27.27 101.18
> 7 2 6
E [Y Y | · · · ] = E [ 0
2 1 y 42
5 1
] = 27.27
2
6.97 20.50
3 6 2 5 101.18 20.50 74.00
y 41 y 42 5
The next step is to plug those estimated values of Y > ι and Y > Y into the
likelihood function and get the maximum likelihood estimates of µ and Σ , in other
words, set mean and dispersion matrix equal to the sample mean vector and sample
63.5. WISHART DISTRIBUTION 1347
z>
r
Pr >
j=1 z j z j is called a (central) Wishart distribution, notation Z > Z ∼ W (r, Σ ). r
1348 63. INDEPENDENT OBSERVATIONS FROM SAME POPULATION
Xc ∼ N (o, I) (the first vector having n and the second r components). Therefore
c> Z > P Zc is distributed as a χ2 , therefore we can use the necessity condition in
theorem 10.4.3 to show that P is idempotent.
As an application it follows from (63.1.2) that S (n) ∼ W (n − 1, Σ ).
63.6. SAMPLE CORRELATION COEFFICIENTS 1349
One can also show the following generalization of Craig’s theorem: If Z as above,
then Z > P Z is independent of Z > QZ if and only if P Q = O.
Given m cross-sectional units, each of which has been observed for t time periods.
The dependent variable for cross sectional unit i at time s is y si . There are also
k independent variables, and the value of the jth independent variable for cross
sectional unit i at time s is xsij . I.e., instead of a vector, the dependent variable is a
matrix, and instead of a matrix, the independent variables form a 3-way array. We
will discuss three different models here which assign equal slope parameters to the
different cross-sectional units but which differ in their treatment of the intercept.
1353
1354 64. POOLING OF CROSS SECTION AND TIME SERIES DATA
where the error terms are uncorrelated and have equal variance σε2 .
In tile notation:
t Y t ι µ ι t X k β t E
(64.1.2) = + +
m m m m
X 1 β · · · X m β represents a matrix obtained by the multiplication of a 3-way
array with a vector. We assume vec E ∼ o, σ 2 I.
If one vectorizes this one gets
(64.1.4)
ι X1 ι X1
ι X 2 ι X2
µ
vec(Y ) = . + vec(E) or vec(Y ) = . µ + . β + vec(E)
.
.. .. β .. ..
ι Xm ι Xm
Using the abbreviation
X1
(64.1.5) Z = ...
Xm
this can also be written
µ
(64.1.6) vec(Y ) = ιµ + Zβ + vec(E) = ι Z + .
β
Problem 520. 1 point Show that vec( X 1 β · · · X m β ) = Zβ with Z as
just defined.
1356 64. POOLING OF CROSS SECTION AND TIME SERIES DATA
Answer.
X1β X1
. .
(64.1.7) vec( X 1 β ··· Xmβ ) = .
. = .
. β = Zβ
Xmβ Xm
One gets the paramater estimates by regressing running OLS on (64.1.4), i.e.,
regressing vec Y on Z with an intercept.
ε.
If one transposes this one obtains ȳ = ιµ + X̄β + ε̄
64.3. DUMMY VARIABLE MODEL (FIXED EFFECTS) 1357
ι/t t Y
(64.2.2) =
m
µ ι ι/t t X k β ι/t t E
= + +
m m m
If one runs this regression one will get estimates of µ and β which are less efficient
than those from the full regression. But these regressions are consistent even if the
error terms in the same column are correlated (as they are in the Random Effects
model).
model is now
k
X
(64.3.1) y si = αi + xsij βj + εsi s = 1, . . . , t, i = 1, . . . , m,
j=1
where the error terms are uncorrelated and have equal variance σε2 . In tile notation
this is
(64.3.2)
t Y t ι α t X k β t E
= + +
m m m m
Y = ια> + X 1 β
(64.3.3) ··· X mβ + E
where Y = y 1 · · · y m is t×m, each of the X i is t×k, ι is the t-vector of ones, α
is the m-vector collecting
all the intercept terms, β the k-vector of slope coefficients,
E = ε 1 · · · ε m the matrix of disturbances. We assume vec E ∼ o, σ 2 I.
64.3. DUMMY VARIABLE MODEL (FIXED EFFECTS) 1359
Using the K defined in Problem 521 and the Z defined in (64.1.5), (64.3.4) can
also be written as
(64.3.5) vec(Y ) = Kα + Zβ + vec(E)
1360 64. POOLING OF CROSS SECTION AND TIME SERIES DATA
[JHG+ 88] give a good example how such a model can arise: s is years, i is firms,
y si is costs, and there is only one xsi for every firm (i.e. k = 1), which is sales. These
firms would have equal marginal costs but different fixed overhead charges.
In principle (64.3.4) presents no estimation problems, it is OLS with lots of
dummy variables (if there are lots of cross-sectional units). But often it is advanta-
geous to use the following sequential procedure: (1) in oder to get β̂ regress
Dy 1 DX 1
.. ..
(64.3.6) . = . β̂ + residuals
Dy m DX m
without a constant term (but if you leave the constant term in, this does not matter
either, its coefficient will be exactly zero). Here D is the matrix which takes the
mean out. I.e., take the mean out of every y individually and out of every X before
running the regression. (2) Then you get each α̂i by the following equation:
(64.3.7) α̂i = ȳ i − x̄>
i β̂
Answer. Equation (64.3.4) has the form of (30.0.1). Define D = I − ιι> /t and W = I −
K(K > K)−1 K > = I ⊗ D. According to (30.0.3) and (30.0.4), β̂ and the vector of residuals can be
obtained by regressing W vec(Y ) on W Z, and if one plugs this estimate β̂ back into the formula,
then one obtains an estimate of α.
Without using the Kronecker product, this procedure can be described as follows: one gets the
right β̂ if one estimates (64.3.3) premultiplied by D. Since Dι = o, this premultiplication removes
the first parameter vector α from the regression, so that only
(64.3.8) DY = DX 1 β ··· DX m β + DE
remains—or, in vectorized form,
Dy 1 DX 1 Dεε1
(64.3.9) .. = .. β + ..
. . .
Dy m DX m εm
Dε
Although vec(DE) is no longer spherically distributed, it can be shown that in the present situation
the OLS of β is the BLUE.
After having obtained β̂, one obtains α̂ by plugging this estimated β̂ into (64.3.3), which gives
(64.3.10) Y − X 1 β̂ ··· X m β̂ = ια> + E
Here each column of Y is independent of all the others, they no longer share common parameters,
therefore one can run this regression column by column:
(64.3.11) y i − X i β̂ = ιαi + ε i i = 1, . . . , m
1362 64. POOLING OF CROSS SECTION AND TIME SERIES DATA
Since the regressor is the column of ones, one can write down the result immediately:
(64.3.12) α̂i = ȳ i − x̄>
i β̂
where ȳ i is the mean of y i , and x̄>
i is the row vector consisting of the column means of X i .
2 2
To get the unbiased estimate of σ , one can almost take the s from the regression
(64.3.9), one only has to adjust it for the numbers of degrees of freedom.
Problem 523. We are working in the dummy-variable model for pooled data,
which can be written as
Y = ια> + X 1 β · · · X m β + E
(64.3.13)
where Y = y 1 · · · y m is t × m, each of the X i is t × k, ι is the t-vector of
ones, E is a t × m matrix of identically distributed independent error terms with zero
mean, and α is a m-vector and β a k-vector of unknown nonrandom parameters.
• a. 3 points Describe in words the characteristics of this model and how it can
come about.
Answer. Each of the m units has a different intercept, slope is the same. Equal marginal
costs but different fixed costs.
• b. 4 points Describe the issues in estimating this model and how it should be
estimated.
64.3. DUMMY VARIABLE MODEL (FIXED EFFECTS) 1363
Answer. After vectorization OLS is fine, but design matrix very big. One can derive formulas
that are easier to evaluate numerically because they involve smaller matrices, by exploiting the
structure of the overall design matrix. First estimate the slope parameters by sweeping out the
means, then the intercepts.
Answer. The unrestricted regression is the dummy variables regression which was described
here: first form DY and all the DX i , then run regression (64.3.9) without intercept, which is
already enough to get the SSE r .
Number of constraints is m − 1, number of observations is tm, and number of coefficients in
the unrestricted model is k + m. The test statistic is given in [JHG+ 88, (11.4.25) on p. 475]:
1364 64. POOLING OF CROSS SECTION AND TIME SERIES DATA
Answer. If one believes that variances are similar, and if one is not interested in those par-
ticular firms in the sample, but in all firms.
Answer. Both models involve different cross-sectional units in overlapping time intervals. In
the SUR model, the different equations are related through the disturbances only, while in the
dummy variable model, no relationship at all is going through the disturbances, all the errors are
independent! But in the dummy variable model, the equations are strongly related since all slope
coefficients are equal in the different equations, only the intercepts may differ. In the SUR model,
there is no relationship between the parameters in the different equations, the parameter vectors
may even be of differant lengths. Unlike [JHG+ 88], I would not call the dummy variable model a
special case of the SUR modiel, since I would no longer call it a SUR model if there are cross-equation
restrictions.
64.5. VARIANCE COMPONENTS MODEL (RANDOM EFFECTS) 1365
>
(64.5.5) V [vec(ιδ
2
+ E)] = σα I m ⊗ ιι> + σε2 I m ⊗ I t = I m ⊗ (σα
2 >
ιι + σε2 I t ) = I m ⊗ V
• a. 3 points Show that the BLUE in this model based on all observations is
ˆ = X X >Σ −1 X −1 X X >Σ −1 y
(64.5.6) β̂ i i i i i i
i i
Since the columns of ιδ > +E are independent and have equal covariance matrices,
it is possible to transform ιδ > + E into a matrix of uncorrelated and homoskedastic
64.5. VARIANCE COMPONENTS MODEL (RANDOM EFFECTS) 1369
The connection between the dummy variable model and the error components
model becomes most apparent if we scale P such that P V P > = σε2 I, i.e., P is the
square root of the inverse of V /σε2 . In Problem 527 we must therefore set Ω = ιι> /t,
ν = 1, and ω = tσα2 /σε2 . The matrix which diagonalizes the error covariance matrix
is therefore
s
ιι> σε2 /t
(64.5.10) P =I −γ where γ =1−
t σε /t + σα2
2
σ 2 /t
2
Answer. Since P ι = ι(1−γ), P wi = ιδ i (1−γ)+P ε i and V [P wi ] = ισα ε
σ 2 /t+σ 2
ι> +σε2 P P > .
ε α
2
−2γ >
Now P P > = I + ι γ t
ι , and
−σα 2
(64.5.11) γ 2 − 2γ = (1 − γ)2 − 1 =
σε2 /t + σα
2
Therefore
σα2 σ 2 /t
(64.5.12) σε2 P P > = σε2 I + ισε2 /t(γ 2 − 2γ)ι> = σε2 I − ι ε
ι>
σε2 /t + σα 2
and V [P wi ] = σε2 .
Problem 529. 1 point Show that P ι = ι(1 − γ) and that the other eigenvectors
of P are exactly the vectors the elements of which sum to 0, with the eigenvalues 1.
Derive from this the determinant of P .
>
Answer. P ι = (I − γ ιιt )ι = ι(1 − γ). Now if a vector a satisfies ι> a = 0, then P a = a.
Since there are t − 1 independent such vectors, this gives all eigenvectors. det(P ) = 1 − γ (the
product of all eigenvalues).
Problem 530. 3 points Now write down this likelihood function, see [Gre97,
exercise 4 on p. 643].
Answer. Assuming normality, the ith column vector is y i ∼ N (ιµ + X i β, V ) and different
columns are independent. Since V [P wi ] = P V P > = σε2 I it follows det(V ) = σε2t (det P )−2 .
64.5. VARIANCE COMPONENTS MODEL (RANDOM EFFECTS) 1371
Comparing this P with the D which we used to transform the dummy variable
model, we see: instead of subtracting the mean from every column, we subtract γ
times the mean from every column. This factor γ approaches 1 as t increases and as
σα2 increases. If one premultiplies (64.5.3) by P one gets
P y1 ι P X1
(64.5.16)
.. .. .. (1 − γ)µ + spherical disturbances,
. = . . β
P yt ι P Xm
To sum up, if one knows γ, one can construct P and has to apply P to Y and all
X i and then run a regression with an intercept. The estimate of µ is this estimated
intercept divided by 1 − γ.
How can we estimate the variances? There is a rich literature about estimation
of the variances in variance component models. ITPE gives a very primitive but
intuitive estimator. An estimate of σε2 can be obtained from the dummy variable
model, since the projection operator in (64.3.9) removes α together with its error
term.
Information about σα2 can be obtained from the variance from the “between”-
regression which one gets by premultiplying (64.5.3) by 1t ι> . Defining ȳ > = 1t ι> Y ,
i.e., ȳ > is the row vector consisting of the column means, and in the same way
64.5. VARIANCE COMPONENTS MODEL (RANDOM EFFECTS) 1373
x̄> 1 >
ε> = 1t ι> E, one obtains
i = t ι X i and ε̄
(64.5.17)
x̄>
1
ȳ > = µι> + x̄> X̄ = ...
> >
x̄> ε = µι> +(X̄β)> +δ > +ε̄
ε>
1β ··· m β +δ +ε̄ where
x̄>
m
no longer guaranteed. Individual variances can obtain negative estimates when the formula for the
variance contains several parameters which are estimated separately. In the variance components
model, the variance is estimated as the difference between two other variances, which are estimated
separately so that there is no guarantee that their difference is nonnegative. If the estimated ρ in
an AR1-process comes out to be greater than 1, then the estimated covariance matrix is no longer
nnd, and the formula var[ε] = var[v]/(1 − ρ2 ) yields negative variances.
64.5.1. Testing. The variance component model relies on one assumption which
is often not satisfied: the errors in α and the errors in E must be uncorrelated. If
this is not the case, then the variance components estimator suffers from omitted
variables bias. Hausman used this as a basis for a test: if the errors are uncorrelated,
then the GLS is the BLUE, and the Dummy variable estimator is consistent, but
not efficient. If the errors are correlated, then the Dummy variables estimator is still
consistent, but the GLS is no longer BLUE. I.e., one should expect that the differ-
ence between these estimators is much greater when the error terms are correlated.
And under the null hypothesis that the error terms are orthogonal, there is an easy
way to get the covariance matris of the estimators: since the GLS is BLUE and the
other estimator is unbiased, the dispersion matrix of the difference of the estimators
is the difference of their dispersion matrices. For more detail see [Gre97, 14.4.4].
CHAPTER 65
1375
1376 65. SEEMINGLY UNRELATED
tm Y tm X km B tm E
(65.0.18) = ∆ +
m m m
y1 X1 O ··· O β1 ε1
y2 O X2 ··· O β2 ε 2
(65.1.1) .. = .. .. + ..
.. .. ..
. . . . . . .
65.1. THE SUPERMATRIX REPRESENTATION 1377
The covariance matrix of the disturbance term in (65.1.1) has the following “striped”
form:
ε1 σ11 I 11 σ12 I 12 · · · σ1m I 1m
ε2 σ21 I 21 σ22 I 22 · · · σ2m I 2m
(65.1.2) V [ .. ] =
.. .. .. ..
. . . . .
εm σm1 I m1 σm2 I m2 · · · σmm I mm
Here I ij is the ti × tj matrix which has zeros everywhere except at the intersections
of rows and columns denoting the same time period.
In the special case that all time periods
are identical,
i.e., all ti = t, one can
define the matrices Y = y 1 · · · y m and E = ε 1 · · · ε m , and write the
equations in matrix form as follows:
(65.1.3) Y = X 1 β1 . . . X m β m + E = H(B) + E
The vector of dependent variables and the vector of disturbances in the supermatrix
representation (65.1.1) can in this special case be written in terms of the vector-
ization operator as vec Y and vec E. And the covariance matrix can be written as
a Kronecker product: V [vec E] = Σ ⊗ I, since all I ij in (65.1.2) are t × t identity
1378 65. SEEMINGLY UNRELATED
If in addition all regressions have the same number of regressors, one can combine
the coefficients into a matrix B and can write the system as
(65.1.4) vec Y = Z vec B + vec E vec E ∼ (o, Σ ⊗ I),
65.1. THE SUPERMATRIX REPRESENTATION 1379
the numbers of observations in the different regressions are unequal, then the formula
for the GLSE is no longer so simple. It is given in [JHG+ 88, (11.2.59) on p. 464].
Answer.
a>
a> a> a>
" #
1
1 Ω a1 1 Ω a2 ··· 1 Ω at
> .
A Ω A = .. Ω a1 ... at = a>
2 Ω a1 a>
2 Ω a2 ··· a>
2 Ω at
a> a>
t Ω a1 a>
t Ω a2 ··· a>
t Ω at
t
To derive the likelihood function, define the matrix function H(B) as follows:
H(B) is a t × m matrix the ith column of which is X
i β i , i.e., H(B) as a column-
partitioned matrix is H(B) = X 1 β 1 · · · X m β m . In tiles,
tm X km B
(65.2.1) H(B) = ∆
The above notation follows [DM93, 315–318]. [Gre97, p. 683 top] writes this
same H as the matrix product
(65.2.2) H(B) = ZΠ(B)
where Z has all the different regressors in the different regressions as columns (it is
Z = X 1 · · · X n with duplicate columns deleted), and the ith column of Π has
zeros for those regressors which are not in the ith equation, and elements of B for
those regressors which are in the ith equation.
Using H, the model is simply, as in (65.0.18),
(65.2.3) Y = H(B) + E, vec(E) ∼ N (o, Σ ⊗ I)
1382 65. SEEMINGLY UNRELATED
η>
t (B)
is
t
Y 1
fY (Y ) = (2π)−m/2 (det Σ )−1/2 exp − (y s − η s (B))>Σ −1 (y s − η s (B))
s=1
2
1X
= (2π)−mt/2 (det Σ )−t/2 exp − (y s − η s (B))>Σ −1 (y s − η s (B))
2 s
1
= (2π)−mt/2 (det Σ )−t/2 exp − tr(Y − H(B))Σ Σ−1 (Y − H(B))>
2
1
−mt/2 −t/2
exp − tr(Y − H(B))> (Y − H(B))Σ Σ−1 .
(65.2.5) = (2π) (det Σ )
2
Problem 533. Expain exactly the step in the derivation of (65.2.5) in which the
trace enters.
1384 65. SEEMINGLY UNRELATED
In order to concentrate out Σ it is simpler to take the partial derivatives with respect
to Σ −1 than those with respect to Σ itself. Using the matrix differentiation rules
(C.1.24) and (C.1.16) and noting that −t/2 log det Σ = t/2 log det Σ −1 one gets:
∂` t 1
(65.2.12) = Σ − (Y − H(B))> (Y − H(B)),
Σ−1
∂Σ 2 2
and if we set this zero we get
1
(65.2.13) Σ̂(B) = (Y − H(B))> (Y − H(B)).
t
Written row vector by row vector this is
t
1X
(65.2.14) Σ̂ = (y − η s (B))(y s − η s (B))>
t s=1 s
The maximum likelihood estimator of Σ is therefore simply the sample covariance
matrix of the residuals taken with the maximum likelihood estimates of B.
We know therefore what the maximum likelihood estimator of Σ is if B is known:
it is the sample covariance matrix of the residuals. And we know what the maximum
likelihood estimator of B is if Σ is known: it is given by equation (65.1.6). In such a
situation, one good numerical method is to iterate: start with an initial estimate of
Σ (perhaps from the OLS residuals), get from this an estimate of B, then use this
1386 65. SEEMINGLY UNRELATED
to get a second estimate of Σ , etc., until it converges. This iterative scheme is called
iterated Zellner or iterated SUR. See [Ruu00, p. 706], the original article is [Zel62].
Here is a derivation of this using tile notation. We use the notation Ê = Y − H(B)
for the matrix of residuals, and apply the chain rule to get the derivatives:
∂`c ∂`c ∂ Σ̂ ∂ Ê
(65.3.4) >
= > > >
∂Π ∂ Σ̂ ∂ Ê ∂Π
The product here is not a matrix product but the concatenation of a matrix with
three arrays of rank 4. In tile notation, the first term in this product is
∂`c t −1
(65.3.5) >
=∂ `c /∂ Σ̂ = Σ̂
∂ Σ̂ 2
1388 65. SEEMINGLY UNRELATED
This is an array of rank 2, i.e., a matrix, but the other factors are arrays of rank 4:
Using (C.1.22) we get
Ê .
∂ Σ̂ . 1
>
= ∂ Σ̂ ∂ Ê = ∂ ∂ Ê =
∂ Ê t
Ê
X X
1 1
= +
t t
Finally, by (C.1.18),
Z . Z
∂ Ê
= ∂ ∂ Π =
∂Π> Π
65.4. SITUATIONS IN WHICH OLS IS BEST 1389
Putting it all together, using the symmetry of the first term (65.3.5) (which has the
effect that the term with the crossing arms is the same as the straight one), gives
∂`c Ê Z
=∂ `c / ∂ Π = Σ̂−1
∂Π>
(65.4.1) y i = Xβ i + ε i i = 1, . . . m
in which all X i are equal to X, note that equation (65.4.1) has no subscript at the
matrices of explanatory variables.
1390 65. SEEMINGLY UNRELATED
y 1 · · · y m , B = β 1 · · · β m and E =
• a. 1 point Defining Y =
ε 1 · · · ε m , show that the m equations (65.4.1) can be combined into the single
matrix equation
(65.4.2) Y = XB + E.
Answer.
The only step
needed to show this is that XB, column by column, can be written
XB = Xβ 1 . . . Xβ m .
−1
vec(B̂) = (I ⊗ X)> (Σ
Σ ⊗ I)−1 (I ⊗ X) (I ⊗ X)> (Σ
Σ ⊗ I)−1 vec(Y )
−1
= (I ⊗ X > )(Σ
Σ−1 ⊗ I)(I ⊗ X) (I ⊗ X > )(Σ
Σ−1 ⊗ I) vec(Y )
−1
= Σ −1 ⊗ X > X Σ−1 ⊗ X > ) vec(Y )
(Σ
= I ⊗ (X > X)−1 X > vec(Y )
Answer. Look at the derivation of (65.4.3) again. The Σ −1 in numerator and denominator
cancel out since they commute with Z. defining Ω = Σ ⊗ I, this “commuting” is the formula
1392 65. SEEMINGLY UNRELATED
Note that the I on the lefthand side are m × m, and those on the right are k × k. This “commuting”
allows us to apply Kruskal’s theorem.
Joint estimation has therefore the greatest efficiency gains over OLS if the cor-
relations between the errors are high and the correlations between the explanatory
variables are low.
Problem 535. Are following statements true or false?
• a. 1 point In a seemingly unrelated regression framework, joint estimation of
the whole model is much better than estimation of each equation singly if the errors
are highly correlated. True or false?
Answer. True
Assume I have two equations whose disturbances are correlated, and the second
has all variables that the first has, plus some additional ones. Then the inclusion
of the second equation does not give additional information for the first; however,
including the first gives additional information for the second!
1394 65. SEEMINGLY UNRELATED
What is the rationale for this? Since the first equation has fewer variables than
the second, I know the disturbances better. For instance, if the equation would
not have any variables, then I would know the disturbances exactly. But if I know
these disturbances, and know that they are correlated with the disturbances of the
second equation, then I can also say something about the disturbances of the second
equation, and therefore estimate the parameters of the second equation better.
where all σij are known, and the set of explanatory variables in X 1 is a subset of
those in X 2 . One of the following two statements is correct, the other is false. Which
is correct? (a) in order to estimate β 1 , OLS on the first equation singly is as good
as SUR. (b) in order to estimate β 2 , OLS on the second equation singly is as good
as SUR. Which of these two is true?
Answer. The first is true. One cannot obtain a more efficient estimator of β 1 by considering
the whole system. This is [JGH+ 85, p. 469].
65.5. UNKNOWN COVARIANCE MATRIX 1395
Problem 537. 4 points Explain how to do iterated EGLS (i.e., GLS with an
estimated covariance matrix) in a model with first-order autoregression, and in a
seemingly unrelated regression model. Will you end up with the (normal) maximum
likelihood estimator if you iterate until convergence?
Answer. You will only get the Maximum Likelihood estimator in the SUR case, not in the
AR1 case, because the determinant term will never come in by iteration, and in the AR1 case, EGLS
is known to underestimate the ρ. Of course, iterated EGLS is in both situations asymtotically as
good as Maximum Likelihood, but the question was whether it is in small samples already equal to
the ML. You can have asymptotically equivalent estimates which differ greatly in small samples.
1396 65. SEEMINGLY UNRELATED
66.1. Examples
+
[JHG 88, 14.1 Introduction] gives examples. The first example is clearly not
identified, indeed it has no exogenous variables. But the idea of a simultaneous
equations system is not dependent on this:
(66.1.1) y d = ια + pβ + ε 1
(66.1.2) y s = ιγ + pδ + ε 2
(66.1.3) yd = ys
1397
1398 66. SIMULTANEOUS EQUATIONS SYSTEMS
y d , y s , and p are the jointly determined endogenous variables. The first equation
describes the behavior of the consumers, the second the behavior of producers.
Problem 539. [Gre97, p. 709 ff]. Here is a demand and supply curve with q
quantity, p price, y income, and ι is the vector of ones. All vectors are t-vectors.
ε d and ε s are independent of y, but amongst each other they are contemporaneously
correlated, with their covariance constant over time:
(
0 if t 6= u
(66.1.6) cov[εdt , εsu ] =
σds if t = u
Answer. p and q are called jointly dependent or endogenous. y is determined outside the
system or exogenous.
66.1. EXAMPLES 1399
β0 ι + β1 p + ε s = α 0 ι + α 1 p + α 2 y + ε d
(β1 − α1 )p = (α0 − β0 )ι + α2 y + ε d − ε s ,
hence (66.1.7). To get the reduced form equation for q, plug that for p into the supply function
(one might also plug it into the demand function but the math would be more complicated):
β1 (α0 − β0 ) β1 α 2 εd − ε s )
β1 (ε
q = β 0 ι + β 1 p + ε s = β0 ι + ι+ y+ + εs
β1 − α 1 β1 − α 1 β1 − α 1
Combining the first two and the last two terms gives (66.1.8).
• c. 2 points Show that one will in general not get consistent estimates of the
supply equation parameters if one regresses q on p (with an intercept).
1400 66. SIMULTANEOUS EQUATIONS SYSTEMS
εdt −εst
Answer. By (66.1.7) (the reduced form equation for p), cov[εst , pt ] = cov[εst , β1 −α1
] =
2
σsd −σs
β1 −α1
. This is generally 6= 0, therefore inconsistency.
Divide to get
P
(y − ȳ)(q i − q̄)
β1 estimated by P i
(y i − ȳ)(pi − p̄)
• f. 1 point Since the error terms in the reduced form equations are contem-
poraneously correlated, wouldn’t one get more precise estimates if one estimates the
reduced form equations as a seemingly unrelated system, instead of OLS?
Answer. Not as long as one does not impose any constraints on the reduced form equations,
since all regressors are the same.
• g. 2 points We have shown above that the regression of q on p does not give
a consistent estimator of β1 . However one does get a consistent estimator of β1 if
one regresses q on the predicted values of p from the reduced form equation. (This
is 2SLS.) Show that this estimator is also the same as above.
P
(q i −q̄)(p̂i −p̄)
Answer. This gives β̃ 1 = P 2
. Now use p̂i − p̄ = π̂ 1 (y i − ȳ) where π̂ 1 =
P (p̂i −p̄) P
P (pi −p̄)(yi −ȳ) (q i −q̄)(y i −ȳ) (y i −ȳ)(q i −q̄)
P 2
. Therefore β̃ = π̂ 1 2
P 2
= P again.
(y i −ȳ) π̂ 1 (y i −ȳ) (y i −ȳ)(pi −p̄)
1402 66. SIMULTANEOUS EQUATIONS SYSTEMS
Answer. You can’t. The supply function can be estimated because it stays put while the
demand function shifts around, therefore the observed intersection points lie on the same supply
function but different demand functions. The demand function itself cannot be estimated, it is
underidentified in this system.
(66.1.9) c = α + βy + ε
(66.1.10) y =c+i
Exogenous means: determined outside the system. By definition this always means:
it is independent of all the disturbance terms in the equations (here there is just one
disturbance term). Then the first claim is: y is correlated with ε, because y and c are
determined simultaneously once i and ε is given, and both depend on i and ε. Let
us do that in more detail and write the reduced form equation for y. That means,
let us express y in terms of the exogenous variable and the disturbances only. Plug
66.1. EXAMPLES 1403
(66.1.11) y − i = α + βy + ε
(66.1.12) or y(1 − β) = α + i + ε
α 1 1
(66.1.13) y= + i+ ε
1−β 1−β 1−β
α β 1
(66.1.14) and c=y−i= + i+ ε
1−β 1−β 1−β
1 σ2
(66.1.15) cov(y, ε) = 0 + 0 + cov(ε, ε) =
1−β 1−β
Problem 540. 4 points Show that OLS applied to equation (66.1.9) gives an
estimate which is in the plim larger than the true β.
1404 66. SIMULTANEOUS EQUATIONS SYSTEMS
Answer.
β 1
cov[y, c] (1−β)2
var[i] + (1−β)2
var[ε]
(66.1.16) plim β̂ = = 1 1
=
var[y] var[i] + var[ε]
(1−β)2 (1−β)2
β var[i] + var[ε] (1 − β) var[ε]
= =β+ >β
var[i] + var[ε] var[i] + var[ε]
One way out is to estimate with instrumental variables. The model itself provides
an instrument for y, namely, the exogenous variable i.
As an alternative one might also estimate the reduced form equation (66.1.13)
and then get the structural parameters from that. I.e., let â and b̂ be the regression
coefficients of (66.1.13). Then one can set, for the slope parameter β,
1 b̂ − 1
(66.1.17) b̂ = or β̂ = .
1 − β̂ b̂
This estimation method is called ILS, indirect least squares, because the estimates
were obtained indirectly, by estimating the reduced form equations.
66.2. GENERAL MATHEMATICAL FORM 1405
Which of these two estimation methods is better? It turns out that they are
exactly the same. Proof: from b̂ = cd
ov(y, i)/var(i)
c follows
b̂ − 1 ov(y, i) − var(i)
cd c ov(c, i)
cd
(66.1.18) β̂ = = = .
b̂ cd
ov(y, i) cdov(y, i)
• how many equations (structural equations) there should be and how the
system should be “closed”
• algebraic form of the equations, also the question in which scales (logarith-
mic scale, prices or inverse prices, etc.) the variables are to be measured.
• distribution of the random errors
A general mathematical form for a simultaneous equations system is
(66.2.1) Y Γ = XB + E
If one splits Y , X, and E into their columns one gets
γ11 . . . γ1m
y 1 . . . y m ... .. .. =
. .
γM 1 . . . γmm
β11 . . . β1m
= x1 . . . xk ... .. .. + ε
. . 1 . . . εm
βK1 . . . βkm
The standard assumptions are that E [E|X] = O and V [vec E|X] = Σ ⊗ I with
an unknown nonsingular Σ . Γ is assumed nonsingular as well. Furthermore it is
assumed that plim 1t X > X exists and is nonsingular, and that plim n1 X >ε = o.
66.2. GENERAL MATHEMATICAL FORM 1407
Problem 541. 1 point If V [vec E] = Σ ⊗ I, this means (check the true answer
or answers) that
• different rows of E are uncorrelated, and every row has the same covariance
matrix, or
• different columns of E are uncorrelated, and every column has the same
covariance matrix, or
• all εij are uncorrelated.
Answer. The first answer is right.
Now the reduced form equations: postmultiplying by Γ−1 and setting Π = BΓ−1
and V = EΓ−1 one obtains Y = XΠ + V .
Problem 542. If V [vec E] = Σ ⊗ I and V = EΓ−1 , show that V [vec V ] =
(Γ ) Σ Γ−1 ⊗ I.
−1 >
Σ
Answer. First use (B.5.19) to develop vec V = − vec(IEΓ−1 ) = − (Γ−1 )> ⊗ I vec E,
therefore
(66.2.2) V [vec V ] = (Γ−1 )> ⊗ I V [vec E] Γ−1 ⊗ I = (Γ−1 )> ΣΓ−1 ⊗ I.
1408 66. SIMULTANEOUS EQUATIONS SYSTEMS
Here is an example, inspired by, but not exactly identical to, [JHG+ 88, pp.
607–9]. The structural equations are:
(66.2.3) y 1 = −y 2 γ21 + x2 β21 + ε 1
(66.2.4) y 2 = −y 1 γ12 + x1 β12 + x3 β32 + ε 2
This is the form in which structural equations usually arise naturally: one of the
endogenous variables is on the left of each of the structural equations. There are as
many structural equations as there are endogenous variables. The notation for the
unknown parameters and minus sign in front of γ12 and γ21 come from the fact that
these parameters are elements of the matrices Γ and B.
In matrix notation, this system of structural equations becomes
0 β12
1 γ12
(66.2.5) y1 y2 = x1 x2 x3 β21 0 + ε1 ε2
γ21 1
0 β32
Note that normalization conventions and exclusion restrictions are built directly into
Γ and B. In general it is not necessary that each structural equation has a different
endogenous variable on the left. Often the same endogenous variable may be on the
lefthand side of more than one structural equation. In this case, Γ in (66.2.5) does
not have 1 in the diagonal but has a 1 somewhere in every column.
66.2. GENERAL MATHEMATICAL FORM 1409
In the present hypothetical exercise we are playing God and therefore know the
true parameter values −γ21 = 1, −γ12 = 2, β21 = 2, β12 = 3, and β32 = 1. And
while the earthly researcher only knows that the following two matrices exist and are
nonsingular, we know their precise values:
1 ε> 1 ε> >
1 ε1 ε1 ε2 5 1
1
plim ε 1 ε 2 = plim =
n ε>2 n ε> >
2 ε1 ε2 ε2 1 1
> > > >
x1 x1 x1 x1 x2 x1 x3 1 1 0
1 x1 x2 x3 = plim 1 x>
plim x> x> x>
2 x1 2 x2 2 x3 = 1 2 0
2
n > n
x3 x3 x1 x3 x2 x>
> >
x
3 3 0 0 1
Besides, the assumption is always (and this is known to the earthly researcher too):
> >
x ε x>
x 1 ε2 0 0
1 1> 1 1> 1
x2 ε 1 x>
plim x2 ε 1 ε 2 = plim 2 ε2 = 0 0
n > n
x3 x>3 ε 1 x>
3 ε 2 0 0
First let us compute the true values of the reduced form parameters. Insert the
known parameter values into (66.2.5):
0 3
1 −2
(66.2.6) y1 y2 = x1 x2 x3 2 0 + ε 1 ε 2
−1 1
0 1
1410 66. SIMULTANEOUS EQUATIONS SYSTEMS
Using
−1 0 3 3 3
1 −2 1 2 1 2
=− and 2 0 = 2 4 ,
−1 1 1 1 1 1
0 1 1 1
we can solve as follows:
3 3
1 2
(66.2.7) y1 y2 = − x1 x2 x3 2 4 − ε 1 ε2
1 1
1 1
I.e., thetrue parameter
matrix Π in the reduced form equation Y = XΠ + EΓ−1 is
3 3
Π = − 2 4. If we postmultiply (66.2.7) by [ 10 ] and [ 01 ] we get the reduced-form
1 1
equation written column by column:
3
1
y 1 = − x1 x2 x3 2 − ε 1 ε 2 = −3x1 − 2x2 − x3 − ε 1 − ε 2
1
1
3
2
y 2 = − x1 x2 x3 4 − ε 1 ε 2 = −3x1 − 4x2 − x3 − 2ε ε1 − ε 2
1
1
66.2. GENERAL MATHEMATICAL FORM 1411
Problem 543. Show that the plims of the OLS estimates in equation (66.2.3)
are plim γ̂ 21;OLS = −0.6393 6= −1 and plim β̂ 21;OLS = 0.0164 6= 2, i.e., OLS is
inconsistent. Do these plims depend on the covariance matrix of the disturbances?
−γ21
(66.2.8) y1 = y2 x2 + ε1 = Z 1 δ1 + ε1
β21
1 y> −1
−γ̂ 21;OLS 2 y2 y>
2 x2
1 y>
2 y1
(66.2.9) =
β̂ 21;OLS n x>
2 y2 x>
2 x2 n x>
2 y1
The plims of the squares and cross products of the xi and y i can be computed from those of
the xi and ε i which we know since we are playing God. Here are those relevant for running OLS
1412 66. SIMULTANEOUS EQUATIONS SYSTEMS
Also
x> x>
" # " # " #
1 3 1
1
y>
2 y1 = 3 4 1 x>
2 x1 x2 x3 2 + 3 4 1 x>
2 ε1 ε2 +
1
x>
3 1 x>
3
" #
3
ε>
1
ε>
1
1
+ 2 1 x1 x2 x3 2 + 2 1 ε1 ε2
ε>
2 ε>
2 1
1
" #" #
1 1 0 3
1 5 1 1
plim y >
2 y1 = 3 4 1 1 2 0 2 + 2 1 = 44 + 14 = 58
n 1 1 1
0 0 1 1
Finally
" #
3
1
x>
2 y1 =− x>
2 x1 x>
2 x2 x>
2 x3 2 − x>
2 ε1 x>
2 ε2 1
1
" #
3
1
plim x> y =− 1 2 0 2 = −7
n 2 1
1
One sees that the covariance matrix of the disturbance terms enters some of these results.
Putting it all together gives
−1
−γ̂ 21 91 −11 58 1 2 11 58 1 39 0.6393 1
plim = = = = 6=
β̂ 21 −11 2 −7 61 11 91 −7 61 1 0.0164 2
1414 66. SIMULTANEOUS EQUATIONS SYSTEMS
The coefficients of the first structural equation are in the first columns of Γ and B.
Let us write these first columns separately:
π11 − π12 γ21 =
π11 π12 0
1 π21 − π22 γ21 =
π21 π22 = β21 or (66.3.2)
−γ21
π31 π32 0 π31 − π32 γ21 =
One sees that there are two ways to get γ21 from the elements of Π: γ21 = π11 /π12
or γ21 = π31 /π32 . The ILS principle gives us therefore two different consistent
estimates of γ21 , but no obvious way to combine them. This is called: the first
structural equation is “overidentified.” If one looks at the true values one sees that
indeed π11 /π12 = π31 /π32 . The estimation of the reduced form equations does not
take advantage of all the information given in the structural equations: they should
have been estimated as a constrained estimate, not with a linear constraint but a
bilinear constraint of the form π11 π32 = π31 π12 . ILS is therefore not the most efficient
estimation method for the first structural equation.
How about the second structural equation?
−π11 γ12 + π12 =
π11 π12 β12
−γ12 −π21 γ12 + π22 =
π21 π22 = 0 or (66.3.3)
1
π31 π32 β32 −π31 γ12 + π32 =
1416 66. SIMULTANEOUS EQUATIONS SYSTEMS
This can be solved uniquely: γ12 = π22 /π21 , β12 = π12 − π11 π22 /π21 , β32 = π32 −
π31 π22 /π21 . Therefore one says that the second equaton is exactly identified.
It is also possible that an equation is not identified. This identification status is
not a property of ILS, but a property of the model.
Remember how we got the ILS estimates Γ̃ and B̃: First we ran the regression
on the unrestricted reduced form to get Π̂ = (X > X)−1 X > Y , and then we solved
the equation Π̂Γ̃ = B̃ where Γ̃ and B̃ have the zeros and the normalization ones
inserted at the right places, see (66.3.3).
In the case of the 2nd equation this becomes
(66.4.1) Π̂γ̃ 2 = β̃ 2
or
> >
x1
γ̃ 12 x1 β̃ 12
(66.4.3) x>
2
y1 y2 = x>
2
x1 x2 x3 0
> 1 >
x3 x3 β̃ 32
This simplifies
> >
x1 x1
(66.4.4) x>
2
(y γ̃
1 12 + y 2 ) = x>
2
(x1 β̃ 12 + x3 β̃ 32 )
>
x3 x>
3
1418 66. SIMULTANEOUS EQUATIONS SYSTEMS
Now rearrange
> > >
x1 x1 x1 β̃ 12
(66.4.5) x> >
x>
2
y 2 = x2 (x β̃
1 12 −y γ̃
1 12 +x β̃
3 32 ) = 2
x1 y1 x3 −γ̃ 12
> >
x3 x3 x>
3 β̃ 32
I will show that (66.4.5) is exactly the normal equation for the IV estimator.
Write the second structural equation as
β12
(66.4.6) y 2 = x1 y 1 x3 −γ12 + ε 2
β32
The matrix of instruments is W = x1 x2 x3 , i.e., x1 and x3 are instruments for
themselves, and x2 is an instrument for y 1 . Now remember the IV normal equation
in this simplified case: instead of X > X β̂ = X > y one has W > X β̃ = W > y. In our
situation this gives
> >
x1 x1 β̃ 12
x> y 2 = x>
(66.4.7) 2 2
x1 y 1 x3 −γ̃ 12
x>3 x >
3 β̃ 32
which is, as claimed, the same as (66.4.5).
66.5. IDENTIFICATION 1419
Now in the overidentified case, ILS does not have a good method to offer. There
are more than one ways to get the reduced form estimates from the structural es-
timates, and the ILS principle says that one could use either one, but there is no
easy way to combine them. The estimation approach by Instrumental Variables, on
the other hand, has an obvious way to take advantage of overidentification: one will
do Instrumental Variables in the generalized case in which there are “too many”
instruments. This is exactly 2SLS.
Problem 544. 1 point Describe the two “stages” in the two stages least squares
estimation of a structural equation which is part of a simultaneous equations system.
66.5. Identification
How can one tell by looking at the structural equations whether the equation
is exactly identified or underidentified or overidentified? If one just has one system,
solving the reduced form equations by hand is legitimate.
The so-called “order condition” is not sufficient but necessary for identification.
One possible formulation of it is: each equation must have at least m − 1 exclusions.
One can also say, and this is the formulation which I prefer: for each endogenous
variable on the righthand side of the structural equation, at least one exogenous
variable must be excluded from this equation.
1420 66. SIMULTANEOUS EQUATIONS SYSTEMS
Problem 545. This example is adapted from [JHG+ 88, (14.5.8) on p. 617]:
• a. 2 points Use the order condition to decide which of the following equations
are exactly identified, overidentified, not identified.
Answer. (66.5.4) is exactly identified since there are no endogenous variable on the right hand
side, but all exogenous variables are on the right hand side. (66.5.3) is not identified, it has 3 y’s on
the right hand side but only excludes two x’s. (66.5.2) overfulfils the order condition, overidentified.
(66.5.1) is exactly identified.
66.5. IDENTIFICATION 1421
• b. 1 point Write down the matrices Γ and B (indicating where there are zeros
and ones) in the matrix representation of this system, which has the form
γ11 γ12 γ13 γ14
γ21 γ22 γ23 γ24
(66.5.5) y1 y2 y3 y4 γ31 γ32 γ33 γ34 =
Criteria which are necessary and sufficient for identification are called “rank
conditions.” There are various equivalent forms for it. We will pick out here one
1422 66. SIMULTANEOUS EQUATIONS SYSTEMS
of these equivalent formulations, that which is preferred by ITPE, and give a recipe
how to apply it. We will give no proofs.
First of all, define the matrix
Γ
(66.5.6) ∆=
B
∆ contains in its ith column the coefficients of the ith structural equation. In our
example if is
γ11 γ12 γ13 0
γ21 γ22 γ23 0
0 0 γ33 0
γ41 0 γ43 γ44
(66.5.7) ∆= β11 β12
β13 β14
0 β22 0 β24
0 0 0 β34
β41 0 β43 β44
Each column of ∆ is subject to a different set of exclusion restrictions, say the ith
column of ∆ is δ i and it satisfies Ri δ i = o. For instance in the first equation (66.2.3)
66.5. IDENTIFICATION 1423
Answer.
" # " # " # " #
0 γ33 0 0
(66.5.11) β22 α1 + 0 α2 + β24 α3 = 0
0 0 β34 0
γ33 α2 = 0, therefore α2 = 0. It also implies γ34 α3 = 0, therefore also α3 = 0. It remains
β22 α1 + β24 α3 = 0, but since we already know α3 = 0 this means that also α1 = 0.
normal, which however also have good properties if this is not the case; see here
[DM93, p. 641]) and estimators based on instrumental variables.
Single-equation estimators are simpler to compute and they are also more robust:
if only one of the equations is mis-specified, then a systems estimator is inconsistent,
but single-equations estimators of the other equations may still be consistent. Sys-
tems estimators are more efficient: they exploit the correlation between the residuals
of the different equations, they allow exclusion restrictions in one equation to benefit
another equation, and they also allow cross-equation restrictions on the parameters
which cannot be handled by single-equations systems.
Maximum likelihood estimation of the whole model (FIML) requires numerical
methods and is a demanding task. We assume X nonrandom, or we condition on
X = X, therefore we write
(66.6.1) Y Γ = XB + E
In (??) we split Y , X, and E into their columns; now we will split them into their
rows:
> > >
y1 x1 ε1
.. .. ..
(66.6.2) . Γ = . B + .
y>
t x>
t ε>
t
1426 66. SIMULTANEOUS EQUATIONS SYSTEMS
If one compares this with the log likelihood function (65.2.5) for simultenous
equations systems, one sees many similarities: the last item is a function of the
residuals, Σ enters in exactly the same way, the only difference is the term t log |det Γ|.
Therefore the next steps here are parallel to our development in Chapter 65. However
one can see already now the following shortcut if the system is a recursive system, i.e.,
if Γ is lower diagonal with 1s in the diagonal. Then det Γ = 1, and in this case one
can just use the formalism developed for seemingy unrelated systems, simply ignoring
the fact that some of the explanatory variables are endogenous, i.e., treating them
in the same way as the exogonous variables.
But now let us go in with the general casae. In order to concentrate out Σ it is
simpler to take the partial derivatives with respect to Σ −1 than those with respect
to Σ itself. Using the matrix differentiation rules (C.1.24) and (C.1.16) and noting
that −t/2 log det Σ = t/2 log det Σ −1 one gets:
t 1
(66.6.4) Σ−1 = Σ − (Y Γ − XB)> (Y Γ − XB)
∂`/∂Σ
2 2
and if one sets this zero one gets Σ̂ = 1t (Y Γ − XB)> (Y Γ − XB). Plugging this in
gives the concentrated log likelihood function log f (Y )c =
mt t 1 mt
=− log(2π) + t log |det Γ| − log det (Y Γ − XB)> (Y Γ − XB) − .
2 2 t 2
1428 66. SIMULTANEOUS EQUATIONS SYSTEMS
This is not just a minimization of the SSE because of the t log |det Γ| term. This
term makes things very complicated, since the information matrix is no longer block
diagonal, see [Ruu00, p. 724] for more detail. One sees here that Simultaneous
Equations is the SUR system of reduced form equations with nonlinear restrictions.
Must be maximized subject to exclusion restrictions; difficult but can be done. Ref-
erences in [DM93, 640]. Since maximization routine will usually not cross the loci
with det Γ = 0, careful selection of the starting value is important.
Here is an alternative derivation of the same result, using (65.3.2):
> −1
where w> s = εs Γ or ws = (Γ−1 )>ε s , therefore ws ∼ N (o, (Γ−1 )>Σ Γ−1 ). Ac-
cording to (65.3.2) the concentrated likelihood function is
mt t
`c = − (1 + log 2π) − log det(Y − XBΓ−1 )> (Y − XBΓ−1 )
2 2
mt t
= − (1 + log 2π) − log det((Γ−1 )> (Y Γ − XB)> (Y Γ − XB)Γ−1 )
2 2
mt t
= − (1 + log 2π) + t log |det Γ| − log det(Y Γ − XB)> (Y Γ − XB)
2 2
This must be maximized subject to the bilinear constraints imposed by the overi-
dentifying restrictions.
Since FIML is so difficult and expensive, researchers often omit specification
tests. [DM93] recommend to make these tests with the unrestricted reduced form.
This is based on the assumption that most of these mis-specifications already show
up on the unrestricted reduced form: serial correlation or heteroskedasticity of the
error terms, test whether parameters change over the sample period.
Another specification test is also a test of the overidentifying restrictions: a LR
test comparing the attained level of the likelihood function of the FIML estimator
with that of the unrestricted reduced form estimator. Twice the difference between
the restricted and unrestricted value of the log likelihood function ∼ χ2 where the
number of degrees of freedom is the number of the overidentifying restrictions.
1430 66. SIMULTANEOUS EQUATIONS SYSTEMS
γmi βKi
Some of the γgi and βhi must be zero, and one of the γgi is 1. Rearrange the columns
of Y and X such that γ1i = 1, and that the zero coefficients come last:
1
β
(66.6.7) y1 Y 2 Y 3 γ = X 1 X 2
+ εi
o
o
Now write the reduced form equations conformably:
π 11 Π12 Π13
(66.6.8) y1 Y 2 Y 3 = X 1 X 2 + v1 V2 V3
π 21 Π22 Π23
Then LIML for the ith equation is maximum likelihood on the following system:
(66.6.9) y1 + Y 2 γ = X 1 β + εi
(66.6.10) Y 2 = X 1 Π12 + X 2 Π22 + V 2
66.6. OTHER ESTIMATION METHODS 1431
I.e., it includes the ith structural equation and the unrestricted reduced form equa-
tions for all the endogenous variables on the righthand side of the ith structural
equation. Written as one partitioned matrix equation:
o>
1 β Π21
(66.6.11) y1 Y2 = X1 X2 + εi V2
γ I o Π22
1 o>
Since Γ = is lower triangular, its determinant is the product of the diagonal
γ I
elements, i.e., it is = 1. Therefore the Jacobian term in the likelihood function is = 1,
and therefore the likelihood function is the same as that of a seemingly unrelated
regression model. One can therefore compute the LIML estimator from (66.6.11)
using the software for seemingly unrelated regressions, disregarding the difference
between endogenous and exogenous variables. But there are other ways to compute
this estimator which are simpler. They will not be discussed here. They either
amount to (1) an eigenvalue problem. or (2) a “least variance ratio” estimator, or (3)
a “k-class” estimator. See [DM93, pp. 645–647]. Although LIML is used less often
than 2SLS, it has certain advantages: (1) it is invariant under reparametrization,
and (2) 2SLS can be severly biased in small samples.
1432 66. SIMULTANEOUS EQUATIONS SYSTEMS
Here the Z i contain endogenous and exogenous variables, therefore OLS is inconsis-
tent. But if we do 2SLS, i.e., if we take Ẑ i = X(X > X)−1 X > Z i as regressors, we
get consistent estimates:
(66.6.13)
X(X > X)−1 X > Z 1
y1 O ··· O δ1
> −1 >
y2 O X(X X) X Z 2 · · · O δ2
.. =
.. .. .. .. ..
. . . . . .
ym O O ··· X(X > X)−1 X > Z m δm
In the case of SUR we know that OLS singly is not efficient, but GLS is. We use
this same method here: (1) estimate σij —not from the residuals in (66.6.13) but as
σ̂ij = 1t ε̂
εiε̂ εi = y i − Z i δ̂ i;2SLS . With this estimated covariance matrix do
εj where ε̂
66.6. OTHER ESTIMATION METHODS 1433
GLS
> −1 >
(66.6.14) vec(B̂)3SLS = Ẑ (Σ̂ ⊗ I)−1 Ẑ Ẑ (Σ̂ ⊗ I)−1 vec(Y ),
which can also be written as
> −1 −1 > −1
(66.6.15) vec(B̂)3SLS = Ẑ (Σ̂ ⊗ I)Z Ẑ (Σ̂ ⊗ I) vec(Y ).
It is instrumental variables with a nonspherical covariance matrix (and can be de-
rived as a GMM estimator). This is much easier to estimate than FIML, but it is
nevertheless asymptotically as good as FIML.
Problem 547.
• a. 6 points Give an overview over the main issues in the estimation of a
simultaneous equations system, and discuss the estimation principles involved.
• b. 2 points How would you test whether a simultaneous equations system is
correctly specified?
CHAPTER 67
Timeseries Analysis
1435
1436 67. TIMESERIES ANALYSIS
I.e., the means do not depend on s, and the covariances only depend on the distances
and not on s. A covariance stationary time series is characterized by the expected
value of each observation µ, the variance of each observation σ 2 , and the “auto-
correlation function” ρk for k ≥ 1 or, alternatively, by µ and the “autocovariance
function” γk for k ≥ 0. The autocovariance and autocorrelation functions are vectors
containing the unique elements of the covariance and correlation matrices.
The simplest time series has all y t ∼ IID(µ, σ 2 ), i.e., all covariances between
different elements are zero. If µ = 0 this is called “white noise.”
A covariance-stationary process y t (t = 1, . . . , n) with expected value µ = E[y i ]
is said to be ergodic for the mean if
n
1X
(67.1.4) plim y t = µ.
n→∞ n
t=1
Problem 548. [Ham94, pp. 46/7] Give a simple example for a stationary
time series process which is not ergodic for the mean.
Answer. White noise plus a mean which is drawn once and for all from a N (0, τ 2 ) independent
of the white noise.
67.1. COVARIANCE STATIONARY TIMESERIES 1437
every MA(1) process could have been generated by a process in which |β| < 1. This
process is called the invertible form or the fundamental representation of the time
series.
Problem 550. What are the implications for estimation of the fact that a MA-
process can have different data-generating processes?
Answer. Besides looking how the timeseries fits the data, the econometrician should also look
whether the disturbances are plausible values in light of the actual history of the process, in order
to ascertain that one is using the right representation.
The fundamental representation of the time series is needed for forecasting. Let
us first look at the simplest situation: the time series at hand is generated by the
process (67.1.5) with |β| < 1, the parameters µ and β are known, and one wants to
forecast y t+1 on the basis of all past and present observations. Clearly, the past and
present has no information about εt+1 , therefore the best we can hope to do is to
forecast y t+1 by µ + βεt .
But do we know εt ? If a time series is generated by an invertible process, then
someone who knows µ, β, and the current and all past values of y can use this to
67.1. COVARIANCE STATIONARY TIMESERIES 1439
reconstruct the value of the current disturbance. One sees this as follows:
(67.1.7) y t = µ + εt + βεt−1
(67.1.8) εt = y t − µ − βεt−1
(67.1.9) εt−1 = y t−1 − µ − βεt−2
(67.1.10) εt = y t − µ − β(y t−1 − µ − βεt−2 )
(67.1.11) = −µ(1 − β) + y t − βy t−1 + β 2 εt−2
εt = −µ 1 − β + β 2 − · · · + (−β)t−1
(67.1.13)
(67.1.14) + y t − βy t−1 + β 2 y t−2 − · · · + (−β)t−1 y 1 + (−β)t ε0
t−1
1 + (−β)t X
(67.1.15) = −µ + (−β)i y t−i + (−β)t ε0
1+β i=0
1440 67. TIMESERIES ANALYSIS
If |β| < 1, the last term of the right hand side, which depends on the unobservable
ε0 , becomes less and less important. Therefore, if µ and β are known, and all past
values of y t are known, this is enough information to compute the value of the
present disturbance εt . Equation (67.1.15) can be considered the “inversion” of the
MA1-process, i.e., its representation as an infinite autoregressive process.
The disturbance in the invertible process is called the “fundamental innova-
tion” because every y t is composed of a part which is determined by the history
y t−1 , y t−2 , . . . plus εt which is new to the present period.
The invertible representation can therefore be used for forecasting: the best
predictor of y t+1 is µ + βεt .
Even if a time series was actually generated by a non-invertible process, the
formula based on the invertible process is still the best formula for prediction, but
now it must be given a different interpretation.
All this can be generalized for higher order MA processes. [Ham94, pp. 64–68]
says: for any noninvertible MA process (which is not borderline in the sense that
|β| = 1) there is an invertible MA process which has same means, variances, and
autocorrelations. It is called the “fundamental representation” of this process.
The fundamental representation of a process is the one which leads to very sim-
ple equations for forecasting. It used to be a matter of course to assume at the
67.1. COVARIANCE STATIONARY TIMESERIES 1441
same time that also the true process which generated the timeseries must be an in-
vertible process, although the reasons given to justify this assumption were usually
vague. The classic monograph [BJ76, p. 51] says, for instance: “The requirement
of invertibility is needed if we are interested in associating present events with past
happenings in a sensible manner.” [Dea92, p. 85] justifies the requirement of in-
vertibility as follows: “Without [invertibility] the consumer would have no way of
calculating the innovation from current and past values of income.”
But recently it has been discovered that certain economic models naturally lead
to non-invertible data generating processes, see problem 552. This is a process in
which the economic agents observe and act upon information which the econometri-
cian cannot observe.
If one goes over to infinite MA processes, then one gets all indeterministic sta-
tionary processes. According to the so-called Wold decomposition, every stationary
process can be represented as a (possibly infinite) moving average process plus a
“linearly deterministic” term, i.e., a term which can be linearly predicted without
error from its past. There is consensus that economic time series do not contain such
linearly deterministic terms.
The errors in the infinite Moving Average representation also have to do with
prediction: can be considered the errors in the best one-step ahead linear prediction
based on the infinite past [Rei93, p. 7].
1442 67. TIMESERIES ANALYSIS
A stationary process without a linear deterministic term has therfore the form
X∞
(67.1.16) yt = µ + ψj εt−j
j=0
where the timeseries εs is white noise, and B is the backshift operator satisfying
e> >
t B = et−1 (here et is the tth unit vector which picks out the tth element of the
time series). P 2
P The coefficients satiisfy ψi < ∞, and if they satisfy the stronger condition
|ψi | < ∞, then the process is called causal.
Problem 551. Show that without loss of generality ψ0 = 1 in (67.1.16).
Answer. If say ψk is the first nonzero ψ, then simply write η j = ψk ε j+k
Dually, one can also represent Pp each fully indeterministic stationary processs as
an infinite AR-process y t − µ + j=1 φi (y t−i − µ) = εt . This representation is called
P
invertible if it satisfies |θi | < ∞.
67.1. COVARIANCE STATIONARY TIMESERIES 1443
67.1.2. The Box Jenkins Approach. Now assume that the operator Ψ(B) =
P∞ j −1
j=0 ψj B can be written as the product Ψ = Φ Θ where each Φ and Θ are finite
polynomials in B. Again, without loss of generality, the leading coefficients in Ψ
and Θ can be assumed to be = 1. Then the time series can be written
Xp ∞
X
(67.1.18) yt − µ + φi (y t−i − µ) = εt + θj εt−j
j=1 j=1
67.1.3. Moving Average Processes. In order to see what order a finite mov-
ing average process is, one should look at the correlation coefficients. If the order is j,
then the theoretical correlation coefficients are zero for all values > j, and therefore
the estimates of these correlation coefficients, which have the form
Pn
(y − ȳ)(y t−k − ȳ)
(67.1.19) rk = t=k+1 Pn t 2
t=1 (y t − ȳ)
must be insignificant.
For estimation the preferred estimate is the maximum likelihood estimate. It
can not be represented in closed form, therefore we have to rely on numerical maxi-
mization procedures.
(67.1.20) y t = αy t−1 + εt
This process generates a stationary timeseries only if |α| < 1. Proof: var[y t ] =
var[y t−1 ] means var[y t ] = α2 var[y t ] + σ 2 and therefore var[y t ](1 − α2 ) = σ 2 , and
since σ 2 > 0 by assumption, it follows that 1 − α2 > 0.
Solution (i.e., Wold representation as a MA process) is
(67.1.21) y t = y 0 αt + (εt + αεt−1 + · · · + αt−1 ε1 )
As proof that this is a solution, write down αy t−1 and check that it is equal to y t −εt .
67.1.5. Difference Equations. Let’s make here a digression about nth order
linear difference equations with constant coefficients. Definition from [End95, p. 8]:
n
X
(67.1.22) y t = α0 + αi y t−i + xt
i=1
(3) Then the general solution is the sum of the particular solution and an arbi-
trary linear combination of all homogeneous solutions.
(4) Eliminate the arbitrary constant(s) by imposing the initial condition(s) on
the general solution.
Let us apply this to y t = αy t−1 + εt . The homogeneous equation is y t = αy t−1
and this has the general solution y t = βαt where β is an arbitrary
P∞ constant. If the
timeseries goes back to −∞, the particular solution is y t = i=0 αi εt−i , but if the
Pt−1
timeseries only exists for t ≥ 1 the particular solution is y t = i=0 αi εt−i . This
gives solution (67.1.21).
Now let us look at a second order process: y t = α1 y t−1 + α2 y t−2 + xt . In order
to get solutions of the homogeneous equation y t = α1 y t−1 + α2 y t−2 try y t = βγ t .
This gives the following condition for γ: γ t = α1 γ t−1 + α2 γ t−2 or γ 2 − α1 γ + α2 = 0.
The solution of this quadratic equation is
p
α1 ± α12 + 4α2
(67.1.23) γ=
2
If this equation has two real roots, then everything is fine. If it has only one real
root, i.e., if α2 = −α12 /4, then γ = α1 /2, i.e., y t = β1 (α1 /2)t is one solution. But
there is also a second solution, which is not obvious: y t = β2 t(α1 /2)t is a solution as
67.1. COVARIANCE STATIONARY TIMESERIES 1447
(67.1.25) y t = β1 rt cos(θt + β2 )
√
where r = −α2 and θ is defined by cos(θ) = α1 /2r. This formula is from [End95,
p. 29], and more explanations can be found there.
But in all these cases the roots of the characteristic equations determine the
character of the homogeneous solution. They also determine whether the difference
equation is stable, i.e., whether the homogeneous solutions die out over time or not.
For stability, all roots must lie in the unit circle.
In terms of the coefficients themselves, these stability conditions are much more
complicated. See [End95, pp. 31–33].
These stability conditions are also important for stochastic difference equations:
in order to have stationary solutions, it must be stable.
1448 67. TIMESERIES ANALYSIS
It is easy to estimate AR processes: simply regress the time series on its lags.
But before one can do this estimation one has to know the order of the autoregressive
process. A useful tool for this are the partial autocorrelation coefficients.
We discussed partial correlation coefficients in chapter 19. The kth partial auto-
correlation coefficient is the correlation between y t and y t−k with the influence of the
invervening lags partialled out. The kth sample partial autocorrelation coefficient is
the last coefficient in the regression of the timeseries on its first k lags. It is the effect
which the kth lag has which cannot be explained by earlier lagged values. In an
autoregressive process of order k, the “theoretical” partial autocorrelations are zero
for lags greater than k, therefore the estimated partial autocorrelation coefficients
should be insignificant for those lags. The asymptotic distribution of these estimates √
is normal√with zero mean and variance 1/T , therefore one often finds lines at 2/ T
and −2/ T in the plot of the estimated partial autocorrelation coefficients, which
give an indication which values are significant at the 95% level and which are not.
67.1.6. ARMA(p,q) Processes. Sometimes it is appropriate to estimate a
stationary process as having both autoregressive and moving average components
(ARMA) or, if they are not stationary, they may be autoregressive or moving average
after differencing them one or several times (ARIMA).
An ARM A(p, q) process is the solution of a pth order difference equation with
a M A(q) as driving process.
67.1. COVARIANCE STATIONARY TIMESERIES 1449
These models have been very successful. On the one hand, there is reason
to believe on theoretical grounds that many economic timeseries are ARM A(p, q).
[Gra89, p. 64] cites an interesting theorem which also contributes to the usefulness of
ARM A processes: the sum of two independent series, one of which is ARM A(p1 , q1 )
and the other ARM A(p2 , q2 ), is ARM A p1 + p2 , max(p1 + q2 , p2 + q1 ) .
Box and Jenkins recommend to use the autocorrelations and partial autocor-
relations for determining the order of the autoregressive or moving average parts,
although this more difficult for an ARMA process than for an MA or AR process.
The last step after what in the time series context is called “identification” (a
more generally used term might be “specification” or “model selection”) and estima-
tion is diagnostic checking, i.e., a check whether the results bear out the assumptions
made by the model. Such diagnostic checks are necessary because mis-specification
is possible if one follows this procedure. One way would be to see whether the resid-
uals resemble a white noise process, by looking at the autocorrelation coefficients of
the residuals. The so-called portmanteau test statistics test whether a given series is
white noise: there is either the Box-Pierce statistic which is the sum of the squared
sample autocorrelations
p
X
(67.1.26) Q=T rk2
k=1
1450 67. TIMESERIES ANALYSIS
which is asymptotically the same as the Box-Pierce statistic but seems to have better
small-sample properties.
A second way to check the model is to overfit the model and see if the additional
coefficients are zero. A third way would be to use the model for forecasting and to
see whether important features of the original timeseries are captured (whether it
can forecast turning points, etc.)
[Gre97, 839–841] gives an example. Eyeballing the timeseries does not give the
impression that it is a stationary process, but the statistics seem to suggest an AR-2
process.
series. Therefore we will first take a look at multivariate time series in general. A
good source here is [Rei93].
Covariance stationarity of multivariate time series is the obvious extension of the
univariate definition (67.1.1)–(67.1.2):
(67.2.5) E [y t ] = µ
(67.2.6) var[y mt ] < ∞
(67.2.7) C [y t , y t−h ] only depends on h.
One can write a VAR(j) process as
(67.2.8) y> > > > >
t = µ + (y t−1 − µ) Θ1 + · · · + (y t−n − µ) Θn + ε t
or equivalently
n
X
(67.2.9) (y t − µ)> − (y t−j − µ)> Θj = ε >
t
j=1
where Θ0 is lower diagonal and the covariance matrix of the disturbances is di-
agonal. For each permutation of the variables there is a unique lower diagonal Θ0
which makes the covariance matrix of the disturbances the identity matrix, here prior
knowledge about the order in which the variables depend on each other is necessary.
But if one has a representation like this, one can build an impulse response function.
Condition for a VAR(n) process to be stationary is, using (67.2.9):
(67.2.11) det[I − Θ1 z − Θ2 z 2 − · · · − Θn z n ]
has all its roots outside the unit circle. These are the same conditions as the stability
conditions.
Under general conditions, all stationary vector time series are V AR(P ) of a
possibly infinite degree.
Estimation: the reduced form is like a disturbance-related equation system with
all explanatory variables the same: therefore OLS is consistent, efficient, and asymp-
totically normal. But OLS is insensitive, since one has so many parameters to esti-
mate. Therefore one may introduce restrictions, not all lagged variables appear in
all equations, or one can use Bayesian methods (Minnesota prior, see [BLR99, pp.
269–72]).
1454 67. TIMESERIES ANALYSIS
Instead of using theory and prior knowledge to determine the number of lags,
we use statistical criteria. Minimize an adaptation of Akaike’s AIC criterion
2M 2 n
(67.2.12) AIC(n) = log det(Σ̃n ) +
T
M 2 n log T
(67.2.13) SC(n) = log det(Σ̃n ) +
T
where M = number of variables in the system, T = sample size, n = number of lags
ε̂> ε̂
included, and Σ̃ has elements σ̃ ij = iT j
Again, diagnostic checks necessary because mis-specification is possible.
What to do with the estimation once it is finished? (1) forecasting really easy,
the AR-framework gives natural forecasts. One-step ahead forecasts by simply using
present and past values of the timeseries and setting the future innovations zero, and
in order to get forecasts more than one step ahead, use the one-step etc. forecasts
for those date which have not yet been observed.
67.2.1. Granger Causality. Granger causality tests are tests whether cer-
tain autoregressive coefficients are zero. It makes more sense to speak of Granger-
noncausality: the time series x fails to Granger-cause y if y can be predicted as
well from its own past as from the past of x and y. An equivalent expression is:
in a regression of y t on its own lagged values y t−1 , y t−1 , . . . and the lagged values
67.2. VECTOR AUTOREGRESSIVE PROCESSES 1455
xt−1 , xt−2 , . . ., the coefficients of xt−1 , xt−2 , . . . are not significantly different from
zero.
Alternative test proposed by Sims: x fails to Granger-cause y if in a regression
of y t on lagged, current, and future xq , the coefficients of the future xq are zero.
I have this from [Mad88, 329/30]. Leamer says that this should be called
precedence, not causality, because all we are testing is precedence. I disagree; these
tests do have implications on whether the researcher would want to draw causal
inferences from his or her data, and the discussion of causality should be included in
statistics textbooks.
Innovation accounting or impulse response functions: make a moving average
representation, and then you can pick the timepath of the innovations: perhaps a
1-period shock, or a stepped increase, whatever is of economic interest. Then you
can see how these shocks are propagated through the system.
Caveats:
(1) do not make these experiments too dissimilar to what actually transpired in
the data from which the parameters were estimated.
(2) Innovations are correlated, and if you increase one without increasing another
which is highly correlated with it then you may get misleading results.
1456 67. TIMESERIES ANALYSIS
Way out would be: transform the innovations in such a way that their esti-
mated covariance matrix is diagonal, and only experiment with these diagonalized
innovations. But there are more than one way to do this.
If one has the variables ordered in a halfways sensible way, then one could use
the Cholesky decomposition, which diagonalizes this ordering of the variables.
Other approaches: forecast error (MSE) can be decomposed into a sum of contri-
butions coming from the different innovations: but this decomposition is not unique!
Then the MA-representation is the answer to: how can one make policy recom-
mendations with such a framework.
Here is an example how an economic model can lead to a non-invertible VARMA
process. It is from [AG97, p. 119], originally in [Qua90] and [BQ89]. Income at
time t is the sum of a permanent and a transitory component y t = y p t + y t t ; the
permanent follows a random walk y p t = y p t−1 + δ t while the transitory income is
white noise, i.e., y t t = εt . var[εt ] = var[δ t ] = σ 2 , and all disturbances are mutually
independent. Consumers know which part of their income is transitory and which
part is permanent; they have this information because they know their own par-
ticular circumstances, but this kind of information is not directly available to the
econometrician. Consumers act on their privileged information: their increase in
consumption is all of their increase in permanent income plus fraction β < 1 of their
67.2. VECTOR AUTOREGRESSIVE PROCESSES 1457
transitory income ct − ct−1 = δ t + βεt . One can combine all this into
(67.2.14) y t − y t−1 = δ t + εt − εt−1 δ i ∼ (0, σ 2 )
(67.2.15) ct − ct−1 = δ t + βεt εi ∼ (0, σ 2 )
This is a vector-moving-average process for the first differences
y t − y t−1 1 1 − L δt
(67.2.16) =
ct − ct−1 1 β εt
but it is not invertible. In other words, the econometrician cannot consistently esti-
mate the values of the present disturances from the past of this timeseries. who only
sees the timepaths of income and consumption, cannot reconstruct from this these
data the information which the agents themselves used to make their consumption
decision.
There is an invertible data generating process too, but it has the coefficients
y t − y t−1 1 1 − (1 − β)L 1 + β − βL ξ t
(67.2.17) =p
ct − ct−1 1 + β2 0 1 + β2 ζt
If the econometrician uses an estimation method which automatically generates the
invertible representation, he will get the wrong answer. He will think that the shocks
which have a permanent impact on y also have a delayed effect in the opposite
direction on next year’s income, but have no effect on consumption; and that the
1458 67. TIMESERIES ANALYSIS
shocks affecting consumption this period also have an effect on this period’s income
and an opposite effect on next period’s income. This is a quite different scenario,
and in many respects the opposite scenario, than that in equation (67.2.16).
Problem 552. It is the purpose of this question to show that the following two
vector moving averages are empirically indistinguishable:
ut 1 1 − L δt
(67.2.18) =
vt 1 β εt
and
ut 1 1 − (1 − β)L 1 + β − βL ξ t
(67.2.19) =p
vt 1 + β2 0 1 + β2 ζt
where all error terms δ, ε, ξ, and ζ are independent with equal variances σ 2 .
• b. Show also that the first representation has characteristic root 1 − β, and the
1
second has characteristic root 1−β . I.e., with β < 1, the first is not invertible but the
second is.
Answer. Replace the Lag operator L by the complex variable z, and compute the determinant:
1 1−z
(67.2.21) det = β − (1 − z)
1 β
setting this determinant zero gives z = 1 − β, i.e., the first representation has a root within the unit
circle, therefore it is not invertible. For the second representation we get
1 − (1 − β)z 1 + β − βz
(67.2.22) det = (1 − (1 − β)z)(1 + β 2 )
0 1 + β2
1
Setting this zero gives 1 − (1 − β)z = 0 or z = 1−β
, which is outside the unit circle. Therefore this
representation is invertible.
1460 67. TIMESERIES ANALYSIS
[AG97, p. 119] writes: “When the agents’ information set and the econome-
tricians’ information set do coincide, then the MA representation is fundamental.”
Non-fundamental representations for the observed variables are called-for when the
theoretical framework postulates that agents observe variables that the econometri-
cian cannot observe.
(67.3.1) y t = a0 + a1 t + a2 t2 + · · · + an tn + εt
67.3. NONSTATIONARY PROCESSES 1461
But it may also be the case that there is a stochastic trend. To study this look at
the random walk:
(67.3.2) y t = y t−1 + εt
I.e., the effects of the disturbances do not die out but they are permanent. The
MA-representation of this series is
t
X
(67.3.3) yt = y0 + εi
i=1
n-step-ahead forecasts at time t are y t .
Problem 553. Show that in a random walk process (67.3.3) (with y 0 non-
stochastic) var[y t ] = tσ 2 (i.e., it is nonstationary), cov[y t , y t−h ] = σ 2 (t − h), and
p
corr[y t , y t−h ] = (t − h)/t.
Answer. [End95, p. 168]: cov[y t , y t−h ] = cov[e1 + · · · + et , e1 + · · · + et−h . corr[y t , y t−h ] =
σ 2 (t−h)
√ √ .
σ 2 t σ 2 (t−h)
The significance of this last formula is: the autocorrelation functions of a non-
stationary random walk look similar to those of an autoregressive stationary process.
Then Enders discusses some variations: random walk plus drift y t = y t−1 +µ+εt
which is Enders’s (3.36), random walk plus noise y t = µt + ηt where µt = µt−1 + εt
1462 67. TIMESERIES ANALYSIS
with ηt and εt independent white noise processes is Enders’s (3.38–39). Both can be
combined in (3.41), and the so-called local linear trend model (3.45).
How to remove the trend? Random walk (with or without drift) is ARIMA(0,1,0),
i.e., its first difference is a constant plus white noise.
The random walk with noise (or with drift and noise) is ARIMA(0,1,1):
Problem 554. 3 points Show: If you difference a random walk with noise pro-
cess, you get a MA(1) process with a correlation that is between 0 and −1/2.
Answer. Let y t be a random walk with noise, i.e., y t = µt + ηt where µt = µt−1 + εt with
ηt and εt independent white noise processes. Since ∆µt = εt , it follows ∆y t = εt + ηt − ηt−1 .
Stationary. var[∆y t ] = σε2 + 2ση2 . cov[∆y t , ∆y t−1 ] = cov[εt + ηt − ηt−1 , εt−1 + ηt−1 − ηt−2 ] = −ση2 .
corr[∆y t , ∆y t−1 ] = −ση2 /(σε2 + 2ση2 ) between −1/2 and 0. Higher covariances are zero.
The local linear trend model is an example of a model which leads to a stationary
process after differencing twice: it is an ARIMA(0,2,2) model.
I.e., certain time series are such that differencing is the right thing to do. But
if a time series is the sum of a deterministic trend and white noise then differencing
is not called for: From y t = y 0 + αt + εt follows ∆y t = α + εt − εt−1 . This is not
an invertible process. The appropriate method of detrending here is to regress the
timeseries on t and take the residuals.
67.3. NONSTATIONARY PROCESSES 1463
diagram, which suggests a significant relationship (but the plot of the residuals shows
nonstationarity).
There are two ways around this, both connected with the names Dickey and
Fuller (DF-tests): Either one maintains the usual formula for the t-statistic but
different significance points which were obtained by Monte-Carlo experiments. These
are the so-called τ -tests. Based on the three above regressions [DM93, p. 703] calls
them τnc , τc , and τct (like: no constant, constant, and constant and√ trend). Or one
uses a different test statistic in which one divides by T instead of T (and it turns
out that one does not have to divide by the estimated standard deviation). This is
the so-called z-statistic.
67.4. Cointegration
Two timeseries y 0 and y 1 which are I(1) are called co-integrated if there is a
linear combination of them which is I(0). What this means is especially obvious
if this linear combination is their difference, see the graphs in [CD97, pp, 123/4].
Usually in economic applications this linear combination also depends on exogenous
variables; then the definition is that η1 y 1 + η2 y 2 = Xβ + ε whith a stationary ε .
These coefficients are determined only up to a multiplicative constant, therefore one
can normalize them by setting say η1 = 1.
67.4. COINTEGRATION 1465
• b. 2 points y depends on its own lagged values and some other explanatory
variables.
1466 67. TIMESERIES ANALYSIS
• c. 2 points The error terms of different time periods are correlated (but, for
simplicity, they are assumed to form a stationary process).
• d. 2 points What should be considered when more than one of the above situ-
ations occur?
CHAPTER 68
Seasonal Adjustment
Seasonal adjustment has been criticized in [Hyl92, p. 231] on the grounds that
it cannot be explained what the adjusted series is measuring. Signal extraction in
electrical engineering has the goal to restore the original signal which actually existed
before it was degraded by noise. But there is no actually existing “original signal”
which the de-seasonalized economic timeseries tries to measure. Someone concluded
from this, I have to find the quote and the exact wording again, “These adjusted
timeseries must be considered uncomplicated aids in decision making, without a real
counterpart.” Look at [Hyl92, p. 102]. Here it is necessary to make the depth-realist
distinction between the real and the actual. It is true that seasonally adjusted data
have no actual counterpart; they are counterfactual, but they do have a real basis,
namely, the underlying economic mechanisms which would also have been active in
the absence of the seasonal factors.
Natural scientists can investigate their subject under controlled experimental
conditions, shielded from non-essential influences. Economists cannot do this; they
cannot run the economy inside a building in which all seasonal variations of weather
and scheduling are eliminated, in order to see how the economy evolves in the ab-
sence of these disturbances and therefore to understand the character of economic
mechanisms better.
Seasonal adjustment of the data is an imperfect substitute for this. It exploits
the fact that phenomena which are generated by seasonal factors have a different
68. SEASONAL ADJUSTMENT 1469
empirical footprint than those generated by other factors, namely, their periodicity,
which is a very obvious feature of most economic timeseries. The removal of the
periodicity from the data is their attempt to infer what the economy would have
been like in the absence of the seasonal influences.
This is in principle no different than some of the methods by which we try to
eliminate other non-economic influences from the data: many statistical methods
make the the assumption that fast variations in the data are the result of random,
i.e., non-economic influences, because the economy does not move this fast.
These limitations of seasonal adjustment point to a basic methodological flaw of
this research method. The attempt to take data generated by an economy which is
subject to seasonal influences and submit them to a mathematical procedure in order
to see how the economy would have evolved in the absence of these influences really
commits the “fallacy of misplaced concreteness” [Col89, pp. 27?, 52?]: if two different
mechanism are at work, this does not mean that the events generated by them can
be divided into two groups, or that the data generated by these mechanisms can be
decomposed into two components: that generated by the first mechanism, and that
generated by the second. This is why it is recommended so often that the seasonality
should be incorporated in the model instead of adjusting the data. (In some simple
cases, as in Problem 559, these two procedures are equivalent, but usually these two
methods give different results.)
1470 68. SEASONAL ADJUSTMENT
Furthermore, [Hyl92, chapter 6, need reference by author and title] shows by a theo-
retical model that the empirical expressions of seasonal influences do not necessarily
move in synch with the seasons: optimal adjustment to seasonal demand leads to a
seasonally-induced component of economic activity which has its power not restricted
to the seasonal frequencies
Miron [Mir96, pp, 57–66] does not look at the frequency but at the amplitude
of the seasonal variations. He argues that the seasonal variations observed in the
economy are much stronger than the magnitude of the above external influences
might justify. He concludes that there is a “seasonal business cycle” which shares
many characteristics with the usual business cycle.
Despite these flaws, seasonal adjustment can have its uses.
68. SEASONAL ADJUSTMENT 1471
[BF91] apply various adjustment mechanisms to real (and one simulated) time-
series and ask whether the results have desirable properties. For instance, the X11-
procedure was apparently adopted because it gave good results on many different
time series. This shows that seasonal adjustment methods are selected not so much
on the basis of prior theory, but on the basis of what works.
components as follows:
1
1 X 0
(68.1.2) si = s0i − 2s
12 j−1 j
Problem 556. How would you modify this method at the ends of the sampling
period?
Here a large α1 gives a smoother series, a small α1 gives smaller residuals. A large
α2 gives a series whose seasonal behavior is fixed over time, while a small α2 gives a
more flexible seasonal pattern.
Problem 557. Show that vt is indeed the third difference.
Answer. Define the first differences dt = gt − gt−1 , the second differences et = dt − dt−1 ,
and the third difference ft = et − et−1 . Then you will see that ft = vt . Let’s go through this:
et = gt − gt−1 − (gt−1 − gt−2 ) = gt − 2gt−1 + gt−2 . Therefore ft = gt − 2gt−1 + gt−2 − (gt−1 −
2gt−2 + gt−3 ) = gt − 3gt−1 + 3gt−2 − gt−3 = vt .
BTW the smooth component gt is not being produced in the USA, but in Europe
[ESS97, p. 98], this book has a few articles arguing that the smooth component is
good for policy etc.
your model as
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
(68.2.1) y = Xβ + Cδ + ε , C = 1 0 0 0 .
0 1 0 0
0 0 1 0
0 0 0 1
.. .. .. ..
. . . .
1476 68. SEASONAL ADJUSTMENT
1 0 0 0
1 1 0 0
1 0 1 0
1 0 0 1
(68.2.2) y = ια + Xβ + Kδ + ε , ι = 1 K = 0 0 0
1 1 0 0
1 0 1 0
1 0 0 1
.. .. .. ..
. . . .
In R this is the default method to generate dummy variables from a seasonal factor
variable. (Splus has a different default.) This is also the procedure shown in [Gre97,
68.2. SEASONAL DUMMIES IN A REGRESSION 1477
In R one gets these dummy variables from a seasonal factor variable if one specifies
contrast="contr.sum".
3 points What is the meaning of the seasonal dummies δ1 , δ2 , δ3 , and of the
constant term α or the fourth seasonal dummy δ4 , in models (68.2.1), (68.2.2), and
(68.2.3)?
Answer. Clearly, in model (68.2.1), δi is the intercept in the ith season. For (68.2.2) and
(68.2.3), it is best to write the regression equation for each season separately, filling in the values
the dummies take for these seasons, in order to see the meaning of these dummies. Assuming X
1478 68. SEASONAL ADJUSTMENT
y1 1 x1 0 0 0 ε1
y2 1 x2 1 0 0 ε2
y3 1 x3 0 1 0 ε3
y4 1 x4 0 0 1 "δ # ε4
y 1 1
5 = α + 5 β + 0
x 0 0 δ + ε5
y6 1 x6 1 2
0 0 δ3
ε6
y7 1 x7 0 1 0 ε7
y 1 x 0 0 1
ε
8 8 8
.. .. .. .. .. .. ..
. . . . . . .
y 1 = 1 · α + x1 · β + 0 · δ1 + 0 · δ2 + 0 · δ3 + ε1 winter
y 2 = 1 · α + x2 · β + 1 · δ1 + 0 · δ2 + 0 · δ3 + ε2 spring
y 3 = 1 · α + x3 · β + 0 · δ1 + 1 · δ2 + 0 · δ3 + ε3 summer
y 4 = 1 · α + x4 · β + 0 · δ1 + 0 · δ2 + 1 · δ3 + ε4 autumn
therefore the overall intercept α is the intercept of the first quarter (winter); δ1 is the difference
between the spring intercept and the winter intercept, etc.
68.2. SEASONAL DUMMIES IN A REGRESSION 1479
(68.2.3) becomes
y1 1 x1 1 0 0 ε1
y2 1 x2 0 1 0 ε2
y3 1 x3 0 0 1 ε3
y4 1 x4 −1 −1 −1 "δ # ε4
y 1 1
5 = α + 5 β + 1
x 0 0 δ + ε5
y6 1 x6 0 2
1 0 δ3
ε6
y7 1 x7 0 0 1 ε7
y 1 x −1 −1 −1
ε
8 8 8
.. .. .. .. .. .. ..
. . . . . . .
y 1 = 1 · α + x1 · β + 1 · δ1 + 0 · δ2 + 0 · δ3 + ε1 winter
y 2 = 1 · α + x2 · β + 0 · δ1 + 1 · δ2 + 0 · δ3 + ε2 spring
y 3 = 1 · α + x3 · β + 0 · δ1 + 0 · δ2 + 1 · δ3 + ε3 summer
y 4 = 1 · α + x4 · β − 1 · δ1 − 1 · δ2 − 1 · δ3 + ε4 autumn
Here the winter intercept is α + δ1 , the spring intercept α + δ2 , summer α + δ3 , and autmumn
α − δ1 − δ2 − δ3 . Summing this and dividing by 4 shows that the constant term α is the arithmetic
mean of all intercepts, therefore δ1 is the difference between the winter intercept and the arithmetic
mean of all intercepts, etc.
1480 68. SEASONAL ADJUSTMENT
Problem 559. [DM93, pp. 23/4], [JGH+ 85, p. 260]. Your dependent variable
y and the explanatory variables X are quarterly timeseries data. Your regression
includes a constant term (not included in X). We also assume that your data set
spans m full years, i.e., the number of observations is 4m. The purpose of this
exercise is to show that the following two procedures are equivalent:
• a. 1 point You create a “seasonally adjusted” version of your data set, call
them y and X, by taking the seasonal mean out of every variable and adding the
overall mean back, and you regress y on X with a constant term. (The under-
lining does not denote taking out of the mean, but the taking out of the seasonal
means and adding back of the overall mean.) In the simple example where y =
>
1 3 8 4 5 3 2 6 , compute y. Hint: the solution vector contains the
numbers 7,3,6,4 in sequence.
Answer. Subtract the seasonal means, and add back the overall mean to get:
1 3 4 2
3 3 4 4
8 5 4 7
(68.2.4)
4 − 5 + 4 = 3
5 3 4 6
3 3 4 4
68.2. SEASONAL DUMMIES IN A REGRESSION 1481
• b. 2 points The alternative equivalent procedure is: You use the original data
y and X but you add three seasonal dummies to your regression, i.e., you write your
model in the form
1 0 0
0 1 0
0 0 1
−1 −1 −1
(68.2.5) y = ια + Xβ + Kδ + ε , K= 1 0 0
0 1 0
0 0 1
−1 −1 −1
.. .. ..
. . .
3 −1 −1
1 1 1
(68.2.6) (K > K)−1 = −1 3 −1 = (I − ιι> )
4m m 4
−1 −1 3
where I is the 3 × 3 identity matrix, and ι is a 3-vector of ones. Hint: this can
>
easily
be computed element
by element,
but
the most elegant way is to write K =
I −ι . . . I −ι where the I −ι group is repeated m times.
" #
2 1 1
(68.2.7) K>K = m 1 2 1 = m(I + ιι> ).
1 1 2
If it is written in the second way one can apply formula (31.2.6) to get the inverse. Of course, in
the present case, the inverse is already given, therefore one can simply multiply the matrix with its
inverse to verify that it is indeed the inverse.
68.2. SEASONAL DUMMIES IN A REGRESSION 1483
• e. 2 points Using the above equations, show that the OLS estimate β̂ in this
model is exactly the same as the OLS estimate in the regression of the seasonally
adjusted data y on X. Hint: All you have to show is that M 1 y = y, and M 1 X = X,
where M 1 = I − K(K > K)−1 K > .
68.2. SEASONAL DUMMIES IN A REGRESSION 1485
1487
1488 69. BINARY CHOICE MODELS
i.e., one obtains the step by regressing A−1 u on X with weighting matrix A.
69.2. BINARY DEPENDENT VARIABLE 1489
quickly and then the main appeal of OLS, its simplicity, is lost. This is a wrong-
headed approach, and any smart ideas which one may get when going down this road
are simply wasted.
The right way to do this is to set πi = E[y i ] = Pr[y i = 1] = h(x>
i β) where h is
some (necessarily nonlinear) function with values between 0 and 1.
p
Answer. exp y = 1−p , now multiply by 1 − p to get exp y − p exp y = p, collect terms
exp y = p(1 + exp y), now divide by 1 + exp y.
69.2. BINARY DEPENDENT VARIABLE 1491
Problem 561. Sometimes one finds the following alternative specification of the
>
logit model: πi = 1/(1+exi β ). What is the difference between it and our formulation
of the logit model? Are these two formulations equivalent?
Answer. It is simply a different parametrization. They get this because they come from index
number problem.
The logit function is also the canonical link function for the binomial distribution,
see Problem 113.
69.2.2. Probit Model. An important class of functions with values between 0
and 1 is the class of cumulative probability distribution functions. If h is a cumulative
distribution function, then one can give this specification an interesting interpretation
in terms of an unobserved “index variable.”
The index variable model specifies: there is a variable z i with the property that
y i = 1 if and only if z i > 0. For instance, the decision y i whether or not individual
i moves to a different location can be modeled by the calculation whether the net
benefit of moving, i.e., the wage differential minus the cost of relocation and finding
a new job, is positive or not. This moving example is worked out, with references,
in [Gre93, pp. 642/3].
The value of the variable z i is not observed, one only observes y i , i.e., the only
thing one knows about the value of z i is whether it is positive or not. But it is assumed
1492 69. BINARY CHOICE MODELS
that z i is the sum of a deterministic part which is specific to the individual and a
random part which has the same distribution for all individuals and is stochastically
independent between different individuals. The deterministic part specific to the
individual is assumed to depend linearly on individual i’s values of the covariates,
with coefficients which are common to all individuals. In other words, z i = x> i β + εi ,
where the εi are i.i.d. with cumulative distribution function Fε . Then it follows πi =
Pr[y i = 1] = Pr[z i > 0] = Pr[εi > −x> > >
i β] = 1 − Pr[εi ≤ −xi β] = 1 − Fε (−xi β).
I.e., in this case, h(η) = 1 − Fε (−η). If the distribution of εi is symmetric and has a
density, then one gets the simpler formula h(η) = Fε (η).
Which cumulative distribution function should be chosen?
• In practice, the probit model, in which z i is normal, is the only one used.
• The linear model, in which h is the line segment from (a, 0) to (b, 1), can also
be considered generated by an in index function z i which is here uniformly
distributed.
• An alternative possible specification with the Cauchy distribution is pro-
posed in [DM93, p. 516]. They say that curiously only logit and probit are
being used.
In practice, the probit model is very similar to the logit model, once one has rescaled
the variables to make the variances equal, but the logit model is easier to handle
mathematically.
69.2. BINARY DEPENDENT VARIABLE 1493
∂L y
i mi − y i
(69.2.4) ui = = − πi (1 − πi ) = y i − mi πi .
∂ηi πi 1 − πi
These are the elements of u in (69.1.1), and they have a very simple meaning: it is
just the observations minus their expected values. Therefore one obtains immediately
A = E [uu> ] is a diagonal matrix with mi πi (1 − πi ) in the diagonal.
Problem 562. 6 points Show that for the maximization of the likelihood func-
tion of the logit model, Fisher’s scoring method is equivalent to the Newton-Raphson
algorithm.
69.3. THE GENERALIZED LINEAR MODEL 1495
P P
Problem 563. Show that in the logistic model, mi π̂i = yi .
(69.3.1) g(µ) = Xβ
1496 69. BINARY CHOICE MODELS
where g() is some known monotonic function which acts pointwise on µ. Typically
g() is used to transform the µi to a scale on which they are unconstrained.
For
example we might use g(µ) = log(µ) if µi > 0 or g(µ) = log µ/(1 − µ) if 0 < µi < 1.
The same reasons which force us to abandon the linear model also force us to
abandon the assumption of normality. If y is bounded then the variance of y must
depend on its mean. Specifically if µ is close to a boundary for y then var(y) must be
small. For example, if y > 0, then we must have var(y) → 0 as µ → 0. For this reason
strictly positive data almost always shows increasing variability with increased size.
If 0 < y < 1, then var(y) → 0 as µ → 0 or µ → 1. For this reason, generalized linear
models assume that
(69.3.2) var(y i ) = φ · V (µi )
where φ is an unknown scale factor and V () is some known variance function appro-
priate for the data at hand.
We therefore estimate the nonlinear regression equation (69.3.1) weighting the
observations inversely according to the variance functions V (µi ). This weighting
procedure turns out to be exactly equivalent to maximum likelihood estimation when
the observations actually come from an exponential family distribution.
Problem 564. Describe estimation situations in which a linear model and Nor-
mal distribution are not appropriate.
69.3. THE GENERALIZED LINEAR MODEL 1497
Its random specification is such that var[y] depends on E[y] through a variance
function φ · V (where φ is a constant taking the place of σ 2 in the regression model:)
models do not require us to specify the whole distribution but can be derived on the
basis of the mean and variance functions alone.
CHAPTER 70
Discrete choice between three or more alternatives; came from choice of trans-
portation.
The outcomes of these choices should no longer be represented by a vector y, but
one needs a matrix Y with y ij = 1 if the ith individual chooses the jth alternative,
and 0 otherwise. Consider only three alternatives j = 1, 2, 3, and define Pr(y ij =
1) = πij .
Conditional Logit model is a model which makes all πij dependent on xi . It is
πi
very simple extension of binary choice. In binary choice we had log 1−π i
= x>
i β, log
πi2 > πi3 >
of odds ratio. Here this is generalized to log πi1 = xi β 2 , and log πi1 = xi β 3 . From
1499
1500 70. MULTIPLE CHOICE MODELS
this we obtain
> >
(70.0.5) πi1 = 1 − πi2 − πi3 = 1 − πi1 exi β2
− πi1 exi β3
,
or
1
(70.0.6) πi1 = > >β ,
1 + exi β2 + exi 3
>
exi β2
(70.0.7) πi2 = > >β ,
1 + exi β2 + exi 3
>
exi β3
(70.0.8) πi3 = > > .
1 + exi β2 + exi β3
αj +βj Xi
One can write this as πij = Pe eαk +βk Xi if one defines α1 = β1 = 0. The only
estimation method used is MLE.
Y y y y Y (ex> >
i β 2 )y i2 (exi β 3 )y i3
(70.0.9) L= πi1i1 πi2i2 Πi3i3 = > > .
1 + exi β2 + exi β3
Note: the odds are independent of all other alternatives. Therefore the alterna-
tives must be chosen such that this independence is a good assumption. The choice
between walking, car, red buses, and blue buses does not satisfy this. See [Cra91,
p. 47] for the best explanation of this which I found till now.
APPENDIX A
Matrix Formulas
In this Appendix, efforts are made to give some of the familiar matrix lemmas in
their most general form. The reader should be warned: the concept of a deficiency
matrix and the notation which uses a thick fraction line multiplication with a scalar
g-inverse are my own.
Problem 569. Use theorem A.1.1 to prove that every matrix has a g-inverse.
Answer. Simple: a null matrix has its transpose as g-inverse, and if A 6= O then RL is such
a g-inverse.
The g-inverse of a number is its inverse if the number is nonzero, and is arbitrary
otherwise. Scalar expressions written as fractions are in many cases the multiplication
by a g-inverse. We will use a fraction with a thick horizontal rule to indicate where
this is the case. In other words, by definition,
a a
(A.3.2) = b− a. Compare that with the ordinary fraction .
b b
This idiosyncratic notation allows to write certain theorems in a more concise form,
but it requires more work in the proofs, because one has to consider the additional
case that the denominator is zero. Theorems A.5.8 and A.8.2 are examples.
Theorem A.3.1. If B = AA− B holds for one g-inverse A− of A, then it holds
for all g-inverses. If A is symmetric and B = AA− B, then also B > = B > A− A.
If B = BA− A and C = AA− C then BA− C is independent of the choice of g-
inverses.
Proof. Assume the identity B = AA+ B holds for some fixed g-inverse A+
(which may be, as the notation suggests, the Moore Penrose g-inverse, but this is
1506 A. MATRIX FORMULAS
not necessary), and let A− be an different g-inverse. Then AA− B = AA− AA+ B =
AA+ B = B. For the second statement one merely has to take transposes and note
that a matrix is a g-inverse of a symmetric A if and only if its transpose is. For the
third statement: BA+ C = BA− AA+ AA− C = BA− AA− C = BA− C. Here +
signifies a different g-inverse; again, it is not necessarily the Moore-Penrose one.
Problem 570. Show that x satisfies x = Ba for some a if and only if x =
BB − x.
Theorem A.3.2. Both A> (AA> )− and (A> A)− A are g-inverses of A.
Proof. We have to show
(A.3.3) A = AA> (AA> )− A
which is [Rao73, (1b.5.5) on p. 26]. Define D = A − AA> (AA> )− A and show, by
multiplying out, that DD > = O.
Problem 573. Show that S ⊥ U if and only if S is a matrix with maximal rank
which satisfies SU = O. In other words, one cannot add linearly independent rows
to S in such a way that the new matrix still satisfies T U = O.
S O
Answer. First assume S ⊥ U and take any additional row t> so that U = . Then
t> o>
Q S Q
exists a such that > = S, i.e., SQ = S, and t> = r > S. But this last equation means
r t r
that t> is a linear combination of the rows of S with the ri as coefficients. Now
conversely,
assume
> S O
S is such that one cannot add a linearly independent row t such that > U = , and let
t o>
P U = O. Then all rows of P must be linear combinations of rows of S (otherwise one could add
A.4. DEFICIENCY MATRICES 1509
such a row to S and get the result which was just ruled out), therefore P = SS where A is the
matrix of coefficients of these linear combinations.
The deficiency matrix is not unique, but we will use the concept of a deficiency
matrix in a formula only then when this formula remains correct for every deficiency
matrix. One can make deficiency matrices unique if one requires them to be projec-
tion matrices.
Problem 574. Given X and a symmetric nonnegative definite Ω such that X =
Ω W for some W . Show that X ⊥ U if and only if X >Ω − X ⊥ U .
Answer. One has to show that XY = O is equivalent to X >Ω − XY = O. ⇒ clear; for
⇐ note that X >Ω − X = W >Ω W , therefore XY = Ω W Y = Ω W (W >Ω W )− W >Ω W Y =
Ω W (W >Ω W )− X >Ω − XY = O.
A matrix is said to have full column rank if all its columns are linearly indepen-
dent, and full row rank if its rows are linearly independent. The deficiency matrix
provides a “holistic” definition for which it is not necessary to look at single rows
and columns. X has full column rank if and only if X ⊥ O, and full row rank if and
only if O ⊥ X.
Problem 575. Show that the following three statements are equivalent: (1) X
has full column rank, (2) X > X is nonsingular, and (3) X has a left inverse.
1510 A. MATRIX FORMULAS
Answer. Here use X ⊥ O as the definition of “full column rank.” Then (1) ⇔ (2) is theorem
A.4.1. Now (1) ⇒ (3): Since IO = O, a P exists with I = P X. And (3) ⇒ (1): if a P exists with
I = P X, then any Q with QO = O can be factored over X, simply say Q = QP X.
Note that the usual solution of linear matrix equations with g-inverses involves
a deficiency matrix:
Theorem A.4.3. The solution of the consistent matrix equation T X = A is
(A.4.1) X = T −A + U W
where T ⊥ U and W is arbitrary.
Proof. Given consistency, i.e., the existence of at least one Z with T Z = A,
(A.4.1) defines indeed a solution, since T X = T T − T Z. Conversely, if Y satisfies
T Y = A, then T (Y − T − A) = O, therefore Y − T − A = U W for some W .
Theorem A.4.4. Let L ⊥ T ⊥ U and J ⊥ HU ⊥ R; then
L O T
⊥ ⊥ U R.
−J HT − J H
K K
(5) If T is any other right deficiency matrix of , i.e., if ⊥ T , then
P P
(A.4.4) T = ΩΩ − T
Using (A.4.4) one can show the hint: that any D satisfying Ξ = T DT > is a
g-inverse of T >Ω − T :
To complete the proof of (5) we have to show that the expression T (T >Ω − T )− T >
does not depend on the choice of the g-inverse of T >Ω − T . This follows from
T (T >Ω − T )− T > = ΩΩ − T (T >Ω − T )− T >Ω −Ω and theorem A.5.10.
Theorem A.4.6. Given two matrices T and U . Then T ⊥ U if and only if for
any D the following two statements are equivalent:
(A.4.6) TD = O
and
for arbitrary vectors a and g. Equality holds if and only if Ω g and Ω a are linearly
dependent, i.e., α and β exist, not both zero, such that Ω gα + Ω aβ = o.
Proof: First we will show that the condition for equality is sufficient. Therefore
assume Ω gα + Ω aβ = 0 for a certain α and β, which are not both zero. Without
loss of generality we can assume α 6= 0. Then we can solve a>Ω gα + a>Ω aβ = 0 to
get a>Ω g = −(β/α)a>Ω a, therefore the lefthand side of (A.5.1) is (β/α)2 (a>Ω a)2 .
Furthermore we can solve g >Ω gα + g >Ω aβ = 0 to get g >Ω g = −(β/α)g >Ω a =
(β/α)2 a>Ω a, therefore the righthand side of (A.5.1) is (β/α)2 (a>Ω a)2 as well—i.e.,
(A.5.1) holds with equality.
Secondly we will show that (A.5.1) holds in the general case and that, if it holds
with equality, Ω g and Ω a are linearly dependent. We will split this second half of
the proof into two substeps. First verify that (A.5.1) holds if g >Ω g = 0. If this is
the case, then already Ω g = o, therefore the Ω g and Ω a are linearly dependent and,
by the first part of the proof, (A.5.1) holds with equality.
1516 A. MATRIX FORMULAS
The second substep is the main part of the proof. Assume g >Ω g 6= 0. Since Ω
is nonnegative definite, it follows
(A.5.2)
g > Ω a > g >Ω a > (g >Ω a)2 (g >Ω a)2 > (g >Ω a
0 ≤ a−g > Ω a−g = a Ωa−2 + = a Ωa−
g Ωg g >Ω g g >Ω g g >Ω g g >Ω g
>
From this follows (A.5.1). If (A.5.2) is an equality, then already Ω a−g gg>Ω a
Ωg
=
o, which means that Ω g and Ω a are linearly dependent.
Theorem A.5.8. In the situation of theorem A.5.7, one can take g-inverses as
follows without disturbing the inequality
(g >Ωa)2
(A.5.3) ≤ a>Ω a.
g >Ω g
Equality holds if and only if a γ 6= 0 exists with Ω g = Ω aγ.
Problem 577. Show that if Ω is nonnegative definite, then its elements satisfy
2
(A.5.4) ωij ≤ ωii ωjj
Answer. Let a and b be the ith and jth unit vector. Then
(b>Ω a)2 (g >Ω a)2
(A.5.5) ≤ max = a>Ω a.
>Ω g g >Ω g
b Ωb
A.5. NONNEGATIVE DEFINITE SYMMETRIC MATRICES 1517
Answer. Since g = Ω a for some a, maximize over a instead of g. This reduces it to theorem
A.5.8:
(g > x)2 (a>Ω x)2
(A.5.10) max = max = x>Ω x
Ωa for some a
g : g=Ω g >Ω − g a a>Ω a
>
Answer. To see that (A.5.11)
is a special
case of (A.3.3), take any Q with Ω = QQ and P
Σ >
with = P P and define A = KQ P . The independence of the choice of g-inverses follows
from theorem A.3.1 together with (A.5.11).
The following was apparently first shown in [Alb69] for the special case of the
Moore-Penrose pseudoinverse:
1520 A. MATRIX FORMULAS
Ω yy Ω yz
Theorem A.5.11. The symmetric partitioned matrix Ω = is non-
Ω>yz Ωzz
negative definite if and only if the following conditions hold:
(A.5.12)
> −
Ω yy and Ω zz.y := Ω zz − Ω yz Ω yy Ω yz are both nonnegative definite, and
(A.5.13) Ω yz = Ω yy Ω −
yy Ω yz
Reminder: It follows from theorem A.3.1 that (A.5.13) holds for some g-inverse
if and only if it holds for all, and that, if it holds, Ω zz.y is independent of the choice
of the g-inverse.
Proof of theorem A.5.11: First we prove the necessity of the three conditions
in the theorem. If the symmetric partitioned matrix Ω is nonnegative definite,
>
Ω yy Ω yz
there exists a R with Ω = R R. Write R = Ry Rz to get =
Ω>yz Ω zz
>
Ry Ry Ry > Rz
. Ω yy is nonnegative definite because it is equal to Ry > Ry ,
Rz > Ry Rz > Rz
−
and (A.5.13) follows from (A.5.11): Ω yy Ω yy Ω yz = Ry > Ry (Ry > Ry )− Ry > Rz =
>
Ry Rz = Ω yz . To show that Ω zz.y is nonnegative definite, define S = (I −
Ry (Ry > Ry )− Ry > )Rz . Then S > S = Rz > I − Ry (Ry > Ry )− Ry > Rz = Ω zz.y .
A.5. NONNEGATIVE DEFINITE SYMMETRIC MATRICES 1521
To show sufficiency
of the three conditions of theorem A.5.11, assume the sym-
Ω yy Ω yz
metric satisfies them. Pick two matrices Q and S so that Ω yy = Q> Q
Ω>yz Ω zz
and Ω zz.y = S > S. Then
" #
Q> −
Ω yy Ω yz O Q Ωyy
QΩ Ω yz
= − > > ,
Ω>yz Ω zz Ω>
yz Ω yy Q S> O S
• a. Show that Q e −1 Q
e − QQ e is nonnegative definite.
−1 −1
Answer. We know that Q
e − Q∗−1 is nnd, therefore Q
eQe Q e Q∗−1 Q
e −Q e nnd.
Answer. We will write it in a symmetric form from which it is obvious that it is nonnegative
definite:
(A.5.14) Q∗ − Q∗ Q−1 Q∗ = Q∗ − Q∗ (Q
e + Q∗ )−1 Q∗
(A.5.15) = Q∗ (Q
e + Q∗ )−1 (Q
e + Q∗ − Q∗ ) = Q∗ (Q
e + Q∗ )−1 Q
e
−1
(A.5.16) =Q e + Q∗ )−1 (Q
e (Q e + Q∗ )Q
e Q∗ (Q
e + Q∗ )−1 Q
e
−1
(A.5.17) e Q−1 (Q∗ + Q∗ Q
=Q e Q∗ )Q−1 Q
e.
Problem 584. Given the vector h 6= o. For which values of the scalar γ is
>
the matrix I − hhγ singular, nonsingular, nonnegative definite, a projection matrix,
orthogonal?
Answer. It is nnd iff γ ≥ h> h, because of theorem A.5.9. One easily verifies that it is
orthogonal iff γ = h> h/2, and it is a projection matrix iff γ = h> h. Now let us prove that it is
singular iff γ = h> h: if this condition holds, then the matrix annuls h; now assume the condition
>
does not hold, i.e., γ 6= h> h, and take any x with (I − hhγ )x = o. It follows x = hα where
>
α = h> x/γ, therefore (I − hhγ )x = hα(1 − h> h/γ). Since h 6= o and 1 − h> h/γ 6= 0 this can
only be the null vector if α = 0.
A.6. PROJECTION MATRICES 1523
Answer. Idempotence requires theorem A.3.2, and symmetry the invariance under choice of
g-inverse. Furthermore one has to show X(X > X)− Xa = a holds if and only if a = Xb for some
b. ⇒ is clear, and ⇐ follows from theorem A.3.2.
Theorem A.6.1. Let P and Q be projection matrices, i.e., both are symmetric
and idempotent. Then the following five conditions are equivalent, each meaning that
the space on which P projects is a subspace of the space on which Q projects:
Answer. Instead of going in a circle it is more natural to show (A.6.1) ⇐⇒ (A.6.2) and
(A.6.3) ⇐⇒ (A.6.2) and then go in a circle for the remaining conditions: (A.6.2), (A.6.3) ⇒
(A.6.4) ⇒ (A.6.3) ⇒ (A.6.5).
(A.6.1) ⇒ (A.6.2): R[P ] ⊂ R[Q] means that for every c exists a d with P c = Qd. Therefore
far all c follows QP c = QQd = Qd = P c, i.e., QP = P .
(A.6.2) ⇒ (A.6.1): if P c = QP c for all c, then clearly R[P ] ⊂ R[Q].
(A.6.2) ⇒ (A.6.3) by symmetry of P and Q: If QP = P then P Q = P > Q> = (QP )> =
P> = P.
(A.6.3) ⇒ (A.6.2) follows in exactly the same way: If P Q = P then QP = Q> P > = (P Q)> =
P> = P.
A.6. PROJECTION MATRICES 1525
Problem 587. If Y = XA for some A, show that Y (Y > Y )− Y > X(X > X)− X >
Y (Y > Y )− Y > .
Therefore geometrically the statement follows from the fact shown in Problem 585 that the
above matrices are projection matrices on the columnn spaces. But it can also be shown alge-
braically: Y (Y > Y )− Y > X(X > X)− X > = Y (Y > Y )− A> X > X(X > X)− X > = Y (Y > Y )− Y > .
Problem 588. (Not eligible for in-class exams) Let Q be a projection matrix
(i.e., a symmetric and idempotent matrix) with the property that Q = XAX > for
1526 A. MATRIX FORMULAS
Only for the fourth term did we need the condition Q = XAX > :
(A.6.10) X > XAX > X(X > X)− X > XAX > X = X > XAX > XAX > X = X > QQX = X > X.
A.6. PROJECTION MATRICES 1527
>
(A.6.11) X(X > X)− X > − X̃(X > X)− X̃ = X(X > X)− X > − (I − Q)X(X > X)− X > (I − Q) =
(A.6.12)
= X(X > X)− X > −X(X > X)− X > +X(X > X)− X > Q+QX(X > X)− X > −QX(X > X)− X > Q = X
Problem 589. Given any projection matrix P . Show that its ith diagonal ele-
ment can be written
X
(A.6.13) pii = p2ij .
j
P
Answer. From idempotence P = P P follows pii = j
pij pji , now use symmetry to get
(A.6.13).
1528 A. MATRIX FORMULAS
A.7. Determinants
Theorem A.7.1. The determinant of a block-triangular matrix is the product of
the determinants of the blocks in the diagonal. In other words,
A B
(A.7.1) O D = |A| |D|
Answer.
A B A B I O A − BD − C B
(A.7.6) = = = A − BD − D |D| .
D −D − C
C D C I O D
1530 A. MATRIX FORMULAS
Problem 591. Show that whenever BC and CB are defined, it follows |I − BC| =
|I − CB|
Answer. Set A = I and D = I in (A.7.3) and (A.7.5).
A B
is nonnegative definite symmetric, but it also holds in the nonsymmetric
C D
case if A is nonsingular, which by theorem A.7.2 is the case if the whole partioned
matrix is nonsingular.) Define E = D − CA− B, F = A− B, and G = CA− .
Answer. This here is not the shortest proof because I was still wondering if it could be
formulated in a more general way. Multiply out but do not yet use the conditions B = AA− B and
C = CA− A:
A B A− + F E − G −F E − AA− − (I − AA− )BE − G (I − AA− )BE −
(A.8.3) =
C D −E − G E− (I − EE − )G EE −
1532 A. MATRIX FORMULAS
and
AA− − (I − AA− )BE − G (I − AA− )BE − A B
(A.8.4) =
(I − EE − )G EE − C D
A + (I − AA− )BE − C(I − A− A) B − (I − AA− )B(I − E − E)
=
C − (I − EE − )C(I − A− A) D
One sees that not only the conditions B = AA− B and C = CA− A, but also the conditions B =
AA− B and C = EE − C, or alternatively the conditions B = BE − E and C = CA− A imply the
statement. I think one can also work with the conditions AA− B = BD − D and DD − C = CA− A.
Note that the lower right partition is D no matter what.
U V A AF
• b. If is a g-inverse of , show that X is a g-
W X GA E + GAF
inverse of E.
Answer. The g-inverse condition means
A AF U V A AF A AF
(A.8.5) =
GA E + GAF W X GA E + GAF GA E + GAF
For the upper left partition this means AU A+AF W A+AV GA+AF XGA = A, and for the upper
right partition it means AU AF + AF W AF + AV E + AV GAF + AF XE + AF XGAF = AF .
Postmultiply the upper left equation by F and subtract from the upper right to get AV E +
AF XE = O. For the lower left we get GAU A + EW A + GAF W A + GAV GA + EXGA +
GAF XGA = GA. Premultiplication of the upper left equation by G and subtraction gives EW A+
EXGA = O. For the lower right corner we get GAU AF + EW AF + GAF W AF + GAV E +
EXE +GAF XE +GAV GAF +EXGAF +GAF XGAF = E +GAF . Since AV E +AF XE = O
and EW A + EXGA = O, this simplifies to GAU AF + GAF W AF + EXE + GAV GAF +
GAF XGAF = E + GAF . And if one premultiplies the upper right corner by G and postmultiplies
it by F and subtracts it from this one gets EXE = E.
−
(A.8.14) D + CA− B = D − − D − C(A + BD − C)− BD − .
Answer. Proof: Define E = D + CA− B. Then it follows from the assumptions that
(A.8.15)
Since AA− (A + BD − C ) = A + BD − C, we have to show that the second term on the rhs. annulls
(A + BD − C ). Indeed,
(A.8.17) BD − (I − EE − )CA− (A + BD − C ) =
(A.8.18) = BD CA A + BD − CA− BD − C − BD − EE − CA− A − BD − EE − CA− BD − C =
− −
(A.8.19)
= BD − (D + CA− B − EE − D − EE − CA− B)D − C = BD − (E − EE − E)D − C = O.
diagonal elements are nonzero and the others zero. If one removes those eigenvectors
from T which belong to the eigenvalue zero, and calls the remaining matrix P , one
gets the following:
Theorem A.9.1. If B is a symmetric n × n matrix of rank r, then a r × n
matrix P exists with P P > = I (any P satisfying this condition which is not a
square matrix is called incomplete orthogonal), and B = P > ΛP , where Λ is a r × r
diagonal matrix with all diagonal elements nonzero.
Proof. Let T be
an orthogonal matrix whose rows are eigenvectors of B, and
P
partition it T = where P consists of all eigenvectors with nonzero eigenvalue
Q
Λ O
(there are r of them). The eigenvalue property reads B P > Q> = P > Q>
O O
Λ O P
therefore by orthogonality T > T = I follows B = P > Q>
=
O O Q
P > I O
P > ΛP . Orthogonality also means T T > = I, i.e., Q> =
P ,
Q O I
therefore P P > = I.
Problem 599. If B is a n × n symmetric matrix of rank r and B 2 = B, i.e.,
B is a projection, then a r × n matrix P exists with B = P > P and P P > = I.
A.9. EIGENVALUES AND SINGULAR VALUE DECOMPOSITION 1539
A theorem similar to A.9.1 holds for arbitrary matrices. It is called the “singular
value decomposition”:
Theorem A.9.2. Let B be a m × n matrix of rank r. Then B can be expressed
as
(A.9.1) B = P > ΛQ
where Λ is a r × r diagonal matrix with positive diagonal elements, and P P > = I
as well as QQ> = I. The diagonal elements of Λ are called the singular values of
B.
Proof. If P > ΛQ is the svd of B then P > ΛQQ> ΛP = P > Λ2 Q is the eigen-
value decomposition of BB > . We will use this fact to construct P and Q, and then
verify condition (A.9.1). P and Q have r rows each, write them
> >
p1 q1
.. ..
(A.9.2) P = . and Q = . .
p>
r q>
r
1540 A. MATRIX FORMULAS
>
Problem 600. Show that the q i are orthonormal eigenvectors of B B corre-
sponding to the same eigenvalues λ2i .
Answer.
−1 > −1
(A.9.6) q> >
i q j = λi pi BB pj λj = λ−1 > 2 −1
i pi pj λj λj = δij Kronecker symbol
(A.9.7) > >
B Bq i = B BB >
pi λ−1
i
>
= B pi λi = q i λ2i
Answer. The second condition comes from the definition q i = B > pi λ−1
i , and premultiply
this definition by B to get Bq i = BB > pi λ−1
i = λ2 pi λ−1
i = λpi .
P Q
Let P 0 and Q0 be such that and are orthogonal. Then the singular
P0 Q0
value decomposition can also be written in the full form, in which the matrix in the
middle is m × n:
> >
Λ O Q
(A.9.8) B= P P0
O O Q0
Problem 602. Let λ1 be the biggest diagonal element of Λ, and let c and d be
two vectors with the properties that c> Bd is defined and c> c = 1 as well as d> d = 1.
Show that c> Bd ≤ λ1 . The other singular values maximize among those who are
orthogonal to the prior maximizers.
Answer. c> Bd = c> P > ΛQd = h> Λk where we call P c P = h and Qd P= k. 2 By Cauchy-
Schwartz (A.5.1), (h> Λk)2 ≤ (h> Λh)(k> Λk). Now (h> Λk) = λii h2i ≤ λ11 hi = λ11 h> h.
Now we only have to show that h> h ≤ 1: 1 − h> h = c> c − c> P > P c = c> (I − P > P )c =
c> (I − P > P )(I − P > P )c ≥ 0, here we used that P P > = I, therefore P > P idempotent, therefore
also I − P > P idempotent.
APPENDIX B
1543
1544 B. ARRAYS OF HIGHER RANK
(B.1.1) C >A = r C A n = r C m A n .
m
B.1. INFORMAL SURVEY OF THE NOTATION 1545
In the second representation, the tile representing C is turned by 180 degrees. Since
the white part of the frame of C is at the bottom, not on the top, one knows that
the West arm of C, not its East arm, is concatenated with the West arm of A. The
transpose of m C r is r C m , i.e., it is not a different entity but
the same entity in a different position. The order in which the elements are arranged
on the page (or in computer memory) is not a part of the definition of the array
itself. Likewise, there is no distinction between row vectors and column vectors.
Vectors are usually, but not necessarily, written in such a way that their arm
points West (column vector convention). If a and b are vectors, their
scalar product a> b is the concatenation a b which has no free arms, i.e.,
it is a scalar, and their outer product ab> is a b , which is a matrix.
Juxtaposition of tiles represents the outer product, i.e., the array consisting of all
the products of elements of the arrays represented by the tiles placed side by side.
The trace of a square matrix Q is the concatenation Q , which
is a scalar since no arms are sticking out. In general, concatenation of two arms of
the same tile represents contraction, i.e., summation over equal values of the indices
associated with these two arms. This notation makes it obvious that tr XY =
1546 B. ARRAYS OF HIGHER RANK
(1) Juxtapose the tiles for X and Y , i.e., form their outer product, which is
an array of rank 4 with typical element xmp yqn .
(2) Connect the East arm of X with the West arm of Y . This is a contrac-
tion, resulting
P in an array of rank 2, the matrix product XY , with typical
element p xmp ypn .
(3) Now connect the West arm of X with the East arm of P Y . The result of
this second contraction is a scalar, the trace tr XY = p,m xmp ypm .
The result is the same, the notation does not specify which of these alternative eval-
uation paths is meant, and a computer receiving commands based on this notation
can choose the most efficient evaluation path. Probably the most efficient evaluation
path is given by (B.2.8) below: take the element-by-element product of X with the
transpose of Y , and add all the elements of the resulting matrix.
If the user specifies tr(XY ), the computer is locked into one evaluation path: it
first has to compute the matrix product XY , even if X is a column vector and Y a
row vector and it would be much more efficient to compute it as tr(Y X), and then
form the trace, i.e., throw away all off-diagonal elements. If the trace is specified
evaluation paths transparently to the user. This advantage of the graphical notation
is of course even more important if the graphs are more complex.
There is also the “diagonal” array, which in the case of rank 3 can be written
n ∆ n n
(B.1.2) or ∆ n
n n
or similar configurations. It has 1’s down the main diagonal and 0’s elsewhere. It
can be used to construct the diagonal matrix diag(x) of a vector (the square matrix
1548 B. ARRAYS OF HIGHER RANK
n ∆ n
(B.1.3) diag(x) = ,
x
the diagonal vector of a square matrix (i.e., the vector containing its diagonal ele-
ments) as
(B.1.4) ∆ A ,
x
(B.1.5) x∗y= ∆ .
y
All these are natural operations involving vectors and matrices, but the usual matrix
notation cannot represent them and therefore ad-hoc notation must be invented for
B.2. AXIOMATIC DEVELOPMENT OF ARRAY OPERATIONS 1549
it. In our graphical representation, however, they all can be built up from a small
number of atomic operations, which will be enumerated in Section B.2.
Each such graph can be evaluated in a number of different ways, and all these
evaluations give the same result. In principle, each graph can be evaluated as follows:
form the outer product of all arrays involved, and then contract along all those pairs
of arms which are connected. For practical implementations it is more efficient to
develop functions which connect two arrays along one or several of their arms without
first forming outer products, and to perform the array concatenations recursively in
such a way that contractions are done as early as possible. A computer might be
programmed to decide on the most efficient construction path for any given array.
. . . ” (From (B.2.4) and other axioms below it will follow that each unit vector can
be represented as a m-vector with 1 as one of the components and 0 elsewhere.)
For every rank ≥ 1 and dimension n ≥ 1 there is a unique diagonal array denoted
by ∆. Their main properties are (B.2.1) and (B.2.2). (This and the other axioms
must be formulated in such a way that it will be possible to show that the diagonal
arrays of rank 1 are the “vectors of ones” ι which have 1 in every component; diagonal
arrays of rank 2 are the identity matrices; and for higher ranks, all arms of a diagonal
array have the same dimension, and their ijk · · · element is 1 if i = j = k = · · ·
and 0 otherwise.) Perhaps it makes sense to define the diagonal array of rank 0
and dimension n to be the scalar n, and to declare all arrays which are everywhere
0-dimensional to be diagonal.
There are only three operations of arrays: their outer product, represented by
writing them side by side, contraction, represented by the joining of arms, and the
direct sum, which will be defined now:
The direct sum is the operation by which a vector can be built up from scalars,
a matrix from its row or column vectors, an array of rank 3 from its layers, etc. The
direct sum of a set of r similar arrays (i.e., arrays which have the same number of
arms, and corresponding arms have the same dimensions) is an array which has one
additional arm, called the reference arm of the direct sum. If one “saturates” the
reference arm with the ith unit vector, one gets the ith original array back, and this
B.2. AXIOMATIC DEVELOPMENT OF ARRAY OPERATIONS 1551
m m m m
r
M
Ai n = r S n ⇒ i r S n = Ai n .
i=1
q q q q
It is impossible to tell which is the first summand and which the second, direct sum
is an operation defined on finite sets of arrays (where different elements of a set may
be equal to each other in every respect but still have different identities).
There is a broad rule of associativity: the order in which outer products and
contractions are performed does not matter, as long as the at the end, the right arms
are connected with each other. And there are distributive rules involving (contracted)
outer products and direct sums.
Additional rules apply for the special arrays. If two different diagonal arrays
join arms, the result is again a diagonal array. For instance, the following three
concatenations of diagonal three-way arrays are identical, and they all evaluate to
1552 B. ARRAYS OF HIGHER RANK
∆
∆ ∆
(B.2.1) = ∆ ∆ = = ∆
∆
The diagonal array of rank 2 is neutral under concatenation, i.e., it can be written
as
(B.2.2) n ∆ n = .
because attaching it to any array will not change this array. (B.2.1) and (B.2.2) make
it possible to represent diagonal arrays simply as the branching points of several arms.
This will make the array notation even simpler. However in the present introductory
article, all diagonal arrays will be shown explicitly, and the vector of ones will be
denoted m ι instead of m ∆ or perhaps m δ .
Unit vectors concatenate as follows:
(
1 if i = j
(B.2.3) i m j =
0 otherwise.
B.2. AXIOMATIC DEVELOPMENT OF ARRAY OPERATIONS 1553
and the direct sum of all unit vectors is the diagonal array of rank 2:
n
M
(B.2.4) i n = n ∆ n = .
i=1
I am sure there will be modifications if one works it all out in detail, but if done
right, the number of axioms should be fairly small. Element-by-element addition of
arrays is not an axiom because it can be derived: if one saturates the reference arm
of a direct sum with the vector of ones, one gets the element-by-element sum of the
arrays in this direct sum. Multiplication of an array by a scalar is also contained in
the above system of axioms: it is simply the outer product with an array of rank
zero.
Problem 603. Show that the saturation of an arm of a diagonal array with the
vector of ones is the same as dropping this arm.
Answer. Since the vector of ones is the diagonal array of rank 1, this is a special case of the
general concantenation rule for diagonal arrays.
1554 B. ARRAYS OF HIGHER RANK
Problem 604. Show that the diagonal matrix of the vector of ones is the identity
matrix, i.e.,
n ∆ n
(B.2.5) = .
ι
Answer. It is a special case of the direct sum: the direct sum of one array only, the only effect
of which is the addition of the reference arm.
From (B.2.4) and (B.2.2) follows that every array of rank k can be represented
as a direct sum of arrays of rank k − 1, and recursively, as iterated direct sums of
those scalars which one gets by saturating all arms with unit vectors. Hence the
following “extensionality property”: if the arrays A and B are such that for all
B.2. AXIOMATIC DEVELOPMENT OF ARRAY OPERATIONS 1555
κ3 κ4 κ5 κ3 κ4 κ5
(B.2.6) κ2 A κ6 = κ2 B κ6
κ1 κ8 κ7 κ1 κ8 κ7
then A = B. This is why the saturation of an array with unit vectors can be
considered one of its “elements,” i.e.,
κ3 κ4 κ5
(B.2.7) κ2 A κ6 = aκ1 κ2 κ3 κ4 κ5 κ6 κ7 κ8 .
κ1 κ8 κ7
From (B.2.3) and (B.2.4) follows that the concatenation of two arrays by joining
one or more pairs of arms consists in forming all possible products and summing over
1556 B. ARRAYS OF HIGHER RANK
those subscripts (arms) which are joined to each other. For instance, if
m A n B r = m C r ,
Pn
then cµρ = ν=1 aµν bνρ . This is one of the most basic facts if one thinks of arrays
as collections of elements. From this point of view, the proposed notation is simply
a graphical elaboration of Einstein’s summation convention. But in the holistic
approach taken by the proposed system of axioms, which is informed by category
theory, it is an implication; it comes at the end, not the beginning.
Instead of considering arrays as bags filled with elements, with the associated
false problem of specifying the order in which the elements are packed into the
bag, this notation and system of axioms consider each array as an abstract entity,
associated with a certain finite graph. These entities can be operated on as specified
in the axioms, but the only time they lose their abstract character is when they are
fully saturated, i.e., concatenated with each other in such a way that no free arms are
left: in this case they become scalars. An array of rank 1 is not the same as a vector,
although it can be represented as a vector—after an ordering of its elements has
been specified. This ordering is not part of the definition of the array itself. (Some
vectors, such as time series, have an intrinsic ordering, but I am speaking here of
the simplest case where they do not.) Also the ordering of the arms is not specified,
and the order in which a set of arrays is packed into its direct sum is not specified
B.2. AXIOMATIC DEVELOPMENT OF ARRAY OPERATIONS 1557
either. These axioms therefore make a strict distinction between the abstract entities
themselves (which the user is interested in) and their various representations (which
the computer worries about).
Maybe the following examples may clarify these points. If you specify a set of
colors as {red, green, blue}, then this representation has an ordering built in: red
comes first, then green, then blue. However this ordering is not part of the definition
of the set; {green, red, blue} is the same set. The two notations are two different rep-
resentations of the same set. Another example: mathematicians usually distinguish
between the outer products A ⊗ B and B ⊗ A; there is a “natural isomorphism”
between them but they are two different objects. In the system of axioms proposed
here these two notations are two different representations of the same object, as in
the set example. This object is represented by a graph which has A and B as nodes,
but it is not apparent from this graph which node comes first. Interesting conceptual
issues are involved here. The proposed axioms are quite different than e.g. [Mor73].
Problem 606. The trace of the product of two matrices can be written as
(B.2.8) tr(XY ) = ι> (X ∗ Y > )ι.
I.e., one forms the element-by-element product of X and Y > and takes the sum of
all the elements of the resulting matrix. Use tile notation to show that this gives
indeed tr(XY ).
1558 B. ARRAYS OF HIGHER RANK
Answer. In analogy with (B.1.5), the Hadamard product of the two matrices X and Z, i.e.,
their element by element multiplication, is
X
X∗Z= ∆ ∆
Z
X
X ∗Y> = ∆ ∆ .
Y
X X
ι> (X ∗ Y > )ι = ι ∆ ∆ ι = = tr(XY )
Y Y
B.3. AN ADDITIONAL NOTATIONAL DETAIL 1559
k m n
m
L n L k L
k L n
n m m k
n m k
m
k L n L L
n L k
m k m n
The black-and-white pattern at the edge of the tile indicates whether and how much
the tile has been turned and/or flipped over, so that one can keep track which arm
is which. In the above example, the arm with dimension k will always be called the
West arm, whatever position the tile is in.
B.4. EQUALITY OF ARRAYS AND EXTENDED SUBSTITUTION 1561
A = B or K = K
are not allowed. The arms on both sides of the equal sign must be parallel, in order
to make it clear which arm corresponds to which. A permissible way to write the
above expressions would therefore be
A = B and K = K
One additional benefit of this tile notation is the ability to substitute arrays with
different numbers of arms into an equation. This is also a necessity since the number
of possible arms is unbounded. This multiplicity can only be coped with because
1562 B. ARRAYS OF HIGHER RANK
each arm in an identity written in this notation can be replaced by a bundle of many
arms.
Extended substitution also makes it possible to extend definitions familiar from
matrices to higher arrays. For instance we want to be able to say that the array
Ω is symmetric if and only if Ω = Ω . This notion of symme-
try is not limited to arrays of rank 2. The arms of this array may symbolize not just
a single arm, but whole bundles of arms; for instance an array of the form Σ
every scalar. Also the notion of a nonnegative definite matrix, or of a matrix inverse
or generalized inverse, or of a projection matrix, can be extended to arrays in this
way.
notation is
A
(B.5.1)
B
Since this is an array of rank 4, there is no natural way to write its elements down
on a sheet of paper. This is where the Kronecker product steps in. The Kronecker
product of two matrices is their outer product written again as a matrix. Its definition
includes a protocol how to arrange the elements of an array of rank 4 as a matrix.
Alongside the Kronecker product, also the vectorization operator is useful, which is
a protocol how to arrange the elements of a matrix as a vector, and also the so-called
“commutation matrices” may become necessary. Here are the relevant definitions:
an
1564 B. ARRAYS OF HIGHER RANK
By the way, a better protocol for vectorizing would have been to assemble all
rows into one long row vector and then converting it into a column vector. In other
words
>
b1 b1
.. ..
if B = . then vec(B) should have been defined as . .
b>
m bm
The usual protocol of stacking the columns is inconsistent with the lexicograpical
ordering used in the Kronecker product. Using the alternative definition, equation
(B.5.19) which will be discussed below would be a little more intelligible; it would
read
vec(ABC) = (A ⊗ C > ) vec B with the alternative definition of vec
B.5. VECTORIZATION AND KRONECKER PRODUCT 1565
and also the definition of vectorization in tile notation would be a little less awkward;
instead of (B.5.24) one would have
m
mn vec A = mn Π A
n
But this is merely a side remark; we will use the conventional definition (B.5.2)
throughout.
Problem 608. [The71, pp. 303–306] Prove the following simple properties of
the Kronecker product:
If a is a 1 × 1 matrix, then
(B.5.16) a ⊗ B = B ⊗ a = aB
(B.5.17) det(A ⊗ B) = (det(A))n (det(B))k
where A is k × k and B is n × n.
Answer. For the determinant use the following facts: if a is an eigenvector of A with eigenvalue
α and b is an eigenvector of B with eigenvalue β, then a⊗b is an eigenvector of A⊗B with eigenvalue
αβ. The determinant is the product of all eigenvalues (multiple eigenvalues being counted several
times). Count how many there are.
An alternative approach would be to write A ⊗ B = (A ⊗ I)(I ⊗ B) and then to argue that
det(A ⊗ I) = (det(A))n and det(I ⊗ B) = (det(B))k .
The formula for the rank can be shown using rank(A) = tr(AA− ). compare Problem 566.
Problem 609. 2 points [JHG+ 88, pp. 962–4] Write down the Kronecker product
of
1 3 2 2 0
(B.5.18) A= and B= .
2 0 1 0 3
Show that A ⊗ B 6= B ⊗ A. Which other facts about the outer product do not carry
over to the Kronecker product?
1568 B. ARRAYS OF HIGHER RANK
Answer.
2 2 0 6 6 0 2 6 2 6 0 0
1 0 3 3 0 9 4 0 4 0 0 0
A⊗B = B⊗A=
4 4 0 0 0 0 1 3 0 0 3 9
2 0 6 0 0 0 2 0 0 0 6 0
a>
1
.
Answer. Assume A is k × m, B is m × n, and C is n × p. Write A = .. and B =
a>k
b1 ··· bn . Then (C > ⊗ A) vec B =
The main challenge in this automatic proof is to fit the many matrix rows, columns, and single
elements involved on the same sheet of paper. Among the shuffling of matrix entries, it is easy to
lose track of how the result comes about. Later, in equation (B.5.29), a compact and intelligible
proof will be given in tile notation.
B.5. VECTORIZATION AND KRONECKER PRODUCT 1571
The dispersion of a random matrix Y is often given as the matrix V [vec Y ], where
the vectorization is usually not made explicit, i.e., this matrix is denoted V [Y ].
Problem 612. 2 points If α and γ are vectors, then show that vec(αγ > ) =
γ ⊗ α.
Answer. One sees this by writing down the matrices, or one can use (B.5.19) with A = α,
B = 1, the 1 × 1 matrix, and C = γ > .
Answer.
B.5.3. The Commutation Matrix. Besides the Kronecker product and the
vectorization operator, also the “commutation matrix” [MN88, pp. 46/7], [Mag88,
p. 35] is needed for certain operations involving arrays of higher rank. Assume A
is m × n. Then the commutation matrix K (m,n) is the mn × mn matrix which
transforms vec A into vec(A> ):
The main property of the commutation matrix is that it allows to commute the
Kronecker product. For any m × n matrix A and r × q matrix B follows
Answer.
1 0 0 0 0 0
0 0 1 0 0 0
0 0 0 0 1 0
(B.5.22) K (2,3) =
0 1 0 0 0 0
0 0 0 1 0 0
0 0 0 0 0 1
m A n
(B.5.23) mr A⊗B nq = mr Π Π nq
r B q
Strictly speaking we should have written Π(m,r) and Π(n,q) for the two Π-arrays in
(B.5.23), but the superscripts can be inferred from the context: the first superscript
is the dimension of the Northeast arm, and the second that of the Southeast arm.
Vectorization uses a member of the same family Π(m,n) to convert the matrix
n A m into the vector
m
(B.5.24) mn vec A = mn Π A
n
This equation is a little awkward because the A is here a n × m matrix, while else-
where it is a m × n matrix. It would have been more consistent with the lexicograph-
ical ordering used in the Kronecker product to define vectorization as the stacking
of the row vectors; then some of the formulas would have looked more natural.
B.5. VECTORIZATION AND KRONECKER PRODUCT 1575
m
The array Π(m,n) = mn Π exists for every m ≥ 1 and n ≥ 1. The
n
dimension of the West arm is always the product of the dimensions of the two East
arms. The elements of Π(m,n) will be given in (B.5.30) below; but first I will list
three important properties of these arrays and give examples of their application.
First of all, each Π(m,n) satisfies
m m m m
(B.5.25) Π mn Π = .
n n n n
Let us discuss the meaning of (B.5.25) in detail. The lefthand side of (B.5.25) shows
the concatenation of two copies of the three-way array Π(m,n) in a certain way that
yields a 4-way array. Now look at the righthand side. The arm m m by itself
(which was bent only in order to remove any doubt about which arm to the left of
the equal sign corresponds to which arm to the right) represents the neutral element
under concatenation (i.e., the m × m identity matrix). Writing two arrays next to
1576 B. ARRAYS OF HIGHER RANK
each other without joining any arms represents their outer product, i.e., the array
whose rank is the sum of the ranks of the arrays involved, and whose elements are
all possible products of elements of the first array with elements of the second array.
The second identity satisfied by Π(m,n) is
m
(B.5.26) mn Π Π mn = mn mn .
n
Finally, there is also associativity:
m m
mnp Π n Π
(B.5.27) =
Π mnp Π n
p p
Here is the answer to Problem 607 in tile notation:
tr B > C = B C = B Π Π C =
Equation (B.5.25) was central for obtaining the result. The answer to Problem 610
also relies on equation (B.5.25):
C
>
C ⊗A vec B = Π Π Π B
A
C
= Π B
A
m µ (
(m,n) 1 if θ = (µ − 1)n + ν
(B.5.30) πθµν = θ mn Π =
0 otherwise.
n ν
1578 B. ARRAYS OF HIGHER RANK
(m,n)
Note that for every θ there is exactly one µ and one ν such that πθµν = 1; for all
(m,n)
other values of µ and ν, πθµν = 0.
Writing ν A µ = aνµ and θ vec A = cθ , (B.5.24) reads
(m,n)
X
(B.5.31) cθ = πθµν aνµ ,
µ,ν
(m,r) (n,q)
X
(B.5.32) cφθ = πφµρ aµν bρκ πθνκ .
µ,ν,ρ,κ
(m,r)
For 1 ≤ φ ≤ r one gets a nonzero πφµρ only for µ = 1 and ρ = φ, and for 1 ≤ θ ≤ q
(n,q)
one gets a nonzero πθνκ only for ν = 1 and κ = θ. Therefore cφθ = a11 bφθ for all
elements of matrix C with φ ≤ r and θ ≤ q. Etc.
B.5. VECTORIZATION AND KRONECKER PRODUCT 1579
The proof of (B.5.25) uses the fact that for every θ there is exactly one µ and
(m,n)
one ν such that πθµν 6= 0:
θ=mn
(
X (m,n) (m,n) 1 if µ = ω and ν = σ
(B.5.33) πθµν πθωσ =
θ=1
0 otherwise
Similarly, (B.5.26) and (B.5.27) can be shown by elementary but tedious proofs.
The best verification of these rules is their implementation in a computer language,
see Section ?? below.
m
(m,n)
(B.5.34) K = mn Π Π mn .
n
This should not be confused with the lefthand side of (B.5.26): K (m,n) is composed
of Π(m,n) on its West and Π(n,m) on its East side, while (B.5.26) contains Π(m,n)
twice. We will therefore use the following representation, mathematically equivalent
1580 B. ARRAYS OF HIGHER RANK
m
(B.5.35) K (m,n) = mn Π Π mn .
n
Problem 615. Using the definition (B.5.35) show that K (m,n) K (n,m) = I mn ,
the mn × mn identity matrix.
r m A n
rm Π Π rm Π Π
m r B q
nq =
n
B.5. VECTORIZATION AND KRONECKER PRODUCT 1581
r A n
= rm Π Π nq =
m B q
r B q
= rm Π Π nq .
m A n
APPENDIX C
Matrix Differentiation
dy
(C.1.1) = f 0 (x)
dx
Multiply through by dx to get dy = f 0 (x) dx. In order to see the meaning of this
equation, we must know the definition dy = f (x + dx) − f (x). Therefore one obtains
f (x + dx) = f (x) + f 0 (x) dx. If one holds x constant and only varies dx this formula
shows that in an infinitesimal neighborhood of x, the function f is an affine function
1583
1584 C. MATRIX DIFFERENTIATION
of dx, i.e., a linear function of dx with a constant term: f (x) is the intercept, i.e.,
the value for dx = 0, and f 0 (x) is the slope parameter.
Now let us transfer this argument to vector functions y = f (x). Here y is a
n-vector and x a m-vector, i.e., f is a n-tuple of functions of m variables each
y1 f1 (x1 , . . . , xm )
.. ..
(C.1.2) . =
.
yn fn (x1 , . . . , xm )
One may also say, f is a n-vector, each element of which depends on x. Again, under
certain differentiability conditions, it is possible to write this function infinitesimally
as an affine function, i.e., one can write
(C.1.3) f (x + dx) = f (x) + Adx.
Here the coefficient of dx is no longer a scalar but necessarily a matrix A (whose
elements again depend on x). A is called the Jacobian matrix of f . The Jacobian
matrix generalizes the concept of a derivative to vectors. Instead of a prime denoting
the derivative, as in f 0 (x), one writes A = Df .
Problem 617. 2 points If f is a scalar function of a vector argument x, is its
Jacobian matrix A a row vector or a column vector? Explain why this must be so.
C.1. FIRST DERIVATIVES 1585
The Jacobian A defined in this way turns out to have a very simple functional
form: its elements are the partial derivatives of all components of f with respect to
all components of x:
∂fi
(C.1.4) aij = .
∂xj
Since in this matrix f acts as column and x as a row vector, this matrix can be
written, using matrix differentiation notation, as A(x) = ∂f (x)/∂x> .
Strictly speaking, matrix notation can be used for matrix differentiation only if
we differentiate a column vector (or scalar) with respect to a row vector (or scalar),
or if we differentiate a scalar with respect to a matrix or a matrix with respect to a
scalar. If we want to differentiate matrices with respect to vectors or vectors with
respect to matrices or matrices with respect to each other, we need the tile notation
for arrays. A different, much less enlightening approach is to first “vectorize” the
matrices involved. Both of those methods will be discussed later.
If the dependence of y on x can be expressed in terms of matrix operations
or more general array concatenations, then some useful matrix differentiation rules
exist.
1586 C. MATRIX DIFFERENTIATION
wn xn
is
∂w> x ∂ ∂
= ∂x1 (w1 x1 + · · · + wn xn ) · · · ∂xn (w1 x1 + · · · + wn xn )
∂x>
= w1 · · · wn = w>
and take the partial derivative of this sum with respect to each of the xi . For instance,
differentiation with respect to x1 gives
Now split the upper diagonal element, writing it as m11 x1 + x1 m11 , to get
The sum of the elements in the first row is the first element of the column vector
M x, and the sum of the elements in the column underneath is the first element of
the row vector x> M . Overall this has to be arranged as a row vector, since we
differentiate with respect to ∂x> , therefore we get
This is true for arbitrary M , and for symmetric M , it simplifies to (C.1.7). The
formula for symmetric M is all we need, since a quadratic form with an unsymmetric
M is identical to that with the symmetric (M + M > )/2.
C.1. FIRST DERIVATIVES 1589
(C.1.10) A dx = dy ,
i.e.,
(C.1.11) ∂ y ∂ x dx = dy
and (C.1.8) is
x x
(C.1.13) ∂ M ∂ x = M + M .
x x
In (C.1.6) and (C.1.7), we took the derivatives of scalars with respect to vectors.
The simplest example of a derivative of a vector with respect to a vector is a linear
function. This gives the most basic matrix differentiation rule: If y = Ax is a linear
vector function, then its derivative is that same linear vector function:
(C.1.14) ∂Ax/∂x> = A,
or in tiles
(C.1.15) ∂ A x ∂ x = A
In tiles it reads
m
(C.1.17) ∂ A X ∂ X = A .
P
Answer. tr(AX) = i,j
aij xji i.e., the coefficient of xji is aij .
A
. A
(C.1.18) ∂ X ∂ X =
B
B
Equations (C.1.17) and (C.1.18) can be obtained from (C.1.12) and (C.1.15) by
extended substitution, since a bundle of several arms can always be considered as
one arm. For instance, (C.1.17) can be written
∂ A X ∂ X = A
and this is a special case of (C.1.12), since the two parallel arms can be treated as
one arm. With a better development of the logic underlying this notation, it will not
be necessary to formulate them as separate theorems; all matrix differentiation rules
given so far are trivial applications of (C.1.15).
C.1. FIRST DERIVATIVES 1593
∂x> Ay
Problem 619. As a special case of (C.1.18) show that ∂A>
= yx> .
Answer.
x
. x
(C.1.19) ∂ A ∂ A =
y
y
x
(C.1.20) y = A
x
1594 C. MATRIX DIFFERENTIATION
x . x
(C.1.21) ∂ A ∂ x = A + A
x x
Proof. yi = j,k aijk xj xk . For a given i, this has x2p in the term aipp x2p , and
P
it has xp in the terms aipk xp xk where P p 6= k, and in Paijp xj xp where j 6= p. The
derivatives of these terms are 2aipp xp + k6=p aipk xk + j6=p aijp xj , which simplifies
P P
to k aipk xk + j aijp xj . This is the i, p-element of the matrix on the rhs of (C.1.21).
sum of the two above: ∂yii /∂xli = ∂x2li /∂xli = 2xli . In tiles this is
i
l
X . X X
∂X > X
(C.1.22) = ∂ ∂ X = + .
∂X > X
m
k
This rule is helpful for differentiating the multivariate Normal likelihood function.
A computer implementation of this tile notation should contain algorithms to
automatically take the derivatives of these array concatenations.
Here are some more matrix differentiation rules:
Chain rule: If g = g(η) and η = η(β) are two vector functions, then
If A is nonsingular then
∂ log det A
(C.1.24) = A−1
∂A>
Proof in [Gre97, pp. 52/3].
Bibliography
1597
1598 BIBLIOGRAPHY
[BCW96] Richard A. Becker, John M. Chambers, and Allan R. Wilks. The New S Language: A
Programming Environment for Data Analysis and Graphics. Chapman and Hall, 1996.
Reprint of the 1988 Wadsworth edition. 422, 557, 938
[BD77] Peter J. Bickel and Kjell A. Doksum. Mathematical Statistics: Basic Ideas and Selected
Topics. Holden-Day, San Francisco, 1977. 214, 398, 404, 434
[Ber91] Ernst R. Berndt. The Practice of Econometrics: Classic and Contemporary. Addison-
Wesley, Reading, Massachusetts, 1991. 580, 595, 596
[BF85] Leo Breiman and Jerome H. Friedman. Estimating optimal transformations for multiple
regression and correlation. JASA, 80(391):580–619, 1985. 1020, 1023, 1025, 1026
[BF91] F. A. G. den Butter and M. M. G. Fase. Seasonal Adjustment as a Practical Problem.
North-Holland, 1991. 1472
[Bha78] Roy Bhaskar. A Realist Theory of Science. Harvester Wheatsheaf, London and New
York, second edition, 1978. xxiv
[Bha93] Roy Bhaskar. Dialectic: The Pulse of Freedom. Verso, London, New York, 1993. xxiv
[BJ76] George E. P. Box and Gwilym M. Jenkins. Time Series Analysis: Forecasting and
Control. Holden-Day, San Francisco, revised edition, 1976. 1441
[BKW80] David A. Belsley, Edwin Kuh, and Roy E. Welsch. Regression Diagnostics: Identifying
Influential Data and Sources of Collinearity. Wiley, New York, 1980. 799, 808, 819,
821, 824
[Bla73] R. C. Blattberg. Evaluation of the power of the Durbin-Watson statistic for non-first
order serial correlation alternatives. Review of Economics and Statistics, 55:508–515,
August 1973. 1281
1600 BIBLIOGRAPHY
[BLR99] Luc Bauwens, Michel Lubrano, and Jean-Fran¸ois Richard. Bayesian Inference
in Dynamic Econometric Models. Oxford University Press, 1999. Data sets at
http://www.core.ucl.ac.be/econometrics/index.htm. 1453
[BM78] Charles M. Beach and James G. MacKinnon. A maximum likelihood procedure for
regression with autocorrelated errors. Econometrica, 46(1):51–58, 1978. 1269, 1276
[BQ89] Olivier Jean Blanchard and Danny Quah. The dynamic effect of demand and supply
disturbances. American Economic Review, 79:655–673, 1989. 1456
[Bra68] James Vandiver Bradley. Distribution-Free Statistical Tests. Prentice-Hall, Englewood
Cliffs, N.J., 1968. 440
[BT99] Kaye E. Basford and J. W. Tukey. Graphical Analysis of Multiresponse Data: Illus-
trated with a Plant Breeding Trial. Interdisciplinary Statistics. Chapman & Hall/CRC,
Boca Raton, Fla., 1999. 834
[Buj90] Andreas Buja. Remarks on functional canonical variates, alternating least squares
methods and ACE. The Annals of Statistics, 18(3):1032–1069, 1990. 1020, 1021
[Bur98] Patrick J. Burns. S poetry. www.seanet.com/~pburns/Spoetry, 1998. 557
[Cam89] Mike Camden. The Data Bundle. New Zealand Statistical Association, Wellington,
New Zealand, 1989. 835
[CB97] Dianne Cook and Andreas Buja. Manual controls for high-dimensional data projections.
Journal of Computational and Graphical Statistics, 1997. 840
[CBCH97] Dianne Cook, Andreas Buja, J. Cabrera, and H. Hurley. Grand tour and projection
pursuit. Journal of Computational and Graphical Statistics, 2(3):225–250, 1997. 840
[CCCM81] M. Cameron, K. D. Collerson, W. Compston, and R. Morton. The statistical anal-
ysis and interpretation of imperfectly-fitted Rb-Sr isochrons from polymetamorphic
terrains. Geochimica et Geophysica Acta, 45:1087–1097, 1981. 1165
BIBLIOGRAPHY 1601
[CD28] Charles W. Cobb and Paul H. Douglas. A theory of production. American Economic
Review, 18(1, Suppl.):139–165, 1928. J. 565, 567
[CD97] Wojciech W. Charemza and Derek F. Deadman. New Directions in Econometric Prac-
tice: General to Specific Modelling, Cointegration, and Vector Autoregression. Edward
Elgar, Cheltenham, UK; Lynne, NH, 2nd ed. edition, 1997. 481, 1464
[CH93] John M. Chambers and Trevor Hastie, editors. Statistical Models in S. Chapman and
Hall, 1993. 557, 594, 821
[Cha96] B. G. Charlton. Should epidemiologists be pragmatists, biostatisticians, or clinical sci-
entists? Epidemiology, 7(5):552–4, 1996. 447
[Cho60] G. C. Chow. Tests of equality between sets of coefficients in two linear regressions.
Econometrica, 28:591–605, July 1960. 954
[Chr87] Ronald Christensen. Plane Answers to Complex Questions; The Theory of Linear
Models. Springer-Verlag, New York, 1987. 679, 703, 942, 1240
[Coh50] A. C. Cohen. Estimating the mean and variance of normal populations from singly and
doubly truncated samples. Annals of Mathematical Statistics, pages 557–569, 1950.
169
[Col89] Andrew Collier. Scientific Realism and Socialist Thought. Harvester Wheatsheaf and
Lynne Rienner, Hertfordshire, U.K. and Boulder, Colorado, 1989. 1469
[Coo77] R. Dennis Cook. Detection of influential observations in linear regression. Technomet-
rics, 19(1):15–18, February 1977. 824, 825
[Coo98] R. Dennis Cook. Regression Graphics: Ideas for Studying Regressions through Graph-
ics. Series in Probability and Statistics. Wiley, New York, 1998. 223, 715, 835, 836,
837, 841, 843
1602 BIBLIOGRAPHY
[Cor69] J. Cornfield. The Bayesian outlook and its applications. Biometrics, 25:617–657, 1969.
406
[Cow77] Frank Alan Cowell. Measuring Inequality: Techniques for the Social Sciences. Wiley,
New York, 1977. 174, 1044
[CP77] Samprit Chatterjee and Bertram Price. Regression Analysis by Example. Wiley, New
York, 1977. 1134
[CR88] Raymond J. Carroll and David Ruppert. Transformation and Weighting in Regression.
Chapman and Hall, London and New York, 1988. 1230
[Cra43] A. T. Craig. Note on the independence of certain quadratic forms. Annals of Mathe-
matical Statistics, 14:195, 1943. 289
[Cra83] J. G. Cragg. More efficient estimation in the presence of heteroskedasticity of unknown
form. Econometrica, 51:751–63, 1983. 1296
[Cra91] Jan Salomon Cramer. An Introduction of the Logit Model for Economists. Edward
Arnold, London, 1991. 1500
[CT91] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. Series in
Telecommunications. Wiley, New York, 1991. 104, 106
[CW99] R. Dennis Cook and Sanford Weisberg. Applied Regression Including Computing and
Graphics. Wiley, 1999. 714, 715, 772, 841, 845
[Dav75] P. J. Davis. Interpolation and Approximation. Dover Publications, New York, 1975.
1004
[Daw79a] A. P. Dawid. Conditional independence in statistical theory. JRSS(B), 41(1):1–31, 1979.
51, 55
[Daw79b] A. P. Dawid. Some misleading arguments involving conditional independence. JRSS(B),
41(2):249–252, 1979. 55
BIBLIOGRAPHY 1603
[DP20] R. E. Day and W. M. Persons. An index of the physical volume of production. Review
of Economic Statistsics, II:309–37, 361–67, 1920. 565
[DW50] J. Durbin and G. Watson. Testing for serial correlation in least squares regression—I.
Biometrika, 37:409–428, 1950. 1277
[DW51] J. Durbin and G. Watson. Testing for serial correlation in least squares regression—II.
Biometrika, 38:159–178, 1951. 1277
[DW71] J. Durbin and G. Watson. Testing for serial correlation in least squares regression—III.
Biometrika, 58:1–42, 1971. 1277, 1279
[Efr82] Bradley Efron. The Jackknife, the Bootstrap, and Other Resampling Plans. SIAM
(Society for Industrial and Applies Mathematics), Philadelphia, PA, 1982. 1301
[Ell95] Rebecca J. Elliott. Learning SAS in the Computer Lab. Cuxbury Press, Belmont, Cal-
ifornia, 1995. 545
[End95] Walter Enders. Applied Econometric Time Series. Wiley, New York, 1995. 1285, 1437,
1445, 1447, 1460, 1461, 1463
[ESS97] Klaus Edel, Karl August Schäffer, and Winfried Stier, editors. Analyse saisonaler
Zeitreihen. Number 134 in Wirtschaftwissenschaftliche Beiträge. Physica-Verlag, Hei-
delberg, 1997. 1473, 1474
[ET93] Bradley Efron and Robert J. Tibshirani. An Introduction to the Bootstrap. Chapman
and Hall, 1993. 1302
[Eub88] Randall L. Eubank. Spline Smoothing and Nonparametric Regression. Marcel Dekker,
New York, 1988. 1003, 1004, 1007, 1008
[Eve94] Brian Everitt. A Handbook of Statistical Analyses Using S-Plus. Chapman & Hall,
1994. 557
BIBLIOGRAPHY 1605
[Far80] R. W. Farebrother. The Durbin-Watson test for serial correlation when there is no
intercept in the regression. Econometrica, 48:1553–1563, September 1980. 1279
[Fis] R. A. Fisher. Theory of statistical estimation. Proceedings of the Cambridge Philosoph-
ical Society, 22. 366
[Fri57] Milton Friedman. A Theory of the Consumption Function. Princeton University Press,
1957. 315
[FS81] J. H. Friedman and W. Stuetzle. Projection pursuit regression. JASA, 76:817–23, 1981.
839, 843, 1015
[FS91] Milton Friedman and Anna J. Schwarz. Alternative approaches to analyzing economic
data. American Economic Review, 81(1):39–49, March 1991. 484
[FT74] J. H. Friedman and J. W. Tukey. A projection pursuit algorithm for exploratory data
analysis. IEEE Transactions on Computers, C-23:881–90, 1974. 839, 840
[FW74] George M. Furnival and Jr. Wilson, Robert W. Regression by leaps and bounds. Tech-
nometrics, 16:499–511, 1974. 789
[Gas88] Joseph L. Gastwirth. Statistical Reasoning in Law and Public Policy. Statistical mod-
eling and decision science. Academic Press, Boston, 1988. 458, 462
[GC92] Jean Dickinson Gibbons and S. Chakraborti. Nonparametric Statistical Inference. Mar-
cel Dekker, 3rd edition, 1992. 440, 445
[GG95] Joseph L. Gastwirth and S. W. Greenhouse. Biostatistical concepts and methods in the
legal setting. Statistics in Medicine, 14:1641–53, 1995. 456
[GJM96] Amos Golan, George Judge, and Douglas Miller. Maximum Entropy Econometrics:
Robust Estimation with Limited Data. Wiley, Chichester, England, 1996. 120
[Gra76] Franklin A. Graybill. Theory and Application of the Linear Model. Duxbury Press,
North Sciutate, Mass., 1976. 518
1606 BIBLIOGRAPHY
[JL97] B. D. Javanovic and P. S. Levy. A look at the rule of three. American Statistician,
51(2):137–9, 1997. 447
[JS61] W. James and C. Stein. Estimation with quadratic loss. In Proceedings of the Fourth
Berkeley Symposium on Mathematical Statistics and Probability, volume 1, pages 361–
379. University of California Press, Berkeley, 1961. 886
[JW88] Richard A. Johnson and Dean W. Wichern. Applied Multivariate Statistical Analysis.
Prentice Hall, 1988. 983, 987
[KA69] J. Koerts and A. P. J. Abramanse. On the Theory and Application of the General
Linear Model. Rotterdam University Press, Rotterdam, 1969. 509
[Kal82] R. E. Kalman. System identification from noisy data. In A. R. Bednarek and L. Cesari,
editors, Dynamical Systems, volume II, pages 135–164. Academic Press, New York,
1982. 1109, 1122
[Kal83] R. E. Kalman. Identifiability and modeling in econometrics. In P. R. Krisnaiah, editor,
Developments in Statistics, volume 4. Academic Press, New York, 1983. 1109
[Kal84] R. E. Kalman. We can do something about multicollinearity! Communications in
Statistics, Theory and Methods, 13(2):115–125, 1984. 1134
[Kap89] Jagat Narain Kapur. Maximum Entropy Models in Science and Engineering. Wiley,
1989. 116, 166
[KG80] William G. Kennedy and James E. Gentle. Statistical Computing. Dekker, New York,
1980. 1207, 1210, 1212
[Khi57] R. T. Khinchin. Mathematical Foundations of Information Theory. Dover Publications,
New York, 1957. 102
[Kim] Kim. Introduction of Factor Analysis. 1115
1610 BIBLIOGRAPHY
[Kin81] M. King. The Durbin-Watson test for serial correlation: Bounds for regressions with
trend and/or seasonal dummy variables. Econometrica, 49:1571–1581, 1981. 1279
[KM78] Jae-On Kim and Charles W. Mueller. Factor Analysis: Statistical Methods and Prac-
tical Issues. Sage, 1978. 1115
[Kme86] Jan Kmenta. Elements of Econometrics. Macmillan, New York, second edition, 1986.
825, 895, 1069, 1279, 1281
[Knu81] Donald E. Knuth. Seminumerical Algorithms, volume 2 of The Art of Computer Pro-
gramming. Addison-Wesley, second edition, 1981. 10, 123, 129
[Knu98] Donald E. Knuth. Seminumerical Algorithms, volume 2 of The Art of Computer Pro-
gramming. Addison-Wesley, third edition, 1998. 124
[Krz88] W. J. Krzanowski. Principles of Multivariate Analysis: A User’s Persective. Clarendon
Press, Oxford, 1988. 517
[KS79] Sir Maurice Kendall and Alan Stuart. The Advanced Theory of Statistics, volume 2.
Griffin, London, fourth edition, 1979. 347, 366, 1021, 1022
[Ksh19] Anant M. Kshirsagar. Multivariate Analysis. Marcel Dekker, New York and Basel, 19??
289
[Lan69] H. O. Lancaster. The Chi-Squared Distribution. Wiley, 1969. 289
[Lar82] Harold Larson. Introduction to Probability and Statistical Inference. Wiley, 1982. 38,
81, 148, 198, 199, 386
[Law89] Tony Lawson. Realism and instrumentalism in the development of econometrics. Oxford
Economic Papers, 41:236–258, 1989. Reprinted in [dMG89] and [HR97]. xxiv
[Lea75] Edward E. Leamer. A result on the sign of the restricted least squares estimator. Journal
of Econometrics, 3:387–390, 1975. 761
BIBLIOGRAPHY 1611
[Mor65] A. Q. Morton. The authorship of Greek prose (with discussion). Journal of the Royal
Statistical Society, Series A, 128:169–233, 1965. 59, 60
[Mor73] Trenchard More, Jr. Axioms and theorems for a theory of arrays. IBM Journal of
Research and Development, 17(2):135–175, March 1973. 1544, 1557
[Mor02] Jamie Morgan. The global power of orthodox economics. Journal of Critical Realism,
1(2):7–34, May 2002. 472
[MR91] Ieke Moerdijk and Gonzalo E. Reyes. Models for Smooth Infinitesimal Analysis.
Springer-Verlag, New York, 1991. 66
[MS86] Parry Hiram Moon and Domina Eberle Spencer. Theory of Holors; A Generalization
of Tensors. Cambridge University Press, 1986. 1544
[MT98] Allan D. R. McQuarrie and Chih-Ling Tsai. Regression and Time Series Model Selec-
tion. World Scientific, Singapore, 1998. 789
[Mul72] S. A. Mulaik. The Foundations of Factor Analysis. McGraw-Hill, New York, 1972.
1115
[NT92] Sharon-Lise Normand and David Tritchler. Parameter updating in a Bayes network.
JASA Journal of the American Statistical Association, 87(420):1109–1115, December
1992. 1203
[Qua90] D. Quah. Permanent and transitory movements in labor income: An explanation for
‘excess smoothness’ in consumption. Journal of Political Economy, 98:449–475, 1990.
1456
[Rao52] C. Radhakrishna Rao. Some theorems on minimum variance estimation. Sankhyā,
12:27–42, 1952. 689
BIBLIOGRAPHY 1613
[Ron02] Amit Ron. Regression analysis and the philosophy of social science: A critical realist
view. Journal of Critical Realism, 1(1):119–142, November 2002. 472
[Roy97] Richard M. Royall. Statistical evidence: A Likelihood Paradigm. Number 71 in Mono-
graphs on Statistics and Applied Probability. Chapman & Hall, London; New York,
1997. 45, 46, 410, 473
[Ruu00] Paul A. Ruud. An Introduction to Classical Econometric Theory. Oxford University
Press, Oxford and New York, 2000. 1386, 1428
[RZ78] L. S. Robertson and P. L. Zador. Driver education and fatal crash invovlement of
teenage drivers. American Journal of Public Health, 68:959–65, 1978. 86
[SAS85] SAS Institute Inc., Cary, NC. SAS User’s Guide: Statistics, version 5 edition edition,
1985. 1229
[SCB91] Deborah F. Swayne, Dianne Cook, and Andreas Buja. Xgobi: Interactive dynamic
graphics in the X windows system with a link to S. ASA Proceedings of the Section on
Statistical Graphics, pages 1–8, 1991. 839
[Sch59] Henry Scheffé. The Analysis of Variance. Wiley, New York, 1959. 1247
[Sch97] Manfred R. Schroeder. Number Theory in Science and Communication. Number 7 in
Information Sciences. Springer-Verlag, Berlin Heidelberg New York, 3rd edition, 1997.
134
[Scl68] Stanley L. Sclove. Inproved estimators for coefficients in linear regression. Journal of
the American Statistical Association, 63:595–606, 1968. 883
[Seb77] G. A. F. Seber. Linear Regression Analysis. Wiley, New York, 1977. 248, 251, 257,
289, 676, 683, 697, 778, 787, 952, 954, 959, 962, 974, 975, 977, 980, 982
[Sel58] H. C. Selvin. Durkheim’s suicide and problems of empirical research. American Journal
of Sociology, 63:607–619, 1958. 86
BIBLIOGRAPHY 1615
[SG85] John Skilling and S. F. Gull. Algorithms and applications. In C. Ray Smith and
Jr W. T. Grandy, editors, Maximum-Entropy and Bayesian Methods in Inverse Prob-
lems, pages 83–132. D. Reidel, Dordrecht, Boston, Lancaster, 1985. 120
[Shi73] R. Shiller. A distributed lag estimator derived from smoothness priors. Econometrica,
41:775–778, 1973. 1061
[Sim96] Jeffrey S. Simonoff. Smoothing Methods in Statistics. Springer Series in Statistics.
Springer, New York, 1996. 1031, 1032
[SM86] Hans Schneeweiß and Hans-Joachim Mittag. Lineare Modelle mit fehlerbehafteten
Daten. Physica Verlag, Heidelberg, Wien, 1986. 1101, 1154, 1521
[Spe94] Phil Spector. An Introduction to S and S-Plus. Duxbury Press, Belmont, California,
1994. 557
[Spr98] Peter Sprent. Data Driven Statistical Methods. Texts in statistical science. Chapman
& Hall, London; New York, 1st ed. edition, 1998. 87, 440, 456, 460, 462
[SS35] J. A. Schouten and Dirk J. Struik. Einführung in die neuren Methoden der Differen-
tialgeometrie, volume I. 1935. 1544
[Sta95] William. Stallings. Protect your Privacy: The PGP User’s Guide. Prentice Hall, En-
glewood Cliffs, N.J., 1995. 134
[Sta99] William Stallings. Cryptography and Network Security: Principles and Practice. Pren-
tice Hall, Upper Saddle River, N.J., 2nd edition, 1999. 136
[Ste56] Charles Stein. Inadmissibility of the usual estimator for the mean of a multivariate
normal distribution. In Proceedings of the Third Berkeley Symposium on Mathematical
Statistics and Probability, volume 1, pages 197–206. University of California Press,
Berkeley, 1956. 886
1616 BIBLIOGRAPHY
[SW76] Thomes J. Sargent and Neil Wallace. Rational expectations and the theory of economic
policy. Journal of Monetary Economics, 2:169–183, 1976. 322
[Sze59] G. Szegö. Orthogonal Polynomials. Number 23 in AMS Colloquium Publications. Amer-
ican Mathematical Society, 1959. 1004
[The71] Henri Theil. Principles of Econometrics. Wiley, New York, 1971. 1392, 1566
[Thi88] Ronald A. Thisted. Elements of Statistical Computing. Chapman and Hall, 1988. 1207,
1226
[Tib88] Robert Tibshirani. Estimating transformations for regression via additivity and vari-
ance stabilization. JASA, 83(402):394–405, 1988. 1027
[Tin51] J. Tinbergen. Econometrics. George Allen & Unwin Ltd., London, 1951. 482
[TS61] Henri Theil and A. Schweitzer. The best quadratic estimator of the residual variance
in regression analysis. Statistica Neerlandica, 15:19–23, 1961. 343, 668, 677
[Vas76] D. Vasicek. A test for normality based on sample entropy. JRSS (B), 38:54–9, 1976.
1042
[VdBDL83] E. Van der Burg and J. De Leeuw. Nonlinear canonical correlation. British J. Math.
Statist. Psychol., 36:54–80, 1983. 1020
[VR99] William N. Venables and Brian D. Ripley. Modern Applied Statistics with S-Plus.
Springer-Verlag, New York, third edition, 1999. 557
[VU81] Hrishikesh D. Vinod and Aman Ullah. Recent Advances in Regression Methods. Dekker,
New York, 1981. 879
[Wah90] Grace Wahba. Spline Models for Observational Data. Society for Industrial and Applied
Mathematics (SIAM), Philadelphia, PA, 1990. 1006
[Wal72] Kenneth Wallis. Testing for fourth order autocorrelation in quarterly regression equa-
tions. Econometrica, 40:617–36, 1972. 1277
BIBLIOGRAPHY 1617
[WG68] M. B. Wilk and R. Gnanadesikan. Probability plotting methods for the analysis of
data. Biometrica, 55:1–17, 1968. 1042
[WH82] B. A. Wichmann and I. D. Hill. Algorithm AS 183: An efficient and portable pseudo-
random number generator. Applied Statistics, 31:188–190, 1982. Correction in Wich-
mann/Hill:BS183. See also [?] and [Zei86]. 128
[WH97] Mike West and Jeff Harrison. Bayesian Forecasting and Dynamic Models. Springer-
Verlag, second edition, 1997. 1169, 1181, 1189, 1190, 1193, 1194, 1200
[Whi93] K. White. SHAZAM, Version 7. Department of Economics, University of British
Columbia, Vancouver, 1993. 1279
[Wit85] Uli Wittman. Das Konzept rationaler Preiserwartungen, volume 241 of Lecture Notes
in Economics and Mathematical Systems. Springer, 1985. 234
[WJ95] M. P. Wand and M. C. Jones. Kernel Smoothing, volume 60 of Monographs on Statistics
and Applied Probability. Chapman & Hall, London; New York, 1st ed. edition, 1995.
1031, 1036
[WM83] Howard Wainer and Samuel Messick, editors. Principals of Modern Psychologial Mea-
surement: A Festschrift for Frederic M. Lord. Associates,, Hillsdale, N.J.: L. Erlbaum,
1983. 473
[Woo72] L. A. Wood. Modulus of natural rubber cross-linked by dicumyl peroxide. i. experi-
mental observations. J Res. Nat. Bur. Stand, 76A:51–59, 1972. 843
[WW79] Ronald J. Wonnacott and Thomas H. Wonnacott. Econometrics. Wiley, New York,
second edition, 1979. 999, 1402
[Yul07] G. V. Yule. On the theory of correlation for any number of variables treated by a new
system of notation. Proc. Roy. Soc. London A, 79:182, 1907. 519
1618 BIBLIOGRAPHY
[Zei86] H. Zeisel. A remark on algorithm AS 183. an efficient and portable random number
generator. Applied Statistics, 35(1):89, 1986. 129, 1615
[Zel62] Arnold Zellner. An efficient method of estimating seemingly unrelated regressions
and tests for aggregation bias. Journal of the American Statistical Association,
57(298):348–368, 1962. 1386
[ZG70] Arnold Zellner and M. Geisel. Analysis of distributed lag models with application to
the consumption function. Econometrica, 38:865–888, 1970. 1069
[Zim95] Philip R. Zimmermann. The Official PGP User’s Guide. MIT Press, Cambridge, Mass.,
1995. 134