PCML Notes
PCML Notes
PCML Notes
Matthias Seeger
Probabilistic Machine Learning Laboratory
Ecole Polytechnique Fédérale de Lausanne
INR 112, Station 14, CH-1015 Lausanne
matthias.seeger@epfl.ch
1 Introduction 1
1.1 Learning Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 How To Read These Notes . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Mathematical Preliminaries . . . . . . . . . . . . . . . . . . . . . 2
1.4 Recommended Machine Learning Textbooks . . . . . . . . . . . . 3
1.5 Thanks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Linear Classification 5
2.1 A First Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Techniques: Vectors, Inner Products, Norms . . . . . . . . 9
2.2 Hyperplanes and Feature Spaces . . . . . . . . . . . . . . . . . . 13
2.3 Perceptron Classifiers . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.1 The Perceptron Convergence Theorem . . . . . . . . . . . 19
2.3.2 Normalization of Feature Vectors . . . . . . . . . . . . . . 21
2.3.3 The Margin of a Dataset (*) . . . . . . . . . . . . . . . . 22
2.4 Error Function Minimization. Gradient Descent . . . . . . . . . . 23
2.4.1 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . 25
2.4.2 Online and Batch Learning. Perceptron Algorithm as Gra-
dient Descent . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4.3 Techniques: Matrices and Vectors. Outer Product . . . . . 29
3
4 CONTENTS
Introduction
In this chapter, some general comments are given on the subject of these notes,
how to read them, what kind of mathematical preliminaries there are, as well
as some recommendations for further reading.
• What the basic machine learning methods and techniques are, at least
when it comes to supervised machine learning (pattern classification, curve
fitting).
• How to apply these methods to real-world problems and datasets.
• Why/when these methods work, and why/when they do not.
• Relationships between methods, opportunities for recombinations. We will
learn about the basic “red threads” through this rapidly growing field.
• How to derive and safely implement machine learning methodology by
building on underlying standard technology.
1
2 1 Introduction
• For linear algebra, the author’s favourite is the book by Strang [42]. Make
sure to check out the accompanying videos.
• For calculus, the book by Simmons [39] seems to be widely acclaimed for
its intuitive approach and focus on applications to real-world problems.
1.5 Thanks
The author would like to thank Hesam Setareh for creating many of the figures.
4 1 Introduction
Chapter 2
Linear Classification
Machine learning is a big field, but let’s not be timid and jump right into the
middle. Consider the problem of classifying hand-written digits (Figure 2.1).
The US postal service (USPS) decides to commission a system to automatize
the presorting of envelopes according to ZIP codes, and your company wins the
contract. How does such a system look like? Let us ignore hardware issues: a
scanned image of the envelope comes in, a ZIP code prediction comes out. The
first step is preprocessing: detecting the writing on the envelope, detecting the
ZIP code, segmenting out the separate digits, normalizing them by correcting for
simple transformations such as translation, rotation and scale, and quantizing
them onto a standard grid of equal size. Preprocessing is of central importance
for any real-world machine learning system. However, since preprocessing tends
to be domain-dependent, we will not discuss it any further during this course,
whose objective is to cover machine learning principles of wide or general appli-
cability.
Datasets of preprocessed handwritten digits are publicly available, the most
well known are MNIST (http://yann.lecun.com/exdb/mnist/) and USPS
(http://www.gaussianprocess.org/gpml/data/). In the middle ages of ma-
chine learning, these datasets have played a very important role, providing a
5
6 2 Linear Classification
Figure 2.1: Patterns from the training dataset of MNIST handwritten digits
database. Shown are ten patterns per class, drawn at random.
unified testbed for all sorts of innovative ideas and algorithms, publicly available
to anyone.
How does the MNIST dataset look like? For much of this course, a dataset is
an unordered set of instances or cases. Each case decomposes into attributes in
the same way. MNIST is a classification dataset (more specifically, multi-way
classification): each case consists of an input point (or pattern) and a target
(or class label). We can formalize this by introducing variables x for the input
point attribute, t for the target, and (x, t) for the case. A dataset like MNIST
is simply an set of variable instances:
D = {(xi , ti ) | i = 1, . . . , n} .
In other words, the dataset has n instances (or data points), and the i-th instance
is (xi , ti ). Each attribute has a data type, comes from a defined value range.
For MNIST, t ∈ {0, . . . , 9}, the ten digits. The input point x is a 28 × 28
bitmap, each pixel quantized to {0, . . . , 255} (0 for no, 255 for full intensity). It
is common practice to represent bitmaps as vectors, stacking columns on top of
each other (from left to right):
x1 xi,1
x2 xi,2
x = . , xi = .
.. ..
xd xi,d
Refresh your memory on vectors in Section 2.1.1. Given the MNIST specifica-
tion, the attribute space for x should be {0, . . . , 255}d , where d = 28 · 28 = 784.
2.1 A First Example 7
Binary Classification
Unfortunately, the lookup table approach will never work. While the input do-
main is finite, it is extremely large (exponential in the dimensionality d), and
any conceivable dataset would only ever cover a vanishing fraction. Essentially,
flut (x∗ ) = “don’t know” all the time. For people who like lookup tables and
similar things, this is known as the “curse of dimensionality”.
Learning starts with the insight that most attributes of the real world we care
about are robust to some changes. If x represents an 8 written by hand, neither
of the following modifications will in general turn it into a 9: modifying a few
components xj by some small amount; translating the bitmap slightly; rotating
the bitmap a bit. It is therefore a reasonable assumption on how the world works
that with respect to a sensible distance function between bitmaps, if x belongs
to a class (say, 8), any x0 close to x in this distance is highly likely to belong to
the same class. As x, x0 are vectors, let us use the standard Euclidean distance:
v
q u d
uX
kx − x0 k = (x − x0 )T (x − x0 ) = t (xj − x0j )2 .
j=1
8 2 Linear Classification
fNN (x) = ti ⇔ kx − xi k ≤ kx − xj k, j = 1, . . . , n.
This is the nearest neighbour (NN) classification rule. It is one of the oldest
machine learning algorithms, and variants of it are widely used in practice.
Moreover, the theoretical analysis of this rule, initiated by Cover and Hart, is a
cornerstone of learning theory. Some details about nearest neighbour methods
can be found in [5, Sect. 2.5.2], its theoretical analysis (with original citations)
is detailed in [11]. Despite the fact that variants of NN work well on our digits
problem (whether binary or multi-way), there are some substantial drawbacks:
• The whole dataset has to be kept around, as each single case (xi , ti ) po-
tentially takes part in a classification decision. For large datasets, this is
a costly liability.
• There is not much to be learned about the data from NN. In fact, some ma-
chine learning work aims to replace the Euclidean by a distance adapted to
the job, but compared to model-based mechanisms for specifying domain
knowledge, these possibilities are limited.
In order to avoid having to store our whole MNIST dataset, we have to represent
the information relevant for classification in a far smaller number of parameters.
For example, we could try to find m n (m “much smaller than” n) prototype
vectors wc , c = 1, . . . , m, each associated with a target value tw c , then to do NN
classification w.r.t. the prototype set {(wc , tw c ) | c = 1, . . . , m} rather than the
full dataset:
f (m) (x) = fNN (x; {(wc , twc )}).
Here, learning refers to the automatic choice of the parameters {(wc , tw c )} from
the data {(xi , ti )}, or to the modification of the parameters as new data comes
in. Having learned our classifier (i.e., fixed its parameters), we predict the label of
a pattern x∗ by NN on the prototype vectors. How can we learn automatically?
How to choose m (number of prototypes) given n (size of dataset)? If n increases,
as new data comes in, do we need to increase m or can we keep it fixed? These
are typical machine learning questions.
Taking this idea to the limit, let us pick exactly m = 2 prototype vectors w−1 ,
(2)
w+1 , one for each class, with tw
c = c, c = −1, +1. The classification rule f (x∗ )
for a pattern x∗ is simple: if kx∗ − w+1 k < kx∗ − w−1 k output +1, otherwise
1 If there are ties, we pick one of the nearest neighbour labels at random.
2.1 A First Example 9
kx∗ − w+1 k < kx∗ − w−1 k ⇔ kx∗ − w+1 k2 < kx∗ − w−1 k2
⇔ kx∗ k2 − 2wT+1 x∗ + kw+1 k2 < kx∗ k2 − 2wT−1 x∗ + kw−1 k2
⇔ wT+1 x∗ − kw+1 k2 /2 > wT−1 x∗ − kw−1 k2 /2
1
⇔ (w+1 − w−1 )T x∗ + kw−1 k2 − kw+1 k2 > 0.
2
Problems with expanding kx∗ − w+1 k2 ? Have a look at Section 2.1.1. If we set
w = w+1 − w−1 , b = (kw−1 k2 − kw+1 k2 )/2, this is simply a linear inequality:
+1 | wT x∗ + b > 0
f (2) (x∗ ) = = sgn wT x∗ + b .
−1 | otherwise
The sign function sgn(a) takes the value +1 for a > 0, −1 for a < 0. The
definition2 of sgn(0) typically does not matter, let us define sgn(0) = 0. We
came some way: from lookup tables over nearest neighbours all the way to
linear classifiers.
Note that we use columns, not rows. The set of all such columns a is denoted
by Ad : the d-fold direct product. This is a d-dimensional vector space if A is
a field. In this course, A = R, the real numbers. We will denote vectors a in
bold face, to distiguish them from scalars. In other books, you might encounter
notations like ~a or a, or simply a (no distinction between scalars and vectors).
The dimension of the space, d, is the maximum number of linearly independent
vectors it contains (read up on “linearly independent”, “basis” in [42]). You
can add vectors, multiply with scalars, it all works by doing the operations
on the coefficients separately. In fact, the general definition of vector space is
that of a set which is closed under addition and multiplication with scalars
(being “closed” means that if you perform any such operation on vectors, you
end up with another vector, not with something outside of your vector space).
The transpose operator converts a column into a row vector (and vice versa):
aT = [a1 , . . . , ad ] ∈ R1×d . We can identify Rd with Rd×1 : column and row
vectors are special cases of matrices (Section 2.4.3). Some texts use a0 to denote
transposition instead of aT .
2 If sgn(w T x + b) is a classification rule supposed to output 1 or −1 for every input x, we
We will often define vectors by giving an expression for its coefficients. For
example, a = [f (xi )]i (or a = [f (xi )] if the index i is obvious from the context)
means that the coefficients are defined by ai = f (xi ). A number of special
vectors will be used frequently in these notes. 0 = [0, . . . , 0]T is the vector
of all zeros. 1 = [1, . . . , 1]T is the vector of all ones. Their dimensionality is
typically clear from the context. We also need delta vectors δ k ∈ Rd , defined by
δ k = [I{j=k} ]j (recall that I{j=k} = 1 if j = k, 0 otherwise): the k-th coefficient
of δ k is 1, all others are 0. Again, the dimensionality is typically clear from the
context.
We can also combine two vectors. The most basic operation is the inner product
2.1 A First Example 11
bd j=1
Other notations you might encounter are a · b or (a, b) (or ha|bi if you want to
be a really cool quantum physicist), but not in these notes. There are obvious
properties, such as aT b = bT a (symmetry) or aT (b + c) = aT b + aT c (lin-
earity). By the way, what are ab or aT bT ? Nothing, such operations are not
defined (unless you are in R1 ). What is abT ? That works. More about these in
Section 2.4.3
The geometrical meaning of the inner product will be discussed in Section 2.3.
Here only two points. First, the square root of the inner product of a with itself
is the standard (or Euclidean) distance of a from the origin, its length or its
(Euclidean) norm:
v
q u d
uX
kak = aT a = t a2 . j
j=1
We will often use the squared norm kak2 to get rid of the square root. The
(Euclidean) distance between a and b is the norm of b − a (or a − b, since
k − ck = kck), something that is obvious once you draw it. Convince yourself
that if kak = 0, then a must be the zero vector 0. If kak =6 0, we can normalize
the vector: a → a/kak. The outcome is a unit vector (vector of length 1). We
will use unit vectors if we really only want to represent a direction. In fact, any
non-zero vector a can be represented by its length kak and its direction a/kak.
Second, sometimes it happens that aT b = 0: the inner product between a and b
is zero, such vectors are called orthogonal (or perpendicular). The angle between
orthogonal vectors is a right one (90 degrees). Note that a set of mutually
orthogonal vectors (“mutually” applied to a set means: “for each pair”) is also
linearly independent, so there can be no more than d such vectors (an example
of such a set is {δ 1 , . . . , δ d }).
12 2 Linear Classification
First, we use the linearity, then the symmetry of the inner product. This equa-
tion relates the squared distance between a and b to the squared distance of a
and b from the origin respectively. The term −2aT b is sometimes called “cross-
talk”. It vanishes if a and b are orthogonal, so that ka −bk2 = kak2 +kbk2 . This
is simply the Pythagorean theorem from high school (Figure 2.4). I bet it took
more pains back then to understand it! As we move on, you will note that this
innocuous observation is ultimately the basis for orthogonal projection, condi-
tional expectation, least squares estimation, and bias-variance decompositions.
Test your understanding by working out
The first is called the parallelogram identity (draw it!). Anything still unclear?
Then you should pick your favourite source from Section 1.3.
is called decision boundary, literally the region where the decisions made by
f (x) are switching between the classes. Classifier and discriminant function are
equivalent concepts. However, it is usually much simpler to define and work with
functions mapping to R than to {−1, +1}.
For the moment, let us accept the following definition, which we will slightly
generalize towards the end of this section. A binary classifier f (x), x ∈ Rd , is
called linear if it is defined in terms of a linear discriminant function y(x):
f (x) = sgn(y(x)), y(x) = wT x + b, w ∈ Rd \ {0}, b ∈ R.
Here, w is called weight vector (or also normal vector), and b is called bias
parameter.
How can we picture a linear discriminant function? Let us start with a linear
equation x2 = ax1 + b of the high school type. While this form suggests that
x1 is the input (or free), x2 the output (or dependent) variable, we can simply
treat (x1 , x2 ) as points in the Euclidean R2 coordinate system. Obviously, this
equation describes a line. a is the slope: for a = 0, the line is horizontal, for
a > 0 it is increasing. b is the offset: we cross the x2 axis at x2 = b, and the line
runs through the origin if and only if b = 0. Let us rewrite this equation:
x1
x2 = ax1 + b ⇔ [a, −1] + b = 0.
x2
14 2 Linear Classification
is called unit hypersphere, the set of all unit vectors in Rd . Another “hypercon-
cept”, in R2 this is the unit circle (draw it!), and in R3 the unit sphere.
How do we learn a classifier? Given a dataset of examples from the problem of
interest, D = {(xi , ti ) | xi ∈ X , ti ∈ T , i = 1, . . . , n}, we should pick a classifier
which does well on D. For example, if f (x) is a linear classifier with weight
vector w and bias parameter b, we should aim to adjust these parameters so
to achieve few errors on D. This process of fitting f to D is called training. A
dataset used to train a classifier is called training dataset.
Suppose we are to train a linear classifier f (x) = sgn(wT x + b) on a training
dataset D = {(xi , ti ) | i = 1, . . . , n}. At least within this chapter, our goal will be
to classify all training cases correctly, to find a hyperplane for which xi ∈ Hti ,
16 2 Linear Classification
Figure 2.7: Left column: Datasets which are not linearly separable in R2 and R3
respectively. Right column: Linearly separable datasets.
Feature Maps
x1
..
.
xd
x1 x1
x
φ(x) = ...
= [x x ] ∈ Rd(d+3)/2 .
j k j≤k
x1 xd
x2 x2
.
..
xd xd
You should work out as an exercise how the dimensionality d(d + 3)/2 comes
about. How does a linear discriminant for this feature map look like? The weight
vector w lives in Rd(d+3)/2 . If we index its components by j for the linear, then
jk (j ≤ k) for the quadratic features, then
X X
y(x) = wT φ(x) + b = w j xj + wjk xj xk + b.
j j≤k
• Iterative: The algorithm cycles over the data points in some random or-
dering. Visiting (φi , ti ), it extracts some information, based on which w
may be updated.
• Local updates: Each update depends only on the pattern currently visited,
not on what has been encountered previously.
1 2 3
4 5 6
Figure 2.8: Perceptron algorithm on R2 vectors. Blue crosses represent −1, red
circles +1 patterns. For each iteration, the update pattern is marked by black
circle. Shown is signed pattern as well as old and new weight vector w.
In fact, this is all we need for the perceptron algorithm, which is given in Algo-
rithm 1.
In practice, we run random sweeps over all data points, terminating if there is
no more mistake during a complete sweep. One subtle point is that we normalize
all feature vectors φ(xi ) before starting the algorithm. Convince yourself that
this cannot do any harm: w is a separating hyperplane for D if and only if it is
one for the set obtained by normalizing each φ(xi ). We will see in Section 2.3.2
that normalization does not only simplify the analysis, but also leads to faster
convergence of the algorithm. A run of the algorithm is shown in Figure 2.8.
pattern, these angles can at the same time increase for other patterns not cur-
rently visited (see Figure 2.8).
Beware that w∗ determines just some separating hyperplane, not necessarily the
one which is output by the algorithm. w∗ has to exist because D is separable,
it is necessarily only in order to define γ, thus to bound the number of updates.
γ is a magic number, we come back to it shortly.
Recall that w∗ and all φ̃i are unit vectors. Denote by wk the weight vector after
the k-th update of the algorithm, and w0 = 0. wk is not a unit vector, its norm
will grow during the course of the algorithm. The proof is geometrically intuitive.
We will monitor two numbers, the inner product wT∗ wk and the length kwk k.
Why these two? The larger wT∗ wk the better, because we want to decrease the
angle between w∗ and wk . However, the inner product could simply grow by wk
becoming longer, so we need to take care of its length kwk k (while kw∗ k = 1).
Now, both numbers may increase with k, but the first does so much faster than
the second. By Cauchy-Schwarz (or the definition of angle), wT∗ wk ≤ kwk k, so
this process cannot go on forever. Let us zoom into the update wk → wk+1 ,
which happens (say) on pattern (φ̃i , ti ). First,
kwk+1 k2 = kwk + ti φ̃i k2 = kwk k2 + 2ti wTk φ̃i + kti φ̃i k2 ≤ kwk k2 + 1.
| {z } | {z }
≤0 (mistake!) =1
What does this mean? γ is a fixed number, and we must have M ≤ 1/γ 2 , where
M is the number of updates to the weight vector. Since the perceptron algorithm
updates on misclassified patterns only, this situation arises at most 1/γ 2 times.
After that, all patterns are classified correctly, and the algorithm stops. This
concludes the proof.
The perceptron algorithm is probably the simplest procedure for linear classifi-
cation one can think of. Nevertheless, it provably works on every single linearly
2.3 Perceptron Classifiers 21
separable dataset. However, there are always routes for improvement. What
happens if D is not linearly separable? In such a case, the perceptron algorithm
runs forever. Moreover, if γ is very small but positive, it may run for many it-
erations. It is also difficult to generalize the perceptron algorithm to multi-way
classification problems. Nevertheless, its simplicity and computational efficiency
render the perceptron algorithm an attractive choice in practice.
ti wT∗ φi
γ = min ,
i=1,...,n kφi k
The first step remains unchanged: wT∗ wk+1 ≥ wT∗ wk + γ̃. For the second step:
Proceeding as above:
√
wT∗ wk ≤ kwk k ⇒ M γ̃ ≤ MP2 ⇔ M ≤ (P/γ̃)2 .
Now,
γ̃/P = min ti wT∗ (φi /P ) ≤ min ti wT∗ (φi /kφi k) = γ,
i=1,...,n i=1,...,n
using that P ≥ kφi k and ti wT∗ φi > 0 for all i. This means that the upper bound
on the number of updates in the perceptron algorithm never increases (and typ-
ically decreases) if we normalize the feature vectors. The message is: normalize
your feature vectors before running the perceptron algorithm. You should try it
out for yourself, seeing is believing. Run the algorithm on normalized vectors,
then rescale some of them and compare the number of updates the algorithm
needs until convergence.
22 2 Linear Classification
This section can be skipped at a first reading. We will require the margin concept
again in Chapter 9, when we introduce large margin classification and support
vector machines.
Recall the magic number γ from Theorem 2.1. Intuitively, γ quantifies the dif-
ficulty of finding a separating hyperplane (the smaller, the more difficult). For
general vectors, we have
ti wT φ(xi )
γD (w) = min .
i=1,...,n kwkkφ(xi )k
Note that w is a separating hyperplane for D if and only if γD (w) > 0. Aiming
2.4 Error Function Minimization. Gradient Descent 23
ti wT φ(xi )
γD = max γD (w) = max min .
p
w∈R \{0} w∈Rp \{0} i=1,...,n kwkkφ(xi )k
attained for some w. To see this, note that we restrict it to the unit hypersphere w ∈ Sd ,
which is compact, and the argument is continuous in w.
4 In the context of this statement, the signed angle between two hyperplanes is the angle
Error Functions
Recall that our problem is to find a linear classifier f (x) = sgn(y(x)), y(x) =
wT φ(x), which separates handwritten 8s from 9s. To this end, we use the
MNIST subset D = {(xi , ti ) | i = 1, . . . , n} as training dataset: if f (x) commits
few errors on D, it should work well for other patterns. Let us formalize this idea
in an optimization problem, by setting up an error function E(w) of the weight
vector, so that small E(w) translates to good performance on the dataset. We
could minimize the number of errors:
n
X n
X
E0/1 (w) = I{f (xi )6=ti } = I{ti wT φ(xi )≤0} .
i=1 i=1
Recall that I{A} = 1 if A is true, I{A} = 0 otherwise. While this error func-
tion formalizes our objective well, its minimization is computationally6 difficult.
E0/1 (w) is piecewise constant in w, so that local information such as the gradi-
ent cannot be used. For a better choice, we turn to the oldest and most profound
concept of scientific computing, the one with which Gauss and Legendre started
it all: least squares estimation. Consider the squared error function
n n
1X 1X T 2
E(w) = (y(xi ) − ti )2 = w φ(xi ) − ti
2 i=1 2 i=1
n 2
1 X Xd
= wj φj (xi ) − ti .
2 i=1 j=1
The significance of the prefactor 1/2 will become clear below. This error function
is minimized by achieving ti ≈ wT φ(xi ) across all training patterns. Whenever
(xi , ti ) is misclassified, i.e. ti y(xi ) < 0, then (y(xi ) − ti )2 > 1: errors are penal-
ized even stronger than in E0/1 (w). By pushing y(xi ) closer to ti , the mistake
is corrected. All in all, E(w) seems a reasonable criterion to minimize in order
to learn a classifier on D.
The major advantage of E(w) is that its minimization is computationally
tractable and well understood in terms of properties and algorithms. Once
more, geometry helps our understanding. Build the vector of targets t = [ti ] ∈
5 As we proceed with this course, we will see that it is even more useful to think about
models of the domain of interest. These give rise, in an almost automatical fashion, to the
problems we should really solve, and then we can be clever about algorithms. But for now, let
us stick with the problems.
6 The problem min E
w 0/1 (w) is NP-hard to solve in general.
2.4 Error Function Minimization. Gradient Descent 25
φ(x1 )T
φ1 (x1 ) . . . φp (x1 )
.
.. .. .. .. n×p
Φ= = ∈R .
. . .
φ(xn )T φ1 (xn ) . . . φp (xn )
Feeling rusty about matrices? Visit Section 2.4.3. Now, wT φ(xi ) = φ(xi )T w is
simply the i-th element of Φw ∈ Rn . In other words, if we define y = [y(xi )] ∈
Rn , meaning that yi = y(xi ) = wT φ(xi ), then y = Φw, therefore
n
1X 1 1 2
E(w) = (y(xi ) − ti )2 = ky − tk2 = kΦw − tk .
2 i=1 2 2
The squared error is simply half of the squared Euclidean distance between the
vector of targets t and the vector of predictions y, which in itself is a linear
transform of w, given by the design matrix Φ. We will frequently use this rep-
resentation of E(w) in the sequel. Converting expressions from familiar sums
over i and j into vectors, matrices and squared norms may seem unfamiliar, but
once you get used to it, you will appreciate the immense value of “vectoriza-
tion”. First, no more opportunities to get the indices wrong. Debugging is built
in: if you make a mistake, you usually end up with matrix multiplications of
incompatible dimensions. More important, we can use our geometrical intuition
about lengths, orthogonality and projections. Finally, writing E(w) as squared
norm immediately leads to powerful methods to solve the minimization prob-
lem minw E(w) with tools from (numerical) linear algebra, as we will see in
Chapter 4.
d = sgn(−E 0 (w)) only: the sign of the negative derivative. For w ∈ Rd , we use
the multivariate Taylor’s theorem:
The step size η is typically kept constant for many iterations, which corresponds
to scaling the steepest descent direction by the gradient size k∇w E(w)k. We
take large8 steps as long as much progress can be made, but slow down when
the terrain gets flatter. If ∇w E(w) = 0, we have reached a stationary point
of E(w), and gradient descent terminates. Whether or not this means that we
8 This does not always work well, and we will discuss improvements shortly.
2.4 Error Function Minimization. Gradient Descent 27
have found a minimum point of E(w) (a local or even a global one), depends
on characteristics of E(w) [2]. Only so much: for the squared error E(w) over
linear functions, we will have found a global minimum point in this case (see
Section 4.2).
Will this method converge (and how fast)? How do I choose the step size η? Shall
I modify η as I go? Answers to these questions depend ultimately on properties of
E(w). For the squared error, they are well understood. Many results have been
obtained in mathematical optimization and machine learning. A clear account
is given in [2].
What is the gradient for the squared error E(w)? Let us move in steps. First,
∂E 1 ∂
= (yi − ti )2 = yi − ti ⇒ ∇y E = y − t = Φw − t.
∂yi 2 ∂yi
As more complicated settings lie ahead, you should train yourself to do deriva-
tives in little chunks, exploiting this fabulous divide and conquer rule. The ex-
pression for ∇w E is something we will see over and over again. y − t = Φw − t
is called residual vector, it is the difference between prediction and true target
for each data case. The gradient is obtained by mapping the residuals back to
the weights, multiplying by ΦT . In a concrete sense developed in Section 4.2,
we project that part of the residual error we can do anything about by changing
w.
• For the most part, machine learning is about mapping real-world clas-
sification, estimation or decision-making problems to mathematical opti-
mization problems, which can be solved by robust, well understood and
efficient algorithms.
28 2 Linear Classification
• We separate three things: (a) the real-world problem motivating our ef-
forts, (b) the optimization problem we try to map it to, and (c) the method
for solving the optimization problem. In this section, (a) we would like to
classify handwritten 8s versus 9s. To this end, (b) we minimize the squared
error function E(w), and (c) we do so by gradient descent optimization.
The rationale should be obvious: the mapping from one to the other is not
one to one. If we mix up the problem and an algorithm to solve it, we will
miss out on other algorithms solving the same problem in a better way.
The goal remains unchanged: minimize E(w), a sum over all data points. Com-
pared to gradient descent, each update is about n times faster to compute. On
extremely large training sets, running an online algorithm may be the sole viable
option. On the other hand, stochastic gradient descent does not update along
the direction of steepest descent for E(w). An update may even increase the
error function value, and this will go unnoticed in general. Why would such an
algorithm ever work?
Some rough intuition goes as follows. The usual gradient is
n
1X
∇w E(w) = ∇w Ei (w),
n i=1
since the right hand side is what is subtracted from w when visiting (φ̃i , ti ). This
means that Eperc,i (w) should be zero for ti wT φ̃i > 0, Eperc,i (w) = −ti wT φ̃i
for ti wT φ̃i ≤ 0. Concisely,
Eperc,i (w) = −ti wT φ̃i I{ti wT φ̃i ≤0} = g −ti wT φ̃i , g(z) = zI{z≥0} .
The new matrix C = BA ∈ Rr×q is the matrix-matrix product (or just matrix
product) of B and A. The matrix product is associative and distributive, but
not commutative: AB 6= BA. If A and B have no special structure, computing
C costs O(r p q). Compare this to O(p q + r p) for computing z from x by two
matrix-vector products. These differences become even more pronounced if A
and B have useful structure, which is lost in C .
A special matrix product is that between a column and a row vector:
x1 y1 . . . x1 yq
xy T = [y1 x . . . yq x] = ... .. .. ∈ Rp×q , x ∈ Rp , y ∈ Rq .
. .
xp y1 ... xp yq
This is called the outer product between x and y. Compare this to the inner
product (Section 2.1.1) xT y ∈ R, which works for x, y ∈ Rp only (same dimen-
sionality). The linear transformation of the outer product xy T maps any vector
z ∈ Rq to a scalar multiple of x:
xy T z = x y T z = αx, α = y T z.
(2.2)
It is an extreme case which shows you that even if a linear transform maps into
an output vector space Rp of p dimensions, it may not reach every vector in
that space. Test your understanding by the following column representations of
a matrix in terms of outer products:
q
X q
X
A= aj δ Tj , I= δ j δ Tj .
j=1 j=1
2.4 Error Function Minimization. Gradient Descent 31
Please verify for yourself that ker A is indeed a linear subspace. The role of the
null space is as follows. Suppose that y = Ax0 . If you add any v ∈ ker A to
x0 , then x0 + v is still mapped to y. In fact, the set of all solutions x to the
linear system Ax = y is precisely the affine subspace
n o
x0 + ker A = x0 + v v ∈ ker A .
Test yourself: what is the kernel of the matrix wT ∈ R1×q ? Does this remind you
of something in Section 2.2? It is argued in [42] that you only really understand
matrices if you understand the four subspaces: ARq , AT Rp , ker A, and ker AT .
I highly recommend you study chapters 3 and 4 of [42] to refresh your memory
and gain a better perspective. One important result is that the range ARq
and the null space ker AT are orthogonal subspaces (Section 2.1.1). Namely, if
x = Av is from ARq , y ∈ ker AT , then
T
y T x = y T Av = AT y v = 0T v = 0.
AA−1 = I = A−1 A.
As we will see later during the course, it is not in general a good idea to compute
the inverse in practice, where other techniques are faster and more reliable.
Chapter 3
The Multi-Layer
Perceptron
or more general neural networks. The author’s favourite is [4]. Then there is [22] and [35].
33
34 3 The Multi-Layer Perceptron
Figure 3.1: The XOR problem is not linearly separable in R2 . In order to solve it
without error, we need discriminant functions which are nonlinear in the input
space R2 .
Figure 3.2: Multi-layer perceptron with two layers and tanh(·) transfer function.
See text for explanations.
Let us work out an example and thereby fix concepts. Given inputs x ∈ Rd ,
define activations
d
(1)
X
a(1) (1) T (1)
q = (w q ) x + bq = wqj xj + b(1)
q , q = 1, . . . , h1 .
j=1
(1)
We can collect the activations in a vector a(1) = [aq ] ∈ Rh1 . Note that this is
really a mapping of x. Next, we pass each activation through a transfer function
(1) (1)
g(·): zq = g(aq ). h1 is the size of the first layer. We say that the first layer
has h1 units, more specifically hidden units (“hidden” refers to the fact that
(1)
activation values are not given as part of a training dataset). zq is the output
of the k-th unit in the first layer.
36 3 The Multi-Layer Perceptron
Note that this layer looks the same as the first, only that inputs are hidden
z (1) instead of observed x. We can use as many layers as we like, but the sim-
(2)
plest choice above the linear class is to use two layers. The activations aq of
the uppermost layer are called outputs. They are taken as they are, no further
(2)
transfer functions and zq are used. For our binary classification example, we
would have h2 = 1 (just one unit in the second layer), and a(2) (x) corresponds
to the nonlinear discriminant function y(x) (we drop the subscript “1” here, as
there is a single activation only). The resulting classifier is f (x) = sgn(a(2) (x)).
An example of a two-layer MLP is given in Figure 3.2. Its parameters are
(1) (1)
{(wq , bq ) | q = 1, . . . , h1 } for the first layer, and the usual w(2) , b(2) for
the second, linear layer. Altogether,
h1 d
!
(1)
X X
(2) (2) (2) (1)
f (x) = sgn(a (x)), a (x) = wq g wqj xj + bq + b(2) .
q=1 j=1
| {z }
(1)
=aq
Our comment at the beginning of section is clear now. A two-layer MLP has
the form of a linear model a(2) (x) = (w(2) )T φ(x; θ) + b(2) , where the features
φq (x; θ) = g (w(1) T
q ) x + bq
(1)
E(w), quantifying the mismatch between predictions (outputs) and training set
targets. Here, we collect all parameters of the network model in a single large
vector w. Test your understanding by writing down how w would look like
for the two-layer example above. You have to end up with (d + 1)h1 + h1 + 1
parameters. Let us use the squared error function:
n 2
1 X (2)
E(w) = a (xi ) − ti .
2 i=1
This has the same flavour than in the linear case. First, we compute residuals
a(2) (xi ) − ti for every training case. Then, we combine these with gradients of
the outputs in order to accumulate the gradient. For online learning (stochastic
gradient descent), we compute the gradient on a single case only. Then, we make
a short step along the negative gradient direction. There is of course one key
difference to the linear case: ∇w a(2) (xi ) is not just some fixed φ(xi ) anymore,
but looks daunting to compute. We will meet this challenge in Section 3.3.
A final “historical” note before we move on. A surprisingly large amount of work
has been spent on figuring out which functions can or cannot be represented by
an MLP with two layers (one hidden, one output). If you really want to know
about this, [4, ch. 4] gives a brief overview. For example, a two-layer network
with tanh transfer functions can approximate any continuous function on a
compact set arbitrarily closely, if only the number h1 of hidden units is large
enough. These results are of close to no practical relevance, not only because the
number h1 has to be very large, but also because what really counts is whether
we can learn complex functions from training data in a reliable and efficient
way.
Recall from Section 2.4.3 that part of our mission is to vectorize our procedures.
Vectorization does not only expose important structure most clearly and helps
with “debugging” expressions, it is an essential step towards elegant and efficient
implementation. Most machine learners use Matlab or Python for development
and prototyping, interpreted languages in which vectorization is a must. Let
us vectorize the MLP function evaluation. We already have vectors x ∈ Rd ,
(1)
a(1) ∈ Rh1 . The weight matrix W (1) ∈ Rh1 ×d consists of rows (wq )T , the
(1)
bias vector b(1) = [bq ] ∈ Rh1 . Then, a(1) = W (1) x + b(1) . Moreover, we
extend scalar functions to vectors by simply applying them component-wise.
For example, f (v) = [f (vj )], where f : R → R. Then, the vector of first layer
unit outputs is z (1) = g(a(1) ). All in all, the first layer is vectorized as:
z (2) = g(a(2) ), a(2) = W (2) z (1) + b(2) , W (2) ∈ Rh2 ×h1 . (3.1)
Note the clear separation between linear activation maps and nonlinear trans-
fers.
∂Ei ∂Ei
r(2) = , rq(1) = .
∂a(2) ∂aq
(1)
Note that for every activation variable, there is one error variable. Moreover,
the error variables determine the gradient components:
∂Ei ∂Ei
∇w(2) Ei = (2)
∇w(2) a(2) = r(2) z (1) , = r(2) ,
∂a ∂b(2)
! (3.2)
∂Ei (1) (1) ∂Ei (1)
∇w(1) Ei = (1)
∇w(1) aq = rq x, (1)
= rq .
q q
∂aq ∂bq
All we need to do is to compute the errors. First, r(2) = a(2) − ti . This is just
the residual we already know from the linear case (Section 2.4.1). Finally, we
3.3 Error Backpropagation 39
(1)
This means we compute rq in terms of r(2) , multiplying it with the weight
(2) (1)
wq linking the output and q-th hidden unit, then by the derivative g 0 (aq ).
(2)
Note the remarkable symmetry with the activation variables. For them, a is
(1) (2)
computed in terms of aq , multiplied by wq as well.
Figure 3.3: Forward pass (left) and backward pass (right) of error backpropa-
gation algorithm for a two-layer MLP. Notice the symmetry between the two
(1)
passes, and how information g 0 (aq ) computed during the forward pass and
stored locally at each node is recycled during the backward pass.
To see the backpropagation technique in its full glory, let us consider a network of
at least three layers. The relationship between gradient terms and error variables
(1)
remain the same. To work out rq , note that the q-th unit in the first layer is
now connected to h2 second layer units. We have to use the chain rule with
respect to the error variables for all of them:
h2 (2) h2 h2
X ∂Ei ∂aj X (2) (2)
X (2) (2)
rq(1) = (2)
· (1)
= rj wjq g 0 (a(1) 0 (1)
q ) = g (aq ) wjq rj . (3.3)
j=1 ∂aj ∂aq j=1 j=1
1. Forward pass: Propagate the activation variables forward through the net-
work, from the inputs up to the output layer. At the j-th unit of layer l,
(l) (l)
maintain g 0 (aj ) (or alternatively aj ).
The computational cost is dominated by the linear mappings, both in the for-
ward and the backward pass, and the final gradient assembly. In each of the
passes, each weight is touched exactly once (the bias parameters are not used in
the backward pass, but their number is subdominant). Therefore, both passes
scale linearly in the number of weights, and so does the final gradient assembly.
This is the backpropagation technique applied to gradient computations for the
squared error function. Later during the course, in Chapter 8, we will consider
models with several output units and different error functions. Our mantra holds
true: one of the key aspects of backpropagation is that it is invariant to such
changes. You have to write code for it only once in order to serve a wide variety
of machine learning problems.
Forward propagation uses W (2) (and b(2) ), backward propagation the transpose
(W (2) )T . For the gradient assembly, (3.2) becomes
good results. We will cover some of the most important ones. Finally, for batch
optimization, there are far better optimization methods than gradient descent.
Covering them in any depth is out of the scope of this course, but if you use
MLPs in practice, you need to know about them. Some pointers are provided
below.
Suppose our MLP for binary classification has at least two layers, one linear
output layer and at least one hidden layer. In (batch) gradient descent opti-
mization, we decrease E(w) along the negative gradient direction, until we find
w∗ so that ∇w∗ E = 0. This means that w∗ is a stationary point. Suppose we
take a tiny step in direction d, to w∗ + εd. The change in error function value
is
E(w∗ + εd) − E(w∗ ) = ε(∇w∗ E)T d + O(ε2 ) = O(ε2 ).
No matter what direction d we pick (kdk = 1), the error function value will
not change (to first order in ε). Now, for some error functions of specific kind,
a stationary point is a global minimum point, so that minw E(w) is solved.
Alas, not for our MLP problem. First, in order to check whether w∗ is a local
minimum point in the strict sense of mathematical optimization, we have to use
a Taylor expansion of E(w∗ + εd) to second order:
1 2 T
E(w∗ + εd) − E(w∗ ) = ε d (∇∇w∗ E)T d + O(ε3 ),
2
where we used that ∇w∗ E = 0, so the first-order term vanishes. Here,
∂2E
∇∇w∗ E =
∂w∗,j ∂w∗,k j,k
is the Hessian. w∗ is a local minimum in the strict sense if dT (∇∇w∗ E)T d > 0
for any unit norm direction d, which guarantees E(w∗ + εd) > E(w∗ ) for small
ε > 0. In other words, the Hessian has to be positive definite, a concept we will
revisit later during the course (Section 4.2.2). More details about conditions for
local minima can be found in [2, ch. 1].
There is a more serious problem. Suppose that w∗ is a local minimum point
in the strict sense. It will in general not be a global minimum point: E(w∗ ) >
minw E(w). And there are no tractable conditions for finding out how big the
difference is, or how far away w∗ is from a global minimum point. In Figure 3.4,
we illustrate the the concepts of local and global minima, maxima, and station-
ary points. The squared error function for an MLP with ≥ 2 layers has many lo-
cal minimum points which are suboptimal. To see why that is, consider the high
degree of non-identifiability of the model class: for a particular weight vector w,
we can easily find very many different w0 so that E(w) = E(w0 ). Here are two
examples. First, the transfer function g(a) = tanh(a) is odd, g(−a) = −g(a).
(l)
Pick one latent unit with activation aq and flip the signs of all weights and
bias parameters of connections feeding into the unit from below. This flips the
(l) (l)
sign of aq , therefore of g(aq ) as well. Now, flip the signs of all weights and
biases on connections coming from the unit, which compensates the sign flip
(l)
of g(aq ). All outputs remain the same, and so does the error function value.
Second, we can simply interchange any two units in a hidden layer, leading to
a permutation of the weight vector w. In short, the error function E(w) for an
42 3 The Multi-Layer Perceptron
Figure 3.4: A non-convex function, such as the squared error for a multi-layer
perceptron, can have local minima and maxima, as well as stationary points
which are neither of the two (called saddle points). A stationary point is defined
by ∇w E = 0.
Initialization (*)
We begin with the initialization issue. For model classes like MLPs, the rela-
tionship between weights w and outputs is highly non-transparent, and it is not
usually possible to select a good starting point w0 in a well-informed way. It is
therefore common practice to initialize w0 at random, drawing each component
of w0 independently from a zero-mean Gaussian distribution (see Section 6.3)
with component-dependent variances. In order to understand why the choice of
these variances makes a difference, and also why we should not just start with
w0 = 0, consider the following argument. A good starting point is one from
which training can proceed rapidly. Recall the transfer function g(a) = tanh(a)
from Section 3.2 (the argument holds for other transfer functions as well). In a
(l)
nutshell, a good initial w0 is chosen so that many or most activation values aq
lie in the area of largest curvature |g 00 (a)|. Let us see why. If a is far away from
zero, g(a) is saturated at sgn(a), with g 0 (a) ≈ 0. Large coefficients of w0 imply
large (absolute) activation values, and the role of g 0 (a) in the error backpropa-
gation formulae (3.2) implies small gradients and slow progress in general. On
the other hand, for a ≈ 0, g(a) ≈ a is linear. If w0 is chosen such that most
activations are small (most extreme: w0 = 0), the network behaves like a linear
model, and many iterations are needed to step out of this linear regime.
The upshot of this argument is that we should sample w0 in a way that most
(l)
activations aq (xi ) are order of unity, on average over the training sample {xi }
and w0 (recall that n−1 E(w) is the average of individual errors over the train-
ing set). Assume that the data has been preprocessed to zero mean and unit
covariance:
n n
1X 1X
xi = 0, xi xTi = I.
n i=1 n i=1
This transformation, known as whitening, is a common step in preprocessing
(1)
pipelines (see also Section 11.1.1). Consider the first layer activation aq . Since
E[w0 ] = 0, we have that
h i Xd h i h i
(1)
E a(1)
q = E wqj E[xi,j ] + E b(1)
q = 0.
j=1
d h i h i Xd h i h i
(1) (1)
X
= Var wqj Var[xij ] + Var b(1)
q = Var wqj + Var b(1)
q .
j=1 j=1
44 3 The Multi-Layer Perceptron
Choosing the learning rate ηk seems more of an art than a science. Ultimately,
the best choice is determined by the local curvature (second derivative), which
is typically not computed for MLPs (although it can be obtained at surprisingly
little extra effort, see [4, ch. 4.10] or [5, ch. 5.4]). In the context of online learn-
ing, the proposal ηk = 1/k is common. For batch gradient descent, a constant
learning rate often works well and is faster [4, ch. 7.5]. If assessing the Hessian is
too much for you, there is a wealth of learning rate adaptation techniques from
the old neural networks days [4, ch. 7.5.3].
Figure 3.5: Gradient descent optimization without (top) and with momentum
term (middle, bottom). We update ∆wk = η((1 − µ)∇wk E + µ∆wk−1 ), where
η is selected by line minimization.
is highly curved2 around the current point wk , the steepest descent direction,
which ignores the curvature information (Section 3.4), is seriously misleading.
Subsequent steps along the negative gradient lead to erratic steps with slow
overall progress. Zig-zagging is illustrated in Figure 3.5, top panel. The idea
behind a momentum term is the observation that during a phase of zig-zagging,
it would be better to move along a direction averaged over several subsequent
steps, which would smooth out the erratic component, but leave useful system-
atic components in place. Denote by ∆wk = wk+1 − wk the update done in
iteration k. Gradient descent with momentum works as follows:
Note that we assume a constant learning rate η > 0 for simplicity. The new
update step is a convex combination of the steepest descent direction and the
previous update step. Effects of momentum terms of different strength are shown
in Figure 3.5. To understand what this is doing, we consider two regimes. First,
in a region of low curvature of E(w), the gradient will remain approximately
constant over several updates. Solving ∆w = −η(1 − µ)∇w E + µ∆w results
in ∆w = −η∇w E. In a low-curvature regime, the rule does full-size steep-
est descent updates. On the other hand, if E(w) has high curvature, then
∇w E changes erratically. In this case, subsequent ∆wk will tend to cancel each
other out, and the effective learning rate reduces to η(1 − µ) η or even to
η(1 − µ)/(1 + µ), damping the oscillations. If you use momentum in practice,
you can afford to experiment with larger learning rates. A momentum term is
pretty obligatory with online learning by stochastic gradient descent. Stochas-
tic gradients have erratic behaviour built in, and a momentum term is often
successful in smoothing away much of the noise.
Here are details for the analysis of momentum learning in the two extreme
regimes just mentioned. In reality, the gradient will have both constant and
oscillatory components, in which case momentum will roughly leave the former
alone, but damp the latter. First, suppose that ∇wk E = ∇w E over many
updates. To see what the updates are doing, we equate ∆wk+1 and ∆wk :
Training an MLP is tricky enough, due to local minima issues and the fact
that we cannot develop a good “feeling” about what individual parameters in
the network stand for. But there is one point we can (and absolutely should)
ensure: that the backpropagation code for the gradient computation is bug-free.
To keep the presentation simple, we focus on an error function E(w) of a single
weight w ∈ R only. For the general case, the recipe is repeated for each gradient
component (but see below for an idea to save time).
Suppose we are at a point w, and g(w) = E 0 (w) = dE/dw is the gradient (or
derivative) there. Gradient testing works by comparing the result for g(w) with
finite differences, computed by evaluating the error function at other points w0
very close to w:
E(w + ε) − E(w)
g(w) = E 0 (w) ≈ g̃1;ε (w) :=
ε
for a very small ε > 0. This means that we can test our gradient code by
computing g̃1;ε (w) (which costs one forward pass), then inspecting the relative
error
|g(w) − g̃1;ε (w)|
.
|g(w)|
Since finite differences and derivatives are not exactly the same, we cannot
expect the error to be zero, but it should be small, in fact of the same order as
ε if g(w) itself is away from zero. It is important to choose ε not too small, say
ε = 10−8 .
There is a more accurate symmetric finite-difference approximation of gradient
components, about twice as costly to evaluate. As debugging is not about speed,
we recommend this latter one:
E(w + ε) − E(w − ε)
g(w) = E 0 (w) ≈ g̃2;ε (w) := .
2ε
This needs two forward passes to compute, instead of just one. Why does this
work better? Let us look at the Taylor expansion of the function E at the current
value w:
1
E(w ± ε) = E(w) + E 0 (w)(±ε) + E 00 (w)ε2 + O(ε3 ).
2
Notice that we expand up to second order, and that the ε2 term does not depend
on the sign in ±ε. Subtracting one from the other line:
E(w + ε) − E(w − ε) = 2E 0 (w)ε + O(ε3 ) ⇒ g̃2;ε (w) = E 0 (w) + O(ε2 ),
3.4 Training a Multi-Layer Perceptron 47
where we divided by 2ε. The error between E 0 (w) and g̃2;ε (w) is only O(ε2 ),
whereas you can easily confirm that the error between E 0 (w) and g̃1;ε (w) is
O(ε), thus much larger for small ε.
If your network has lots of parameters, it can be tedious to test every gradi-
ent component separately. Far better to run the following randomized test in
order to spot problems (which are then analyzed component by component),
and instead test the gradient at many different points w. The idea is to test
directional derivatives along random directions3 d ∈ Rp , kdk = 1. Suppose we
wish to test the gradient g = ∇w E(w) at the point w. If f (t) := E(w + td),
t ∈ R, the directional derivative of E at w along d is f 0 (0) = g T d. Our finite
difference approximation is
sian N (0, 1), then normalizing the resulting vector to unit length.
48 3 The Multi-Layer Perceptron
Beyond variants of gradient descent, the most basic algorithm of this form is con-
jugate gradients, designed originally for minimizing quadratic functions. Exam-
ples for quadratic minimization are linear least squares problems (Section 2.4).
Gradient descent is a poor method for minimizing general quadratic functions,
and even a momentum term does not help much in general. In contrast, conju-
gate gradients is the optimal5 method for minimizing quadratics, given that one
gradient is evaluated per iteration. It is easily extended to non-quadratic func-
tions by incorporating a line search. Applied to MLP training, it tends to out-
perform gradient descent dramatically, at the same cost per iteration. A variant
called scaled conjugate gradients avoids line searches for most updates, which
can be expensive for MLPs. Methods advancing on conjugate gradients are all
motivated by approximating the gold-standard method: Newton-Raphson opti-
mization. Recall from high school that an update of this algorithm is based on
approximating E(w) by a quadratic q(w) locally at wk , then minimizing q(w)
as surrogate for E(w). Unfortunately, this needs computing the Hessian ∇∇wk E
and solving a linear system with it, which is out of scope6 of our recipe above.
However, quasi-Newton methods manage to build search directions over sev-
eral updates which share important properties with Newton directions. Among
them, limited memory quasi-Newton methods are most useful for large MLPs,
and are in general widely used for many other machine learning problems as
well. Finally, the Levenberg-Marquardt algorithm is specialized to minimizing
squared error functions, and is maybe the most widely used general technique
4 For serious applications, codes from numerical mathematicians should be preferred, such
definite matrix (see Section 4.2.2), conjugate gradient is the gold-standard for iterative linear
solvers as well.
6 However, note that there are surprisingly efficient methods for computing expressions
is so simple. We will see that classification is much better served by error functions whose
minimization is not much harder to do, whether for linear methods (Chapter 2) or multi-layer
perceptrons (Chapter 3).
51
52 4 Linear Regression. Least Squares Estimation
3.5 3.5
3 3
2.5 2.5
2 2
1.5 1.5
1 1
0.5 0.5
0 0
−0.5 −0.5
−1 −1
−1.5 −1.5
−2 −2
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
Figure 4.1: Two different ways of fitting the data {(xi , ti )}. Left: Piece-wise
linear interpolation. Right: Linear least squares regression.
function x → y represented by this data. What does that mean? For example,
we could just connect neighbouring points by straight lines and be done with it:
piece-wise linear interpolation2 (Figure 4.1, left). However, interpolation is not
what curve fitting is about. Rather, we aim to represent the data as the sum of
two functions: a systematic curve, smooth and simple, plus some random errors,
highly erratic but small. We will refine and clarify this notion in Chapter 8, but
for now this working definition will suffice. The rationale is not to get sidetracked
by errors made during the recording of the dataset.
−1
−2
−3
−5 −4 −3 −2 −1 0 1 2 3 4
Figure 4.2: Illustration of squared error criterion for linear regression. Each
triangle area corresponds to an error contribution of (y(xi ) − ti )2 /2.
The simplest curve fitting technique is linear regression (more correctly: linear
regression estimation, but the shorter term is commonly used). We assume a
2 Interpolation differs from curve fitting in that all data points have to be represented
linear function y(x) = wx + b for the systematic part, then fit it to the data by
minimizing the squared error:
n n
1X 1X
E(w, b) = (y(xi ) − ti )2 = (wxi + b − ti )2 . (4.1)
2 i=1 2 i=1
This problem is known as least squares estimation. Its prediction on our data
above is shown in Figure 4.1, right. The squared error criterion is illustrated
in Figure 4.2. Note the rapid growth with |y(xi ) − ti |: large differences are not
tolerated. We solve this problem by setting the gradient w.r.t. [w, b]T to zero
and solving for w, b. You should do this as an exercise, the solution is given in
Section 4.1.1.
1
p=1 1
p=2
0.5 0.5
0 0
−0.5 −0.5
−1 −1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
1
p=4 1
p=10
0.5 0.5
0 0
−0.5 −0.5
−1 −1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
polynomial regression:
1
x
φ(x) = . = xj−1 j , y(x) = wT φ(x) = w0 + w1 x + · · · + wp−1 xp−1 .
..
xp−1
The function y(x) is a polynomial of maximum degree p − 1, whose coefficients
are fit by least squares estimation. The larger the number of features p, the more
complex functions we are able to represent. In fact, if p0 > p, then the class for
p is strictly included in the class for p0 : lines are special cases of quadratic
functions. We might assume that the accuracy of curve fitting improves as p
grows, since we run our method with more and more flexible function classes.
(p)
Indeed, if w∗ is a minimizer of the squared error E (p) (w(p) ) (making the
dimensionality p explicit in the notation for now), we have
h iT
(p) (p) (p0 ) (p) T 0 (p0 )
E (w∗ ) = E w∗ , 0, . . . , 0 ≥ E (p ) (w∗ ), p < p0 .
Our fit in terms of squared error can never get worse as p increases, and typi-
cally it improves. In Figure 4.3, we show least squares solutions using different
dimensionalities p. The dataset consists of 10 points, drawn from the smooth
green curve plus random errors. The fit is rather poor for p = 1 (constant) and
p = 2 (line), giving rise to large errors. It improves much for p = 4 (cubic), rep-
resenting the generating curve well. The fit for p = 10 provides a first example
of over-fitting. The fit to the data is even better than for p = 4: the polynomial
is exactly interpolating each data point. But away from the xi locations of the
training set, the prediction behaves highly erratically. It is not at all a useful
representation of the generating curve. We will return to this important issue
in Chapter 7.
Now,
n
X
hx2 i − hxi2 = n−1 (xi − hxi)2 = Var[x],
i=1
Xn
−1
ht xi − htihxi = n (xi − hxi)(ti − hti) = Cov[x, t],
i=1
Let us set ∇w E = 0:
ΦT Φw = ΦT t. (4.2)
These are the celebrated normal equations: a linear system we need to solve in
order to obtain w. Since Φ has full rank, it is easy to see that ΦT Φ ∈ Rp×p
has full rank as well, therefore is invertible. We can write the solution as
−1
ŵ = argmin E(w) = ΦT Φ ΦT t. (4.3)
w
Beyond simple techniques like gradient descent, many modern machine learning
algorithms solve the normal equations inside, so it is important to understand
how to do this well. Some advice is given in Section 4.2.3.
56 4 Linear Regression. Least Squares Estimation
As an exercise, you should convince yourself that the solution for univariate
linear regression of Section 4.1.1 is a special case of solving the normal equations.
The first point to note about (4.3) is that ŵ, and also ŷ = Φ ŵ, are linear
maps of the vector of targets t. Not only the forward mapping from weights
to predictions is linear, but also the inverse, the estimator itself. What type of
linear mapping is t 7→ ŷ? The normal equations read ΦT (t − ŷ) = 0. This means
that the residual vector t − ŷ is orthogonal to the subspace ΦRp of possible
prediction vectors (see Figure 4.4). Therefore,
t = (t − ŷ) + ŷ .
| {z } |{z}
⊥ΦRp ∈ΦRp
(Section 2.1.1): ŷ is the orthogonal projection of t onto the model space ΦRp .
Try to get a feeling for this result, by drawing in R2 . ŷ is the closest point to t
in ΦRp , so t − ŷ must be orthogonal to this subspace.
M is the matrix of the orthogonal projection. Its properties are clear from our
geometrical picture: M u = u for u ∈ U, and M v = 0 for v ∈ U ⊥ . In other
words, ker M = U ⊥ , and M has eigenvalues 1 (p dimensions) and 0 (n − p
dimensions) only (see Section 11.1.2).
40
30
20
10
0
2
0 2
0
−2 −2
1 2
q(td) = t α − tbT d + c → −∞ (t → ∞).
2
For the case α = 0, we pick d so that bT d > 0 (ignoring the special case that
bT d may be zero). This means that in general3 , we require that dT Ad > 0 for
all d 6= 0. A symmetric matrix with this property is called positive definite. It is
these matrices which give rise to lower bounded quadratics, curving upwards like
a bowl in all directions (Figure 4.5). To solve our problem: ∇x q(x) = Ax − b =
0, so that x∗ = A−1 b. A positive definite matrix A is also invertible (why?),
so x∗ is the unique stationary point, which is a global minimum point by virtue
3 For the meticulous: The precise condition is dT Ad ≥ 0 for all d 6= 0, and if dT Ad = 0,
Rp×p is upper triangular (rij = 0 for i > j) and invertible [23, 42]. The QR
decomposition is computed using Householder’s method. We do not have to
care about this, as we simply use good numerical4 code. Plugging this into
the normal equations and using QT Q = I, you get RT Rw = RT QT t, or
Rw = QT t (as R is invertible). This is solved as
so that the “cross-talk” vanishes (Section 2.1.1). The solution ŵ makes the
first term in (4.4) vanish, and the second is the remaining (minimum) squared
distance. Recalling Section 4.2.1,
ŷ = Φ ŵ = QR ŵ = QQT t
A discussion of iterative solvers is beyond the scope of this course. For systems
of the form (4.2), with ΦT Φ symmetric positive definite, the most commonly
used method is conjugate gradients [17]. I can recommend the intuitive deriva-
tion in [4, ch. 7.7]. However, specifically for the normal equations, the LSQR
algorithm [30] is to be preferred (code at http://www.stanford.edu/group/
SOL/software/lsqr.html).
4 Good numerical code is found on www.netlib.org. The standard package for matrix de-
compositions (QR, Cholesky, up to the singular value decomposition) is LAPACK. This code
is wrapped in Matlab, Octave, Python, or the GNU scientific library. In contrast, “Numerical
recipes for X” is not recommended.
60 4 Linear Regression. Least Squares Estimation
Chapter 5
Probability. Decision
Theory
In this chapter, we refresh our knowledge about probability, the language and
basic calculus behind decision making and machine learning. We also introduce
concepts from decision theory, setting the stage for further developments. Rais-
ing from technical details like binary classifiers, linear functions, or numerical
optimizers up towards the big picture, we will understand how decision making
works in the optimal case, and which probabilistic aspects of a problem are
relevant. We will see that Bayesian computations, or inference, are the basis for
optimal decision making. In later chapters, we will see that learning is based on
inference as well.
61
62 5 Probability. Decision Theory
Box
Fruit red blue
apple 1/10 9/20
orange 3/10 3/20
We will use an example losely based on [5, ch. 1.2]. We draw a fruit out of a box,
as illustrated and explained in Figure 5.1. When thinking about a probabilistic
setup, the following concepts have to be defined:
• The probability (or sample) space Ω: the set of all possible outcomes. In
our example, (F, B) can take values Ω = {(a, r), (o, r), (a, b), (o, b)}.
For example,
Note that both Ω and ∅ (the empty set) are events, with P (Ω) = 1 and P (∅) =
0 for any joint probability distribution P . Many rules of probability can be
understood by drawing Venn diagrams such as Figure 5.2. For example,
For most probability experiments, the relevant events are conveniently described
by random variables. Formally, a random variable is a function from Ω into some
set, for example R, {0, 1}, or {a, o}. This notion is so natural that we have chosen
it implicitly in order to define the example in Figure 5.1: F is a random variable
mapping to {a, o}, and B is a random variable mapping to {r, b}. We can define
random variables in terms of others. For example I{F =a or B=b} is a random
variable mapping into {0, 1}. It is equal to zero for the outcome ω = (o, r),
equal to one otherwise.
The joint probability distribution, P (F, B), is a complete description of the ex-
periment. All other distributions are derived from it. First, there are marginal
64 5 Probability. Decision Theory
distributions. Suppose we want to predict which fruit will be drawn with which
probability, no matter what the box. Every time you hear “no matter what B”,
you translate “sum over all possible values of B”. The marginal distribution
P (F ) is
X 11/20 |F =a
P (F ) = P (F, B) = .
9/20 |F =o
B=r,b
If you are given a joint distribution P (F, B), you can first work out the marginals
P (B) and P (F ) by summing out over the other variable respectively, then obtain
P (F, B) P (F, B)
P (F |B) = , P (B) 6= 0, P (B|F ) = , P (F ) 6= 0.
P (B) P (F )
For our Venn diagram example,
P ({F = a, B = r}) P (A ∩ B)
P (A|B) = P ({F = a}|{B = r}) = = .
P ({B = r}) P (B)
Given that we are in B, the probability of being in A is obtained by the fractional
size of the intersection A ∩ B within B.
Note that we speak about the conditional distribution P (F |B), even though it
is really a set of distributions, P (F |B = r) and P (F |B = b). Storing it in a
table needs as much space as the joint distribution. Note that if, for a different
experiment, P (B = r) was zero, then P (F |B = r) would remain undefined: it
does not make sense to reason about F given B = r if B = r cannot happen.
To sum up, you only really need two rules:
P (X, Y ) = P (X|Y )P (Y ).
5.1 Essential Probability 65
These rules formalize two basic operations which drive probabilistic reasoning.
Whenever a variable Y is observed, so we know its value (for example, it is a
case in a dataset), we condition on it, consider conditional distributions given
Y . Whenever a variable X, whose precise value is uncertain, is not currently of
interest (not the aim of prediction), we marginalize over it, sum distributions
over all values of X, thereby eliminate it.
There is no sense in which the variables F and B are ordered, we could write
P (B, F ) instead of P (F, B). After all, the intersection of events (sets) is not
ordered either. In fact, in derivations, we will sometimes simply use P (B, F ) =
P (F, B): the argument names matter, not their ranking. Given that, let us apply
the product rule in both orderings:
This is Bayes’ formula (or Bayes’ rule, or Bayes’ theorem). This looks harmless,
but substitute “cause” for B, ”effect” for F , and Bayes’ formula provides the
ultimately solution for the inverse problem of reasoning. Suppose you show me
the fruit you got: F = o (an orange). What can I say about the box you picked?
It is twice as likely that your box was the red one. One point should become
clear even from this simple example. The fact that events happen at random
and we have to be uncertain about them, does not mean that they are entirely
unpredictable. It just means that we might not be able to predict them with
complete certainty. Most events around you are uncertain to some degree, almost
none are completely unpredictable. The key to decision making from uncertain
knowledge is to quantify your probabilistic belief1 in dependent variables, so that
if you observe some of them, you can predict others by probability computations.
Such variables are independent: the joint distribution is the product of the
marginals. Equivalently, the events A and B are independent if P (A|B) = P (A),
1 Belief, or “subjective probability”, is another word for probability distribution.
66 5 Probability. Decision Theory
or P (A ∩ B) = P (A)P (B). For example, if we pick a fruit Fr from the red box
and a fruit Fb from the blue box, then Fr and Fb are independent random
variables. Within a machine learning problem, independent variables are good
and bad. They are good, because we can treat them separately, thereby sav-
ing computing time and memory. Independence can simplify derivations a lot.
They are bad, because we only learn things from dependent variables. Inde-
pendence is often too strong a concept, we need something weaker. Consider
three random variables X, Y , Z with joint distribution P (X, Y, Z). X and Y
are conditionally independent, given Z if they are independent under the con-
ditional P (X, Y |Z), namely if P (X, Y |Z) = P (X|Z)P (Y |Z). Conditional inde-
pendence constraints are useful for learning. In general, X, Y , Z are still all
dependent, but the conditional independence structure can be used to simplify
derivations and computations. In Figure 5.1, draw a box B, then two fruits
F1 , F2 at random from B (with replacement: you put the fruits back). Then,
P (F1 , F2 |B) = P (F1 |B)P (F2 |B), but P (F1 , F2 ) 6= P (F1 )P (F2 ) (check for your-
self that P (F2 = o|F1 = o) = 7/12 6= P (F2 = o) = 9/20). Note that it can go
the other way around. Throw two dice X, Y ∈ {1, . . . , 6} independently. If I tell
you that Z = X + Y = 7, then X and Y are dependent given Z, but they are
independent as such.
Figure 5.3: From a discrete distribution (histogram) on a finer and finer grid to
a continuous probability density.
itself:
P ([x, x + ∆x))
p(x) = lim .
∆x→0 ∆x
Under technical conditions on P (·) and the random variable, this limit exists
for all x ∈ R: it is called the probability density of the random variable x. Just
as for P , we have that
Z
p(x) ≥ 0, x ∈ R, p(x) dx = 1.
A “graphical proof” for the latter is given in Figure 5.3. Careful: the value p(x)
of a density at some x ∈ R can be larger than 1, in fact some densities are
unbounded2 above. The density allows us to compute probabilities over x, for
example:
Z b Z Z
P ({x ∈ [a, b]}) = p(x) dx = I{a≤x≤b} p(x) dx, P (E) = I{x∈E} p(x) dx.
a
The cumulative distribution function is
Z x
F (x) = P ({x ∈ (−∞, x]}) = p(t) dt ⇒ p(x) = F 0 (x).
−∞
In this course, what probability distributions are for discrete variables, prob-
ability densities are for continuous variables. Sum and product rule work for
densities as well:
Z
p(x) = p(x, y) dy, p(x, y) = p(x|y)p(y),
and so does Bayes rule, always replacing sums by integrals. At this point, you
will start to appreciate random variables. If x and y are random variables, so
are ex and x + y, and computing probabilities for the latter is merely an exercise
in integration.
This is about the level of formality we need in this course. However, you should
be aware that we are avoiding some difficulties here which do in fact become
relevant in a number of branches of machine learning, notably nonparametric
statistics and learning theory. For example, it is not possible to construct a
probability space on R so that every subset is an event. We have to restrict
ourselves to measurable (or Borel) sets. Also, not every function is allowed as
random variable, and not every distribution of a random variable has a prob-
ability density (a simple example: the constant variable x = 0 does not have
a density). Things also become difficult if we consider an infinite number of
random variables at the same time, for example in order to make statements
about limits of random variable sequences. None of these issues are in the scope
of this course. Good general expositions are given in [3, 18].
if x is continuous with density p(x) (always under the assumption that sum or
integral exists). Note that our definition covers extensions such as
Z
E[f (x)] = f (x)p(x) dx,
E[x] = [E[xj ]] ∈ Rd .
E[xy] = E[x]E[y].
However, the reverse3 is not true in general. Moreover, the conditional expecta-
tion is X
E[x | y] = xP (x|y)
x
or Z
E[x | y] = xp(x|y) dx.
Var[x] = E (x − E[x])2 ,
the expected squared distance of x from its mean. Another formula for the
variance is
3 If E[f (x)g(y)] = E[f (x)]E[g(y)] for every pair of (measurable) functions f , g, then x and
y are independent.
5.1 Essential Probability 69
Again,
Var[x | y] = E (x − E[x])2 | y = E[x2 | y] − E[x | y]2 .
left-multiplication by A, right-multiplication by AT .
Given a dataset D = {xi | i = 1, . . . , n}, empirical mean and empirical covari-
ance (or sample mean, sample covariance) are computed as
n n
1X 1X
µ̂ = xi , Σ̂ = xi xTi − µ̂ µ̂T .
n i=1 n i=1
If the data points are drawn independently from a distribution with mean E[x],
covariance Cov[x], then µ̂ and Σ̂ converge4 against E[x] and Cov[x] as n → ∞.
We will gain a more precise idea about these estimators in Chapter 6.
4 Convergence happens almost surely (with probability one), a notion of stochastic conver-
Finally, be aware that not all continuous random variables with a density have a
variance, some do not even have a mean, since corresponding integrals diverge.
For example, the Cauchy distribution with density
1
p(x) = .
π(1 + (x − x0 )2 )
does not have mean or variance. The reason for this is that too much proba-
bility mass resides in the tails (−∞, −x0 ] ∪ [x0 , ∞), x0 large. Such heavy-tailed
distributions become important in modern statistics and machine learning, as
good models for data with outliers. They can be challenging to work with.
• Class prior probability distribution P (t): The distribution of the class label
t on its own. What is the fraction of healthy versus cancerous patients to
be exposed to our screening procedure?
p(x|t)P (t) X
P (t|x) = , p(x) = p(x|t)P (t). (5.1)
p(x) t
5.2 Decision Theory 71
R(f ) = P {f (x) 6= t}
is as small as possible. How does such an optimal classifier look like? Consider
the example in Figure 5.4, where x ∈ R and t ∈ {1, 2}.
x0 x
b
p(x, C1 )
p(x, C2 )
x
R1 R2
Figure 5.4: Joint distributions p(x, Ct ) = p(x|t)P (t) for a binary classification
problem with x ∈ R, T = {1, 2}. The classifier f (x) = 1 + I{x≥x̂} has decision
regions H2 = [x̂, ∞) and H1 = (−∞, x̂) (called R2 and R1 in the figure). Its
error probability R(f ) is visualized as the combined area of the blue, green and
red regions. The blue region stands for points x from class 1 being classified as
f (x) = 2, while the union of green and red regions stands for points x from class
2 being classified as f (x) = 1. If we move the decision threshold away from x̂,
the combined area of green and blue regions stays the same. It symbolizes the
unavoidable Bayes error R∗ . On the other hand, we can reduce the red area to
zero by setting x̂ = x0 , the point where the joint density functions intersect.
f ∗ (x) = 1 + I{x≥x0 } is the Bayes-optimal rule.
Figure from [5] (used with permission).
It becomes clear from this example that the error R(f ) can be decomposed into
several parts, some of which are intrinsic, while others can be avoided by a better
choice of f . From the figure, it seems that the optimal classifier f ∗ follows the
lower of the two curves p(x|t)P (t), t = 1, 2, and its error R∗ is the area under
72 5 Probability. Decision Theory
the curve min{p(x, t = 1), p(x, t = 2)}, the best we can do. In particular, this
minimum achievable error is not zero.
To cement our intuition, denote the space of label values by T : T = {0, 1} for
our cancer screening example, or T = {0, . . . , K − 1} for K-way classification.
A classifier f (x) is characterized by its decision regions
n o
Ht = x f (x) = t .
For example, recall from Section 2.2 that the decision regions for a binary linear
classifier are half-spaces in feature space. First,
P {f (x) 6= t} = E I{f (x)6=t} ,
The total error is the expectation of the class-conditional errors R(f |t = k) over
the class prior P (t). This is useful in order to understand the composition of
the error. Test your understanding by deriving the second equation (note that
1 − R(f ) is the probability of getting it right). Second,
Z !
X
R(f ) = p(x) I{f (x)6=k} P (t = k|x) dx
k∈T
Z
= p(x) (1 − P (t = f (x)|x)) dx.
The second equation is due to the fact that in the definition (5.1) of the posterior
P (t|x), the denominator p(x) does not depend on t. In Figure 5.4, we plot
p(x|t = k)P (t = k) for two classes. The Bayes-optimal classifier picks t = k
whenever the curve for k lies above the curve for all other classes. Its decision
regions are
n o
Hk∗ = x p(x|t = k)P (t = k) > p(x|t = k 0 )P (t = k 0 ), ∀k 0 ∈ T \ {k} .
The probability of error for the Bayes-optimal classifier is called Bayes error:
Z
∗ ∗
R = R(f ) = p(x) 1 − max P (t = k|x) dx = 1 − E max P (t = k|x) .
k∈T k∈T
5.2 Decision Theory 73
In Figure 5.4, the Bayes error R∗ is the area under the curve min{p(x, t =
1), p(x, t = 2)}.
p(x|t = 1) P (t = 1)
y ∗ (x) = log + log > 0.
p(x|t = 0) P (t = 0)
in that f ∗ (x) = argmaxt∈T yt∗ (x). As in the binary case, we could get by with
K − 1 functions, say yk∗ (x) − y0∗ (x), k = 1, . . . , K − 1. However, this singles out
one class (t = 0) arbitrarily and creates more problems than it solves, so usually
K discriminant functions are employed.
The setup is illustrated in Figure 5.5. Recall from Section 5.1.3 that the Cauchy
distribution is peculiar in that mean and variance do not exist. If you sample
5 It does not matter to which basis the logarithm is taken, as long as we keep consistent.
0.5
0.45
0.4
0.35
0.3
R*
0.25
0.2
0.15
0.1
0.05
0
0 2 4 6 8 10
Delta=(a1−a2)/(2 b)
Figure 5.5: Bayes-optimal classifier and Bayes error for two class-conditional
Cauchy distributions, centered at a0 and a1 . The optimal rule thresholds at the
midpoint a = (a0 + a1 )/2. Since the class prior is P (t = 0) = P (t = 1) = 1/2,
the Bayes error R∗ is twice the yellow area. Right plot show R∗ as function of
separation parameter ∆. The slow decay of R∗ is due to the very heavy tails of
the Cauchy distributions.
values from it, you may encounter very large values occasionally. Assume a1 > a0
and define the midpoint a = (a0 + a1 )/2. The Bayes-optimal rule is obvious by
symmetry: f ∗ (x) = I{x>a} . Moreover,
1 X
Z
R∗ = p(x|t = k) dx.
2 H∗
k=0,11−k
The Bayes error R∗ is a function of the separation ∆ of the classes (Figure 5.5,
right). For ∆ = 0, the class-conditional densities are the same, so that R∗ = 1/2.
Also, R∗ → 0 as ∆ → ∞. However, the decay is very slow, which is a direct
consequence of the heavy tails of the p(x|t). Any x > a is classified as f ∗ (x) = 1,
but even a x a could still come from p(x|t = 0) whose probability mass far
to the right is considerable.
The Bayes-optimal classifier f ∗ (x) under loss L(y, t) minimizes the risk, and
R∗ = R(f ∗ ) is called Bayes risk. Since
Z !
X
R(f ) = p(x) L(f (x), k)P (t = k|x) dx,
k∈T
You have probably noticed by now that minimizing the classification error is
just a special case under the zero-one loss L(y, t) = I{t6=y} : no loss for getting it
right, loss 1 for an error.
How does the optimal decision rule depend on the loss function values? Let us
work out the Bayes-optimal discriminant function y ∗ (x) for our cancer screening
example. It is positive (classifies 1) if
L(1, 0)P (t = 0|x) + L(1, 1)P (t = 1|x) < L(0, 0)P (t = 0|x) + L(0, 1)P (t = 1|x)
⇔ (L(0, 1) − L(1, 1))P (t = 1|x) > (L(1, 0) − L(0, 0))P (t = 0|x) ⇔
p(x|t = 1) P (t = 1) L(1, 0) − L(0, 0)
log + log > log = − log λ ⇔
p(x|t = 0) P (t = 0) L(0, 1) − L(1, 1)
p(x|t = 1) P (t = 1)
y ∗ (x) = log + log + log λ > 0.
p(x|t = 0) P (t = 0)
The loss function values only shift the threshold of the optimal discriminant.
The larger λ, the more y ∗ (x) = 1 outputs will happen (human checkup). You
76 5 Probability. Decision Theory
5 1.2
p(C1 |x) p(C2 |x)
p(x|C2 )
1
4
0.8
class densities
3
0.6
2
0.4
p(x|C1 )
1
0.2
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
Our insights from decision theory can be summed up in the statement that
optimal decision making is driven by probabilistic inference, the computation of
posterior probabilities. The standard approach to a problem is to:
However, we are in a comfortable position in this chapter: we know all of p(x, t),
and P (t|x) is easily computed. This is not so for most real-world situations.
The true, or even a very good model is unknown. We have to learn from data,
and computing posterior distributions can be very difficult. In the real world,
we have to find shortcuts to the standard approach. For example, if only the
posterior P (t|x) enters decision making, why don’t we model it directly, rather
than bothering with all of p(x, t)? It may well be that the precise shape of the
class-conditionals p(x|t) is irrelevant for optimal decision making (Figure 5.6),
in which case learning it from data is wasteful. This rationale underlies the
discriminative (as opposed to generative) approach to probabilistic modelling,
which we will learn about in Chapter 8. Even more economical, we could learn
discriminant functions directly, without taking the detour over the posterior
P (t|x). On the other hand, the posterior P (t|x) provides much more useful
information about a problem than any single discriminant function. For one, we
can evaluate our risk and act accordingly. Moreover, the combination of multiple
probabilistic predictors is easy to do by rules of probability.
78 5 Probability. Decision Theory
Chapter 6
Probabilistic Models.
Maximum Likelihood
• The choice of ∆x is crucial. Too large, and we miss important details. Too
fine, and most bins will be empty. Histograms are smoothed out in kernel
density estimators [5, ch. 2.5.1].
• Histograms do not work in more than five dimensions or so. The number
of cells grows exponentially with the number of dimensions, and most cells
1 For the final grades data, responses are naturally quantized at bin width 0.5, so there is
79
80 6 Probabilistic Models. Maximum Likelihood
0.4
6
0.35
5
0.3
4 0.25
0.2
3
0.15
2
0.1
1
0.05
0 0
0 5 10 15 20 25 30 35 40 45 0 1 2 3 4 5 6 7
Figure 6.1: Left: Final grades from Pattern Classification and Machine Learning
course 2010. Right: Histogram of data. Overlaid is the maximum likelihood fit
for a Gaussian distribution. For this data, the responses ti are quantized to
{0, 0.5, 1, . . . , 5.5, 6}, so the bin width is ∆x = 0.5.
• Histograms are often not versatile enough. There are no knobs we can
adjust in order to start analyzing the data, instead of just staring at it
from one angle.
At this point, we need to specify our assumptions about the data. Consider a
dataset D = {xi | i = 1, . . . , n}. The most commonly made assumption is that
the data points xi are independently and identically distributed (short: i.i.d.).
There exists a “true” distribution, from which the xi are drawn independently
(recall independence from Section 5.1.1).
2 We
will not be concerned with multimodal densities in the present chapter, but the tech-
niques we develop here will be the major ingredient for Gaussian mixture models in Sec-
tion 12.2.
6.2 Maximum Likelihood Estimation 81
You throw it n = 100 times, thereby collecting data D = {x1 , . . . , x100 }. We can
assume that this data is i.i.d. Now, if P {x1 = 1} = p1 , the probability of
generating D is
n
Y n
X
P (D|p1 ) = px1 i (1 − p1 )1−xi = pn1 1 (1 − p1 )n−n1 , n1 = xi .
i=1 i=1
n1 is the number of times the thumbjack lands with point facing up. The distri-
bution of xi is called Bernoulli distribution with parameter p1 . We can regard
P (D|p1 ) as the probability of the data D under the model assumption. This has
nothing to do with the “true” probability of data, whatever that may be. After
all, for every value of p1 ∈ [0, 1], we get a different probability P (D|p1 ). This
model probability, as a function of the parameter p1 , is called likelihood func-
tion. If our goal is to fit the model P {x1 = 1} = p1 with parameter p1 to the
i.i.d. data D, it makes sense to maximize the likelihood. The maximum likelihood
estimator (MLE) for p1 is
−20
−40
−60
n1/n=0.1
−80
n1/n=0.7
n1/n=0.5
−100
Figure 6.3: Log likelihood functions for Bernoulli distributions (thumbtack ex-
ample). The sample size is n = 20 here. Note that the log likelihood function
log P (D|p1 ) depends on the data D only through p̂1 = n1 /n and n. Its mode is
at p1 = p̂1 , the maximum likelihood estimate.
Let us solve for p̂1 . Assume for now that n1 ∈ {1, . . . , n − 1}. Then, P (D|p1 ) > 0
for p1 ∈ (0, 1), P (D|p1 ) = 0 for p1 ∈ {0, 1}, so we can assume that p1 ∈
(0, 1). It is always simpler to maximize the log-likelihood log P (D|p1 ) instead.
The derivative is
d log P (D|p1 ) n1 n − n1 n1
= − =0 ⇔ n1 (1−p1 ) = p1 (n−n1 ) ⇔ p1 = .
dp1 p1 1 − p1 n
p̂1 = n1 /n is indeed a maximum point, the unique maximizer, and this holds for
n1 ∈ {0, n} as well. The maximum likelihood estimator (MLE) for p1 = P {x1 =
6.3 The Gaussian Distribution 83
1} is
n1
p̂1 =
,
n
the fraction of the thumbjack pointing up over all throws. This is what we
would have estimated anyway. Indeed, maximum likelihood estimation often
coincides with common sense (ratios of counts) in simple situations. Bernoulli
log likelihood functions for the thumbtack setup are shown in Figure 6.3.
Is this “correct”? Remember that this question is ill-defined. Is it any good
then? Maximum likelihood estimation works well in many situations in prac-
tice. It comes with a well-understood asymptotic theory. Given that, we will see
that it can have shortcomings as a statistical estimation technique, in particular
if applied with small datasets, and we will study modifications of maximum like-
lihood in order to overcome these problems. For now, it is simply a surprisingly
straightforward and general principle to implement density estimation.
The real importance of the Gaussian becomes apparent only for multivariate dis-
tributions. In fact, Gaussians are maybe the only distributions we can tractably
work with in high dimensions. The density of a Gaussian distribution of x ∈ Rd
is
−1/2 − 12 (x−µ)T Σ−1 (x−µ)
N (x|µ, Σ) = |2πΣ| e . (6.2)
This looks daunting, but is nothing to worry about. |Σ| is the determinant of
Σ. Refresh your memory about determinants in Section 6.3.1. In Section 6.3.3,
we work out the form of this density from the univariate Gaussian, using trans-
formation rules which come in handy elsewhere as well. We also show there that
µ = E[x], Σ = Cov[x]. The parameters of the multivariate Gaussian are mean
and covariance matrix.
Not every matrix Σ ∈ Rd×d is allowed here. First, Σ must be invertible. Also,
− log N (x|µ, Σ) is a quadratic function. Recall our discussion of quadratic func-
tions in Section 4.2.2: Σ−1 , therefore Σ should be symmetric. It also must be a
valid covariance matrix. What does that mean? If x ∈ Rd is a random variable
with covariance Cov[x] (not necessarily Gaussian), then for any v ∈ Rd :
0 ≤ Var[v T x] = v T Cov[x]v.
v T Av ≥ 0 ∀v ∈ Rd .
0.08 1 1
0 0
0.06
−1 −1
0.04 −2
−2 −1 0 1 2
−2
−2 −1 0 1 2
1
0
2 0
0 1 −1
0
−1 −2
−2 −1 0 1 2
−2 −2
Figure 6.4: Left: Bivariate Gaussian density function. Right: Contours of Gaus-
sian density functions. Contours are spherical for isotropic Gaussians (no pre-
ferred direction), they are aligned with the standard coordinate axes for inde-
pendent Gaussians (diagonal covariance matrix). Each contour line is an ellipse,
whose major axes are given by the eigenvectors of the covariance matrix.
In this basic course, we will mainly be concerned with linear methods and mod-
els based on Gaussian distributions. However, it is increasingly appreciated in
statistical physics, statistics, machine learning, and elsewhere that many real-
world distributions of interest are not Gaussian at all. Relevant buzzwords are
“heavy-tailed” (or “fat tails”), “power law decay”, “scale-free”, “small world”,
etc. Real processes exhibit large jumps now and then, which are essentially
ruled out in Gaussian random noise. As Gaussian distributions are simple to
work with, these facts have been widely ignored until quite recently. In this
sense, parts of classical statistics, economics and machine learning are founded
on principles which do not fit the data.
Does this mean we waste our time learning about Gaussians? Absolutely not!
The most natural way to construct realistic heavy-tailed distributions with
power law decay is by mixing together Gaussians of different scales. Non-
Gaussian statistics is built on top of the Gaussian world. The bottomline is this.
Classical statistics use Gaussians and linear models to represent data directly.
Modern statistics employs non-Gaussian distributions and non-linear models,
which are built from Gaussians and linear mappings inside, mainly via the pow-
erful concept of latent variables (we will explore this idea in Chapter 12). Modern
methods behave non-Gaussian, but they are driven by Gaussian mathematics
and corresponding numerical linear algebra as major building blocks.
6.3 The Gaussian Distribution 87
The multivariate Gaussian density (6.2) features a determinant |Σ| of the matrix
Σ. It is highly recommended that you study [42, ch. 5.1], even if you think you
know everything about determinants. The following is just a brief exposition,
mainly taken from there.
Every square matrix A ∈ Rp×p has a determinant |A|. Other texts use the
notation detA to avoid ambiguities, but the |A| is most common and will be
used in this course. The number |A| ∈ R contains a lot of information about
the matrix A. First, |A| 6= 0 if and only if A is invertible. If this is the case,
then |A−1 | = 1/|A|. You should memorize the 2 × 2 case:
a b 1 d −b
A= ⇒ |A| = ad − bc, A−1 = .
c d ad − bc −c a
F (a1 , . . . , ap ) = |[a1 , . . . , ap ]| ,
AA−1 = I.
The same holds for lower triangular matrices (and of course for diagonal
matrices).
88 6 Probabilistic Models. Maximum Likelihood
So it all comes down to N (0, 1). Being an even function, its mean must be zero if
it exists (it may not, remember the Cauchy). Let us go for the variance directly:
if this exists, so does the mean (why?). We substitute r = t2 /2, so dr = t dt:
Z ∞ Z ∞
1 2
Var[t] = 2 t2 (2π)−1/2 e− 2 t dt = (2/π)1/2 (2r)1/2 e−r dr
0 0
Z ∞
= 2π −1/2 r1/2 e−r dr = 2π −1/2 Γ(3/2).
0
Here, ||L|| denotes the absolute value of the determinant |L|. This rule is easy
to understand. The differentials dx and dt are infinitesimal volume elements.
Picture dt as tiny hypercube. Since t → x involves multiplication with L, the
cube is transformed. What is its new volume? By our volumne intuition about
the determinant, it must be dx = ||L||dt. Plugging this into (6.4):
1 2 1 T
L−T L−1 (x−µ) 1 T
Σ−1 (x−µ)
e− 2 ktk = e− 2 (x−µ) = e− 2 (x−µ) .
Moreover, |LT | = |L|, so that |Σ| = |LT L| = |L|2 , and the prefactor becomes
||L||−1 = |Σ|−1/2 . All in all, the density of x is (6.2). Finally, we know how
mean and covariance transform (Section 5.1.3):
We have derived (6.2) from the univariate case and shown that µ, Σ are mean
and covariance respectively.
First,
n n
∂L X n 1X
= (µ − xi )/σ 2 = 2 (µ − x̄), x̄ = xi .
∂µ i=1
σ n i=1
Therefore, µ̂ = x̄: the empirical mean. Note that this is a minimum point, since
the second derivative is positive. Plugging in µ = µ̂:
n
1 X n
L(σ 2 ) = (xi − x̄)2 /σ 2 + log(2πσ 2 ) = S/σ 2 + log(2πσ 2 ) ,
2 i=1 2
n
1X
S= (xi − x̄)2 .
n i=1
If τ = σ −2 , then
n ∂L n 1
L(τ ) = (Sτ + log(2π) − log τ ) , = S− .
2 ∂τ 2 τ
6.4 Maximum Likelihood for Gaussian Distributions 91
∂ 2 (2L/n) ∂ 1
2
= (S − 1/τ ) = 2 > 0.
∂τ ∂τ τ
The maximum likelihood estimator for mean and variance of a Gaussian (6.1)
is
n n
1X 1X
µ̂ = x̄ = xi , σ̂ 2 = (xi − x̄)2 .
n i=1 n i=1
The Gaussian ML fit to our exam results data employs the empirical mean and
variance. For multivariate data D = {xi | i = 1, . . . , n}, xi ∈ Rd , the maximum
likelihood estimator is equally intuitive:
n n
1X 1X
µ̂ = x̄ = xi , Σ̂ = (xi − x̄)(xi − x̄)T , (6.5)
n i=1 n i=1
sample mean and covariance. The latter holds only if the sample covariance has
full rank d (in particular, n ≥ d). Otherwise, the MLE for the covariance is
undefined. We derive these estimators in Section 6.4.3.
Expanding the squared distances, the kxk2 terms cancel each other:
1 π1
y ∗ (x) = (µ+1 − µ−1 )T x − kµ+1 k2 − kµ−1 k2 + log
. (6.6)
2 1 − π1
µ+1 − µ−1 . If c = log{π1 /(1 − π1 )}, we can use kµ+1 k2 − kµ−1 k2 = (µ+1 −
µ−1 )T (µ+1 + µ−1 ) to obtain
1 c
y ∗ (x) = wT (x − x0 ), x0 = (µ+1 + µ−1 ) − w.
2 kwk2
The geometrical picture is clear (Figure 6.6). Imagine a line through the class
means µ+1 , µ−1 . The optimal hyperplane is orthogonal to this line. It intersects
the line at x0 , which is the midpoint between the class means, (µ+1 + µ−1 )/2, if
and only if c = 0, or P (t = +1) = P (t = −1) = 1/2. For unequal class priors, x0
is obtained by translating the midpoint along the line, towards the class mean
whose P (t) is smaller. For this simple setup, we can compute the Bayes error
analytically (Section 6.4.2). For the special case c = 0 (equal class priors),
Z x
2
R∗ = Φ(−kwk/2), Φ(x) = (2π)−1/2 e−t /2
dt.
−∞
Here, Φ(x) is the cumulative distribution function of the standard normal distri-
bution N (0, 1): Φ(x) = P {t ≤ x}, t ∼ N (0, 1). As expected, R∗ is a decreasing
function of the distance kwk = kµ+1 −µ−1 k between the class means, R∗ = 1/2
for µ+1 = µ−1 , and R∗ → 0 as kwk → ∞.
In order to use this in the real world, we can estimate the parameters µ+1 , µ−1 ,
and π1 = P (t = +1) from data D = {(xi , ti ) | i = 1, . . . , n}, using the principle
6.4 Maximum Likelihood for Gaussian Distributions 93
of maximum likelihood:
n
!
Y X
p(D|µ+1 , µ−1 , π1 ) = N (xi |µti , I) π1n1 (1 − π1 )n−n1 , n1 = I{ti =+1} .
i=1 i
• Pick a model p(x, t|θ) = p(x|t, θ)P (t|θ), parameterized by θ. This choice
determines everything else, and for real-world situations there is no uni-
formly optimal recipe for it. In this course, we will learn about conse-
quences of modelling choices, so we can do them in an informed way. We
need two basic properties:
– For any fixed θ, the Bayes-optimal classifier is known and has a
reasonably simple form.
– ML density estimation is tractable for p(x|t, θ) and P (t|θ)
(1+t)/2
In our example above, p(x|t, θ) = N (x|µt , I), P (t|θ) = π1 (1 −
π1 )(1−t)/2 , and θ = [µT+1 , µT−1 , π1 ]T .
• Given training data D = {(xi , ti )}, estimate θ by maximizing the likeli-
hood, resulting in θ̂.
• The ML plug-in classifier is obtained by plugging the estimated parameters
θ̂ into the optimal rule.
Plug-in classifiers are examples of the generative modelling paradigm. Our goal
is to predict t from x, and we get there by estimating the whole joint density
by p(x, t|θ̂), then use Bayes’ formula to obtain the posterior P (t|x, θ̂), based
on which we classify. Another idea would be to estimate the posterior P (t|x)
directly, bypassing the modelling of inputs x altogether, which is what we do in
discriminative modelling. We will get to the bottom of this important distinction
in Chapter 8.
Equal Covariances 6= I
which is what we obtained above. This means that the Bayes error is given by
the expressions derived in Section 6.4.2, replacing kwk by
q
kw̃k = (µ+1 − µ−1 )T Σ−1 (µ+1 − µ−1 ) = kµ+1 − µ−1 kΣ .
Multi-Way Classification
C3
C1
?
R1
R3
R1 C1 ?
R2
C3
C1 R2
R3 C2
C2
not C1
C2
not C2
in the spherical covariance case, where C is a constant. Given our model assump-
tions, the Bayes-optimal rule is f ∗ (x) = argmaxk∈T yk∗ (x). The ML plug-in
classifier has the same form, plugging in ML estimates
n n
X nk X
µ̂k = n−1
t I{ti =k} xi , P̂ (t = k) = , nk = I{ti =k} .
i=1
n i=1
Importantly, the optimal rule for Gaussian class-conditional densities, while de-
fined in terms of K linear discriminant functions, is not a combination of binary
linear classifiers. A lot of effort has been spent in machine learning in order
to combine multi-way classifiers from binary linear ones. The most commonly
used heuristic, “one-against-rest”, uses K binary classifiers fk (x), discriminat-
ing t = k versus t 6= k. There are always regions in which the binary votes of all
fk (x) remain ambiguous (Figure 6.7, left). Another heuristic uses K(K − 1)/2
binary classifiers fk,k0 (x), k 6= k 0 , discriminating t = k versus t = k 0 . Again,
their votes remain ambiguous in some regions (Figure 6.7, right). More complex
heuristics employ “error-correcting codes” or directed acyclic graphs. Decision
theory indicates that such attempts cannot in general attain optimal perfor-
mance5 .
where x ∼ N (µ+1 , I). Denote w = µ+1 − µ−1 . If x̃ = x − µ+1 ∼ N (0, I), the
event is
1 2 1 2 T 1 2
2 kx̃k > 2 kx̃ + wk + c ⇔ w x̃ < − 2 kwk − c.
kwk2 /2 + c
1 c
Φ − = Φ − kwk − ,
kwk 2 kwk
where Z x Z x
2
Φ(x) = N (t|0, 1) dt = (2π)−1/2 e−t /2
dt
−∞ −∞
is the cumulative distribution function of N (0, 1). The second part of the error
is P {f ∗ (x) = 1 and t = −1}. Due to the symmetry of the setup, this must be
the same as the first if we replace indices +1 and −1 everywhere: kwk remains
unchanged, c is replaced by −c. Therefore, the Bayes error is
∗ 1 c 1 c
R = P (t = +1)Φ − kwk − + P (t = −1)Φ − kwk +
2 kwk 2 kwk
A few sanity checks. First, R∗ → 0 as kwk → ∞. Also, R∗ = min{P (t =
+1), P (t = −1)} for kwk = 0: if x is independent of the target t, the optimal
rule uses P (t) only. In the equal class prior case P (t = +1) = P (t = −1) = 1/2,
R∗ simplifies to Φ(−kwk/2).
where C1 , C2 do not depend on µ. But this looks like the quadratic we know
from a Gaussian over µ, something like N (µ|x̄, Σ/n). We know this quadratic
is smallest at the mean, therefore at x̄. Note that we dropped additive terms
not involving µ immediately: they do not influence the minimization, and there
is no merit in keeping them around.
The derivation of Σ̂ is a bit more difficult. First, since µ̂ does not depend on
Σ, we can plug it into L, then minimize the result w.r.t. Σ. To this end, we will
compute the gradient ∇Σ L, set it equal to zero and solve for Σ. The gradient
w.r.t. a matrix may be a bit unfamiliar, but recall that matrix spaces are vector
spaces as well. Before we start, recall the trace of a square matrix:
d
X
tr A = ajj = 1T diag(A), A ∈ Rd×d ,
j=1
the sum of the diagonal entries. Unlike the determinant, the trace is a linear
function: tr(A + αB) = tr A + α tr B. An important result is tr BC = tr C B,
given the product is a square matrix. Also, tr AT = tr A is obvious. Note that
X
tr B T C = bjk cjk ,
j,k
Here, we dropped an additive constant and also the prefactor 1/2. We also used
the outer product manipulation involving the trace. If P = Σ−1 , then
−1
P̂ = argmin {f (P ) = tr S P − log |P |} , Σ̂ = P̂ ,
P
used the symmetry of S here). Next, we will show a result which will be of
independent interest later during the course. At any P ∈ Rd×d with |P | > 0:
∇P log |P | = P −T . (6.7)
∂ log |P | 1
= lim log |P + εδ j δ Tk | − log |P | .
∂pjk ε→0 ε
Recall from Section 6.3.1 that we can do column eliminations without changing
the determinant. Eliminating all entries of v except vk :
|M | = I + (vk − 1)δ k δ Tk = vk = (P −1 )kj .
Figure 6.8: The Reuters RCV1 collection is a set of 800,000 documents (news
articles), with about 200 words per document on average. After standard pre-
processing (stop word removal), its dictionary (set of distinct words) is roughly
of size 400,000. A common machine learning problem associated with this data
is to classify documents into groups (for example: politics, business, sports, sci-
ence, movies), which are often organized in a hierarchical fashion.
could be about politics, US politics, Barack Obama) and may associate parts
of a document with different topics.
Some vocabulary. The atomic unit is the word. A document is an ordered set
of words. A corpus (plural: corpora) is a dataset of documents, part of which
may be labeled according to a grouping scheme. Roughly, preprocessing works
as follows:
• Remove stop words: frequent words which occur in most documents and
tend to carry no discriminative information. For example: a, and, is, it,
be, by, for, to, . . .
N
Y
P (x, N |t = k) = P (N ) p(k)
xj ,
j=1
where P (N ) is a distribution over N which does not depend on t. For our clas-
sification purposes, we can use P (x|N, t) in place of P (x, N |t), since the P (N )
factor cancels out in ratios like P (x, N |t = k)/P (x, N |t = k 0 ), and these are all
we ever need in the end. If you are confused at this point, simply move on and
ignore the N in P (x|N, t) as a technical detail.
Imagine the build-up of P (x|N, t) for two distinct classes t = k, k 0 , say busi-
ness (k) and sports (k 0 ). For each word xj of x, we multiply P (x|N, t = k)
(k) (k0 )
and P (x|N, t = k 0 ) by pxj and pxj respectively. For example, if cx1 =“CEO”,
(k) (k0 )
presumably px1 > px1 , so the fraction P (x|N, t = k)/P (x|N, t = k 0 ) in-
creases. On the other hand, cx5 =“Football” implies a decrease of the ra-
(k) (k0 )
tio by way of multiplication with px5 /px5 < 1. Each word contributes to
the accumulation of evidence in a multiplicative fashion, independent of the
distribution of other words. The combined parameters7 of this model are
θ = [(p(0) )T , . . . , (p(K−1) )T , P (t = 0), . . . , P (t = K − 1)]T . In practice, we
deal with many documents xi of different lengths Ni , and a representation with
M factors is preferable:
N
Y M
Y φm (x) N
X
P (x|N, t = k) = p(k)
xj = p(k)
m , φm (x) = I{xj =m} . (6.8)
j=1 m=1 j=1
φm (x) is the number of times the word cm occurs in document x. Note that in
PM
general, the majority of the counts will be zero. Also, m=1 φm (x) = N . The
feature vector φ(x) = [φm (x)] ∈ NM summarizes the information in x required
to compute P (x|N, t = k) for all k = 0, . . . , K − 1. Such summaries are called
sufficient statistics. Given our model assumptions, this is all the information we
need to know (and therefore, compute) in order to learn and predict. Whenever a
model for documents x has single word occurence features as sufficient statistics,
it falls under the bag of words assumption. The same count vector is obtained
by permuting words in x arbitrarily, their ordering does not matter. It is as if
we cut x into single words, put them in a bag and mix them up. This seems like
a drastic assumption, but it is frequently made in information retrieval, since
resulting computational simplifications are very substantial.
7 We do not include P (N ) in θ, since the discriminant functions do not depend on it.
6.5 Maximum Likelihood for Discrete Distributions 101
8 Namely, ewk,m = 1.
P
m
9 In particular, 00 = 1 everywhere in this course. This is mainly a convention to make
indicators work. The argument limx→0 xx = exp(limx→0 x log x) = 1 is also convincing.
102 6 Probabilistic Models. Maximum Likelihood
This was pretty simple. A more complex example is given by the likelihood for
our text classification model. Here, we use indicators I{xi,j =m} , where xi,j is the
j-th word of the i-th document, xi = [xi,j ]. Moreover, we use label indicators
I{ti =k} . The first step is to expand the likelihood, a product over i = 1, . . . , n,
by introducing products over label values k, word values m:
n n K−1
I
Y Y Y
P (D|θ) = P (xi , Ni |ti )P (ti ) = (P (xi |Ni , t = k)P (t = k)P (Ni )) {ti =k}
i=1 i=1 k=0
n K−1 M I{t PNi
i =k} j=1 I{xi,j =m}
Y Y Y
I{ti =k}
= P (Ni ) P (t = k) p(k)
m .
i=1 k=0 m=1
The important point here is that the products over k and m are unconstrained
over all label and word values. All constraints are encoded in the indicators. In
particular, we can always interchange unconstrained products (or sums). Pulling
the product over data points inside, then summing exponents instead:
K−1 M Pi I{t PNi
i =k} j=1 I{xi,j =m}
Y P Y
P (D|θ) = C P (t = k) i I{ti =k} p(k) ,
m
k=0 m=1
Y
C= P (Ni ).
i
Q P Q
Here,
Q i and i is over all data points. In the same way, we could write k and
m over all label and word values. Dropping the range in sums and products
is very commonly done in research papers, and we will sometimes follow suit to
keep notations simple.
Indicator variables provide Q
a simple and mechanical way to convert theQ likeli-
Q
hood in its original form ( i ) into the likelihood in a useful form ( k m ),
(k)
directly in terms of the model parameters P (t = k) and pm . We have to accu-
mulate the counts which appear in the exponents:
X X Ni
X
nk = I{ti =k} , N (k,m) = I{ti =k} I{xi,j =m} .
i i j=1
N (k,m) is the number of times the word cm occurs in the documents xi labeled
as ti = k, and nk is the number of documents labeled as ti = k. With these, the
log likelihood is
K−1
X K−1
X X M
log P (D|θ) = nk log P (t = k) + N (k,m) log p(k)
m + log C. (6.10)
k=0 k=0 m=1
There is no need to drag around additive constants, and we can drop log C at
will. Given the thumbtack example of Section 6.2 and common sense, you may
guess the following maximizers of the log likelihood is
N (k,m) nk
p̂(k)
m = P (k,m0 )
, P̂ (t = k) = , (6.11)
m0 N n
the ratios of empirical counts. This is correct. We establish these maximum
likelihood estimators for our text classification model in Section 6.5.3.
6.5 Maximum Likelihood for Discrete Distributions 103
Each term in the rightmost product depends on a distinct part of the parameter
(k)
vector θ (here: a single coefficient pm ). This factorization assumption implies
crucial simplifications. The plug-in discriminants are linear functions, which can
be evaluated rapidly. More importantly, the log likelihood decomposes additively
and can be maximized efficiently.
A naive Bayes classifier does not have to use word count features φm (x) over
a dictionary, but can be run based on any feature map φ(x) = [φm (x)] over
the input point x. Naive Bayes far transcends text classification and can be
used with rather arbitrary input points, even discrete and continuous attributes
mixed together. Our example above is special, in that the different parameter
(k)
values pm , m = 1, . . . , M , come together to form one distribution in ∆M . Naive
Bayes can be used with features and corresponding probabilities which are not
linked in any way. For example, consider a binary feature map φ̃(x) = [φ̃m (x)],
where φ̃m (x) ∈ {0, 1} are not linearly dependent like the word counts. If x is a
document once more, each φ̃m (x) could be sensitive to a certain word pattern,
indicating its presence by φ̃m (x) = 1. A naive Bayes setup10 would be
M
Y φ̃m (x) 1−φ̃m (x)
P (x|t = k) = p(k)
m 1 − p(k)
m . (6.12)
m=1
(k)
If you parse this expression with indicators in mind, you see that pm encodes
P {φ̃m (x) = 1 | t = k}. Once more, the log likelihood decouples additively over
(k)
the different pm , and we can estimate them independently of each other. Naive
Bayes classification for binary features is summarized at the end of this subsec-
tion.
At this point, we should note that the term “naive Bayes” is not used consis-
tently in the literature. There is a narrow and a more general definition, and
we use the latter here. To understand the difference, consider the two exam-
ples we looked at. The narrow definition stipulates that the class-conditionals
P (x|t = k) are such that, given t = k, the different features φm (x) are condi-
tionally independent in the probabilistic sense (see Section 5.1.1). This is the
10 For the meticulous: We should use P (φ̃(x)|t = k) instead of P (x|t = k), in order to
case for our latter binary feature example (6.12), but it is not the case for the
PM
document classification setup (6.8): the constraint m=1 φm (x) = N obviously
links the features. The more general definition of naive Bayes, adopted here, re-
(k)
quires the class-conditionals P (x|t = k) to factorize w.r.t. the parameters pm ,
so that the log likelihood decouples additively. However, the features may still
be linked11 , and so are the corresponding parameters.
One final observation, in preparation of things to come. Back to the document
classification example. What happens if one specific word cm does not occur in
any documents xi labeled as ti = k? Go back up and check for yourself. The
(k)
ML estimator p̂(k) will have p̂m = 0 in this case. Under this distribution, it
is impossible that any document of class k ever contains cm . In other words,
suppose I come along with a document x∗ which contains cm at least once:
φm (x∗ ) > 0. Then, the ML plug-in discriminant function ŷk (x) of the form
(6.9) is
ŷk (x∗ ) = φm (x∗ ) log p̂(k)
m + · · · = −∞.
The hypothesis t = k is entirely ruled out for x∗ , due to the absence of a
single word cm from its training data. Does this matter? Yes, very much so.
Natural language distributions over words are extremely heavy-tailed, meaning
that new unseen words pop up literally all the time. The fact that some cm does
not occur in documents for some class will happen with high probability for any
real-world corpus. We will analyze this serious shortcoming of ML plug-in rules
in Chapter 7, where we will learn how it can be alleviated.
Suppose that φ̃(x) = [φ̃m (x)], where φ̃m (x) ∈ {0, 1}. Different from word
count features, the different φ̃m can be on or off independent of each other.
Naive Bayes classification is based on the model (6.12). This means that given
t = k, the features φ̃m (x) are conditionally independent. The classifier comes
(k)
with parameters pm ∈ [0, 1], m = 1, . . . , M , k = 0, . . . , K − 1, one for each
feature and class. These are estimated by maximum
PK−1 likelihood. If there are nk
input points xi labeled as ti = k, and n = k=0 nk , then
Pn
(k) I{t =k} φ̃m (xi )
p̂m = i=1 i ,
nk
the fraction of points xi of class k with φ̃m (xi ) = 1. The ML estimator for P (t)
is
nk
P̂ (t = k) = .
n
The trained classifier uses P̂ (t = k) and P̂ (x|t = k), the latter is (6.12) with
(k)
p̂m plugged in. For example, the (posterior) probability of class k̃ for a new
point x is computed by first computing
M
!
Y nk
P̂ (x|t = k)P̂ (t = k) = (p̂(k)
m )
φ̃m (x)
(1 − p̂(k)
m )
1−φ̃m (x)
.
m=1
n
11 For the meticulous: This linkage is typically a linear one (linear equality constraints),
expressed in not overly many constraints. It is possible to write down models with decoupling
log likelihood and intricate nonlinear constraints between the features. Such models would not
be called “naive Bayes” anymore.
6.5 Maximum Likelihood for Discrete Distributions 105
In practice, we have to use log in order to convert products into sums, otherwise
we produce overflow or underflow. It is easiest to work in terms of discriminant
functions:
n o
ŷk (x) = log P̂ (x|t = k)P̂ (t = k)
M
X nk
= φ̃m (x) log p̂(k)
m + (1 − φ̃ m (x)) log(1 − p̂ (k)
m ) + log .
m=1
n
The naive Bayes classifier decides for the class argmaxk ŷk (x). Moreover,
eŷk̃(x )
P̂ (t = k̃|x) = P ŷ (x) .
ke
k
If you are given a distribution q ∈ ∆L and seek the maximizer of Fq (p) for
p ∈ ∆L , the unique answer is p̂ = q. This problem appears over and over again12
12 It holds just as well for distributions over continuous variables, even though our proof
This function of two distributions over the same set is called relative entropy
(or Kullback-Leibler divergence, or also “cross-entropy” in the neural networks
literature). We need to show that D[q k p] ≥ 0 for any q, p ∈ ∆L , and that
D[q k p] = 0 only if q = p. This is the information inequality (or Gibbs
inequality), one of the most powerful “inequality generators” in mathematics.
It holds for general probability distributions, a general proof is found in [10].
2
x−1
1 log(x)
−1
−2
−3
0 0.5 1 1.5 2 2.5 3
all l = 1, . . . , L. Suppose that ql∗ 6= pl∗ for some l∗ with ql∗ = 6 0. Then,
ε = f (pl∗ /ql∗ ) > 0, and
pl pl
−D[q k p] = Eq log ≤ Eq − 1 − εI{l=l∗ } = −εql∗ < 0,
ql ql
Generalization.
Regularization
In this chapter, we introduce the concept of generalization and shed some light
on the phenomenon of over-fitting, which can negatively affect estimation-based
learning techniques. We will study regularization as a simple and frequently
effective remedy against over-fitting. Finally, MAP estimation is introduced as
general way of regularizing ML estimation, employing prior distributions over
model parameters.
7.1 Generalization
The world around us is complex. And the closer we look, the more details we
see. Arguably, for a model to stand any chance making interesting predictions, it
ought to reflect this complexity: many variables, high-dimensional feature maps,
a great depth of hidden layers separated by nonlinearities, so that training data
can be fit with high precision. What could be wrong with that?
Certainly, there will be computational issues. For highly detailed models, max-
imum likelihood estimation can be a hard problem to solve. But leaving these
issues aside, there is a fundamental problem: generalization. Understanding gen-
eralization is arguably the single most important lesson we will learn in this
course. Without a solid understanding of this concept, there can be no valid
statistics or useful machine learning. Its relevance goes far beyond statistics and
machine learning, essentially governing all natural sciences: Occam’s razor dic-
tates that we should always favour the simplest model appropriate for our data,
not the most complex one.
We start with some definitions which link back to decision theory (Sec-
tion 5.2) and discriminants (Chapter 2). Suppose you are given some data
D = {(xi , ti ) | i = 1, . . . , n}, ti ∈ {−1, +1}, and your goal is binary classi-
fication: finding a classifier f (x), mapping to labels {−1, +1}, which predicts
well. What does “predict well” mean? We can give several answers. After read-
ing Chapter 2, you might say: f (x) is a good classifier if it does well on the
109
110 7 Generalization. Regularization
Of course, we don’t know p∗ (x, t), we only know the data D. But we could use
D in order to learn about p∗ (x, t), using any number of subtle ideas (an exam-
ple is probabilistic modelling, leading to maximum likelihood plug-in rules; see
Chapter 6), and this may well lead to a classifier f (x) which does not minimize
the training error R̂n (f ) well, but attains a small test error R(f ). Training error
R̂n (f ) and test error R(f ) are different numbers, which in extreme cases can
have little relationship with each other. It is perfectly possible to attain very
small, even zero, training error and at the same time run up a large test error.
This nightmare scenario for statistical machine learning is called over-fitting.
This is all not very deep and fairly intuitive. But here is the interesting part. It
is possible to predict under which circumstances over-fitting is likely to occur.
Moreover, there are automatic techniques to guard against it, one of which we
will study in this chapter.
As far as over-fitting is concerned, there is nothing special about the training
error as a learning statistic. We will see that maximum likelihood estimation is
equally affected, where the statistic to minimize is the negative log likelihood.
Before we look into over-fitting and what to do about it, let us clarify our goals.
The correct answer above is the second: we wish to find a predictor with as small
a test error as possible. The catch with this goal is that it is in general impossible
to attain. We do not know p∗ (x, t), but only have a finite dataset D drawn from
it. The next best idea seems to select a classifier which minimizes the training
error, a statistic we can compute on D. This idea is a good one in general, it
works well in many cases. Yet training error minimization has some problems
which can make it fail poorly in certain relevant situations. Understanding and
mediating some of these problems is the subject of this chapter.
7.1.1 Over-fitting
We have already encountered over-fitting at several places. In Section 4.1, poly-
nomial curve fitting gave absurd, yet interpolating results for too high a poly-
nomial degree. In Section 4.2.3, we noted potential difficulties when solving
the normal equations of linear regression. In Section 6.4.1, we mentioned that
maximum likelihood plug-in classification can run into trouble if Gaussian class-
conditionals come with a full covariance matrix to be estimated. Finally, at the
7.1 Generalization 111
end of Section 6.5, we observed an extreme sensitivity of our naive Bayes docu-
ment classifier to zero word counts. In this section, we expose the commonalities
between these issues.
Figure 7.1: Example of over-fitting for binary classification. The simple linear
discriminant on the left errs on a single pattern. In order to drive the traning
error to zero, a more complex nonlinear discriminant is required (right). Given
the limited amount of data, the latter solution is less likely to generalize well.
the first place. In the Bayesian statistics approach to machine learning, over-fitting is ruled
out up front, at least in principle. Bayesian machine learning will not feature much in this
basic course, but see [28, 5].
112 7 Generalization. Regularization
1
p=1 1
p=2
0.5 0.5
0 0
−0.5 −0.5
−1 −1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
1
p=4 1
p=10
0.5 0.5
0 0
−0.5 −0.5
−1 −1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
In Figure 7.2, these least squares solutions are plotted for n = 10 data points
and different numbers p of free parameters. The data comes from a smooth
curve, yet the targets ti are obscured by additive noise. In the presence of noise,
a good predictor should refrain from interpolating the points. However, as p
grows close to n, it is the interpolants which minimize the training error, no
matter how erratic they behave elsewhere. For p = n, the training error drops
to zero, all training points are fit exactly. For p > n, we face a non-identifiable
problem: infinitely many polynomials interpolate the data with training error
zero. However, even for p ≈ n, p ≤ n, the least squares solutions behave terribly.
As noted in Section 4.2.3, this is due to ill-conditioning. For p ≤ n, the design
matrix Φ ∈ Rn×p typically has full rank p, and the system matrix ΦT Φ of
the normal equations is invertible. However, in particular for large p and n,
some of its eigenvalues are very close to zero. Geometrically speaking, for some
directions d, dT ΦT Φd = kΦdk2 ≈ 0. This means that our data remains silent
about contributions of the weight vector along d. Such matrices are called ill-
conditioned, and solving systems with them tends to produce solutions with
large coefficients, which in turn give rise to highly erratic polynomials. In short,
over-fitting occurs for least squares polynomial regression if the data is noisy
and the number of parameters p is overly close to n. We can do little about noise,
but in Section 7.2 we will learn to know remedies against ill-conditioning and
7.1 Generalization 113
7.2 Regularization
Simple learning techniques like training error minimization or maximum like-
lihood estimation can run into serious trouble, collectively termed over-fitting.
Will we have to sacrifice their simple geometrical structure and efficient learning
algorithms and do something else altogether? Will we have to painstakingly sift
through data, identify smallish counts and treat them by hand-tuned heuris-
tics? We don’t. It turns out that with a simple modification of the standard
techniques, we can alleviate some of the most serious over-fitting issues. Impor-
tantly, this modification does not add any computational complexity. In fact,
many algorithms behave better and may converge faster in that case. This idea
is called regularization (or penalization).
15
data
nu=0 [alpha=1394565]
nu=1e−7 [alpha=1024]
nu=1e−4 [alpha=113]
10 nu=1e−2 [alpha=6.8]
true
−5
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
Figure 7.3: Regularized polynomial curve fitting. n = 30 data points were drawn
from a smooth curve (dashed) plus Gaussian noise. Polynomials of degree p − 1
are used for the fitting, where p = 20. The black curve is the standard least
squares solution without regularization. The other curves are regularized least
squares solutions with different regularization parameter values ν. The α values
are α(ν) = kŵν k, where ŵν = argminw Eν (w). Notice the extreme size of α(0)
for the non-regularized solution (ν = 0). Its erratic behaviour is smoothed out
in what amounts to better tradeoffs between data fit and curve complexity.
Recall how over-fitting manifests itself in polynomial curve fitting. Since some
directions are almost unconstrained by the data, contributions of the weight
vector w can become large along these. Not required to pay for it, least squares
estimation uses these degrees of freedom in order to closely fit all training points.
As these are noisy, large weights are required, and the least squares fit behaves
very erratically elsewhere (Figure 7.3, black curve). A remedy is to make LS
estimation pay for using large weights, no matter along which directions. We
7.2 Regularization 115
0.45
0.25
0.4
0.2
0.15 0.35
0 10 20 30 40 50 0 10 20 30 40 50
Figure 7.4: Illustration of over-fitting and early stopping. Shown are error on
the training dataset (left) and on an independent validation dataset not used
for training (right), for a MLP applied to a regression problem (details in [5],
Figure 5.12), the horizontal unit is number of gradient descent iterations. The
training error curve is monotonically decreasing as expected. In contrast, the
validation error curve drops only up to a point, after which it increases. Early
stopping corresponds to monitoring the error on a validation set, terminating
MLP training once this statistic starts to increase.
Figure from [5] (used with permission).
There are other techniques to keep over-fitting at bay. One simple technique,
early stopping, is frequently used with multi-layer perceptrons or other neural
network models. Early stopping is based on monitoring an estimate of the test
error R(f ) alongside training (minimization of R̂n (f )). To do so, we hold out
some of our data exclusively for the purpose of validation, split our data into
a training set DT and a validation set DV . Since the latter is never used for
training, the empirical error R̂(fˆ; DV ) provides a reliable estimate of R(fˆ) even
as fˆ is fitted to DT . A typical MLP training run is shown in Figure 7.4. By
definition, the training error R̂(fˆ, DT ) (left panel) decreases monotonically. In
contrast, the validation error R̂(fˆ; DV ) does so only up to a point, after which
it begins to increase. We can stop the MLP training early at this point, since
any further decrease in training error does not lead to a decrease in R̂(fˆ; DV ).
Compared to regularization by adding a complexity penalty term(for example,
a Tikhonov squared norm), early stopping has the advantage of not changing
the standard training procedure at all, so existing code does not have to be
modified. In contrast to penalization, whose success relies on a good choice of
7.2 Regularization 117
The predictor ŷν (x) is a weighted sum of the basis functions φ(x)T uj , each be-
ing the inner product between the feature map φ(x) and the eigendirection uj .
At least for large p, these basis functions differ dramatically when we evaluate
them at the training points {xi } only. Namely,
For large p, the smallest eigenvalue λ1 is typically very close to zero, so that
Φu1 ≈ 0, therefore φ(xi )T u1 ≈ 0 for all i = 1, . . . , n. In contrast, the basis
118 7 Generalization. Regularization
ΦT Φ + νI = U ΛU + νU T U = U (Λ + νI) U T ,
−1
−1
ΦT Φ + νI = U (Λ + νI) U T
and
−1 n
−1
X ỹj
ŵν = ΦT Φ + νI ΦT t = U (Λ + νI) ỹ = uj ,
j=1
λj + ν
T T
ỹ = U Φ t.
Therefore,
ỹj ỹj λj
β0,j = , βν,j = = β0,j .
λj λj + ν λj + ν
Regularization with ν > 0 transforms β0,j (standard least squares) to βν,j .
Obviously, |βν,j | < |β0,j |, but the amount of shrinkage depends strongly on λj :
λj ≈1 | λj ν (well determined by data)
= .
λj + ν 1 | λj ν (poorly determined by data)
Therefore, regularization precisely counteracts the erratic behaviour of the stan-
dard least squares solution. Put differently, β0,j is inversely proportional to λj ,
explaining the very large contributions of the smallest eigendirections to ŵ0
and ŷ0 (x). Regularization exchanges the denominator by λj + ν, uniformly lim-
iting the influence of these poorly determined directions. In contrast, for well
determined directions with λj ν, regularization hardly makes a difference.
The expansion of ŵν in the orthonormal eigendirections uj implies that for
0 ≤ ν1 < ν2 :
n n n n
2
X X ỹj2 X ỹj2 X
kŵν2 k = βν22 ,j = < = βν21 ,j = kŵν1 k2 .
j=1 j=1
(λj + ν2 )2 j=1
(λ j + ν1 )2
j=1
The first equality uses the orthonormality: uTi uj = I{i=j} . On the other hand,
the squared training error increases with ν:
kΦ ŵν2 − tk2 > kΦ ŵν1 − tk2 .
To establish this, assume that the opposite holds. Then,
ν1 ν2
Eν1 (ŵν2 ) = E0 (ŵν2 ) + kŵν2 k2 < E0 (ŵν2 ) + kŵν1 k2
2 2
ν2
≤ E0 (ŵν1 ) + kŵν1 k2 = Eν1 (ŵν1 ),
2
7.3 Maximum A-Posteriori Estimation 119
resulting in the ML estimator p̂1 = n1 /n. However, having enjoyed some training
in physics, you take a good look at the thumbtack and convince yourself that
its shape implies that p1 > 1/2 is substantially more probable than p1 < 1/2
(“if not, I will eat my hat”). It is not that you dare to pin down the value p1
from first principles. But you are more or less certain about some properties of
p1 . You could formulate your thoughts in a probability distribution over p1 .
Wait a second. “More probable”, “probability distribution”? p1 is not random,
it’s just a parameter we don’t know. Recall our introduction of probability
in Section 5.1. We do not care whether p1 is “random” or just an unknown
parameter (whatever this distinction may mean). We are uncertain about it,
and that is enough. We encode our uncertainty in the precise value of p1 by
treating it as random variable. In other words, we can maintain a distribution
over p1 in the same way as we maintain one over x. The latter is a condi-
tional distribution P (x|p1 ). If you plug in data D, this becomes the likelihood
P (D|p1 ) = i P (xi |p1 ). The former3 is p(p1 ), called prior distribution over p1 .
Q
If this vocabulary reminds you of decision theory (Section 5.2), you are on the
right track. Back there, we predict a class label t from an observed input point
x as follows. We know the class-conditional distributions p(x|t) and the class
3 p(p is a density, since p1 ∈ [0, 1] is continuous.
1)
120 7 Generalization. Regularization
prior P (t). We determine the class posterior P (t|x) = p(x|t)P (t)/p(x) by Bayes’
formula, then predict f ∗ (x) = argmaxt P (t|x). The normalization by p(x) does
not matter, so that f ∗ (x) = argmaxt p(x|t)P (t). Here, we observe data D and
want to predict the probability p1 . We determine the posterior distribution
P (D|p1 )p(p1 )
Z
p(p1 |D) = , P (D) = P (D|p01 )p(p01 ) dp01
P (D)
p̂MAP
1 = argmax p(p1 |D) = argmax {P (D|p1 )p(p1 )} ,
p1 ∈[0,1] p1 ∈[0,1]
the maximum point of the posterior, the maximum a-posteriori (MAP) esti-
mator. How does this differ from ML estimation? As usual, we minimize the
negative log:
The MAP criterion is the sum of the negative log likelihood and the negative
log prior. The first quantifies data fit and plays the role of an error function.
The second is a regularizer. MAP estimation is a form of regularized estimation,
obtained by choosing the negative log prior as regularizer.
Figure 7.5: Density functions of the Beta distribution Beta(α, β) for different
values (α, β).
Figures from wikipedia (used with permission). Copyright by Krishnavedala,
Creative Commons Attribution-Share Alike 3.0 Unported license.
What is a good prior p(p1 )? In principle, we can choose any distribution we like.
The choice of a prior is simply a part of the choice of a probabilistic model. But
certain families of models are easier to work with than others, and the same
7.3 Maximum A-Posteriori Estimation 121
holds for prior distributions. For our thumbtack example, a prior from the Beta
family Beta(α, β) is most convenient:
1 Γ(α)Γ(β)
p(p1 |α, β) = (p1 )α−1 (1 − p1 )β−1 I{p1 ∈[0,1]} , B(α, β) = ,
B(α, β) Γ(α + β)
α, β > 0.
Here, Γ(x) is Euler’s Gamma function (Section 6.3.2). Before we move on, a gen-
eral hint about working with probability distributions. Never write, or even try
to remember, the normalization constants. What matters is the part depending
on p1 . Memorize p(p1 |α, β) ∝ (p1 )α−1 (1 − p1 )β−1 as Beta(α, β) density. Here,
A ∝ B reads “A proportional to B”, or A = CB for some constant C > 0. If you
need the normalization constant at all (you usually do not), you can look it up.
In particular, keeping normalization constants up to date during a derivation of
a posterior is a waste of time and an unnecessary source of mistakes.
Back to Beta(α, β). It is a distribution over a probability p1 ∈ [0, 1]. Its mean
is E[p1 ] = α/(α + β). Density functions for a range of (α, β) values are shown
in Figure 7.5. The larger the sum α + β, the more peaked the distribution. The
density is bounded above only if α ≥ 1, β ≥ 1. Beta(1, 1) is a special case you
already know: the uniform distribution over [0, 1]. For α > 1, β > 1, the unique
mode (maximum point) of the density is (α − 1)/(α + β − 2), which is always
further away from 1/2 than the mean, unless α = β. Beta(α, β) is symmetric
around 1/2 if and only if α = β. What is special about this family? You may have
noticed the similarity between p(p1 |α, β) and the likelihood P (D|p1 ). Indeed,
Finally, some more advanced comments, which can be skipped at first reading.
ML estimation was motivated by treating the likelihood P (D|p1 ) as a scoring
function to assess data fit, whose maximization w.r.t. p1 intuitively makes sense.
However, the posterior p(p1 |D) is a distribution over p1 . Why should the pos-
terior be particularly well represented by its mode? Why not for example the
mean E[p1 | D]? Our analogy with decision theory is useful, in that it motivates
the role of the prior p(p1 ), but it is not perfect. p1 is not a class label from a
finite set, but a continuous probability. Why not output all of p(p1 |D) as result,
given that it is just a Beta distribution with two parameters? These questions
are resolved in the Bayesian approach to machine learning, the ultimately cor-
rect way to implement decision theory from finite data, whereas ML or MAP
estimation are computationally attractive shortcuts. Bayesian machine learning
will not feature much in this course, but [28, 5] provide good introductions.
If MAP corresponds to regularized ML estimation, then Beta(α, β) implies the
following regularizer:
dropping an additive constant. Here, q1 is the mode of p(p1 |α, β). This is cer-
tainly not a quadratic function of p1 . What is it then? The attentive reader may
remember the pattern from Section 6.5.3, which allows us to write
Recall our naive Bayes document classifier from Section 6.5. Its over-fitting issues
have been discussed in Section 7.1.1, where we noted a surprisingly simple and
widely used fix: Laplace smoothing, add 1 to each count and move on. Compare
this to MAP estimation for the thumbtack example, where we added α and β
to the counts n1 and n − n1 . Maybe Laplace smoothing is MAP estimation with
a prior much like Beta? Abstracting from naive Bayes details, suppose that
XM
p ∈ ∆M = q qm ≥ 0, m = 1, . . . , M, qm =1 .
m=1
The normalization constant is not important for our purposes here. Densi-
ties for different Dirichlet distributions over ∆3 are shown in Figure 7.6. Its
properties parallel those of the Beta. The mean is E[p|α] = α/(1T α), and
PM
1T α = m=1 αm determines the concentration of the density. Dir(p|1) is
the uniform distribution over ∆M . If any αm < 1, the density grows un-
boundedly as p → δ m . If all αm > 1, the density has a unique mode at
(α − 1)/(1T α − M ). Most importantly, the Dirichlet family is conjugate for
QM (m)
the likelihood P (D|p) = m=1 (pm )N of i.i.d. data sampled from p. If the
prior is p(p) = Dir(p|α), then the posterior
M
Y (m)
+αm −1
p(p|D) ∝ (pm )N I{p∈∆M } ∝ Dir(p|α + [N (m) ])
m=1
Note that this lacks the −x̄ x̄ T term of the standard sample covariance, since the
mean is fixed to zero here. If p is comparable to the sample size n, then Σ̂ is ill-
conditioned, and the ML plug-in classifier is adversely affected (Section 7.1.1).
One remedy is to use MAP estimation instead, with a Wishart prior distribution.
Let us focus on the precision matrix P = Σ−1 . The likelihood is
Pn
1
xT n
p(D|P ) ∝ |P |n/2 e− 2 i=1 i P xi = |P |n/2 e− 2 tr Σ̂P
.
Here, we used manipulations involving the trace which were introduced in Sec-
tion 6.4.3. The Wishart family contains distributions over symmetric positive
7.3 Maximum A-Posteriori Estimation 125
This is of course conjugate to our Gaussian likelihood for the precision matrix.
For the Wishart prior p(P ) = W (I, α) with mean I, the posterior is
α
tr P − n
p(P |D) ∝ |P |(α+n−p−1)/2 e− 2 2 tr Σ̂P
α+n
tr( α+n
α n
Σ̂ )P
∝ |P |(α+n−p−1)/2 e− 2 I+ α+n
,
Wishart with α + n degrees of freedom and mean (αI + nΣ̂)/(α + n). Its mode
is the inverse of
MAP 1
Σ̂ = αI + nΣ̂ ,
α+n−p−1
derived as in Section 6.4.3. If λ1 ≤ · · · ≤ λp are the eigenvalues of Σ̂, then
MAP
Σ̂ has the same eigenvectors, but eigenvalues
α + nλj
λMAP
j = .
α+n−p−1
In particular, λMAP
1 ≥ α/(α + n − p − 1). Eigenvalues are bounded away from
zero, alleviating the over-fitting problems of the plug-in classifier. To conclude,
the MAP estimator with a Wishart prior with mean I is obtained from the stan-
dard sample covariance matrix by regularizing the eigenspectrum. Such spectral
regularizers are widely used in machine learning.
126 7 Generalization. Regularization
Chapter 8
Conditional Likelihood.
Logistic Regression
The machine learning methods we encountered so far can be classified into two
different groups. We can pick some error function and function class, then opti-
mize for the error-minimizing hypothesis within the class. In order to alleviate
over-fitting problems, the error function can be augmented by a regularization
term. Examples for this approach include perceptron classification, least squares
estimation or multi-layer perceptron learning. Alternatively, we can capture our
uncertainty in relevant variables by postulating a probabilistic model, then use
probability calculus for prediction or decision making. The most important ideas
in this context are maximum likelihood estimation (Chapter 6) and maximum
a-posteriori estimation (Section 7.3). Both paradigms come with strenghts and
weaknesses. Probabilistic modelling is firmly grounded on decision theory (Sec-
tion 5.2), and it is more “user-friendly”. As noted in Section 5.1, it is natural for
us to formulate our ideas about a setup in terms of probabilities and conditional
independencies among them, while constructing some regularizer or feature map
from scratch is unintuitive business. On the other hand, a probabilistic model
forces us to specify each and every relationship between variables in a sound
way, which can be more laborious than fitting some classifier to the data.
127
128 8 Conditional Likelihood. Logistic Regression
Its gradient is easily computed, and we can run gradient descent optimization
(Section 2.4.1). If y(xi ; w) is linear in the weights w, minimizing Esq (w) (least
squares estimation) corresponds to orthogonal projection onto the linear mode
space (Section 4.2.1), for which we can use robust and efficient algorithms from
numerical mathematics (Section 4.2.3). For multi-layer perceptrons, the gradient
can be computed by error backpropagation. Why would we ever use anything
else? As we will see in this section, the squared error function is not always
a sensible choice in practice. In many situations, we can do much better by
optimizing other error functions. Switching from squared error to something
else, we should be worried about sacrificing the amazing properties of least
squares estimation. However, as we will see in the second half of this chapter,
such worries are unfounded. The minimization of most frequently used error
functions can be reduced to running least squares estimation a few times with
reweighted data points. MLP training works pretty much the same way with all
these alternatives, in particular error backpropagation remains essentially the
same.
Where do alternative error functions come from? How do we choose a good error
function for a problem out there? We will address this question in much the same
spirit as in the last two chapters: we let ourselves be guided by decision theory
and probabilistic modelling. We will discover a second way of doing maximum
likelihood estimation and learn about the difference between generative and
discriminative (or diagnostic) modelling.
In this section, we work through some examples, demonstrating why the squared
error (8.1) can be improved upon in certain situations. Recall that we introduced
Esq (w) in Section 2.4 for training a binary classifier, even before discussing the
classical application of least squares linear regression (Section 4.1). Is Esq (w)
a good error function for binary classification? No, it is not! Let us see why.
Consider a binary classification dataset D = {(xi , ti ) | i = 1, . . . , n}, where
ti ∈ {−1, +1}, for which we wish to train a linear classifier f (x) = sgn(y(x)),
y(x) = wT φ(x). The only other classification error function we know up to
now is the perceptron error
n
X
Eperc (w) = g (−ti y(xi )) , g(z) = zI{z≥0} ,
i=1
derived in Section 2.4.2, so let us compare the two. Before we do that, one
comment is necessary in the light of Chapter 2. We do not have to assume in this
comparison that the dataset D is linearly separable. The perceptron algorithm
8.1 Conditional Maximum Likelihood 129
from Section 2.3 will fail for a non-separable set, but other algorithms1 minimize
Eperc (w) properly for any dataset.
2 Perceptron error
Squared error
1.5
Error
0.5
0
−1 −0.5 0 0.5 1 1.5 2 2.5 3
ty
4 4
2 2
0 0
−2 −2
−4 −4
−6 −6
−8 −8
−4 −2 0 2 4 6 8 −4 −2 0 2 4 6 8
Figure 8.1: Top: Perceptron and squared error for binary classification, shown
as function of ty. Notice the rapid increase of the squared error on both sides of
ty = 1. Bottom: Two binary classification datasets. On both sides, we show the
least squares discriminant (squared error; magenta line) and the logistic regres-
sion discriminant (logistic error; green line). The logistic error is introduced in
Section 8.2. For now, think of it as a smooth approximation of the perceptron
error. Bottom right: The least squares fit is adversely affected by few additional
datapoints in the lower right, even though they are classified correctly by a large
margin. Figures on the bottom from [5] (used with permission).
Note that the perceptron error per pattern is a function of ti y(xi ), which is
positive if and only if y(·) classifies (xi , ti ) correctly. We can also get the squared
error into this form (using that t2i = 1):
n n
1X 1X
Esq (w) = (y(xi ) − ti )2 = (ti y(xi ) − 1)2 .
2 i=1 2 i=1
1E
perc (w) is convex and lower bounded. It can be minimized by solving a linear program,
which you may derive for yourself.
130 8 Conditional Likelihood. Logistic Regression
We can compare the error function by plotting their contribution for pattern
(xi , ti ) as function of ti yi , where yi = y(xi ) (Figure 8.1, top). As far as sim-
ilarities go, both error functions are lower bounded by zero and attain zero
only for correctly classified patterns (yi = ti for squared error). But there are
strong differences. The perceptron error does not penalize correctly classified
patterns at all, which is why the perceptron algorithm never updates on such.
Also, Eperc (w) grows gracefully (linearly) with −ti yi for misclassified points. In
contrast, the squared error grows quadratically in ti yi on both sides of 1. Misclas-
sified patterns (ti yi ≤ 0) are penalized much more than for the perceptron error.
The behaviour of the squared error right of 1 is much more bizarre. ti yi > 1
means that y(·) classifies (xi , ti ) correctly with a large margin (Section 2.3.3).
That is a good thing, but the squared error penalizes it! In fact, it penalizes
it the more, the larger the margin is! This leads to absurd situations, as de-
picted in Figure 8.1, bottom right. This irrational2 behaviour of the squared
error, applied to binary classification, can lead to serious problems in practice.
Minimizing the squared error is not a useful procedure to train a classifier. In
this section, we learn about alternatives which are only marginally more costly
to run and require about the same coding efforts.
Why don’t we simply always use the perceptron error? While it is widely
used, it comes with problems of its own. It is not differentiable everywhere,
so most gradient-based optimizers do not work without subtle modifications.
Also, Eperc (w) does not assign any error to correctly classified points (xi , ti ),
which means that all separating hyperplanes are equally optimal under this cri-
terion. A different and more compelling reason for using other error functions
is discussed in Section 8.2.2.
As noted in Section 6.3, random errors are often captured well by a Gaussian
distribution. Let us choose p(t|y) = N (y, σ 2 ), meaning that each εi is drawn
from N (0, σ 2 ), where σ 2 is the noise variance. Here is a surprise. The negative
log conditional likelihood for Gaussian noise is the squared error function:
n
X 1 1
− log p(t|w) = 2
(y(xi ; w) − ti )2 + C = 2 Esq (w) + C,
i=1
2σ σ (8.2)
n
C = log(2πσ 2 ).
2
Least squares estimation of w, minimizing Esq (w), is equivalent to maximizing
a Gaussian conditional likelihood.
This observation has several important consequences. First, it furnishes us with
statistical intuition about the squared error. Independent of computational ben-
efits and simplicity, it is the correct error function to minimize if residuals away
from a smooth curve behave more or less like Gaussian noise. On the other
hand, if ti are classification labels and y(x) a discriminant, or if the problem is
curve fitting, but we expect large errors to occur with substantial probability,
then other criteria can lead to far better solutions, and we should make the
additional effort to implement estimators for them. Second, just as squared er-
ror and Gaussian noise are related by the now familiar − log(·) transform, the
same holds for other such pairs. It is fruitful to walk this bridge in both ways.
Given an exotic error function, it may correspond to a negative log conditional
likelihood, which provides a clear idea about the scales of residuals to which the
error is most sensitive. And if a conditional density p(ti |yi ) captures properties
of the random errors we expect for our Pndata, we have to search no further for a
sensible error function to minimize: i=1 − log p(ti |yi ).
linear functions y(x) = wT φ(x) and a Gaussian noise model p(t|y) = N (y, σ 2 ).
In our examples so far, y(x) was a univariate function (a “clean” curve in
regression, or a discriminant function in binary classification), and noise was
additive (ti = yi +εi , where εi ∼ pε (·) i.i.d. and independent of yi ), but these are
special cases. In the sequel of this chapter, we will learn to know a non-additive
noise model for binary classification. Its extension to multi-way classification
will necessitate a multivariate systematic mapping y(x).
Given such a setup, we can fit the conditional model to data D = {(xi , ti ) | i =
1, . . . , n} by maximizing the conditional likelihood
n
Y
p(t|w) = p(ti |yi ), yi = y(xi ; w).
i=1
parameters fixed and given, and concentrate our learning efforts on the weights
(or primary parameters) w.
0.8
0.6
0.4
0.2
0
−4 −3 −2 −1 0 1 2 3 4
Figure 8.3: Logistic function σ(v) = 1/(1 + e−v ), the transfer function implied
by the logistic regression likelihood.
P (t|x) depends on x only through y ∗ (x), so this is the noise model we have
been looking for: P (t|y) = σ(ty).
This went a bit fast, so let us repeat what we did here. We used a simple
generative setup of binary classification (two Gaussian class-conditional distri-
butions with unit covariance matrix) in order to specify a noise model P (t|y)
for conditional maximum likelihood. Given the joint setup p(x, t) = p(x|t)P (t),
we worked out the posterior P (t|x) and the corresponding optimal discriminant
function y ∗ (x) in Section 6.4.1. Two things happen. First, y ∗ (x) turns out to be
linear. Second, the posterior distribution P (t|x) can be expressed as P (t|y ∗ (x))
for a conditional distribution P (t|y), y ∈ R. In probabilistic terms, t and x are
conditionally independent given y ∗ (x). The “two spherical Gaussian” generative
setup exhibits the separation into systematic part and random noise required
for conditional modelling (Section 8.1.3). The systematic part y ∗ (x) is linear,
while the noise model turns out to be logistic: P (t|y) = σ(ty).
There is nothing special about the setup with spherical Gaussian class-
conditional distributions p(x|t). The logistic link between y ∗ (x) and P (t|y) is
simply an inversion of the definition of the log odds ratio (8.3) in terms of P (t|y).
Our conditional likelihood requirements are satisfied whenever the log odds ra-
tio is linear in the parameters w to be learned, and this happens for a range
of other generative models as well. An example is the naive Bayes document
classification model for K = 2 classes (Section 6.5).
5
Perceptron error
Squared error
4 Logistic error
3
Error
0
−4 −3 −2 −1 0 1 2 3
ty
In this section, we show how to compute the gradient of the logistic error (8.4)
w.r.t. the weights w, which allows us to minimize Elog (w) by gradient descent
(Section 2.4.1). We begin with the linear case, y(x) = wT φ(x). Recall that
n
X
Ei (w) = − log σ(ti yi ) = log 1 + e−ti yi .
Elog (w) = Ei (w),
i=1
a linearly separable training dataset, logistic regression can drive kwk to very large values at
ever decreasing gains in Elog (w).
136 8 Conditional Likelihood. Logistic Regression
use g i for single patterns (xi , ti ) (Section 2.4.2). Before we begin, recall the
simple form of the gradient for the squared error (8.1) from Section 2.4.1:
1
∇w (yi − ti )2 = (yi − ti )∇w yi = (yi − ti )φ(xi ).
2
The gradient is the sum of feature vectors φ(xi ), each weighted by the residual
yi − ti .
Recall that ti ∈ {−1, +1}. Let t̃i = (ti + 1)/2 ∈ {0, 1}, and define πi = P (ti =
1|yi ) = σ(yi ) ∈ (0, 1). In order to work out the gradient, note that
e−v e−v
σ 0 (v) = = σ(v) = σ(v)σ(−v) = σ(v)(1 − σ(v)).
(1 + e−v )2 1 + e−v
Therefore,
and
∇w Ei = (πi − t̃i )∇w yi = (πi − t̃i )φ(xi ), πi = σ(yi ).
This has the same form as for the squared error, only that the residuals are
redefined: πi − t̃i instead of yi − ti . For the squared error, a pattern (xi , ti ) does
not contribute much to the gradient if and only if yi − ti ≈ 0, or yi ≈ ti . As
noted in Section 8.1.1, this does not make much sense for binary classification.
For example, ti = +1, yi = 5 is penalized heavily, even though the pattern
is classified correctly with large margin. For the logistic error, the residual is
redefined in a way which makes sense for binary classification. A pattern does
not contribute much if and only if πi = P (ti = 1|yi ) ≈ t̃i , namely if the pattern
is classified correctly with high confidence. If ti = +1, yi = 5, then π1 ≈ 1 and
t̃i = 1, so that (xi , ti ) contributes very little to ∇w Elog . Using the vectorized
notation of Section 2.4.1:
n
X
∇w Elog = (πi − t̃i )φ(xi ) = ΦT (π − t̃), π = [πi ], t̃ = [t̃i ]. (8.5)
i=1
1 (L) 2
Esq,i (w) = a (xi ) − ti , Elog,i (w) = − log σ ti a(L) (xi ) .
2
The corresponding residuals at the uppermost layer are
(L) (L)
rsq,i = a(L) (xi ) − ti , rlog,i = σ a(L) (xi ) − t̃i .
Recall our comparison between logistic and perceptron error above. A com-
pelling reason to prefer Elog (w) over Eperc (w) is understood by noting that for
our decision-theoretic setup, the link between the optimal discriminant y ∗ (x)
and P (t|x) is one-to-one. In other words, the posterior distribution P (t|x) is
a fixed and known function of y ∗ (x). If we employ the logistic link for condi-
tional maximum likelihood, we cannot only learn to discriminate well, but also
estimate the posterior probabilities P (t|x). There are numerous reasons why we
would want to do the latter, some of which were summarized in Section 5.2.5.
After all, the Bayes-optimal classifier f ∗ (x) decides f ∗ (x) = 1 in the same
way for P (t = 1|x) = 0.51 and P (t = 1|x) = 0.99, while knowledge of P (t|x)
may lead to different decisions in these cases, and is clearly more informative
in general. For instance, class probability estimates are essential in the cancer
screening example of Section 5.2.4.
The representation of f ∗ (x) in terms of a discriminant function, f ∗ (x) =
sgn(y ∗ (x)), is obviously not unique. When we talked about “the” optimal dis-
criminant y ∗ (x) above, we meant the log odds ratio (8.3), but there are many
others. For some of them, we can reconstruct the posterior probability P (t|x)
from y ∗ (x) for each input point x, for others we cannot. An example for the
first class of discriminant functions is the log odds ratio:
P (t = +1|x)
y ∗ (x) = log ⇒ P (t|x) = σ(ty ∗ (x)).
P (t = −1|x)
138 8 Conditional Likelihood. Logistic Regression
An example for the second class is f ∗ (x) itself. This is an optimal discriminant
function, since f ∗ (x) = sgn(f ∗ (x)). On the other hand, f ∗ (x) contains no
more information about P (t|x) than whether it is larger or smaller than 1/2.
If our goal is to estimate posterior class probabilities, we have to choose an
error function whose minimizer tends to a “probability-revealing” discriminant
function at least in principle, for a growing number of training data points and
discriminant functions to choose from.
Giving an operational meaning to “at least in principle” is the realm of learn-
ing theory, which is not in the scope of this lecture. But a necessary con-
dition for an error function to allow for posterior probability estimation is
easily understood by the concept of population minimizers. Recall the setup
of decision theory
Pn from Section 5.2. Suppose we are given an error function
E(y(·)) = n−1 i=1 g(ti , y(xi )). Then, y ∗ (x) is a population minimizer if
for almost all x. Intuitively, if we use more and more training data and allow
for any kind of function y(x), we should end up with a population minimizer.
An error function allows for estimating posterior probabilities only if all its
population minimizers y ∗ (x) are probability-revealing, in that P (t = +1|x) is
a fixed function of y ∗ (x).
For the logistic error function, glog (t, y) = log(1 + e−ty ), and the unique popula-
tion minimizer is the log odds ratio (8.3), which is probability-revealing. On the
other hand, the perceptron error function performs poorly w.r.t. this test. One
of its population minimizers is y ∗ (x) ≡ 0. This comment may paint an overly
pessimistic picture of the perceptron criterion, which often works well if applied
with linear functions y(x) = wT φ(x) under constraints such as kwk = 1 (which
rules out the all-zero solution). But if Eperc is used in less restrictive settings, it
often does lead to solutions with a tiny margin (see Section 2.3.3), from which
posterior probabilities cannot be estimated. As we will see in Chapter 9, one way
of addressing these problems of Eperc is to insist on a maximally large margin.
However, as shown in Section 9.4, the resulting maximum margin perceptron (or
support vector machine) does not allow for the consistent estimation of posterior
class probabilities either.
To sum up, the logistic error function, being the negative log conditional like-
lihood under the logistic noise model P (t|y) = σ(ty), supports the consistent
estimation of posterior class probabilities, while the perceptron error does not.
It is interesting to note that the squared error (8.1), for all its shortcomings with
binary classification, shares this beneficial property with Elog . We will see in Sec-
tion 10.1.3 that its population minimizer is the conditional expectation E[t | x],
which is obviously probability-revealing in the case of binary classification.
Armed with the example of logistic regression, we are now ready to highlight
the fundamental difference between joint and conditional likelihood maximiza-
tion, between generative and discriminative modelling. Recall our motivation
8.2 Logistic Regression 139
then plug the solution θ̂ gen into the logs odds ratio
T T µ̂+1 − µ̂−1
ŷgen (x; θ̂ gen ) = µ̂+1 − µ̂−1 x + b̂ = (ŵgen ) φ(x), ŵgen = ,
b̂
1 π̂1
kµ̂+1 k2 − kµ̂−1 k2 + log
b̂ = − .
2 1 − π̂1
For the second option, we parameterize the log odds ratio directly as
ydsc (x; θ dsc ) = (wdsc )T φ(x), so that θ dsc = wdsc , and fit it to the data by
maximizing the conditional likelihood
( n n
)
Y Y
max P (ti |θ dsc , xi ) = σ (ti ydsc (xi ; θ dsc )) .
θ dsc
i=1 i=1
Even though these two classification methods share the same decision-theoretic
motivation and result in P (t|x) estimates of the same functional form, they can
behave very differently in practice. The weights ŵdsc in conditional maximum
likelihood are estimated directly, while the weights ŵgen are assembled from
θ̂ gen , which are twice as many parameters estimated in a different way. In Fig-
ure 8.5, the generative ML plug-in rule is compared to discriminative logistic
regression. The two methods clearly produce different results. They constitute
examples of different modelling and learning paradigms:
n=8 n=20
n=34 n=58
The most important point to note about generative and discriminative mod-
elling is that they constitute two different options we have for addressing a
problem such as classification or regression, each coming with strengths and
weaknesses. For a particular application, it is not usually obvious a priori which
of the two will work better. However, some general statements can be made.
Discriminative modelling is more direct and often leads to models with less pa-
rameters to estimate. After all, the distribution p(x) over input points is not
represented at all. Most “black-box” classification or curve fitting methods are
based on discriminative models, even though the feature map φ(x) still has to
be specified. On the other hand, it is often much simpler to encode structural
prior knowledge about a task in a generative “forward” model (as emphasized
in Section 6.1) than to guess a sensible form for P (t|x). Moreover, training a
generative model is often simpler and more modular. A general advantage of
generative over discriminative models is that cases with corrupted or partly
missing input point xi can be used rather easily with the former (using tech-
niques discussed in Chapter 12), but typically have to be discarded with the
latter. It is possible to combine the two paradigms in different ways, a topic
which is not in the scope of this course.
reduction to LSE. The next best idea is to proceed iteratively. Suppose we are at
the point w. The locally closest quadratic fit to Elog (w0 ) is given by the Taylor
approximation
T
Elog (w0 ) ≈ qw (w0 ) :=Elog (w) + (∇w Elog ) (w0 − w)
1
+ (w0 − w)T (∇∇w Elog ) (w0 − w).
2
Here, ∇∇w Elog is the Hessian (matrix of second derivatives) at w. The idea be-
hind Newton-Raphson is to minimize the surrogate qw (w0 ) instead of Elog (w0 ),
updating w to the quadratic minimizer. We will see that ∇∇w Elog is positive
definite, so the quadratic minimizer w0 is given by
We already worked out the gradient (8.5) in Section 8.2.1. For the Hessian, we
employ the same strategy. First, it will be the sum of contributions ∇∇w Ei ,
one for each data point. Second, we can use the chain rule once more. If y =
[yi ] = Φw, then
∂Ei ∂Ei
∇w E i = ∇w yi = φ(xi )
∂yi ∂yi
∂E 2 ∂E 2
⇒ ∇∇w Ei = (∇w yi ) 2 i (∇w yi )T = 2 i φ(xi )φ(xi )T .
∂ yi ∂ yi
Therefore, the Hessian is
n
X ∂Ei2
∇∇w Elog = κi φ(xi )φ(xi )T = ΦT (diag κ)Φ, κi = .
i=1
∂ 2 yi
typically allow for such weighting. The solution d∗ is called Newton direction,
qw (w0 ) is minimized for w0 = w + d∗ . This observation explains the naming of
IRLS, which is solved by iterating over a sequence of reweighted least squares
estimation problems.
If you ever implement IRLS in practice, you should note one more detail. The
derivation above suggests to update w to w + d∗ at the end of an iteration, as
this is the minimizer of the quadratic fit qw (w0 ). In practice, it works better to
employ a line search:
In other words, we search for a minimum point along the line segment {w +
λd∗ | λ > 0} determined by the Newton direction. The minimum does not have
to be found to high accuracy, so that very few evaluations of Elog (and possibly
its derivative) along the line are sufficient. It is common practice to start the
line search with λ = 1 and accept the full Newton step if this leads to sufficient
descent. Details can be found in [2].
We developed logistic regression in Section 8.2 for the case of two classes. How
about the general case of K ≥ 2 classes? We can let ourselves being guided
by decision theory in much the same way. We derived a generative maximum
likelihood approach to multi-way classication in Section 6.4.1, which employed
K discriminant functions
Here, C(x) does not depend on k. In the case of spherical Gaussian class-
conditional distributions discussed back then, we had φ(x) = [xT , 1]T . What
matters is that the log posterior log P (t = k|x) is a linear function of φ(x) plus
a part which does not depend on the class label k. The implied classification
rule was
f ∗ (x) = argmax yk∗ (x) = argmax P (t = k|x),
k=0,...,K−1 k=0,...,K−1
since neither the addition of C(x) nor the increasing exp(·) transform changes
the maximizing value for k. How do we obtain P (t|x) from the yk∗ (x)? We take
144 8 Conditional Likelihood. Logistic Regression
e vk X
σk (v) = P v = evk −lsexp(v ) , lsexp(v) = log evk̃ .
k̃ e
k̃ k̃
σ(v) = [σk (v)] ∈ ∆K is called softmax mapping (or multivariate logit mapping
in statistics). Recall from Section 6.5 that ∆K denotes the simplex of probability
distributions over {0, . . . , K − 1}. The softmax mapping is not one-to-one, since
σ(v + α1) = σ(v) for any α ∈ R, reflecting the fact that we can always add
any fixed C(x) to our discriminant functions yk∗ (x) and still obtain the same
posterior probabilities. The softmax mapping is conveniently defined in terms of
the logsumexp function lsexp(v), a convex function we previously encountered
in Section 8.2.1.
Once more, our motivation of the softmax link for multi-way classification is not
limited to spherical Gaussian generative models, but holds more generally. At
this point, we take the same step as in Section 8.2. We define a discriminative
model for P (t|x) as
w0
P (t = k|x, w) = σk [yk̃ (x; wk )] , yk (x; wk ) = (wk )T φ(x), w =
..
.
.
wK−1
Suppose we are given some multi-way classification data D = {(xi , ti ) | i =
1, . . . , n}, where ti ∈ {0, . . . , K − 1}. What is the negative log conditional like-
lihood − log P (t|w), the generalization of the logistic error function (8.4) to
K classes? Recall from Section 6.5 that the use of indicators can simplify ex-
pressions substantially. With this in mind, let us define t̃i = [t̃ik ] ∈ {0, 1}K ,
where t̃ik = I{ti =k} , or t̃i = δ ti . t̃i has a one in component ti , zeros elsewhere.
The representation of {ti } in terms of the t̃i is called 1-of-K coding. Using the
techniques from Section 6.5:
K−1 K−1
!
Y X
t̃ik
P (ti |xi , w) = P (ti |y i ) = σk (yik ) = exp t̃ik yik − lsexp(y i ) ,
k=0 k=0
y i = [yik ], yik = yk (xi ; wk ).
P
In this derivation, we used that k t̃ik = 1. The negative log conditional likeli-
hood, or K-way logistic error function, is
Xn
lsexp(y i ) − (t̃i )T y i ,
− log P (t|w) = Elog (w) = y i = [yk (xi ; wk )] .
i=1
| {z }
=:Ei (w)
This is still a sum of independent terms Ei (w), one for each data point (xi , ti ),
but it couples the entries of each y i ∈ RK . It is also a convex function, due to
the convexity of lsexp(·). We can understand the structure of this error function
by noting that lsexp(v) is a “soft” approximation to maxk vk :
X X
log evk = M + log evk −M ∈ (M, M + log K], M = max vk .
k
k k
8.3 Discriminative Models 145
The error is close to zero if yi(ti ) = maxk yik , i.e. if (xi , ti ) is classified correctly,
while an error is penalized linearly by the distance between predicted maxk yik
and desired yi(ti ) .
In order to minimize − log P (t|w), we need to work out its gradient w.r.t. w.
To this end, we use a remarkable property of lsexp(v):
∂ e vk
lsexp(v) = P v = σk (v) ⇒ ∇v lsexp(v) = σ(v).
∂vk k̃ e
k̃
Therefore,
∇yi Ei = σ(y i ) − t̃i ,
Note that
K−1 X
X K−1
∇w k E i = (σk (y i ) − t̃ik ) φ(xi ) = 0,
k=0
k=0
P P
since k σk (y i ) = k t̃ik = 1. The contribution φ(xi ) is distributed between
the gradients ∇wk Elog in a zero-sum fashion: a positive example for class ti , a
negative example for all other classes k 6= ti .
Let us apply our new found K-way logistic error function to training a multi-
layer perceptron for K-way classification. We need K output layer activations
(L) (L)
ak (x), one for each class. Also, a(L) (x) = [ak (x)]. Much like in Section 8.2.1,
we only have to modify error backpropagation in a minor way. Fix pattern
(xi , ti ). The output residuals form a vector
(L)
ri = σ a(L) (xi ) − t̃i
(L)
now. The backward pass starting at output activation ak (xi ) is seeded with
(L)
the error ri,k .
146 8 Conditional Likelihood. Logistic Regression
p(w|t) ∝ p(t|w)p(w),
You should confirm for yourself that the second expression is σ −2 times the
regularized least squares criterion (7.1) up to a constant, if ν = βσ 2 . The prob-
abilistic conditional MAP interpretation of Tikhonov-regularized least squares
is as follows. Or model assumptions are twofold. First, the noise is additive
Gaussian with variance σ 2 . Second, the clean curve y(x) is a linear function
of weights w with a Gaussian prior N (w|0, β −1 I), which keeps the weights
uniformly small.
MAP estimation is useful for logistic regression just as well. In fact, let us
consider what happens with the logistic error function (8.4) for a linearly sep-
arable dataset D. In this case, there exists some weight vector w1 so that
ti y(xi ; w1 ) = ti (w1 )T φ(xi ) > 0 for all i = 1, . . . , n. But then,
Just like the variance σ 2 of the Gaussian noise model, the prior parameter β is
a hyperparameter (Section 8.1.3). Choosing good hyperparameter values is an
instance of model selection (see Chapter 10).
the minimizer of the regularized squared error (7.1). But what is the poste-
rior distribution? Let us fill in a hole we left in our survey of the Gaussian in
Section 6.3. First,
1 2 β 2
p(w|t) ∝ p(t|w)p(w) ∝ exp − 2 kt − Φwk + kwk .
2σ 2
Recall our timesaving trick from Section 7.3: when working out a distribution
(or density) over w, we ignore all multiplicative terms which do not depend on
w. Note the absence of 2π and determinant terms in our derivation. Writing
1
ν = βσ 2 , we have that p(w|t) ∝ e− 2σ2 q(w) , where
q(w) = kt − Φwk2 + νkwk2 = wT ΦT Φ + νI wT − 2tT Φw + C.
C is some constant, we could work it out, but we don’t, as it does not depend
on w. In the sequel, instead of writing C1 , C2 , . . . for all sorts of constants we
don’t care about anyway, we call all of them C (even though the value may
change from line to line). Now, q(w) is a quadratic function, and we begin to
suspect that maybe p(w|t) is Gaussian after all. Indeed, if Σ = (ΦT Φ + νI)−1 ,
then
We first completed the square (Section 6.4.3), then matched the result to the
form of a Gaussian density (6.2). If we find such a match, we can just read off
148 8 Conditional Likelihood. Logistic Regression
mean and covariance matrix. The posterior distribution is a Gaussian, its mode
ŵ is also its mean (we already know that from Section 6.3), and its covariance
matrix is σ 2 Σ. In other words, a Gaussian prior p(w) is conjugate for a Gaussian
likelihood p(t|w) = N (t|w, σ 2 I) (Section 7.3.1). One more closedness property
for the amazing Gaussian family:
This list is by no means exhaustive. If you recall the sum and product rule
from Section 5.1, you note that all these operations keep us within the Gaussian
family, which is one reason why it is so frequently used in practice. An example
which throws you out of the family: if x ∼ N (0, 1), then x2 is not a Gaussian
random variable.
Chapter 9
In this chapter, we explore kernel methods for binary classification, first and
foremost the support vector machine. These are nonparametric methods, like
nearest neighbour (Section 2.1), which can be regarded as regularized linear
methods in huge, even infinite-dimensional feature spaces. We first encountered
the margin of a dataset in Chapter 2, in the context of the perceptron algorithm.
In this chapter, we will gain a deeper understanding of this concept.
This chapter is mathematically somewhat more demanding than previous ones,
but we will move slowly and build intuition. The work will be very worthwile.
Support vector machines are not just among the most powerful “black box”
classifiers, with high impact1 on many applications, they also seeded a link be-
tween machine learning and convex optimization which has been and continues
to be enormously fruitful, far beyond SVMs.
149
150 9 Support Vector Machines
(b)
(a)
(c)
Figure 9.1: Different separating hyperplanes for binary classification data. (a) is
the largest margin hyperplane, while (b) and (c) are potential solutions for the
perceptron learning algorithm.
Fine, but ask yourself which of the solutions in Figure 9.1 you would pick if
you had to. Would you go for (b) or (c), sporting tiny distances to some of
the datapoints? Or would you choose (a), which allows for maximum room
to move? The principle behind our intuitive preference for (a) is as follows.
Our basic assumption about D is that it is an i.i.d. random sample. If we could
repeat the sampling process, we would get another set D0 whose overall structure
resembled that of D, but details would be different. With this in mind, it makes
sense to search for a discriminant function exhibiting stability to small changes in
the individual data points. For example, suppose that each pattern xi is slightly
displaced, leading to equally slight displacements of the φ(xi ). A stable solution
would in general remain unchanged. Given this particular notion of stability,
the best hyperplane is the one whose distance to the nearest training pattern is
maximum. We should look for the discriminant function with maximum margin.
For much of this chapter, we will use a slightly redefined version of the margin
for unnormalized patterns (Section 2.3.3):
ti (wT φ(xi ) + b)
γD (w, b) = min . (9.1)
i=1,...,n kwk
9.1 Maximum Margin Perceptron Learning 151
Figure 9.2: The signed distance δi of a pattern φ(xi ) to a hyperplane with unit
normal vector w0 is defined by φ(xi ) = âi + δi ti w0 , where âi is the orthogonal
projection of φ(xi ) onto the hyperplane. The margin of the hyperplane w.r.t.
a dataset is the smallest signed distance over all datapoints. The hyperplane
is separating the data iff its margin is positive. The maximum margin over all
hyperplanes is the margin γD of the dataset.
Recall the geometrical picture of the margin from Section 2.3.3 and Figure 9.2.
For any i, ti (wT φ(xi ) + b)/kwk is the signed distance δi of φ(xi ) from the hy-
perplane (negative if the point is misclassified). Namely, let âi be the orthogonal
projection of φ(xi ) onto the hyperplane. Then, wT âi + b = 0, since âi is on the
plane. Moreover, to get to φ(xi ), we march to âi , then move δi along the unit
normal vector w0 = w/kwk: φ(xi ) = âi + δi ti w0 . Multiplying with wT :
ti (wT φ(xi ) + b)
wT φ(xi ) = wT âi + δi ti kwk = −b + δi ti kwk ⇔ δi = .
kwk
Here, we used wT w0 = kwk and 1/ti = ti . The margin is the smallest signed
distance to any of the training cases. If (w, b) describes a separating hyperplane,
you can remove “signed”. A difference2 to the margin concept in Section 2.3.3
is that here, we normalize w only, but leave b unregularized. Moreover, we do
not insist on normalizing the feature vectors φ(xi ) here, even though this is
typically done in SVM practice (see end of Section 9.2.3).
The maximum margin perceptron learning problem (also known as optimal per-
ceptron learning problem) is given by
ti (wT φ(xi ) + b)
max γD (w, b) = min . (9.2)
w,b i=1,...,n kwk
We maximize the minimum margin per pattern, where the minimum is over the
data points, the maximum over the hyperplane. The solution to this problem
is the margin γD for the dataset D. As in Section 2.3.3, we can interpret 2γD
as the maximum width of a slab of parallel separating hyperplanes in between
datapoints of the two classes (the light-red region in Figure 9.2). Note that at
least one degree of freedom is left unspecified in this problem. If (w∗ , b∗ ) is a
solution, then so is (βw∗ , βb∗ ) for any β > 0.
2 For the margin concept used in this chapter, we can translate all datapoints by a common
152 9 Support Vector Machines
Figure 9.3: Examples of convex and non-convex sets (top panel), as well as
convex and non-convex functions (bottom panel).
This is a convex optimization problem. For one, kwk2 is a convex function. More-
over, the constraints determine the set to be optimized over, called the feasible
set. Each such affine constraint defines a halfspace in Rp+1 where [wT , b]T lives
(Section 2.2), and the intersection of halfspaces, if not empty, defines a convex
set (please prove this for yourself). Why is the feasible set not empty? Because
we assume that the dataset D is linearly separable.
formulation has two shortcomings shared with the perceptron algorithm. First,
it works only for linearly separable datasets D. Second, we are limited to linear
discriminants y(x) = wT φ(x) + b in some finite-dimensional feature space,
w ∈ Rp . In this section, we will learn to know remedies for both shortcomings. At
the end, we will not only have derived support vector machines, but also gained
a much broader perspective on “kernel” variants of regularized estimation.
The problem (9.4) is often called soft margin SVM problem, while our previous
variant (9.3) becomes the hard margin SVM problem. Note that the hard margin
problem is a special4 case of the soft margin problem, where C = ∞. The soft
margin extension can also lead to an improved solution for a linearly separable
dataset D. To understand this point, let us define the soft margin as 1/kw∗ k
for an optimal solution (w∗ , b∗ , ξ ∗ ) of (9.4), in analogy with Section 9.1.1. Note
that this quantity depends on C. If D is separable, the soft margin for C = ∞
coincides with the normal (hard) margin. In geometrical terms, the soft mar-
gin is the closest distance to the hyperplane of any pattern i with ξ∗,i = 0. It
4 In general, the soft margin solution will coincide with the hard margin solution from some
Figure 9.4: The soft margin support vector machine allows for a larger soft
margin γsoft at the expense of violating the large margin constraints ti yi ≥ 1
for certain patterns. Different from the hard margin SVM, it can be applied to
datasets which are not linearly separable.
is therefore no smaller, and typically larger, than the hard margin. Figure 9.4
illustrates the consequences of optimizing the more general soft margin. Intu-
itively, we extend the width of our slab (light-red region), thus the stability of
the solution, at the expense of engulfing a few patterns. To conclude, the soft
margin SVM is more generally useful than its hard margin special case, and we
will mainly be concerned with the former in the remainder of this chapter.
The soft margin SVM can be understood as a Tikhonov-regularized estimator,
directly related to conditional maximum a-posteriori estimation (Section 8.3.2).
To this end, we eliminate the slack variables in (9.4) by using the hinge function
[x]+ := max{x, 0} and divide the criterion by C, arriving at the equivalent
problem
n
1 X
min kwk2 + [1 − ti yi ]+ , yi = wT φ(xi ) + b.
w,b 2C
i=1
with the Tikhonov regularizer (2C)−1 kwk2 . We can compare Esvm with the
perceptron and logistic error (Figure 9.5). It corresponds to the perceptron error
[−ti yi ]+ shifted to the right in order to enforce a discrimination with margin.
Unlike the logistic error, it is exactly zero for ti yi ≥ 1. The consequences of this
fact will be highlighted in Section 9.4.
156 9 Support Vector Machines
5
Perceptron
4.5 SVM
Logistic
4
3.5
3
L(y,t)
2.5
1.5
0.5
0
−4 −3 −2 −1 0 1 2 3
ty
Figure 9.5: Error per pattern i as function of ti yi , for perceptron, logistic and
SVM hinge error.
We assume that the criterion is lower bounded and has globally optimal solu-
tions. What we will show is that there is always an optimal solution (w∗ , b∗ ) for
which we can write
Xn
w∗ = α∗,i φ(xi ).
i=1
9.2 Support Vector Machines 157
The weight vector w∗ can be written as linear combination of the feature vectors
φ(xi ). In other words, w∗ = ΦT α∗ . Writing w∗ in terms of α∗ is known as dual
representation. The consequence of this so called representer theorem is that we
may as well optimize over α, with w = ΦT α, without loss in optimality. This
statement is trivial for p ≤ n, so for the rest of this section we will assume that
p > n (more features than datapoints), and that ν > 0 in (9.5).
We begin with a method which does not use Tikhonov regularization, the per-
ceptron algorithm of Section 2.3. Recall from Algorithm 1 that we start with
w ← 0, and each update has the form w ← w + ti φ(xi ). This means that we
can just as well maintain a dual representation α ∈ Rn , start with α ← 0, and
update α ← α + ti δ i . The perceptron algorithm is ready for “kernelization”.
Note that the dual representation can even be a good idea in this case for p < n.
If D has a sizeable margin, the perceptron algorithm will only ever update on
a few patterns, and the dual vector α remains sparse, thus can be stored and
handled efficiently.
How about linear classification by minimizing the squared error Esq ? Consider
the stochastic gradient variant from Section 2.4.2. Upon visiting pattern i, we
update
w ← w − η∇w Ei = w − η(yi − ti )φ(xi ),
which again gives rise to a dual representation, in which we update α ← α −
η(yi − ti )δ i . We derive the underlying dual optimization problem at the end of
this section.
How about any problem of the form (9.5)? If you read Section 4.2.2 on orthog-
onal projection, you have all the tools together. The criterion (9.5) consists of
two parts. The error function depends on (w, b) through y = Φw + b1 only,
while the regularizer is (ν/2)kwk2 . We will show the following. For any (w, b)
giving rise to y, we can find some w̃ = ΦT α so that Φ w̃ + b1 = y = Φw + b1
and kw̃k2 ≤ kwk2 . We just need to apply this to an optimal solution in order to
establish the representer theorem. Recall from Section 4.2.2 that w = wk + w⊥
uniquely, where wk ∈ ΦT Rn (therefore wk = ΦT α for some α ∈ Rn ) and w⊥
is orthogonal to ΦT Rn , meaning that Φw⊥ = 0. Therefore,
y = Φ wk + w⊥ + b1 = Φwk + b1.
The claim follows with w̃ = wk . The intuition behind the representer theorem
is that the contribution w⊥ orthogonal to wk cannot help in decreasing the
error (it does not affect y), but it adds to the regularization costs.
To conclude, for a wide range of Tikhonov5 regularized estimation problems of
the form (9.5), including penalized least squares, MAP estimation for logistic
regression (Section 8.3.2) and soft margin support vector machines,
Pn the optimal
solution can be represented in terms of w∗ = ΦT α∗ = i=1 α∗,i φ(xi ). This
fact is valuable if p > n, as it allows us to optimize over (α, b) instead of (w, b).
5 You will have no problems to confirm that the representer theorem holds more generally
Let us work out the dual representation for linar least squares regression, namely
(9.5) with Ei = (yi − ti )2 /2. Note that for this example, we deviate from the
policy here and include b into w, appending 1 to φ(x). The other case leads to
the same conclusion, but is more messy to derive. Starting from Section 7.2, we
have that
1
νI + ΦT Φ w∗ = ΦT y ⇒ w∗ = ΦT α∗ , α∗ = (y − Φw∗ ) .
ν
Note the remarkable duality in this formulation. The system for the dual rep-
resentation α∗ is obtained by replacing ΦT Φ ∈ Rp×p with ΦΦT ∈ Rn×n , and
the right hand side ΦT y with y. Ultimately, this simple classical relationship is
the basis for the “kernel trick”. We will work out details in Section 9.2.3. This
example provides an important lesson about penalized linear regression. If we
encounter p > n, we solve for α∗ (a n × n linear system) rather than for w∗
directly (a p × p linear system). Computations scale superlinearly only in the
smaller of p and n.
Finally, what happens to the representer theorem for unregularized criteria, (9.5)
without the Tikhonov regularizer? Should we still use a dual representation for
w? The answer is yes, although some clarification is needed. If p > n and
(9.5) comes without regularizer, then the problem has infinitely many optimal
solutions. In particular, we can always add any multiple of w⊥ to w without
changing the criterion at all. To see why dual representations are still possible,
in fact constitute a very sensible choice, note that (9.5) without regularizer is
obtained by setting ν = 0. For any ν > 0, the representer theorem provides
(ν) (ν) (ν) (0)
an optimal solution w∗ = ΦT α∗ . By continuity, α∗ converges to α∗ , and
(0) (0)
w∗ = ΦT α∗ is a solution of the unregularized problem. Moreover, if rk Φ = n,
(0)
then among the infinitely many solutions, w∗ is the one with smallest norm
kw∗ k, simply because this holds true for any ν > 0. For the regularized LSE
example above, we have that
−1
(0) (0)
w∗ = ΦT α∗ = Φ ΦΦT y,
We need finite quantities only in order to make our idea work, namely the
matrix ΦΦT during training, and the mapping Φφ(x) for predictions later on,
to be evaluated at finitely many x. This is the basic observation which makes
kernelization work.
The entries of ΦΦT are φ(xi )T φ(xj ), while [Φφ(x)] = [φ(xi )T φ(x)]. We can
write
K(x, x0 ) = φ(x)T φ(x0 ),
a kernel function. It is now clear that given the kernel function K(x, x0 ), we
never need to access the underlying φ(x). In fact, we can forget about the
dimensionality p and vectors of this size altogether. What makes K(x, x0 ) a
kernel function? It must be the inner product in some feature space, but what
does that imply? Let us work out some properties. First, a kernel function is
obviously symmetric: K(x0 , x) = K(x, x0 ). Second, consider some arbitrary set
{xi } of n input points and construct the kernel matrix K = [K(xi , xj )] ∈ Rn×n .
Also, denote Φ = [φ(x1 ), . . . , φ(xn )]T ∈ Rn×p . Then,
2
αT K α = αT ΦΦT α =
ΦT α
≥ 0.
In other words, the kernel matrix K is positive semidefinite (see Section 6.3).
This property defines kernel functions. K(x, x0 ) is a kernel function if the kernel
matrix K = [K(xi , xj )] for any finite set of points {xi } is symmetric positive
semidefinite. An important subfamily are the infinite-dimensional or positive
definite kernel functions. A member K(x, x0 ) of this subfamily is defined by all
its kernel matrices K = [K(xi , xj )] being positive definite for any set {xi } of
any size. In particular, all kernel matrices are invertible. As we will see shortly,
it is positive definite kernel functions which give rise to infinite-dimensional
feature spaces, therefore to nonlinear kernel methods.
160 9 Support Vector Machines
Let us look at some examples. Maybe the simplest kernel function is K(x, x0 ) =
xT x0 , the standard inner product. Moreover, for any finite-dimensional feature
map φ(x) ∈ Rp (p < ∞), K(x, x0 ) = φ(x)T φ(x0 ) is a kernel function. Since
any kernel matrix of this type can at most have rank p, such kernel functions
are positive semidefinite, but not positive definite. However, even for finite-
dimensional kernels, it can be much simpler to work with K(x, x0 ) directly
than to evaluate φ(x). For example, recall polynomial regression estimation
from Section 4.1, giving rise to a polynomial feature map φ(x) = [1, x, . . . , xr ]T
for x ∈ R. Now, if x ∈ Rd is multivariate, a corresponding polynomial feature
9.2 Support Vector Machines 161
map would consists of very many features. Is there a way around their explicit
representation? Consider the polynomial kernel
r X
K(x, x0 ) = xT x0 = (xj1 . . . xjr ) x0j1 . . . x0jr .
j1 ,...,jr
Therefore, a kernel function gives rise to a normalized feature map if its diagonal
entries K(x, x) are all 1. For example, the Gaussian kernel (9.6) is normalized.
Moreover, if K(x, x0 ) is a kernel, then so is
K(x, x0 )
p
K(x, x)K(x0 , x0 )
(see Section 9.2.4), and the latter is normalized. It is a good idea to use nor-
malized kernels in practice.
T 0
are all kernels, and so is the limit ex x = limr→∞ Kr (x, x0 ). More general, if
0
K(x, x0 ) is a kernel, so is eK(x,x ) . Now,
0 2
τ τ 2 T
x0 − τ2 kx0 k2
e− 2 kx−x k = e− 2 kxk eτ x e .
The middle is a kernel, and we apply our normalization rule with f (x) =
τ 2
e− 2 kxk . The Gaussian kernel is infinite-dimensional (positive definite), al-
though we will not show this here.
9.2 Support Vector Machines 163
Covariance functions are kernel functions. For some set {xi }, let a = [a(xi ) −
E[a(xi )]] ∈ Rn be a random vector. Then, for any v ∈ Rn :
v T K v = v T E aaT v = E (v T a)2 ≥ 0.
Finally, there are some symmetric functions which are not kernels. One example
is
K(x, x0 ) = tanh αxT x0 + β .
K 1 ⊗ K 2 = (V 1 ⊗ V 2 )(V 1 ⊗ V 2 )T .
The same proof works to show that the positive definiteness of K 1 , K 2 implies
the positive definiteness of K 1 ◦ K 2 , a result due to Schur.
9.2.5 Summary
Let us summarize the salient points leading up to support vector machine bi-
nary classification. We started with the observation that for a linearly separable
dataset, many different separating hyperplanes result in zero training error.
Among all those potential solutions, which could arise as outcome of the per-
ceptron algorithm, there is one which exhibits maximum stability against small
displacements of patterns xi , by attaining the maximum margin γD (w, b). Pur-
suing this lead, we addressed a number of problems:
7 It is even on Wikipedia (en.wikipedia.org/wiki/Support vector machine).
164 9 Support Vector Machines
• Maximizing the margin does not look like a convex optimization prob-
lem. However, exploiting the fact that the size of (w, b) does not matter
(only the direction does), we can find an equivalent convex program for-
mulation: minimize a positive definite quadratic function, subject to linear
constraints, a quadratic program. This is the hard margin support vector
machine. Each margin constraint is enforced in a hard, non-negotiable
manner.
At this point, we could plug the kernel expansion in terms of α and b into the
soft margin SVM problem (9.4), then solve the resulting quadratic program in
(α, b, ξ). Indeed, this is how we would proceed in order to “kernelize” logistic
regression. However, in case of the SVM, some more work leads to a simpler
9.3 Solving the Support Vector Machine Problem 165
optimization problem, whose properties provide important insights into the in-
fluence of single patterns on the final solution. We will derive this dual problem
in the following section.
The new variables α = [αi ] represent the slopes of the linear bounds, they come
with two-sided linear inequality constraints. Given these variables, our problem
becomes
( n
)
1 2
X
p̃∗ = min max L(w, b, α) = kwk + αi (1 − ti yi ) .
w,b {0≤αi ≤C} 2 i=1
166 9 Support Vector Machines
−1
−2
−3 −2 −1 0 1 2 3
At the expense of introducing new variables αi , one for each soft margin con-
straint, the inner criterion L is continuously differentiable. This is called the
primal problem, and p̃∗ is called the primal value. Notice the min-max structure
of this problem: what we are after is a saddlepoint, minimizing L w.r.t. the pri-
mal variables w, b, while at the same time maximizing L w.r.t. the new dual
variables α.
These are not the only saddlepoints, we might just as well look at
d˜∗ = max min L(w, b, α).
{0≤αi ≤C} w,b
The dual value d˜∗ bounds the primal value p̃∗ from below. This fact is called
weak duality. While weak duality holds for any primal problem, convex or not,
more is true for our soft margin SVM problem. There, we have strong duality,
in that the two saddlepoint values are identical:
!
max min L(w, b, α) = d˜∗ = p̃∗ = min max L(w, b, α).
{0≤αi ≤C} w,b w,b {0≤αi ≤C}
This means that we can solve the original primal problem by solving the dual
problem instead, which will turn out to be simpler in the case of soft margin
SVM.
9.3 Solving the Support Vector Machine Problem 167
Let us solve the dual problem. Denote φi := φ(xi ). The inner criterion is
n
1 X
L= kwk2 + αi (1 − ti yi ), yi = wT φi + b, i = 1, . . . , n.
2 i=1
The criterion for the dual problem is obtained by minimizing over the primal
variables (w, b):
ΦD (α) = min L(w, b, α).
w,b
Finally an unconstrained and differentiable problem: we can apply our “set the
gradient to zero” procedure! First,
n
X n
X
∇w L = w − αi ti φi = 0 ⇒ w= αi ti φi .
i=1 i=1
This means that the dual problem has an equality constraint. Plugging these
into the criterion, we obtain the dual optimization problem:
1 X X 1 X
max ΦD (α) = kwk2 + αi (1 − ti yi ) = αi − αi αj ti tj φTi φj ,
α 2 i i
2 i,j
X
subj. to αi ∈ [0, C], i = 1, . . . , n, αi ti = 0.
i
(9.7)
We used that
X X (1) X (2)
αi ti yi = αi ti φTi w + b = αi ti φTi w = kwk2 ,
i i i
P P
since (1) i αi ti = 0 and (2) w = i αi ti φj . The dual problem is indeed
simpler than the primal. First, it depends on n variables only, not on p+n+1. It is
convex, since −ΦD (α) is a positive semidefinite quadratic. Finally, its feasible set
is simply the intersection of the hypercube [0, C]n with the hyperplane αT t = 0.
It is naturally kernelized (Section 9.2.3). Picking a kernel function K(x, x0 ), we
have that
Kij = K(xi , xj ) = φTi φj , K = [Kij ] ∈ Rn×n .
In terms of this kernel matrix, the dual criterion becomes
X 1X 1
ΦD (α) = αi − αi ti Kij tj αj = 1T α − αT (diag t)K (diag t)α.
i
2 i,j
2
Pn
Moreover, if α∗ is the solution of the dual problem, then w∗ = i=1 α∗i ti φi
solves the primal, and we obtain the discriminant
n
X
y ∗ (x) = (w∗ )T φ(x) = α∗,i ti K(xi , x) + b∗ . (9.8)
i=1
168 9 Support Vector Machines
Support Vectors
There is a property we have not yet used, which links primal and dual variables
at the saddlepoint. By working it out, we will complete our derivation of the
soft margin SVM dual problem. First, we will find how to solve for b∗ . Second,
we will obtain an important classification of patterns (xi , ti ) in terms of values
of αi , which will finally clarify the catchy name “support vectors”. To gain some
intuition, consider the soft margin SVM solution in Figure 9.7. Three different
things can happen for a pattern (xi , ti ). First, it may be classified correctly
with at least the margin, in that ti yi ≥ 1. In this case, the optimal solution does
not really depend on it. We could remove the pattern from D and still get the
same solution. The solution is not supported by these vectors. Second, it may
lie precisely on the margin, ti yi = 1. These ones are important, they provide
the essential support. Finally, for the soft margin SVM, there may be patterns
in the margin area or even misclassified, ti yi < 1. They support the solution as
well, because we pay for them, and removing them may allow us to increase the
soft margin.
Figure 9.7: Different types of patterns for soft margin SVM solution. (x1 , t1 )
is classified correctly with margin, t1 y1 ≥ 1. The SV solution does not depend
on it. (x2 , t2 ) is an essential support vector and lies directly on the margin,
t2 y2 = 1. Both (x3 , t3 ) and (x4 , t4 ) are bound support vectors, α3 = α4 = C.
(x3 , t3 ) lies in the margin area, while (x4 , t4 ) is misclassified (t4 y4 < 0).
Suppose now that we are at a saddlepoint of the dual problem, dropping the
“*” subscript for easier notation. Recall how we got the αi into the game above:
C[1 − ti yi ]+ = max αi (1 − ti yi ).
αi ∈[0,C]
A glance at Figure 9.6 reveals that αi is linked with yi (and therefore with w, b)
9.3 Solving the Support Vector Machine Problem 169
through the requirement that the lower bound has to touch the hinge function
at the argument 1 − ti yi . As illustrated in Figure 9.8, there are three different
cases:
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
−1 −1 −1
−2 −2 −2
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
1−t y 1−t y 1−t y
If (xi , ti ) is not a support vector, its αi = 0 and it does not appear in the
kernel expansion (9.8) of y ∗ (x). If this happens for many patterns, the kernel
expansion (9.8) is sparse, and a lot of time can be saved both when predicting
on future test data and during training. Sparsity is a major attractive point
about support vector machines in practice (see Section 9.4).
Finally, the essential support vectors allow us toPdetermine the solution for b.
Denote S = {i | αi ∈ (0, C)}. For these, let ỹi = j αj tj K(xj , xi ) = yi − b. We
know that 1 = ti yi = ti (ỹi + b), so that
1 X
b= (ti − ỹi ) .
|S|
i∈S
Note we could also just compute b from any single i ∈ S, but the above formula
averages out numerical errors.
Most SVM codes in use today solve this dual problem, making use of the so
called Karush-Kuhn-Tucker (KKT) conditions we just derived. A particularly
simple algorithm is sequential minimal optimization (SMO) [32], where pairs
(αi , αj ) are updated in each iteration. There are some good and some not
so good SVM packages out there. An example for a good code is LIBSVM at
www.csie.ntu.edu.tw/∼cjlin/libsvm/.
This concludes our derivation of the soft margin SVM optimization problem in
both primal and dual form. Our derivation is a special case of the much more
general and powerful framework of Lagrange multipliers and Lagrange duality,
which we discuss in Appendix A.
170 9 Support Vector Machines
Sparsity
The solution (α, b) to a soft margin SVM problem is sparse if most patterns
are not support vectors and most of the coefficients of α are zero. This does
not always happen, but it happens frequently in practice. A beneficial aspect
of sparsity is that the final predictor y ∗ (x) (9.8) can be evaluated efficiently
on test data. In fact, we only have to store the support vectors for prediction
purposes. In many real-world applications, it is very important to be able to
predict quickly, and sparsity is a substantial asset then. Even better, sparsity
tends to help with the training as well. Most SVM solvers exploit sparsity in one
way or another, which often lets them cope with dataset sizes n for which the
storage of the full kernel matrix K is out of the question. An extreme example is
SMO (Section 9.3), which is very slow without, yet highly effective with sparsity.
One problem with SVM sparsity is that it cannot be controlled in a sensible way,
nor can we easily say up front how sparse our solution will be. To some extent,
sparsity can be linked to the Bayes error of a problem (details in [36]), but the
link is not perfect. It is sometimes advocated to regulate sparsity by the choice of
C in (9.4). It is true that cranking up C can lead to sparser solutions, but that is
not what this parameter is for. C should be chosen solely with one goal in mind,
namely to attain the best possible generalization error. There are alternative
methods which allow the degree of sparsity to be regulated, for example the
relevance vector machine [5, ch. 7.2] or the informative vector machine [26].
However, when it comes to combining sparsity with computational efficiency
and robustness, there is still no match for the SVM.
Recall from Section 8.2.2 that a conditional maximum likelihood method such
as logistic regression can estimate class posterior probabilities P (t = +1|x),
9.4 Support Vector Machines and Kernel Logistic Regression (*) 171
Here, K is the kernel matrix. Other than in the SVM dual problem, the αi can
be negative (in fact, αi plays the role of ti αi in the SVM problem). For kernels
such as the Gaussian
P (9.6) and a certain decay of ν = νn with training set size
n, the minimizers j α∗,j K(xj , ·) + b∗ of (9.9) converge to the log odds ratio
log{P (t = +1|x)/P (t = −1|x)} for the true data distribution almost surely8 .
In other words, posterior class probabilities P (t = +1|x) can be estimated
consistently with regularized kernel logistic regression.
How about the soft margin SVM? First, it is not a conditional maximum a-
posteriori technique, mainly since the hinge loss [1 − ti yi ]+ does not correspond
to a negative log likelihood [37]. A way of understanding the SVM in terms of
probabilities is shown in [24], but this is markedly different from MAP estima-
tion. It should therefore not come as a surprise that posterior class probabilities
cannot be estimated consistently with the SVM. The population minimizer for
the hinge loss is
The proof is left to the reader. Note that y ∗ (x) can take any value in [−1, +1]
if P (t = 1|x) = P (t = −1|x). Now, y ∗ (x) is certainly not probability-revealing.
For each x, it merely contains the information whether P (t = +1|x) > P (t =
−1|x) or P (t = +1|x) < P (t = −1|x), but nothing else. Indeed, Bartlett
and Tewari [1] demonstrated rigorously that P (t = +1|x) cannot be estimated
consistently by SVMs. The reason for this is precisely the sparsity of SVMs.
The failure of SVMs does not just happen on some unusual data distributions,
but is a direct implication of margin maximization. Ironically, despite this clear
result, SVMs are used to “estimate class probabilities” to this day, using [33]
or more elaborate variants. This cannot be justified as an approximation, it is
just plain wrong. If we want a sparse classifier trained by convex optimization,
we can use the SVM. If our goal is to estimate class probabilities for automatic
decision making (Section 5.2), the SVM is not a trustworthy option, and we
should use other methods such as kernelized logistic regression, trained by the
IRLS algorithm (Section 8.2.4).
We close this section with noting that there is a probabilistic counterpart to
support vector machines, going far beyond MAP estimation for logistic regres-
sion: Gaussian process methods. Kernels are covariance functions there (Sec-
tion 9.2.4), which give rise to Gaussian priors over functions y(·). GP methods
8 This is a classical result, which follows for example from Corollary 3.62 and example 3.66
in [41].
172 9 Support Vector Machines
are out of scope of this basic course, but do constitute an important part of
modern machine learning, with applications ranging from computer vision over
robotics, spatial statistics to bioinformatics. If you are interested, start with [45]
or the textbook [34]. A useful website is www.gaussianprocess.org.
Chapter 10
173
174 10 Model Selection and Evaluation
• Model assessment: How will the finally chosen method behave on future
test data?
1 7
6
0.5
5
4
0
3
2
−0.5
1
0
−1
−1
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
1 7
6
0.5
5
4
0
3
2
−0.5
1
0
−1
−1
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
Figure 10.1: Two datasets of size 10, closely fitted by polynomials of degree 1
(left) and degree 7 (right). Top row: Least squares fits exhibit similar squared
error in either case. Bottom left: Degree 1 polynomials fitted to subsets of size 5
are close to the fit for the whole set. Bottom right: Degree 7 polynomials fitted
to subsets of size 5 are very different from each other and from the fit for the
whole set.
1 h h ii
E = E [L(ŷ(x), t)] = E E (ŷ(x) − t)2 x .
2
If you are unsure about the notation, you might want to revisit Chapter 5, in
particular Section 5.2. Our convention is that a conditional expectation E[A|B]
is over all variables in the expression A which do not appear in B. For example,
the expectation above is over (x, t) from the true law, which in the second
expression is decomposed into conditional expectation over t given x inside and
marginal expectation over x outside. We will worry about the dependence of
10.1 Bias, Variance and Model Complexity 177
ŷ(·) on the training sample D in a moment. For now, let us find out what the
optimal solution for ŷ(·) would be. Namely,
h i h i
2 2
E (ŷ(x) − t) = E (ŷ(x) − yopt (x) + yopt (x) − t)
h i h i
2 2
= E (ŷ(x) − yopt (x)) + 2E [(ŷ(x) − yopt (x)) (yopt (x) − t)] + E (yopt (x) − t) ,
where yopt (x) does not depend on t. The middle term is “cross-talk”, it vanishes
if we choose yopt (x) = E[t | x]:
h i h i h i
2 2 2
E (ŷ(x) − t) = E (ŷ(x) − E[t | x]) + E (E[t | x] − t)
h i (10.1)
2
= E (ŷ(x) − E[t | x]) + E [Var[t | x]] .
You should test your understanding on this derivation. Why does the cross-talk
term vanish under this particular choice of yopt (x)? Why can the final term be
written as expected conditional variance? Is this the same as Var[t]? If not1 ,
why not, what is missing?
The expression (10.1) is uniquely minimized by ŷ(x) = E[t | x], the conditional
expectation of t given x. In other words, E[t | x] is the population minimizer
under the squared loss function (see Section 8.2.2), the optimal solution to the
curve fitting problem, which requires knowledge of the true underlying law.
The term E[Var[t | x]] is a lower bound on the population squared error, we
cannot perform better than this. For example, suppose the data is generated
as ti = y∗ (xi ) + εi , where y∗ (·) is an underlying curve, and the εi are i.i.d.
noise variables with zero mean and variance σ 2 (Section 4.1). Then, E[t | x] =
E[y∗ (x) + ε | x] = y∗ (x) and E[Var[t | x]] = σ 2 . For this additive noise setup,
the conditional expectation recovers the underlying curve perfectly, while the
minimum unresolvable error is equal to the noise variance.
Our argument so far is decision-theoretic much like in Section 5.2, but deal-
ing with curve fitting instead of classification. In practice, ŷ(·) is learned from
data D, without knowledge of the true law. Let us fix x and use the quadratic
decomposition once more, this time taking expectations over the sample D:
h i h i
2 2
E (ŷ(x|D) − E[t | x]) x = E (ŷ(x|D) − yavg (x) + yavg (x) − E[t | x]) x ,
where yavg (x) does not depend on D. Using the same argument as above, the
cross-talk vanishes for yavg (x) = E[ŷ(x|D) | x], the expected prediction curve,
and
h i
2 2
E (ŷ(x|D) − E[t | x]) x = (E[ŷ(x|D) | x] − E[t | x]) + Var ŷ(x|D) x .
| {z } | {z }
Bias2 Variance
This is the bias-variance decomposition for regression with squared loss. Note
that it holds pointwise at each x. A further expectation over x provides a bias-
variance decomposition of the expected test risk:
h i
2
E[R(ŷν |D)] = E (E[ŷ(x|D) | x] − E[t | x]) +E Var ŷ(x|D) x +E [Var[t | x]]
1 Answer: Var[t] = E[Var[t | x]] + Var[E[t | x]]. Why?
178 10 Model Selection and Evaluation
into average squared bias, average variance and intrinsic noise variance. While
the third term is constant (and typically unknown), there is a tradeoff between
the first two terms which can be explored by varying the model complexity
(regularization parameter ν in the example above).
2 2
1.5 1.5
1 1
0.5 0.5
0 0
−0.5 −0.5
−1 −1
−1.5 −1.5
−2 −2
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
3 3
Bias2 2
Bias
2.5 Variance 2.5 Variance
2 2
1.5 1.5
1 1
0.5 0.5
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Figure 10.2: Bias and variance for least squares polynomial curve fitting. 1000
samples of size 12 each were drawn from y∗ (x) = sin(2πx) plus N (0, 1/16) noise,
where xi are drawn uniformly from [0, 1]. Each sample includes 0 and 1 among
the xi , to avoid high variance due to extrapolation. Note that this favours the
higher-order polynomials.
Left column: Each sample is fitted by a line y(x) = w1 x + w0 . The expected fit
E[ŷ(x|D) | x] (upper left; solid black) is rather poor, giving rise to substantial
squared bias (lower left; blue). On the other hand, the variance across samples
(lower left; red) is small, in that most fitted lines are close to the average. Right
column: Each sample is fitted by a polynomial y(x) = w6 x6 + · · · + w1 x + w0 of
degree 6. The expected fit is very good (small squared bias), but the variance
across samples (lower right; red) is large. Notice that these curves represent
sample averages of the population quantities E[ŷ(x|D) | x] and Var[ŷ(x|D) | x]
(averaged over 1000 samples).
1.2
0.8
0.6
0.4
0.2
0
0 1 2 3 4 5 6 7
In Figure 10.3, we repeat the same polynomial curve fitting setup (1000 training
samples D of size 12 each), showing average squared bias, average variance and
expected test risk as function of the polynomial degree (p − 1, where p is the
number of weights). Here is a bias-variance tradeoff in action. As p (and therefore
the model complexity) increases, the bias decreases, while the variance increases.
The expected test risk is smallest for degree 3 (p = 4).
180 10 Model Selection and Evaluation
In this case, the bias vanishes for ν = 0, while being positive in general for
ν > 0. For general curves y∗ (x), we can show that the average bias on the
training sample is nondecreasing in ν:
E kt − Φ ŵν1 k2 ≤ E kt − Φ ŵν2 k2 , ν1 < ν2 .
This is because for any fixed sample D, the training error is nonincreasing in ν
(Section 7.2). Regularization is a means to decrease variance at the expense of
increasing the bias.
Given these L methods, we can pick one of them for prediction. Doing so, the
average error is
L L
1X 1X
E ε̂l (x)2 x .
Eavg (x) = El (x) =
L L
l=1 l=1
whose error is
" 2 #
XL
−1
Eens (x) = E L (y∗ (x) + ε̂l (x)) − y∗ (x) x
l=1
" 2 #
XL
−1
=E L ε̂l (x) x .
l=1
so that Eens (x) ≤ Eavg (x). The ensemble error is never worse than the average
error picking a single predictor. In the best case, it can be much better. Denote
the expected error of each method by Bl (x) = E[ε̂l (x) | x]. Let us assume that
the errors of different methods are uncorrelated: Cov[ε̂l1 (x), ε̂l2 (x) | x] = 0,
l1 6= l2 . Then,
X
Eens (x) = L−2
E ε̂l1 (x)ε̂l2 (x) x
l1 ,l2
X
= L−2
Cov ε̂l1 (x), ε̂l2 (x) x + Bl1 (x)Bl2 (x)
l1 ,l2
X X 2
−2
Var ε̂l (x) x + L−1
=L Bl (x) ,
l l
| {z } | {z }
Variance Bias2
while X X
Eavg (x) = L−1 Var ε̂l (x) x + L−1 Bl (x)2 .
l l
| {z } | {z }
Avg. Variance Avg. (Bias)2
The ensemble method reduces the variance by a factor of L in this case! More-
over, the bias is not increased:
X 2 X 2 X
L−1 Bl (x) = L−2 1 · Bl (x) ≤ L−1 Bl (x)2 ,
l l l
Our goal is to select a value of ν giving rise to small test risk R(ŷν |D) or small
expected test risk E[R(ŷν |D)].
As noted in Section 10.1.1, we can estimate R(ŷν |D) using a validation dataset
Dvalid of size nvalid , independent of the training dataset D. Since ŷν (·) is inde-
pendent of Dvalid , we have that
h i 1 X
R(ŷν |D) = E L(ŷν (x), t) D ≈ L(ŷν (x̃j ), t̃j )
nvalid
(x̃j ,t̃j )∈Dvalid
by the law of large numbers. We can now minimize the estimator on the right
hand side w.r.t. ν. This is typically done by evaluating the estimator over a
candidate set of values for ν, then picking the minimizer. The problem with this
approach is that Dvalid cannot be used for training. In the remainder of this
section, we discuss a technique for model selection on the training dataset only.
10.2.1 Cross-Validation
Given some dataset D of size n, we would like to split it into a training and a
validation part of 4/5 and 1/5 the total size respectively. Let us partition the
complete set randomly into 5 equisized parts, D1 , . . . , D5 , then use parts 2–5,
[
D−1 = D \ D1 = Dk ,
k=2,...,5
This is the 5-fold cross-validation (CV) estimator. Note how each case (xi , ti )
in D is used once for validation and four times for training. Defining the CV
estimator requires some notation, which can be confusing. You should be guided
by the idea, which is simple indeed:
• Split the available training dataset D into M equisized3 parts (or folds)
Dm , which do not overlap. In the previous example, M = 5.
• For m = 1, . . . , M : Train on D−m = D \ Dm , then evaluate the validation
risk on Dm . Importantly, Dm was not used for training.
• Average the M validation risk values obtained in this round-robin way.
where D0 is an i.i.d. sample of size (M − 1)n/M from the true underlying dis-
(M )
tribution. For not too small M , R̂CV (D) can be used as approximate estimator
of the expected test risk E[R(ŷν |D)] for the full sample. In Figure 10.4, we com-
pare the expected tesk risk with both 5-fold and leave-one-out CV estimators
on the polynomial curve fitting task used already above (we use a sample size
of 50 here, since cross-validation is unrealiable for small datasets). While CV
somewhat overestimates the test risk for too small and too large degrees, its
minimum points coincide with those of the expected test risk in this example.
Notice how it errs on the conservative side, and how its variance (over samples)
grows sharply with polynomial degree.
The curious reader may wonder how we computed the leave-one-out cross-
validation estimator in Figure 10.4. Do we have to run least squares regression
50 times on samples of size 49? And what if n = 1000 or a million? In this light,
3 In practice, the folds can be of slightly different size, which does not change the general
procedure.
184 10 Model Selection and Evaluation
0.4 0.4
Exp. Test Risk Exp. Test Risk
0.35 0.35
5−fold CV 50−fold CV (LOO)
0.3 0.3
0.25 0.25
0.2 0.2
0.15 0.15
0.1 0.1
0.05 0.05
0 0
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
Figure 10.4: Cross-validation estimators versus expected test risk for polynomial
curve fitting task. 100 samples of size 50 each were drawn from y∗ (x) = sin(2πx)
plus N (0, 1/16) noise, where xi are drawn uniformly from [0, 1]. Each sample
includes 0 and 1 among the xi , to avoid high variance due to extrapolation.
Shown are expected test risk (green) and its cross-validation estimate (mean
(solid) and standard deviation (dashed) over 100 samples), as function of poly-
nomial degree. Left: 5-fold cross-validation (test batches of size 10). Right: 50-
fold leave-one-out cross-validation.
over-fitting comes from the fact that weights w are fitted to data D in the first
place. In maximum likelihood and MAP estimation, we do treat w as a random
variable. The sum rule of probability (Chapter 5) tells us to marginalize over w
in order to robustly estimate hyperparameters such as ν from data. Following
this argument, we should maximize the log marginal likelihood
Z
log p(D|ν) = log p(D|w, ν)p(w|ν) dw
Suppose we leave out pattern (φi , ti ). The normal equations for ŷ (−i) =
(w(−i) )T φ(·) are
A − φi φTi w(−i) = ΦT t − ti φi = Aw − ti φi .
!
= ΦT t − ti φi .
so that
n
(n) 1X 2
R̂CV (D) = α .
n i=1 i
How do we compute the αi efficiently? We need the residuals r = t −Φw, more-
over the vector [φTi A−1 φi ]. The latter is the diagonal of the matrix ΦA−1 ΦT .
Recall how the least squares problem can be solved by the QR decomposition
(Section 4.2.3):
!
Φ = QR ⇒ w = A−1 ΦT t = R−1 QT t.
d
X
[φTi A−1 φi ] = diag ΦA −1
Φ T
= (q k )2 ,
k=1
Dimensionality Reduction
In many, if not most real-world machine learning problems, data cases are most
naturally represented in terms of very high-dimensional vectors. Our canon-
ical handwritten digits recognition problem is rather on the low end in this
respect. High-resolution images have millions of pixels. In speech recognition,
audio waveforms are represented by a large number of windowed Fourier trans-
form features. Text documents are commonly represented as count vectors w.r.t.
some dictionary, which for realistic corpora contains hundreds of thousands of
words. If learning theory tells us one thing, it is that learning in such huge-
dimensional spaces is impossible in general. A practical way out of this dilemma
is to reduce the dimensionality of attributes by some form of feature mapping.
In this chapter, we will learn to know some of the most widely used linear
dimensionality reduction techniques. Principal components analysis is among
the most fundamental of all techniques used in machine learning, and we will
cover it in some detail. Linear dimensionality reduction techniques are typically
based on the eigendecomposition of certain sample covariance matrices, and
this foundation will be studied in detail. Other more advanced machine learn-
ing techniques share the same mathematical foundations: spectral clustering,
manifold learning, multi-dimensional scaling and metric embedding techniques
for visualization.
189
190 11 Dimensionality Reduction
Many machine learning problems are characterized by input points living in very
high-dimensional spaces. For example, the MNIST handwritten digit bitmaps
are of size 28 × 28, so can be seen as vectors in R784 . In bioinformatics, gene
microarray measurements can easily give rise to input space dimensionalities of
many thousands. It is important to understand that the underlying “effective”
dimensionality of what is predictively relevant about a signal is typically much
smaller. After all, it is impossible to do meaningful statistics on general distri-
butions in hundreds or thousands of dimensions, as there is simply never enough
training data to explore the vast majority of dependencies between attributes.
This observation is sometimes termed “curse of dimensionality”, but let us be
precise. The general impossibility of statistics in R1000 is a fact, there is noth-
ing to lament about it. If a problem cannot be represented using much fewer
unknowns, it is out of scope and does not deserve1 our attention. Fortunately,
many real-world problems are amenable to useful low-dimensional modelling.
The curse is that we have to find these adequate low-dimensional representa-
tions, and we have to do so in an automated data-driven way. Given we suc-
ceed, we will have found a probabilistic low-dimensional representation of the
high-dimensional input variable, which conveys enough of the latter’s predictive
information: this is dimensionality reduction. In general, relevant parameters
may be nonlinearly related to input points, and this process of “finding needles
in haystacks” can be very challenging. In this chapter, we will concentrate on
linear dimensionality reduction.
Consider our running MNIST 8s versus 9s classification problem from Chapter 2.
Suppose we want to apply a maximum likelihood plug-in rule with Gaussian
class-conditional distributions p(x|t) = N (µt , Σt ), t = −1, +1, and x ∈ Rd ,
where d = 784. How should we parameterize the covariances Σt ? A simple
choice would be Σt = I, leaving us only with the means µt to be estimated.
However, the variances of pixels are clearly different across the bitmap. Pixels
at the boundary almost always have values close to zero, the variance there is
much smaller than in the center. Diagonal covariances Σt would capture differ-
ent variances at a modest increase in the number of parameters to be estimated.
Still, these imply pairwise conditional independence of individual pixels, an as-
sumption which is clearly violated in our data. Digits are composed of smooth
pen strokes, which induce substantial correlations between neighbouring pixel
values. Full matrices Σt then? Unfortunately, this needs d(d + 1)/2 for Σt , so
d(d + 3)/2 = 308504 parameters for each class. We cannot reliably estimate this
many parameters from our limited training data.
Covariances between pixels matter, but maybe not all of them do. Here is
the idea behind linear dimensionality reduction. Instead of modelling the high-
dimensional input variable x ∈ Rd , we map it to z = U T x and model z ∈ RM
instead. Since M is chosen much smaller than d, we can afford full covariance
matrices for the class-conditionals on z. This is of course a special case of a
linear feature map, z = φ(x) = U T x. The matrix U ∈ Rd×M determines the
dimensionality reduction. It has to be learned from training data for the whole
idea to work well. In this section, we concentrate on learning U in order to
represent an input point distribution per se, ignoring class label information
even if such is available (this is a simple example of unsupervised learning, see
1 Unless a financial analysis firm pays you lots of money for it.
11.1 Principal Components Analysis 191
Chapter 12). We will take several different routes and end up at the same idea:
principal components analysis.
We will adopt the following conventions. First, we will consider zero-mean dis-
tributions on xP only. Given we work with data, we first compute the sample
mean µ̂ = n−1 i xi and subtract it off. Without this preprocessing, the fea-
ture map would be U T (x − µ̂). Second, we will restrict U to have orthonormal
columns: U T U = I ∈ RM ×M . In other words, z = U T x is the “encoding” part
of an orthogonal projection (Section 4.2.2). To justify this restriction, notice
that we can decompose any full-rank matrix in Rd×M as product of U with
orthonormal columns and an invertible RM ×M matrix (QR decomposition, Sec-
tion 4.2.3), and we can absorb the latter into the defition of z at no loss.
What is a good projection U for dimensionality reduction? Let us collect some
general criteria of merit.
x̂ = U z = U U T x.
At first sight, these requirements seem to have little to do with each other.
However, we will see that they are closely related. We can optimally satisfy all
three of them by principal components analysis.
Figure 11.1: Illustration of principal components direction for data (blue dots) in
R2 . The first PCA direction minimizes the squared reconstruction error between
points and their projections (green dots) onto the PCA subspace. Squared error
terms are visualized as area of the red squares.
For each x, inside the expectation, we can choose the encoding coefficient z so
to minimize kx − zuk2 . The direction u should work well on average over x,
so the minimization is outside. We already know that z is chosen by way of
orthogonal projection (Section 4.2.2), but let us derive it again:
∂ uT u=1
kx − zuk2 = 2(zu − x)T u = 2(z − xT u) = 0 ⇔ z = uT x.
∂z
Plugging this minimizer in:
E kx − (uT x)uk2 = E kxk2 − 2(uT x)2 + (uT x)2 uT u
As the first term does not depend on u, our optimal direction is given by
u∗ = argmax uT Cov[x]u.
u: kuk=1
11.1 Principal Components Analysis 193
This is also the direction for which z = uT x has maximum variance, since
(recall that E[x] = 0, so that E[z] = 0). The solution u∗ is called first principal
components direction, and the corresponding z∗ = uT∗ x is called first principal
component2 of the random variable x.
Importantly, this definition depends on the covariance of x only. In practice,
this means that all we need to estimate of x are first and second order moments
(mean and covariance). In fact, our argument amounts to a decomposition of
the total amount of covariance:
E[kxk2 ] = tr E[xT x] = tr E[xxT ] = tr Cov[x]
= Var[uT x] + E kx − (uT x)uk2 ,
| {z } | {z }
variance explained squared error
Multiple Directions
What about more than one direction? Denote the first principal components
direction by u1 , how shall we choose a second one, u2 ? This choice has to be
constrained, otherwise we would simply choose u1 again. By our convention,
u2 is orthogonal to u1 . We will see in a moment that this is particularly well
suited to principal components. We can now define u2 in the same way as above,
subject to uT2 u1 = 0:
This is the second principal components direction for x, and it once more cor-
responds to an eigendirection of Cov[x], corresponding4 to the second-largest
eigenvalue (see Section 11.1.2). Note that eigendirections of a symmetric ma-
trix corresponding to different eigenvalues are necessarily orthogonal (Sec-
tion 11.1.2), which in hindsight motivates our convention on the columns of
U.
2 To be precise, u and z are population principal components, while in practice we esti-
∗ ∗
mate them from data through the sample covariance matrix, as is detailed below.
3 In general, eigenvalues of real-valued matrices can be complex-valued, but this does not
More generally, the k-th principal components direction uk can be defined re-
cursively via a variant of (11.1), where the constraints read uTk uj = 0 for all
j < k. The main point is this. All principal components (PC) directions are
eigendirections of the covariance matrix Cov[x]. Their ordering follows the or-
dering of the eigenvalues: first PC for largest eigenvalue, and so on. Selecting the
leading M eigendirections, corresponding to the M largest eigenvalues (counting
multiplicities), to form a dimensionality reduction matrix U ∈ Rd×M is called
principal components analysis (PCA).
Figure 11.2: Mean and first four principal components directions for MNIST 8s
(subset of size 200, drawn at random from the training database). For the latter,
positive values are shown in red, negative values in blue. The first direction
corresponds to rotations, the second to changes in scale.
In Figure 11.2, we show means and first few principal components directions
for a set of 8s from the MNIST database. Using this example, we can test
the autoencoder property of PCA. In Figure 11.3, we show reconstructions x̂
for digits x, using a growing number of PC directions. Notice how scale and
orientation are shaped before finer details.
We can arrive at a non-recursive definition of PCA for general M < d and
gain additional intuition by developing the minimum reconstruction error and
variance decomposition perspective in the general case. The optimal choice for
minr ∈RM kx − U rk2 is r = U T x (why?), and
h i h i
E kx − U U T xk2 = E kxk2 − E kU T xk2 ,
11.1 Principal Components Analysis 195
Original M =1 M =3
M =5 M = 10 M = 30
Figure 11.3: PCA reconstructions of MNIST digit (upper left) from the mean
and a growing number M of PCA directions.
T T
where tr U Cov[x]U = tr Cov[U x]. The complete PCA transformation can
therefore by described as solution of
U ∗ = argmax tr U T Cov[x]U . (11.3)
U T U =I
It is noted in Section 11.1.2 that the columns of U ∗ are given by the M leading
eigendirections5 , and the maximal value of the trace is the sum of the M largest
eigenvalues (counting multiplicities).
Av = λv, λ ∈ C.
Both v and λ can be complex-valued in general (we will show below that these
quantities are real-valued for symmetric matrices). v is called an eigenvector, λ
an eigenvalue of A, (v, λ) is an eigenvector-value pair. If (v, λ) is such a pair,
so is (αv, λ) for any α ∈ C \ {0}, which is why we refer to v as eigendirection
as well. Importantly, we can find a basis of Cd , u1 , . . . , ud linear independent,
so that each uj is an eigenvector:
Auj = λj uj , j = 1, . . . , d.
Since we know all about A if we know what is does to a complete basis (Auj for
all j), this information determines A entirely. If U = [u1 , . . . , ud ], Λ = diag[λj ],
then
AU = U Λ ⇒ A = U ΛU −1 . (11.4)
This representation of A is called eigendecomposition. Why is this useful? For
one, we know immediately everything about Ak for any k ∈ N:
k
Ak = U ΛU −1 = U Λk U −1 ,
The trace is the sum of eigenvalues. For any matrix, the sum of eigenvalues is
the same as the sum of diagonal values. For the determinant:
d
Y
|A| = |U ||Λ| U −1 = |U ||Λ||U |−1 = |Λ| =
λj ,
j=1
the product of the eigenvalues (but not the product of the diagonal entries).
The last mysteries about determinants should disappear now. For example, the
volume interpretation of the determinant given in Section 6.3.1 becomes natural.
The parallelepiped spanned by {uj } is mapped to one spanned by {Auj } =
{λj uj }, therefore the volume changes by |λ1 · · · λd | = ||A||.
Eigenvalues are the roots of the characteristic polynomial. Namely,
Av = λv ⇔ (A − λI) v = 0,
so v ∈ ker(A − λI), v 6= 0. This happens only if q(λ) = |A − λI| = 0. Now,
a closer look at determinants (Section 6.3.2) reveals that q(λ) is nothing but
a degree d polynomial in λ with coefficients in R, determined by A: q(λ) =
α0 + α1 λ + · · · + (−1)d λd . The spectrum of A is the set of roots of q(λ). If
you need to determine the eigenvalues of a 2 × 2 matrix, find the roots of the
characteristic polynomial:
a11 − λ a12
= (a11 − λ)(a22 − λ) − a12 a21 .
a21 a22 − λ
A = U ΛU T , Λ = diag[λj ], λj ∈ R, U T U = I. (11.5)
We are one step away now from understanding PCA in terms of the eigen-
decomposition of the (sample) covariance matrix. More generally, let A be a
symmetric matrix. Then, A has a real-valued spectrum and we can choose an
orthonormal eigenbasis u1 , . . . , ud with uTi uj = I{i=j} . Suppose that the cor-
responding eigenvalues are ordered: λ1 ≥ λ2 ≥ . . . λd . A result due to Rayleigh
and Ritz states that
u1 ∈ argmax uT Au.
kuk=1
If λ1 > λ2 , this maximizer is unique up to sign flip. To show this, we use the
eigendecomposition A = U ΛU T , where U = [u1 , . . . , ud ] and Λ = diag[λj ].
Let u be any unit vector, and let z = U T u. Then,
d
X
T T T
u Au = u U ΛU u = z Λz = T
λj zj2 .
j=1
The numbers zj2 are nonnegative and sum to one, they form a distribution
over j ∈ {1, . . . , d}. Clearly, the expression is maximized by concentrating all
probability mass on j = 1, since λ1 = max{λj }. This means that u = U z =
u1 is a maximizer of the Rayleigh-Ritz ratio (uT Au)/(uT u). The recursive
definition of the k-th PC direction is obtained in the same way. Suppose that
the unit vector u is orthogonal to uj , j < k. Then, z = U T u must have zj = 0
for j < k, and z T Λz is maximized by placing all probability mass on k, which
implies the maximizer u = uk .
Finally, a non-recursive min-max characterization of the eigenvalues is due to
Courant and Fisher [23, ch. 4.2]. It directly implies the result (11.3) we used in
Section 11.1.1.
200 11 Dimensionality Reduction
then (recalling that the data is centered) the data covariance matrix is
Σ̂ = X T X .
X = V ΨU T , V T V = U T U = I, V ∈ Rn×k , U ∈ Rd×k .
Here, k = rk(X ) ≤ min{d, n}, and Ψ ∈ Rk×k is diagonal with positive entries.
The SVD is a generalization of the eigendecomposition to arbitrary matrices.
Now,
Σ̂ = X T X = U Ψ V T V ΨU T = U Ψ2 U T .
| {z }
=I
2
This means that Λ = Ψ are the eigenvalues, U the eigenvectors of Σ̂.
What if n < d (less datapoints than dimensions)? For example, gene microarray
data can come with thousands of dimensions, yet less than hundred datapoints.
In this case, we compute the SVD of X T instead:
X T = U ΨV T , U ∈ Rd×k , V ∈ Rn×k .
X T X = U Ψ2 U T , X X T = V Ψ2 V T ,
X T V = U ΨV T V = U Ψ ⇒ U = X T V Ψ−1 .
n
X n
X X
ṽ k = αj λkj uj , kṽ k k2 = αj2 λ2k
j = λ 2k
1 α12 + αj2 (λj /λ1 )2k .
j>1
j=1 j=1
We may assume that α1 6= 0. Since λj /λ1 ∈ [0, 1) for j > 1, we know that
kṽ k k
r X
= α12 + α2 (λj /λ1 )2k → |α1 | (k → ∞).
λ1 k j>1 j
The Lanczos algorithm is slightly more advanced than the power method, but
both require one MVM with A per iteration. The Lanczos algorithm is far supe-
rior to the power method if it comes to approximating M > 1 PCA directions.
Details are found in [17].
7 The power method is widely used in machine learning. Unfortunately, it typically needs
many more MVMs than Lanczos, in particular if M is moderately large. We should use Lanczos
whenever possible.
8 This can be done by sampling i.i.d. Gaussians, then normalizing the resulting vector.
202 11 Dimensionality Reduction
Centering
The centered data matrix is the product of the uncentered X with the rank
one centering matrix H . Note that H is the orthogonal projection onto the
hyperplane with normal vector 1n (Section 4.2.2), which is the subspace of
vectors whose components sum to zero. Implicit centering works by using H X
in place of X and X T H in place of X T . Here, MVMs with H carry essentially
no additional cost.
Figure 11.4: PCA directions of largest variance need not be directions which are
most useful for classification. In this example, the maximum variance direction
is closely aligned with the horizontal axis, yet projecting onto it results in sub-
stantial overlap between the classes. We would be better off to project onto the
vertical axis.
Here, the label space is T = {0, 1} (for our MNIST example, assign 8 → 0,
9 → 1), and nk is the number of patterns xi with ti = k. The total number
of patterns is n = n0 + n1 . We can quantify the between-class variance as
squared distance between the class means projected onto u: (m0 − m1 )2 , where
mk = uT µ̂k . On the other hand, the within-class scatter for class k can be
measured, in terms of the same units, by summing up the squared distance
between zi = uT xi and mk over all patterns xi belonging to class k:
n n
X X 2
s2k = n−1 I{ti =k} (zi − mk )2 = n−1 I{ti =k} uT (xi − µ̂ti ) .
i=1 i=1
The total amount of within-class variance is s20 + s21 . One way to maximize
between-class scatter while minimizing between-class scatter, the one chosen by
204 11 Dimensionality Reduction
over unit norm vectors u. Stop for a second to note the subtle form of the
denominator s20 + s21 . It looks like the usual covariance, but note that in xi − µ̂ti
the mean we subtract depends on the label ti of xi . How do we solve this
problem? First, both numerator and denominator are quadratic functions in u:
uT S B u
J(u) = ,
uT S W u
where
S B = (µ̂1 − µ̂0 )(µ̂1 − µ̂0 )T
is the between-class scatter matrix,
n
X
S W = n−1 (xi − µ̂ti )(xi − µ̂ti )T = (1 − α)Σ̂0 + αΣ̂1
i=1
is the within-class scatter matrix, and α = n1 /n. We will assume here and
below that the within-class scatter matrix S W is invertible9 . Define d := µ̂1 −
µ̂0 , so that S B = ddT . The form of J(u) looks like a Rayleigh-Ritz ratio we
encountered in Section 11.1.2, only that the denominator squared norm uT S W u
is weighted by S W . We will develop this generalized eigenproblem notion in full
generality below, when we generalize FLD to multiple classes. For now, let us
just set the gradient to zero. Working out ∇u J(u) is a bit simpler if we apply
differentiation to the identity
J(u)uT S W u = uT S B u,
resulting in
T
(dJ(u))uT S W u = 2 (S B u − J(u)S W u) (du).
J(u)S W u = S B u. (11.6)
S −1
W d
ûFLD = −1 , d = µ̂1 − µ̂0 .
kS W dk
S W u = µ̂1 − µ̂0 ,
11.2 Linear Discriminant Analysis 205
0.14 0.14
Class 8 Class 8
Class 9 Class 9
0.12 0.12
0.1 0.1
0.08 0.08
0.06 0.06
0.04 0.04
0.02 0.02
0 0
−0.06 −0.04 −0.02 0 0.02 0.04 −6 −4 −2 0 2 4
6
Class 8
4 Class 9
2
First PCA
−2
−4
−6
−8
−0.08 −0.06 −0.04 −0.02 0 0.02 0.04 0.06
Fisher LD
which PCA is based on? In this section, we show that S can be decomposed
as the sum of S W and a multiple of S B . Recall the definitions of S W and S B
from above, and set d = µ̂1 − µ̂0 . We plug
xi − µ̂ = xi − µ̂ti + µ̂ti − µ̂
into the equation for S . Expanding the squares, we see that the “cross-talk”
vanishes:
n
X n
X X
(xi − µ̂ti )(µ̂ti − µ̂)T = I{ti =k} (xi − µ̂k )(µ̂k − µ̂)T = 0,
i=1 k=0,1 i=1
since
n
X
I{ti =k} (xi − µ̂k ) = nk (µ̂k − µ̂k ) = 0.
i=1
Therefore,
n
X X
S = n−1 (xi − µ̂ti )(xi − µ̂ti )T + n−1 nk (µ̂k − µ̂)(µ̂k − µ̂)T .
i=1 k=0,1
The first part of this equation is just S W . Now, nµ̂ = n0 µ̂0 + n1 µ̂1 , so that
therefore
X X nk (n1−k )2
n−1 nk (µ̂k − µ̂)(µ̂k − µ̂)T = n−1 ddT = α(1 − α)ddT ,
n2
k=0,1 k=0,1
11.2 Linear Discriminant Analysis 207
This simple decomposition provides insight into the relationship between PCA
and FLD. Maximizing total variance is a good idea for data without class labels,
but once we know the assignment of training pattern to classes, we recognize
that a part of this variance helps with classification (variance between patterns
of different classes), while the remaining part hurts (variance of patterns within
each class) by creating overlap. When maximizing the former part, we have
to control the size of the latter in order to obtain a discriminative direction.
Note that the two parts of S are different in nature. The within-class scatter
matrix S W is a positive definite matrix just like S , while the between-class
scatter matrix S B is of rank one only. We will understand the significance of
this observation in Section 11.2.3.
ΣW = E (x − µt )(x − µt )T
is the true within-class covariance matrix. Note the subtlety in this definition:
this is not the covariance of x, which would be
Σ = E (x − µ)(x − µ)T ,
since the expectation is over (x, t), not just x. The Bayes optimal rule was
determined in Section 6.4.1, it comes with the weight vector w∗ = Σ−1 d. Since
Σ 6= ΣW , we have that w∗ 6= wFLD . However, we will see that w∗ and wFLD
are in fact proportional to each other. Since the length of the weight vector does
not matter, we end up with the same optimal discriminant.
From Section 11.2.1, we know that Σ = ΣW +α(1−α)ddT , where α(1−α) > 0.
Therefore,
Σw∗ = d = Σ − α(1 − α)ddT wFLD ,
so that
ΣwFLD = d + α(1 − α)(wTFLD d)d = 1 + α(1 − α)dT Σ−1
W d d
= 1 + α(1 − α)dT Σ−1
W d Σw ∗ .
208 11 Dimensionality Reduction
Here, we used that wTFLD d = dT Σ−1W d > 0. This means that w FLD is propor-
tional to w∗ , and FLD recovers the Bayes optimal discriminant in this case.
At this point, you should go through the derivation in Section 11.2.1 and confirm
for yourself that
X
S = S W + S B , S B = n−1 nk dk dTk , dk = µ̂k − µ̂.
k
Here, µ̂k is the sample mean over the patterns from class k, and µ̂ is the overall
sample mean. What is the rank of S B ? Obviously, rk(S B ) ≤ C. We even have
rk(S B ) ≤ C − 1. Define
hp i
M = nk /n dk ∈ Rd×C .
so the C columns of M are not linearly independent. We may assume that the
class means µ̂k are linearly independent, so that rk(M ) = C − 1 and rk(S B ) =
C − 1. Note that for the binary case C = 2, we recover the FLD situation of a
rank one between-class scatter matrix.
We need to choose U so that the covariance U T S B U is maximum, while the
covariance U T S W U is minimum. There are several ways to measure the size
of covariance, for example its trace (sum of eigenvalues) or its determinant
(product of eigenvalues), see Section 11.1.2. For the derivation of LDA, both
give the same result, so let us go with the trace. The situation for LDA is a bit
more complicated than for PCA, as we need to simultaneously deal with two
11.2 Linear Discriminant Analysis 209
covariance matrices S B and S W here, not just with one. At this point, it is
important to understand that what we are after is not the matrix U , but rather
the M -dimensional subspace spanned by its columns. We can replace any U by
U R for an invertible R ∈ RM ×M . Also, the scale of U is arbitrary, so we can
fix it without loss of generality. In order to maximize the between-class scatter
tr U T S B U while controlling the within-class scatter, we can just fix the size of
the latter by imposing the constraint U T S W U = I. The linear discriminant
analysis (LDA) problem is:
(U Q)T S W U Q = QT Q = I,
tr(U Q)T S B U Q = tr U T S B U QQT = tr U T S B U .
The same indeterminacy holds for PCA as well: U and U Q span the same
subspace. Second, note that a solution of (11.7) is not in general orthonormal.
In fact, U whitens the within-class scatter S W (Section 11.1.1). As we will see
in Section 11.2.4, more is true. For a solution U :
U T S W U = I, U T S B U diagonal.
tr U T S B U = tr(U T M )T U T M .
In the remainder of this section, we show how to solve LDA efficiently in practice.
We will see that by combining the lessons learned in this chapter in a clever way,
we can get away with a small eigenproblem of size C × C, along with having
to solve C linear systems with S W , problems which can be solved much more
efficiently than a d × d eigendecomposition. First, since S W is positive definite,
it has a Cholesky decomposition S W = V V T (Section 4.2.2). Substituting
W = V T U , the LDA problem becomes
max tr W T V −1 M M T V −T W s.t. W T W = I.
X T X = (V −1 M )T (V −1 M ) = M T (S W )−1 M ∈ RC×C .
Û = V −T Ŵ = (S W )−1 M QΛ−1/2
solves the LDA problem. Here, we can reuse (S W )−1 M , which was com-
puted above.
With a bit more work, we can even get by with solving C − 1 linear systems
and a (C − 1) × (C − 1) eigendecomposition. For C = 2, this modified procedure
exactly recovers the FLD method derived at the beginning of this section.
212 11 Dimensionality Reduction
Chapter 12
Unsupervised Learning
213
214 12 Unsupervised Learning
• How to score the “quality” of cluster assignment {ti }, given our assump-
tions?
• How to find an assignment of maximum score among the combinatorial
set of all possible clusterings?
• How to choose the number K of clusters?
A large number of answers to these questions have been proposed, all com-
ing with strengths and weaknesses. Among the more basic concepts, two are
most widely used: agglomerative and divisive clustering. Both are hierarchi-
cal clustering principles, in that not only a single K-clustering is produced,
but a tree of nested clusterings at different granularities (number of groups).
They start with a user-supplied distance function d(x, x0 ) and the general as-
sumption that distance values between two patterns in the same cluster should
be smaller than distance values between two patterns in different clusters.
12.1 Clustering. K-Means Algorithm 215
K-Means Clustering
0 0 0
−2 −2 −2
−2 0 2 −2 0 2 −2 0 2
0 0 0
−2 −2 −2
−2 0 2 −2 0 2 −2 0 2
0 0 0
−2 −2 −2
−2 0 2 −2 0 2 −2 0 2
criterion. However, when we run it multiple times, we can end up with different
solutions, depending on how the initialization was done. Since each assignment
we find satisfies our assumption, we would have to be more specific about how
to compare two different outcomes of K-means. In Section 12.1.1, we will devise
an energy function which scores assignments {ti }, and which K-means descends
on.
All in all, K-means is a simple and efficient algorithm, which is widely used
for data analysis and within preprocessing pipelines of machine learning appli-
cations. Its main attraction is that it reduces to nearest neighbour search, a
computational primitive which is very well studied, and for which highly effi-
cient algorithms and data structures are known. On the other hand, the non-
uniqueness of solutions can be a significant problem in practice, necessitating
12.1 Clustering. K-Means Algorithm 217
restarts with different initial conditions. Also, K-means is not a robust algo-
rithm in general: small changes in the data {xi } can imply large changes in the
clustering {ti }. Some of these problems are due to the hard cluster assignments
we are forced to make in every single iteration. Finally, a good value for K is
hard to select automatically from data. We will address some of these issues
in the remainder of this chapter, when we devise a probabilistic foundation for
K-means with “soft” assignments.
We will show that φ(t, µ) cannot increase with an iteration of K-means. More-
over, whenever an iteration does not produce a decrease, a target assignment is
reached. This means that the algorithm is finitely convergent. Namely, both t
and µ can only ever attain a finite number of different values. This is clear for
t. Moreover, each µk is the empirical average over a subset of the datapoints
xi , which are fixed up front.
Suppose we are at (t, µ). We first update t → t0 , then µ → µ0 , so that at the
end of an iteration, each µk is the sample average over the patterns assigned to
class k. Denote φ0 = φ(t, µ), φ1 = φ(t0 , µ), φ2 = φ(t0 , µ0 ). We also assume that
if for any i,
kxi − µti k2 = min kxi − µk k2 ,
k
− µ0k ) = 0, so that
P
making use of vanishing cross-talk i I{ti =k} (xi
0
K X
X n K
X
φ1 − φ2 = I{t0i =k} kxi − µk k2 − kxi − µ0k k2 = n0k kµ0k − µk k2 ,
k=1 i=1 k=1
Did we not do all that before in Chapter 6, under the umbrella of generative
modelling? True, but we will apply this principle to more complex models in
this chapter, thereby unleashing its power. For the models in Chapter 6, vari-
ables were either observed or could be estimated by simple formulae such as
empirical mean, empirical covariance, or count ratios. Here, we augment models
by additional latent variables, such as our group indicators ti , with the aim of
making the model more expressive and realistic.
15 15 15
10 10 10
5 5 5
0 0 0
−5 −5 −5
15 15
10 10
5 5
0 0
−5 −5
−10 −10
−15 −15
−10 −5 0 5 10 −10 −5 0 5 10
Given the data in Figure 12.2 (top middle), what type of model should we
use for density estimation? The simplest choice would be a single Gaussian
N (x|µ, Σ). However, the maximum likelihood fit is not good in this case: it
is unlikely that this data comes from a single Gaussian (Figure 12.2, top left).
Another simple idea is known as kernel density estimation. We place a Gaussian
function N (x|xi , h2 I) on each datapoint, then estimate the density as
n
1X
p̂(x|h) = N (x|xi , h2 I).
n i=1
The kernel width h > 0 is the only free parameter, its choice is a model se-
lection problem (Chapter 10). The kernel density estimator is an instance of
220 12 Unsupervised Learning
Here, the p(x|t) are called mixture components and P (t) is called prior distri-
bution. The free parameters of this model are {(µk , Σk )} and {P (t = k)}. If
x is high-dimensional, it is common to use spherical (Σk = σk2 I) or diagonal
covariance matrices. Our probabilistic viewpoint of K-means developed above
corresponds to a GMM with Σk = I. As seen in Figure 12.2 (bottom middle),
a GMM with three components provides an excellent2 fit for our data. On the
other hand, it can be evaluated cheaply on future data.
Mixture models are ubiquitous in machine learning and applied statistics. For
data in Rd , whenever a single Gaussian does not look like an optimal choice,
the next choice must be a GMM with few components. Much of their popular-
ity stems from the simple and intuitive expectation maximization algorithm for
parameter fitting, which we will learn about shortly. For discrete data, mixtures
of multinomials or of independent Bernoulli (binary) distributions are widely
used. There are many variations of the theme, such as hierarchical mixtures,
parameter tying, and many more out of scope of this course. The AutoClass
code3 can be used to setup mixture models over data of many different attribute
types and to run expectation maximization. Mixture models have profound im-
pact on many applications. For example, modern large vocabulary continuous
speech recognition systems are based on Gaussian mixture densities.
Given a mixture model, we can fit its parameters to data {xi } by maximum
likelihood estimation. For the GMM introduced at the beginning of this section,
the log likelihood is
n
X K
X
L(µ, π) = log πk N (xi |µk , I),
i=1 k=1
| {z }
=p(xi )
1 Other examples for nonparametric techniques are nearest neighbour classification (Sec-
tion 2.1) and kernel methods such as the support vector machine (Chapter 9).
2 Not too surprisingly, as the true data generating distribution was a Gaussian mixture
The update equation for the mean µk is an empirical average over the data-
points xi , as usual in ML estimation. However, each datapoint is weighted by
its posterior probability of belonging to the k-th class. If xi lies halfway between
class 1 and 2, then half of it contributes to µ1 and µ2 respectively. This is the
soft group assignment we are after. A similar derivation provides us with the
update for πk . However, we need to take into account that π is a distribution,
therefore make use of the technique detailed in Section 6.5.3. You should confirm
for yourself that the update is
n
nk X
πk = , nk = P (ti = k|xi ). (12.2)
n i=1
10 pi1 = 0.36 10 10
pi2 = 0.36
8 pi3 = 0.27 8 8
6 6 6
4 4 4
2 2 2
0 0 0
−2 −2 −2
−4 −4 −4
−6 −6 −6
−8 −8 −8
−10 −8 −6 −4 −2 0 2 4 6 8 10 12 −10 −8 −6 −4 −2 0 2 4 6 8 10 12 −10 −8 −6 −4 −2 0 2 4 6 8 10 12
10 10 10
8 8 8
6 6 6
4 4 4
2 2 2
0 0 0
−2 −2 −2
−4 −4 −4
−6 −6 −6
−8 −8 −8
−10 −8 −6 −4 −2 0 2 4 6 8 10 12 −10 −8 −6 −4 −2 0 2 4 6 8 10 12 −10 −8 −6 −4 −2 0 2 4 6 8 10 12
L(µ, π). Since L is not in general a convex function, this is all we can hope for.
Iterations of EM applied to a Gaussian mixture model are shown in Figure 12.3.
Properties and generalization of EM are discussed in the next section. Let us
close by making the link between EM and K-means precise. Recall that Σk = I
and πk = 1/K, and the group means µk are the sole parameters to learn.
This is called the E step (“E” for “expectation”). The Qi (·) are distribution
parameters, en par with θ. In particular, they do not depend on θ during the
second part of the EM iteration, the M step (“M” for “maximization”). The
gradient we consider there is
EQi [∇θ log P (xi , hi |θ)] = ∇θ {Ei (θ; Qi ) := EQi [log P (xi , hi |θ)]} .
w.r.t. θ, for the Qi (·) determined in the E step. To sum up, we initialize4 θ to
some sensible values, then iterate over the following two steps:
w.r.t. θ. Given the E step statistics, this step is typically not more difficult
to do than maximizing the complete data log likelihood.
For mixture models, it is often a good idea to assign component means µk to randomly chosen
datapoints and to set (co)variance parameters to large values (say, to the total (co)variance
for all the dataset).
12.3 Latent Variable Models. Expectation Maximization 225
Here, x0 = ∅ and [pa|∅ ] is an additional distribution for the first word, which we
assume to be known. For complete data, we would write
M M
! N
Y Y X
I{x1 =a} φa|b (x)
P (x) = (pa|∅ ) (pa|b ) , φa|b (x) = I{xj =a,xj−1 =b} ,
a=1 b=1 j=2
(12.3)
then estimate pa|b by maximum likelihood, accumulating counts over the docu-
ments assigned to each class.
However, what do we do if some words xj are missing in our training documents?
A simple idea would be to break up documents at missing word locations, to
treat all completely observed parts as independent documents per se. This would
probably be a good working solution in this case, even though it would wreak
havoc with a structured5 document model. But let us work out a principled
solution based on latent variables and EM, which in the case of a substantial
fraction of missing words may well make a difference even in this simple setup.
Our training corpus consists of documents xi = [xij ] ∈ {1, . . . , M }Ni . For no-
tational simplicity, we drop the document index i and concentrate on a single
5 For example, on top of the bigram probabilities, we may impose a higher order structure
(title, abstract, introduction, main, references), or we may want to model document length.
226 12 Unsupervised Learning
How does the EM look like for this missing data example? In the E step, we
need to compute the posterior distributions
one for each document. To fix ideas, consider the four-word example in Fig-
ure 12.4. Here, H = {2}, while O = {1, 3, 4}. The posterior distribution is
P (x1 , x2 , x3 , x4 ) px |x px |x px |x px |∅
P (x2 |x1 , x3 , x4 ) = P 0 =P 4 3 3 2 2 1 1 .
x0 P (x1 , x2 , x3 , x4 ) x0 px4 |x3 px3 |x2 px2 |x1 px1 |∅
0 0
2 2
The first point to note is that all terms in the numerator which do not depend
on x2 , appear in the denominator as well, and they cancel each other out:
px3 |x2 px2 |x1
P (x2 |x1 , x3 , x4 ) = P .
x0 px3 |x2 px2 |x1
0 0
2
6 It is straightforward to remove this assumption, but the derivation becomes a bit more
What happens in the general case? No matter how many other observed words
occur elsewhere: if only a single word xj is missing, cancellation happens all the
same: px |x px |x
P (xj |xO ) = P j+1 j j j−1 . (12.4)
x0 pxj+1 |xj pxj |xj−1
0 0
j
The same formula holds even if other words are missing as well. To understand
the following general argument, it helps to write down some small examples (do
that!). Namely,
P (xj , xO ) X
p(xj |xO ) = , P (xj , xO ) = P (x).
P (xO ) xH \xj
Since C<j and C>j do not depend on xj , they cancel out in (12.4). We have
shown that Y
Q(xH ) ← P (xH |xO , θ) = P (xj |xO , θ),
j∈H
where P (xj |xO , θ) is computed as (12.4). A naive7 way to do the E step com-
putations costs O(|H| M ) per document.
For the M step, we follow the mantra laid out in Section 12.3.1. The surrogate
criterion E(θ; Q) for our document is obtained by averaging
M M
. XX
log P (x) = φa|b (x) log pa|b
a=1 b=1
.
over Q(xH ) (here, = denotes equality up to a constant independent of θ). All
we have to do is to average the indicators φa|b (x), then update the pa|b in the
same way as before, based on these average statistics. A convenient way to write
the average statistics is to extend the Q distribution over the whole document
x. To this end, we deviate from our convention above and denote the observed
values by x̃O instead of xO . Then,
Y
Q(x) = Q(xH ) I{xj =x̃j } .
j∈O
Taking an expectation w.r.t. Q(x) means plugging in x̃O for xO , then taking
the expectation over Q(xH ). Given this definition, the M step surrogate is
M M
. XX
E(θ; Q) = Qa|b log pa|b ,
a=1 b=1
Xn
Qa|b = EQ φa|b (x) = Q(xj = a)Q(xj−1 = b).
j=2
7 In practice, for any given b, p
a|b ≈ 0 for most a ∈ {1, . . . , M }. This approximate sparsity
of the distributions can be used to speed up accumulations dramatically.
228 12 Unsupervised Learning
Please check for yourself how this computation would be done efficiently in
practice. For example, at least one of the factors in Q(xj = a)Q(xj−1 = b) is
an indicator. If most of the words are observed, it may be fastest to first do the
usual accumulation over observed pairs (xj , xj−1 ), followed by adding terms for
the missing xj . Finally, according to Section 6.5.3, the M step updates are
Qa|b
pa|b = P , a, b ∈ {1, . . . , M }.
a0 Qa0 |b
Here, the latent variable h is discrete. Our derivation applies without changes
for continuous (or mixed) latent variables as well, we only have to replace sums
by integrals. During EM, we maintain distributions Qi (hi ) over hi , one for each
datapoint, writing Q for {Qi }. Our argument is based on the auxiliary criterion
n
X
φ(θ; Q) = (EQi [log P (xi , hi |θ)] + EQi [− log Qi (hi )]) .
i=1
Here, X
H[Qi (hi )] = EQi [− log Qi (hi )] = Qi (hi )(− log Qi (hi ))
hi
where E(θ; Q) is the surrogate criterion defined in Section 12.3.1. The entropic
part is added to the energy E for technical reasons. It does not depend on θ,
so the M step is not influenced.
In order to establish EM convergence, we show that L(θ) cannot decrease with
an EM iteration. Moreover, if it does not increase, we must be at a local maxi-
mum. A key step will be the bound
(a)
covariances for each component is not upper bounded per se. We can place one
component on top of a datapoint and shrink the variance to zero in order to
obtain infinite likelihood! Such degenerate solutions are avoided by constraining
the variances by some positive lower bound. In the remainder of this section,
we assume that L(θ) is upper bounded.
We begin by relating L(θ) and φ(θ; Q) for arbitrary distributions Qi (hi ). Pick
any i ∈ {1, . . . , n}. For the following, recall the definition of the posterior
P (hi |xi , θ):
P (xi , hi |θ)
log P (xi |θ) = EQi [log P (xi |θ)] = EQi log
P (hi |xi , θ)
P (xi , hi |θ)Qi (hi ) P (xi , hi |θ)
Qi (hi )
= EQi log = EQi log + log
P (hi |xi , θ)Qi (hi ) Qi (hi ) P (hi |xi , θ)
= EQi [log P (xi , hi |θ)] + H[Qi (hi )] + D[Qi (hi ) k P (hi |xi , θ)].
Here, D[Qi (hi ) k P (hi |xi , θ)] is the relative entropy from Section 6.5.3 (make
sure you recall the derivations in that section, we will need them here). This
means that
Xn
φ(θ; Q) = L(θ) − D[Qi (hi ) k P (hi |xi , θ)] (12.6)
i=1
for any θ and any distributions Qi (hi ). But we know that the relative entropy
between two distributions is nonnegative, and zero if and only if the distributions
are the same, so that (12.6) implies (12.5), with equality if and only if Qi (hi ) =
230 12 Unsupervised Learning
The right hand side are partial derivatives, the Qi do not depend on θ. There-
fore, a stationary point of the EM algorithm must be a stationary point of L(θ)
as well. We already derived the gradient of L(θ) in Section 12.3.1:
X 1 X
∇θ log P (xi , hi |θ) = P (xi , hi |θ) (∇θ log P (xi , hi |θ))
P (xi |θ)
hi hi
= EQi (hi ) [∇θ log P (xi , hi |θ)] = ∇θ (EQi [∇θ log P (xi , hi |θ)] + H[Qi (hi )]) .
To EM or not to EM
This concludes our discussion of the EM algorithm, a simple and often surpris-
ingly efficient method for finding a stationary point (local maximum) of the log
likelihood of observed data. As discussed in Section 12.3.1, EM is particularly
attractive if the M step computations can be done in closed form, by accumulat-
ing sufficient statistics over E step posteriors rather than counts. However, it is
important to note that EM is not always the best algorithm to solve maxθ L(θ)
locally. We briefly discussed nonlinear gradient-based optimizers in Section 3.4.2.
As seen above, computing the gradient ∇θ L(θ) comes at exactly the same cost
as a single E step in EM. When comparing EM with alternatives, we should
focus on two points:
fields, or log-linear undirected models in general. Most of the early work used
variants of EM, such as iterative proportional fitting. In these algorithms, we
obtain closed form M step updates only after additional crude bounding. These
approaches have vanished entirely today, as modern optimizers such as scaled
conjugate gradients, limited memory quasi-Newton or truncated Newton run
several orders of magnitude faster. Another poor idea is to use EM in order
to compute principal components analysis (PCA) directions (Section 11.1): the
Lanczos algorithm is much faster in practice (Section 11.1.4), and good code is
publicly available.
A final comment concerns the E step computations, which are required in order
to compute ∇θ L(θ) just as well. For the Gaussian mixture models with a mod-
erate number K of components, the E step posteriors are cheap to compute. But
in general, this computation can be hard to do if h is large and has dependent
components. If you find yourself in such a situation, you need to consider ap-
proximate posterior computations. It is possible to extend the EM algorithm to
allow for such. The resulting variational EM algorithm will still be convergent,
but it does not in general maximize the exact log likelihood anymore.
232 12 Unsupervised Learning
Appendix A
We haved derived the dual formulation of the soft margin SVM problem in
Section 9.3. In this section, we provide a much more general picture on the
underlying principle of Lagrange duality, which plays a central role in modern
machine learning, far beyond support vector machines. We will introduce the
framework in two stages, first seeking to generalize the first-order stationary
condition to constrained problems by way of a Lagrangian function, then ex-
posing the duality embedded in this function, leading to a dual optimization
problem. Finally, we will rederive the soft margin SVM dual from this more
general perspective. This section is slightly more advanced and can be skipped
in a first reading.
Lagrange duality is a powerful framework which helps solving constrained op-
timization problems with continuously differentiable objective. We will restrict
ourselves to linear constraints, as this is all we need in this course, but the frame-
work is more generally applicable. There are two aspects to the Lagrangian tech-
nique. The first is a generalization of the famous first order optimality condition
∇x f (x) = 0 to the constrained case, by introducing new variables (multipliers)
and the Lagrangian function. This is in fact what Lagrange did. It often helps to
fence in optimal solutions by exploring stationary points of the Lagrangian, but
does not provide a method to find them. The second aspect is Lagrange duality.
The nontrivial stationary points of the Lagrangian are saddlepoints, which can
be narrowed down naturally via two optimization problems (primal and dual).
At least for convex optimization problems, Lagrange duality directly leads to
algorithms for finding optimal solutions. Modern optimization textbooks often
start with the duality and present the optimality condition in passing, but we
will thread with Lagrange and develop the Lagrangian from first principles.
The optimization problem we will be interested in here is
233
234 A Lagrange Multipliers and Lagrangian Duality
each hk (x) ≤ 0 is a linear inequality constraint. The set of those x which fulfil
all constraints is called the feasible set, and its members x are called feasible.
For an unconstrained problem, all x ∈ Rp are feasible. The feasible set in our
case is a convex polytope (Section 9.1.1). The number p̃∗ is called the value of
the primal problem. We will assume that the feasible set is not empty, and that
p̃∗ ∈ R, in particular p̃∗ > −∞.
Optimality Conditions
Suppose we are at the feasible point x, so that g(x) = 0. Let us first determine
along which directions d we may move at all, without violating the constraint:
Now, if (∇x g)T d = 0 and (∇x f )T d > 0, we can descend along −d, so we need
∇x f = −λ∇x g, ∇x f + λ∇x g = 0, λ ∈ R.
1 First order in this context means that only the gradient (first derivatives) are used. In
contrast, second order conditions look at the Hessian as well (Section 3.4).
235
Let us see whether this works for our example above. f (x) = x1 , a = δ 1 ,
b = 1, so that ∇x f = δ 1 = ∇x g, and the condition holds for λ = −1. It is
not very useful in this example, since it holds for any x ∈ R2 , but it certainly
is a necessary condition for optimality. We illustrate Lagrange’s condition in
Figure A.1, top.
Feasible Feasible
Directions Descent
Descent Directions
Feasible Directions
Descent Directions
Two things can happen now at x. Either, h(x) < 0, so that h(x + εd) < 0
for small ε, no matter what d. The constraint is inactivate, and we can simply
pretend it is not there. The optimality condition is ∇x f = 0 then. Or, h(x) = 0:
the constraint is active. Then,
∇x f + α∇x h = 0, α ≥ 0, αh(x) = 0.
Our primal problem (A.1) has several equality and inequality constraints. What
do we do then? Our conditions all have the same form, so why not add them
up and divide by the number of constraints:
X X
∇x f + λj ∇x gj + αk ∇x hk = 0, αk ≥ 0, αk hk (x) = 0 ∀k.
j k
Here, α 0 is short for αk ≥ 0 for all k. The Lagrangian is a function not only
of x, but also of λ, α. These additional variables are called Lagrange multipliers.
The complete optimality conditions for (x, λ, α) read
∂L ∂L
= 0, = 0 ∀j,
∂x ∂λj
(A.2)
∂L ∂L
αk ≥ 0, ≤ 0, αk = 0 ∀k.
∂αk ∂αk
Here, we used that ∂L/∂λj = gj (x) and ∂L/∂αk = hk (x). Note how all con-
ditions are expressed in terms of derivatives of the Lagrangian. A worthy re-
placement for ∇x f = 0 indeed. A brief way of describing (A.2) is that (x, λ, α)
constitutes a stationary point of the Lagrangian.
237
Lagrange Duality
whose minimum over x is precisely the primal problem (A.1). What does that
mean? The Lagrangian may have stationary points of many kinds, but as our
goal is to solve the primal problem, we should be only interested in its saddle-
points of the type (A.3).
These are not the only saddlepoints, we might just as well look at
d˜∗ ≤ p̃∗ .
The dual value d˜∗ bounds the primal value p̃∗ from below, a fact which is called
weak duality. Why is this useful? It turns out that in many situations, the dual
problem is easier to solve than the primal problem. For example, it is easily
confirmed that (A.4) is a concave3 maximization problem, no matter what the
primal problem is. Even for very hard primal problems, the dual (A.4) can
often be solved easily and provides a lower bound d˜∗ on the primal value p̃∗ .
This technique is used frequently in theoretical computer science. Moreover,
2 We explicitly allow minima to be −∞ and maxima to be +∞, with the understanding
there may be far fewer dual variables (λ, α) than primal ones (x), which is
what happens for the soft margin SVM problem (Section 9.3).
Let us look at an example, illustrated in Figure A.2. The primal problem is
1
min x2 , subj. to x ≤ −1.
x 2
1 2
L(x, α) = x + α(x + 1), α ≥ 0.
2
For any α ≥ 0:
∂L
= x + α,
∂x
so that x∗ (α) = −α minimizes L(x, α). The dual function is
1
g(α) = L(x∗ (α), α) = α − α2 .
2
A proof of this statement can be found in [2, ch. 3.4]. The practical meaning
of this exercise is as follows. If we know that strong duality holds, we can solve
our primal problem as follows. First, we solve the dual problem, giving rise to
λ∗ , α∗ 0, and the value d˜∗ . Then, x∗ = argminx L(x, λ∗ , α∗ ) is an optimal
point for the primal problem, whose value is p̃∗ = d˜∗ . Even better, some modern
primal-dual optimizers use primal and dual problems in an interleaved fashion,
the gap between their current values is used to monitor progress.
4.5
3.5
2.5
1.5
0.5
−0.5
−3 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5
Figure A.2: Example for primal and dual function for a simple QP. The primal
criterion is f (x) = x2 /2, subject to x ≤ −1. Shown are Lagrangian curves
L(x, α) for various values of α (dashed), as well as their minimum points x∗ (α)
(circles). Note that L(x, 0) coincides with f (x). The largest x∗ (α) is attained at
α∗ = 1.
Since x∗ (α) = −α, we can plot the “inversion” of the dual g(α), namely α 7→
g(−α), in the same figure. Note how g(α) ≤ f (x) in respective feasible regions
x ≤ −1 and α ≥ 0, and that they coincide at the saddle point x∗ = −1, α∗ = 1.
ΦD (α, ν ) = min L.
w,b,ξ
The inner minimization w.r.t. w and b works in the same way as in Section 9.3.
Setting the gradient equal to zero results in
n
X n
X
w= αi ti φ(xi ), αT t = αi ti = 0
i=1 i=1
Finally,
∇ξ L = C1 − α − ν = 0 ⇒ ν = C1 − α.
240 A Lagrange Multipliers and Lagrangian Duality
241
242 BIBLIOGRAPHY
[17] G. Golub and C. Van Loan. Matrix Computations. Johns Hopkins Univer-
sity Press, 3rd edition, 1996.
[24] Tommi Jaakkola, Marina Meila, and Tony Jebara. Maximum entropy dis-
crimination. In Solla et al. [40], pages 470–476.
[30] C. Paige and M. Saunders. LSQR: An algorithm for sparse linear equations
and sparse least squares. ACM Transactions on Mathematical Software,
8(1):43–71, 1982.
[32] J. Platt. Fast training of support vector machines using sequential minimal
optimization. In B. Schölkopf, C. Burges, and A. Smola, editors, Advances
in Kernel Methods: Support Vector Learning, pages 185–208. MIT Press,
1998.
[33] J. Platt. Probabilistic outputs for support vector machines and comparisons
to regularized likelihood methods. In A. Smola, P. Bartlett, B. Schölkopf,
and D. Schuurmans, editors, Advances in Large Margin Classifiers. MIT
Press, 1999.
[36] B. Schölkopf and A. Smola. Learning with Kernels. MIT Press, 1st edition,
2002.
[37] M. Seeger. Bayesian model selection for support vector machines, Gaussian
processes and other kernel classifiers. In Solla et al. [40], pages 603–609.
[38] M. Seeger. Gaussian processes for machine learning. International Journal
of Neural Systems, 14(2):69–106, 2004.
[39] G. Simmons. Calculus with Analytic Geometry. McGraw-Hill, 2nd edition,
1996.
[40] S. Solla, T. Leen, and K.-R. Müller, editors. Advances in Neural Informa-
tion Processing Systems 12. MIT Press, 2000.
[41] I. Steinwart and A. Christmann. Support Vector Machines. Springer, 1st
edition, 2008.
[42] G. Strang. Introduction to Linear Algebra. Wellesley – Cambridge Press,
4th edition, 2009.