Nothing Special   »   [go: up one dir, main page]

PCML Notes

Download as pdf or txt
Download as pdf or txt
You are on page 1of 249

Pattern Classification and Machine Learning

Matthias Seeger
Probabilistic Machine Learning Laboratory
Ecole Polytechnique Fédérale de Lausanne
INR 112, Station 14, CH-1015 Lausanne
matthias.seeger@epfl.ch

May 15, 2012


2
Contents

1 Introduction 1
1.1 Learning Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 How To Read These Notes . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Mathematical Preliminaries . . . . . . . . . . . . . . . . . . . . . 2
1.4 Recommended Machine Learning Textbooks . . . . . . . . . . . . 3
1.5 Thanks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Linear Classification 5
2.1 A First Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Techniques: Vectors, Inner Products, Norms . . . . . . . . 9
2.2 Hyperplanes and Feature Spaces . . . . . . . . . . . . . . . . . . 13
2.3 Perceptron Classifiers . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.1 The Perceptron Convergence Theorem . . . . . . . . . . . 19
2.3.2 Normalization of Feature Vectors . . . . . . . . . . . . . . 21
2.3.3 The Margin of a Dataset (*) . . . . . . . . . . . . . . . . 22
2.4 Error Function Minimization. Gradient Descent . . . . . . . . . . 23
2.4.1 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . 25
2.4.2 Online and Batch Learning. Perceptron Algorithm as Gra-
dient Descent . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4.3 Techniques: Matrices and Vectors. Outer Product . . . . . 29

3 The Multi-Layer Perceptron 33


3.1 Why Nonlinear Classification? . . . . . . . . . . . . . . . . . . . . 33
3.2 Multi-Layer Perceptrons . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.1 Vectorization of MLP Formalism . . . . . . . . . . . . . . 37
3.3 Error Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3.1 Vectorization of Error Backpropagation . . . . . . . . . . 40
3.4 Training a Multi-Layer Perceptron . . . . . . . . . . . . . . . . . 40

3
4 CONTENTS

3.4.1 Gradient Descent Optimization in Practice . . . . . . . . 43


3.4.2 Optimization beyond Gradient Descent (*) . . . . . . . . 47

4 Linear Regression. Least Squares Estimation 51


4.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.1.1 Techniques: Solving Univariate Linear Regression . . . . . 54
4.2 Linear Least Squares Estimation . . . . . . . . . . . . . . . . . . 55
4.2.1 Geometry of Least Squares Estimation . . . . . . . . . . . 56
4.2.2 Techniques: Orthogonal Projection. Quadratic Functions . 56
4.2.3 Solving the Normal Equations (*) . . . . . . . . . . . . . 58

5 Probability. Decision Theory 61


5.1 Essential Probability . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.1.1 Independence. Conditional Independence . . . . . . . . . 65
5.1.2 Probability Densities . . . . . . . . . . . . . . . . . . . . . 66
5.1.3 Expectations. Mean and Covariance . . . . . . . . . . . . 67
5.2 Decision Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.2.1 Minimizing Classification Error . . . . . . . . . . . . . . . 71
5.2.2 Discriminant Functions . . . . . . . . . . . . . . . . . . . 73
5.2.3 Example: Class-conditional Cauchy Distributions . . . . . 73
5.2.4 Loss Functions. Minimizing Risk . . . . . . . . . . . . . . 74
5.2.5 Inference and Decisions . . . . . . . . . . . . . . . . . . . 76

6 Probabilistic Models. Maximum Likelihood 79


6.1 Generative Probabilistic Models . . . . . . . . . . . . . . . . . . . 79
6.2 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . 81
6.3 The Gaussian Distribution . . . . . . . . . . . . . . . . . . . . . . 83
6.3.1 Techniques: Determinants . . . . . . . . . . . . . . . . . . 87
6.3.2 Techniques: Working with Densities (*) . . . . . . . . . . 89
6.3.3 Techniques: Density after Transformation (*) . . . . . . . 89
6.4 Maximum Likelihood for Gaussian Distributions . . . . . . . . . 90
6.4.1 Gaussian Class-Conditional Distributions . . . . . . . . . 91
6.4.2 Techniques: Bayes Error for Gaussian Class-Conditionals
(*) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.4.3 Techniques: MLE for Multivariate Gaussian (*) . . . . . . 96
6.5 Maximum Likelihood for Discrete Distributions . . . . . . . . . . 98
6.5.1 Using Indicators in Maximum Likelihood Estimation . . . 101
6.5.2 Naive Bayes Classifiers . . . . . . . . . . . . . . . . . . . . 103
6.5.3 Techniques: Maximizing Discrete Log Likelihoods (*) . . . 105
CONTENTS 5

7 Generalization. Regularization 109


7.1 Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
7.1.1 Over-fitting . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7.2 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.2.1 Early Stopping . . . . . . . . . . . . . . . . . . . . . . . . 116
7.2.2 Regularized Least Squares Estimation (*) . . . . . . . . . 117
7.3 Maximum A-Posteriori Estimation . . . . . . . . . . . . . . . . . 119
7.3.1 Examples of Conjugate Prior Distributions (*) . . . . . . 123

8 Conditional Likelihood. Logistic Regression 127


8.1 Conditional Maximum Likelihood . . . . . . . . . . . . . . . . . . 128
8.1.1 Issues with the Squared Error Function . . . . . . . . . . 128
8.1.2 Squared Error and Gaussian Noise . . . . . . . . . . . . . 130
8.1.3 Conditional Maximum Likelihood . . . . . . . . . . . . . . 131
8.2 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 133
8.2.1 Gradient Descent Optimization . . . . . . . . . . . . . . . 135
8.2.2 Estimating Posterior Class Probabilities (*) . . . . . . . . 137
8.2.3 Generative and Discriminative Models . . . . . . . . . . . 138
8.2.4 Iteratively Reweighted Least Squares (*) . . . . . . . . . . 141
8.3 Discriminative Models . . . . . . . . . . . . . . . . . . . . . . . . 143
8.3.1 Multi-Way Logistic Regression . . . . . . . . . . . . . . . 143
8.3.2 Conditional Maximum A-Posteriori Estimation (*) . . . . 146

9 Support Vector Machines 149


9.1 Maximum Margin Perceptron Learning . . . . . . . . . . . . . . . 149
9.1.1 A Convex Optimization Problem . . . . . . . . . . . . . . 152
9.2 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . 153
9.2.1 Soft Margins . . . . . . . . . . . . . . . . . . . . . . . . . 154
9.2.2 Feature Expansions. Representer Theorem . . . . . . . . . 156
9.2.3 Kernel Methods . . . . . . . . . . . . . . . . . . . . . . . 159
9.2.4 Techniques: Properties of Kernels (*) . . . . . . . . . . . . 162
9.2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
9.3 Solving the Support Vector Machine Problem . . . . . . . . . . . 165
9.4 Support Vector Machines and Kernel Logistic Regression (*) . . 170
6 CONTENTS

10 Model Selection and Evaluation 173


10.1 Bias, Variance and Model Complexity . . . . . . . . . . . . . . . 173
10.1.1 Validation and Test Data . . . . . . . . . . . . . . . . . . 174
10.1.2 A Simple Example . . . . . . . . . . . . . . . . . . . . . . 174
10.1.3 Bias-Variance Decomposition . . . . . . . . . . . . . . . . 176
10.1.4 Examples of Bias-Variance Decomposition . . . . . . . . . 178
10.2 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
10.2.1 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . 182
10.2.2 Leave-One-Out Cross-Validation (*) . . . . . . . . . . . . 186

11 Dimensionality Reduction 189


11.1 Principal Components Analysis . . . . . . . . . . . . . . . . . . . 189
11.1.1 Three Ways to Principal Components Analysis . . . . . . 191
11.1.2 Techniques: Eigendecomposition. Rayleigh-Ritz Charac-
terization . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
11.1.3 Principal Components Analysis in Practice . . . . . . . . 200
11.1.4 Large Scale Principal Components Analysis (*) . . . . . . 201
11.2 Linear Discriminant Analysis . . . . . . . . . . . . . . . . . . . . 202
11.2.1 Decomposition of Total Covariance . . . . . . . . . . . . . 206
11.2.2 Relationship to Optimal Classification (*) . . . . . . . . . 207
11.2.3 Multiple Classes . . . . . . . . . . . . . . . . . . . . . . . 208
11.2.4 Techniques: Generalized Eigenproblems. Simultaneous
Diagonalization (*) . . . . . . . . . . . . . . . . . . . . . . 210

12 Unsupervised Learning 213


12.1 Clustering. K-Means Algorithm . . . . . . . . . . . . . . . . . . . 214
12.1.1 Analysis of the K-Means Algorithm . . . . . . . . . . . . 217
12.2 Density Estimation. Mixture Models . . . . . . . . . . . . . . . . 218
12.2.1 Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . 219
12.3 Latent Variable Models. Expectation Maximization . . . . . . . . 222
12.3.1 The Expectation Maximization Algorithm . . . . . . . . . 223
12.3.2 Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . 225
12.3.3 Convergence of Expectation Maximization . . . . . . . . . 228

A Lagrange Multipliers and Lagrangian Duality 233


A.1 Soft Margin SVM Revisited . . . . . . . . . . . . . . . . . . . . . 238
Chapter 1

Introduction

In this chapter, some general comments are given on the subject of these notes,
how to read them, what kind of mathematical preliminaries there are, as well
as some recommendations for further reading.

1.1 Learning Outcomes


We will understand what machine learning is all about as we move through the
course. Suffice to say, it is of growing importance in most fields which grow,
and it is an incredible exciting field to learn about and to work in, whether in
research or in one of those many companies which embrace the concept.
These notes complement a course on machine learning at a fairly introductory
level. However, the major aim is to train the reader as an expert in the devel-
opment and understanding of machine learning concepts and techniques, not
simply as a user. At the end of this course, you will understand

• What the basic machine learning methods and techniques are, at least
when it comes to supervised machine learning (pattern classification, curve
fitting).
• How to apply these methods to real-world problems and datasets.
• Why/when these methods work, and why/when they do not.
• Relationships between methods, opportunities for recombinations. We will
learn about the basic “red threads” through this rapidly growing field.
• How to derive and safely implement machine learning methodology by
building on underlying standard technology.

1.2 How To Read These Notes


It is highly recommended to read these notes in an interleaved fashion with
the classroom lectures. The reader should allocate a substantial amount of time

1
2 1 Introduction

on working through the notes, in particular as a postprocessing to classroom


lectures.
Two points are special about these course notes. First, there is substantial ma-
terial on Techniques. Mastering machine learning, whether in research or for
working in one of those cool companies, is mastering a bag of tricks. Fortu-
nately, a rather limited number of mathematical tools come up over and over
again, and the reader will find many of these in the Techniques sections. Each
of them is directly motivated by an application in the main text. In order to
benefit from this material, the reader has to actively work through it, jotting
down notes on a piece of paper.
Second, these notes contain more material than is covered in the classroom lec-
tures. Slightly more advanced material is presented in sections labelled by “(*)”.
Notice that sections labelled as “(*)” may still feature in classroom lectures and
the exam (sections labelled as “(*)”, which are not discussed during classroom
lectures, will not feature in the exam, unless explicitly said otherwise).
These course notes are brand new. The author very much hopes for feedback
from readers for how to make them more useful, more readable, more fun to
learn from, and for fixing mistakes.

1.3 Mathematical Preliminaries


Given that machine learning may be considered a discipline of computer science,
it places somewhat different demands on students compared to fields which have
traditionally been considered core computer science (such as logics, complexity,
verification, databases, etc.). Briefly, we need continuous mathematics: linear al-
gebra, differential calculus and continuous optimization. And we need probability
and computational statistics. Here the good news: machine learning offers you
an amazing route to refresh your skills in these disciplines, to learn them from
a different and immediately practically rewarding perspective, and to recognize
their purpose even if you do not care about physics.
In these notes, required basic mathematical skills are reviewed on the fly, a sort
of “maths as we go” approach. Instead of tucking it away in some Appendix,
the maths is presented where needed, and a lot of “hyperlinks” are used to get
the reader to visit these pages. However, these parts can be no more than a
reminder. They are neither complete, nor very didactic, and certainly they are
not rigorous. Here are some recommended sources for further study, which meet
these criteria:

• For linear algebra, the author’s favourite is the book by Strang [42]. Make
sure to check out the accompanying videos.

• For calculus, the book by Simmons [39] seems to be widely acclaimed for
its intuitive approach and focus on applications to real-world problems.

• An acclaimed introductory text on probability is by Grinstead and Snell


[19], this book can be downloaded free of charge from the web. The author
can also recommmend the book by Grimmett and Stirzaker [18].
1.4 Recommended Machine Learning Textbooks 3

1.4 Recommended Machine Learning Text-


books
There are many books on machine learning. Some are more basic than these
course notes, and they contain many examples, code and “tricks of the trade”.
While these are certainly helpful as an addition to the material here, the author
is not convinced enough by any of them for a recommendation. One intermediate
level book is by Duda, Hart and Stork [12]. While this is a useful book to read,
it is the author’s opinion that the second edition is rather a step backwards from
the first one, which is excellent but out of date. It also comes with an excessive
price tag, so think twice before you buy it. Then, there is a host of books
written by leading machine learning researchers today, each of which contains
substantially more (and somewhat more advanced) material than these notes.
The book by Bishop [5] may be the best to read in addition to these notes,
several of the excellent figures are used here. His older textbook [4] is less up to
date, but still an excellent source, in particular for nonlinear optimization and
multi-layer perceptrons. The book by MacKay [28] is one of the author’s absolute
favourites, and full of great exercises to train your understanding (check it out;
it is freely available from the author’s homepage). A good book on statistical
data analysis and kernel methods is that by Hastie et.al. [20]. The book by
Koller and Friedmann [25] is encyclopaedic and deals exclusively with graphical
models and Bayesian inference. Finally, as to date (spring 2012), there are new
books coming out by David Barber (UC London) and Kevin Murphy (UBC
Vancouver): time will tell how they fit in.

1.5 Thanks
The author would like to thank Hesam Setareh for creating many of the figures.
4 1 Introduction
Chapter 2

Linear Classification

In this chapter, we will study classifiers based on linear discriminant functions


in a fixed, finite-dimensional feature space. A running example will be the clas-
sification of hand-written digits. We will learn to know a first algorithm to train
a linear classifier on data. By studying properties and convergence of this per-
ceptron algorithm, we develop geometric intuition which will be important in
subsequent chapters. More generally, we will see that many machine learning
problems can be phrased in terms of mathematical optimization, as minimiza-
tion of an error function. We will get a first idea about the squared error func-
tion, learn to know batch and online gradient descent optimization, and finally
develop a second viewpoint on the perceptron learning algorithm.

2.1 A First Example

Machine learning is a big field, but let’s not be timid and jump right into the
middle. Consider the problem of classifying hand-written digits (Figure 2.1).
The US postal service (USPS) decides to commission a system to automatize
the presorting of envelopes according to ZIP codes, and your company wins the
contract. How does such a system look like? Let us ignore hardware issues: a
scanned image of the envelope comes in, a ZIP code prediction comes out. The
first step is preprocessing: detecting the writing on the envelope, detecting the
ZIP code, segmenting out the separate digits, normalizing them by correcting for
simple transformations such as translation, rotation and scale, and quantizing
them onto a standard grid of equal size. Preprocessing is of central importance
for any real-world machine learning system. However, since preprocessing tends
to be domain-dependent, we will not discuss it any further during this course,
whose objective is to cover machine learning principles of wide or general appli-
cability.
Datasets of preprocessed handwritten digits are publicly available, the most
well known are MNIST (http://yann.lecun.com/exdb/mnist/) and USPS
(http://www.gaussianprocess.org/gpml/data/). In the middle ages of ma-
chine learning, these datasets have played a very important role, providing a

5
6 2 Linear Classification

Figure 2.1: Patterns from the training dataset of MNIST handwritten digits
database. Shown are ten patterns per class, drawn at random.

unified testbed for all sorts of innovative ideas and algorithms, publicly available
to anyone.
How does the MNIST dataset look like? For much of this course, a dataset is
an unordered set of instances or cases. Each case decomposes into attributes in
the same way. MNIST is a classification dataset (more specifically, multi-way
classification): each case consists of an input point (or pattern) and a target
(or class label). We can formalize this by introducing variables x for the input
point attribute, t for the target, and (x, t) for the case. A dataset like MNIST
is simply an set of variable instances:
D = {(xi , ti ) | i = 1, . . . , n} .
In other words, the dataset has n instances (or data points), and the i-th instance
is (xi , ti ). Each attribute has a data type, comes from a defined value range.
For MNIST, t ∈ {0, . . . , 9}, the ten digits. The input point x is a 28 × 28
bitmap, each pixel quantized to {0, . . . , 255} (0 for no, 255 for full intensity). It
is common practice to represent bitmaps as vectors, stacking columns on top of
each other (from left to right):
   
x1 xi,1
 x2   xi,2 
x =  .  , xi =  . 
   
 ..   .. 
xd xi,d
Refresh your memory on vectors in Section 2.1.1. Given the MNIST specifica-
tion, the attribute space for x should be {0, . . . , 255}d , where d = 28 · 28 = 784.
2.1 A First Example 7

However, we tend to use the weaker specification x ∈ Rd . d is known as input


space dimensionality.
The MNIST multi-way classification problem is as follows. The goal is to learn a
classifier f (x), mapping x ∈ Rd to target predictions f (x) ∈ {0, . . . , 9}. How to
do that? We could hire some people who spend the rest of their life fine-tuning
some large, elaborate set of rules and cryptic computer code, driven by their
“intuition”. This is not learning in any sense, whether human or machine. We
could also use a large set of examples and choose a good classifier by statistically
fitting it to this data. In essence, this is what machine learning is about. Don’t
bother with constructing every detail of an architecture, supposed to solve a
problem you do not fully understand anyway. Instead, collect big data and
induce predictive knowledge directly from there. For the rest of this chapter, we
restrict ourselves to binary classification, where the target can take two different
values only.

Binary Classification

A binary classifier f (x) maps input points x (often, x ∈ Rd ) to binary targets


t ∈ T . The value space T for t is of size two. In principle, any binary set can be
used. We will mainly use T = {0, 1} or T = {−1, +1} in this course.
A running example for binary classification during the rest of this section is
discriminating 8s from 9s among the MNIST digits. We will use the label space
T = {−1, 1}, mapping 8 to -1, 9 to 1. Any good ideas? As a computer scien-
tist, your first attempt may be to setup a database, feeding it with your data
{(xi , ti ) | ti ∈ {−1, 1}}. For this purpose, we would stick with the finite domain
{0, . . . , 255}d for x. Given some pattern x∗ , we query the database. In case we
find it, say x∗ = xi for some i ∈ {1, . . . , n}, we output ti . Otherwise, we output
“don’t know”. This method is also known as lookup table:
 
ti | x = xi , i ∈ {1, . . . , n}
flut (x) = .
Don’t know | x 6= xi , i = 1, . . . , n

Unfortunately, the lookup table approach will never work. While the input do-
main is finite, it is extremely large (exponential in the dimensionality d), and
any conceivable dataset would only ever cover a vanishing fraction. Essentially,
flut (x∗ ) = “don’t know” all the time. For people who like lookup tables and
similar things, this is known as the “curse of dimensionality”.
Learning starts with the insight that most attributes of the real world we care
about are robust to some changes. If x represents an 8 written by hand, neither
of the following modifications will in general turn it into a 9: modifying a few
components xj by some small amount; translating the bitmap slightly; rotating
the bitmap a bit. It is therefore a reasonable assumption on how the world works
that with respect to a sensible distance function between bitmaps, if x belongs
to a class (say, 8), any x0 close to x in this distance is highly likely to belong to
the same class. As x, x0 are vectors, let us use the standard Euclidean distance:
v
q u d
uX
kx − x0 k = (x − x0 )T (x − x0 ) = t (xj − x0j )2 .
j=1
8 2 Linear Classification

Refresh your memory on Euclidean distance in Section 2.1.1. Here is a vastly


better idea than flut . We still use the whole dataset {(xi , ti )} for classification.
For a pattern x∗ , we output the label ti of the nearest1 pattern xi in our
database:

fNN (x) = ti ⇔ kx − xi k ≤ kx − xj k, j = 1, . . . , n.

This is the nearest neighbour (NN) classification rule. It is one of the oldest
machine learning algorithms, and variants of it are widely used in practice.
Moreover, the theoretical analysis of this rule, initiated by Cover and Hart, is a
cornerstone of learning theory. Some details about nearest neighbour methods
can be found in [5, Sect. 2.5.2], its theoretical analysis (with original citations)
is detailed in [11]. Despite the fact that variants of NN work well on our digits
problem (whether binary or multi-way), there are some substantial drawbacks:

• The whole dataset has to be kept around, as each single case (xi , ti ) po-
tentially takes part in a classification decision. For large datasets, this is
a costly liability.

• Research in NN tends to be mainly algorithm-driven. After all, clever


data structures and search techniques are needed in order to find near-
est neighbours rapidly. Most of these techniques have a hard time if the
dimensionality d is larger than 20 or so.

• There is not much to be learned about the data from NN. In fact, some ma-
chine learning work aims to replace the Euclidean by a distance adapted to
the job, but compared to model-based mechanisms for specifying domain
knowledge, these possibilities are limited.

In order to avoid having to store our whole MNIST dataset, we have to represent
the information relevant for classification in a far smaller number of parameters.
For example, we could try to find m  n (m “much smaller than” n) prototype
vectors wc , c = 1, . . . , m, each associated with a target value tw c , then to do NN
classification w.r.t. the prototype set {(wc , tw c ) | c = 1, . . . , m} rather than the
full dataset:
f (m) (x) = fNN (x; {(wc , twc )}).

Here, learning refers to the automatic choice of the parameters {(wc , tw c )} from
the data {(xi , ti )}, or to the modification of the parameters as new data comes
in. Having learned our classifier (i.e., fixed its parameters), we predict the label of
a pattern x∗ by NN on the prototype vectors. How can we learn automatically?
How to choose m (number of prototypes) given n (size of dataset)? If n increases,
as new data comes in, do we need to increase m or can we keep it fixed? These
are typical machine learning questions.
Taking this idea to the limit, let us pick exactly m = 2 prototype vectors w−1 ,
(2)
w+1 , one for each class, with tw
c = c, c = −1, +1. The classification rule f (x∗ )
for a pattern x∗ is simple: if kx∗ − w+1 k < kx∗ − w−1 k output +1, otherwise
1 If there are ties, we pick one of the nearest neighbour labels at random.
2.1 A First Example 9

output −1. Let us simplify this even more:

kx∗ − w+1 k < kx∗ − w−1 k ⇔ kx∗ − w+1 k2 < kx∗ − w−1 k2
⇔ kx∗ k2 − 2wT+1 x∗ + kw+1 k2 < kx∗ k2 − 2wT−1 x∗ + kw−1 k2
⇔ wT+1 x∗ − kw+1 k2 /2 > wT−1 x∗ − kw−1 k2 /2
1
⇔ (w+1 − w−1 )T x∗ + kw−1 k2 − kw+1 k2 > 0.

2
Problems with expanding kx∗ − w+1 k2 ? Have a look at Section 2.1.1. If we set
w = w+1 − w−1 , b = (kw−1 k2 − kw+1 k2 )/2, this is simply a linear inequality:

+1 | wT x∗ + b > 0
 
f (2) (x∗ ) = = sgn wT x∗ + b .

−1 | otherwise

The sign function sgn(a) takes the value +1 for a > 0, −1 for a < 0. The
definition2 of sgn(0) typically does not matter, let us define sgn(0) = 0. We
came some way: from lookup tables over nearest neighbours all the way to
linear classifiers.

2.1.1 Techniques: Vectors, Inner Products, Norms

In order to understand linear classification or anything else built on top of it,


we need to be familiar with vector spaces. Start with some scalar space A for
the coefficients and consider columns of d entries aj from A:
 
a1
 a2 
a= .. .
 
 . 
ad

Note that we use columns, not rows. The set of all such columns a is denoted
by Ad : the d-fold direct product. This is a d-dimensional vector space if A is
a field. In this course, A = R, the real numbers. We will denote vectors a in
bold face, to distiguish them from scalars. In other books, you might encounter
notations like ~a or a, or simply a (no distinction between scalars and vectors).
The dimension of the space, d, is the maximum number of linearly independent
vectors it contains (read up on “linearly independent”, “basis” in [42]). You
can add vectors, multiply with scalars, it all works by doing the operations
on the coefficients separately. In fact, the general definition of vector space is
that of a set which is closed under addition and multiplication with scalars
(being “closed” means that if you perform any such operation on vectors, you
end up with another vector, not with something outside of your vector space).
The transpose operator converts a column into a row vector (and vice versa):
aT = [a1 , . . . , ad ] ∈ R1×d . We can identify Rd with Rd×1 : column and row
vectors are special cases of matrices (Section 2.4.3). Some texts use a0 to denote
transposition instead of aT .
2 If sgn(w T x + b) is a classification rule supposed to output 1 or −1 for every input x, we

should sample the output at random if wT x + b = 0.


10 2 Linear Classification

We will often define vectors by giving an expression for its coefficients. For
example, a = [f (xi )]i (or a = [f (xi )] if the index i is obvious from the context)
means that the coefficients are defined by ai = f (xi ). A number of special
vectors will be used frequently in these notes. 0 = [0, . . . , 0]T is the vector
of all zeros. 1 = [1, . . . , 1]T is the vector of all ones. Their dimensionality is
typically clear from the context. We also need delta vectors δ k ∈ Rd , defined by
δ k = [I{j=k} ]j (recall that I{j=k} = 1 if j = k, 0 otherwise): the k-th coefficient
of δ k is 1, all others are 0. Again, the dimensionality is typically clear from the
context.

Figure 2.2: A vector a ∈ R2 can be visualized as arrow from the origin to a.


The order in which vectors are added does not matter (parallelogram identity).

Vectors a in R2 or R3 can be visualized as arrows in a Cartesian coordinate


system, pointing from the origin 0 to the position a = [a1 , a2 ]T or [a1 , a2 , a3 ]T .
You can visualize a + b by translating the b arrow so that it starts from the
endpoint of a: it will then point to a + b. Flip a and b, and you get to the same
point (what you have now is a parallelogram, see Figure 2.2). −a just mirrors
the a arrow about the origin. Add −a to a, and you are back where you started:
great fun (do it!). In our notation, the vectors defining the Cartesian coordinate
axes are δ 1 , δ 2 (and δ 3 ), but be aware that in some physics-based texts, they
may be called i, j (and k).

If V is a vector space, then U ⊂ V is a (linear) subspace if U itself is a vector


space: closed under addition and multiplication with scalars α ∈ R. Note that
each vector space always contains 0, since that is zero times any other vector.
An affine subspace of V has the form
n o
v + U = v + u u ∈ U ⊂ V,

where U is a (linear) subspace of V. An affine subspace need not contain 0. In


fact, it is a linear subspace if and only if it contains 0 (Figure 2.3).

We can also combine two vectors. The most basic operation is the inner product
2.1 A First Example 11

Figure 2.3: Left: One-dimensional linear subspace in R2 , spanned by u. Right:


Two-dimensional affine subspace in R3 , spanned by the basis {u1 , u2 }.

(or scalar product), resulting in a scalar:


 
b1 d
aT b = [a1 , . . . , ad ]  ...  =
 X
a j bj .

bd j=1

Other notations you might encounter are a · b or (a, b) (or ha|bi if you want to
be a really cool quantum physicist), but not in these notes. There are obvious
properties, such as aT b = bT a (symmetry) or aT (b + c) = aT b + aT c (lin-
earity). By the way, what are ab or aT bT ? Nothing, such operations are not
defined (unless you are in R1 ). What is abT ? That works. More about these in
Section 2.4.3
The geometrical meaning of the inner product will be discussed in Section 2.3.
Here only two points. First, the square root of the inner product of a with itself
is the standard (or Euclidean) distance of a from the origin, its length or its
(Euclidean) norm:
v
q u d
uX
kak = aT a = t a2 . j
j=1

We will often use the squared norm kak2 to get rid of the square root. The
(Euclidean) distance between a and b is the norm of b − a (or a − b, since
k − ck = kck), something that is obvious once you draw it. Convince yourself
that if kak = 0, then a must be the zero vector 0. If kak =6 0, we can normalize
the vector: a → a/kak. The outcome is a unit vector (vector of length 1). We
will use unit vectors if we really only want to represent a direction. In fact, any
non-zero vector a can be represented by its length kak and its direction a/kak.
Second, sometimes it happens that aT b = 0: the inner product between a and b
is zero, such vectors are called orthogonal (or perpendicular). The angle between
orthogonal vectors is a right one (90 degrees). Note that a set of mutually
orthogonal vectors (“mutually” applied to a set means: “for each pair”) is also
linearly independent, so there can be no more than d such vectors (an example
of such a set is {δ 1 , . . . , δ d }).
12 2 Linear Classification

Moreover, a vector v is orthogonal to a subspace U ⊂ Rd if and only if v T u = 0


for all u ∈ U. Two subspaces U1 , U2 of Rd are called orthogonal if for any
u1 ∈ U1 , u2 ∈ U2 : uT1 u2 = 0. In other words, each u1 ∈ U1 is orthogonal to
U2 . Given a subspace U ⊂ Rd , its orthogonal complement U ⊥ is the subspace of
all v ∈ Rd orthogonal to U. We will take up these definitions in Section 4.2.2,
where we discuss orthogonal projections.

Figure 2.4: Pythagorean theorem: ka − bk2 = kak2 + kbk2 if and only if a, b


are orthogonal (aT b = 0).

We will frequently manipulate squared distances between vectors, an example


has been given in Section 2.1. This is a computation you should know by heart,
in either direction:

ka − bk2 = (a − b)T (a − b) = aT a − aT b − bT a + bT b = kak2 − 2aT b + kbk2 .

First, we use the linearity, then the symmetry of the inner product. This equa-
tion relates the squared distance between a and b to the squared distance of a
and b from the origin respectively. The term −2aT b is sometimes called “cross-
talk”. It vanishes if a and b are orthogonal, so that ka −bk2 = kak2 +kbk2 . This
is simply the Pythagorean theorem from high school (Figure 2.4). I bet it took
more pains back then to understand it! As we move on, you will note that this
innocuous observation is ultimately the basis for orthogonal projection, condi-
tional expectation, least squares estimation, and bias-variance decompositions.
Test your understanding by working out

ka − bk2 + ka + bk2 and kak2 + 4aT c.


2.2 Hyperplanes and Feature Spaces 13

The first is called the parallelogram identity (draw it!). Anything still unclear?
Then you should pick your favourite source from Section 1.3.

2.2 Hyperplanes and Feature Spaces


In Section 2.1 we have seen one motivation for linear classifiers, discriminat-
ing handwritten 8s from 9s. There are many other motivations for this class
of discriminants, and we will explore a number of them as we proceed. Linear
classifiers (or, more generally, linear functions) are the single most important
building block of statistics and machine learning, and it is essential to under-
stand their properties. While our goal is of course to find algorithms for learning
such classifiers from data (such as the MNIST database), let us first build some
geometrical intuition about them.
One way to define a binary classifier f : X → {−1, +1}, mapping input points
x ∈ X (for example, X = Rd ) to target predictions f (x) ∈ {0, 1}, is via a
discriminant function y : X → R:
 
+1 | y(x) > 0
f (x) = sgn(y(x)) = .
−1 | y(x) < 0
In other words, we threshold y(x) at zero. The value of f (x) for y(x) = 0 is
often left unspecified, a good practice is to draw it uniformly at random. The
set n o
x y(x) = 0

is called decision boundary, literally the region where the decisions made by
f (x) are switching between the classes. Classifier and discriminant function are
equivalent concepts. However, it is usually much simpler to define and work with
functions mapping to R than to {−1, +1}.
For the moment, let us accept the following definition, which we will slightly
generalize towards the end of this section. A binary classifier f (x), x ∈ Rd , is
called linear if it is defined in terms of a linear discriminant function y(x):
f (x) = sgn(y(x)), y(x) = wT x + b, w ∈ Rd \ {0}, b ∈ R.
Here, w is called weight vector (or also normal vector), and b is called bias
parameter.

Geometry of Linear Discriminants

How can we picture a linear discriminant function? Let us start with a linear
equation x2 = ax1 + b of the high school type. While this form suggests that
x1 is the input (or free), x2 the output (or dependent) variable, we can simply
treat (x1 , x2 ) as points in the Euclidean R2 coordinate system. Obviously, this
equation describes a line. a is the slope: for a = 0, the line is horizontal, for
a > 0 it is increasing. b is the offset: we cross the x2 axis at x2 = b, and the line
runs through the origin if and only if b = 0. Let us rewrite this equation:
 
x1
x2 = ax1 + b ⇔ [a, −1] + b = 0.
x2
14 2 Linear Classification

Figure 2.5: Line x2 = ax1 + b in R2 (one-dimensional affine subspace). The


parameterization wT x + b = 0, w the normal vector, generalizes to hyperplanes
((d − 1)-dimensional affine subspaces) in Rd .

Our equation is equivalent to y(x) = wT x + b = 0, where w = [a, −1]T . We can


also describe our line as the set of points of the form [0, b]T + λd, λ ∈ R, where
d = [1, a]T is a direction along the line. Importantly, wT d = 0: the weight
vector of y(x) is orthogonal to any direction along the line (Figure 2.5).
How about R3 or general Rd ? We simply generalize from R2 . A line is a one-
dimensional affine subspace (“affine what?” Section 2.1.1). In Rd , wT x + b = 0
for w 6= 0 defines a (d − 1)-dimensional affine subspace of Rd . In R3 , this is a
plane, and in the “hyperspace” Rd , it is called a hyperplane. It corresponds to a
set of points of the form v 0 + U, where v 0 ∈ Rd and U is a (linear) subspace of
Rd , of dimension d − 1. w is a normal vector of this hyperplane: it is orthogonal
to any vector u ∈ U. w is uniquely determined, in that any normal vector of
the hyperplane is a nonzero scalar multiple of w. Just as in R2 , the hyperplane
contains the origin (i.e., is a linear subspace) if and only if b = 0. Finally, we
can easily construct an offset point v 0 as follows: if v 0 = −(b/kwk2 )w, then
wT v 0 + b = −(b/kwk2 )wT w + b = 0,
so that v 0 lies on the hyperplane. This choice of v 0 is special: it is the point
on the hyperplane closest to the origin, which you can see by noting that it is
proportional to the normal vector w (it is the orthogonal projection of 0 onto
the hyperplane, see Section 4.2.2).
The geometry of a linear discriminant function y(x) = wT x + b is clear now.
The decision boundary, defined by y(x) = 0, is a hyperplane with normal vector
w. This hyperplane separates the full space Rd into two halfspaces, defined by
n o n o
H+1 = x y(x) > 0 and H−1 = x y(x) < 0

respectively. f (x) = sgn(y(x)) outputs 1 for x ∈ H+1 , −1 for x ∈ H−1 , so the


decision regions for each target value are these halfspaces. The normal vector
w, started from anywhere on the hyperplane, points into H+1 .
2.2 Hyperplanes and Feature Spaces 15

Figure 2.6: Separating hyperplane in R2 . The decision boundary (blue) is defined


by the normal vector w and an offset b ∈ R. v 0 is a point on the hyperplane,
obtained by orthogonal projection of the origin. The plane separates R2 into two
halfspaces H+1 (wT x + b > 0) and H−1 (wT x + b < 0), the decision regions of
the corresponding linear discriminant.

There is some redundancy in our definition. If we replace y(x) by αy(x), α > 0,


the linear classifier f (x) does not change at all. If y(x) = wT x + b is a linear
discriminant, we can always normalize w to become a unit vector: w → w0 =
w/kwk, b → b0 = b/kwk, and y0 (x) = wT0 x + b0 induces the same classifier.
It is therefore without loss of generality to restrict the weight vector w to be of
unit norm, and this is often done in papers. The set
n o
Sd = w kwk = 1 ⊂ Rd

is called unit hypersphere, the set of all unit vectors in Rd . Another “hypercon-
cept”, in R2 this is the unit circle (draw it!), and in R3 the unit sphere.
How do we learn a classifier? Given a dataset of examples from the problem of
interest, D = {(xi , ti ) | xi ∈ X , ti ∈ T , i = 1, . . . , n}, we should pick a classifier
which does well on D. For example, if f (x) is a linear classifier with weight
vector w and bias parameter b, we should aim to adjust these parameters so
to achieve few errors on D. This process of fitting f to D is called training. A
dataset used to train a classifier is called training dataset.
Suppose we are to train a linear classifier f (x) = sgn(wT x + b) on a training
dataset D = {(xi , ti ) | i = 1, . . . , n}. At least within this chapter, our goal will be
to classify all training cases correctly, to find a hyperplane for which xi ∈ Hti ,
16 2 Linear Classification

Figure 2.7: Left column: Datasets which are not linearly separable in R2 and R3
respectively. Right column: Linearly separable datasets.

or f (xi ) = ti , for all i = 1, . . . , n. A hyperplane with this property is called


separating hyperplane for D. Not all datasets have a separating hyperplane. A
dataset D for which there is at least one separating hyperplane is called linearly
separable. If no mistakes on the training set are permitted, the best we can
hope for is to find a training algorithm which outputs a separating hyperplane
for every linearly separable training set. In Figure 2.7, we depict non-separable
datasets in the left, linearly separable datasets in the right column.

Feature Maps

Until now, we concentrated on classifiers whose discriminant functions are linear


in the input point x: y(x) = wT x +b. However, this restriction seems somewhat
artificial. We should allow ourselves some freedom to preprocess x before doing
the linear combination. A feature function (or just feature) is a real-valued func-
tion X → R. A feature map φ(x) is a collection of p features φj (x), mapping
X to Rp as φ(x) = [φj (x)] ∈ Rp . In this context, X is called input space, while
Rp is called feature space (a vector space containing the image φ(X )).
If nothing else is said, a feature map φ(x) is fixed up front and does not have
2.2 Hyperplanes and Feature Spaces 17

to be adjusted to data at the time of learning a classifier.


Given a fixed feature map φ(x), we generalize our definition of linear classifiers.
A linear discriminant function (or just linear discriminant) is a real-valued func-
tion of x which is linear in the feature vector φ(x): y(x) = wT φ(x) + b. Here,
w ∈ Rp is called weight vector, b ∈ R bias parameter. A linear classifier is a
classifier of the form f (x) = sgn(y(x)), where y(x) is a linear discriminant.
Here is an example. Maybe we can boost the performance of a digits classifier
by allowing it to take dependencies between pixels in x into account, such as
correlations of the form xj xk for j 6= k. If x ∈ Rd (d = 28·28 for MNIST digits),
consider the map of all linear and quadratic features:

x1
 
 .. 

 . 

 xd 
 
 x1 x1   
x
φ(x) =  ...
 
 = [x x ] ∈ Rd(d+3)/2 .
 
  j k j≤k
 x1 xd 
 
 x2 x2 
 
 .
 ..


xd xd

You should work out as an exercise how the dimensionality d(d + 3)/2 comes
about. How does a linear discriminant for this feature map look like? The weight
vector w lives in Rd(d+3)/2 . If we index its components by j for the linear, then
jk (j ≤ k) for the quadratic features, then
X X
y(x) = wT φ(x) + b = w j xj + wjk xj xk + b.
j j≤k

This means that y(x) is a quadratic function of x. The decision boundary


{y(x) = 0}, a hyperplane in the feature space Rd(d+3)/2 , can take more complex
forms in the input space Rd (namely, solutions to quadratic equations). In the
same way, we could take into account higher-order dependencies between more
than two components of x, leading to discriminants y(x) which are multivari-
ate polynomials. Moreover, we do not have to include all interactions, but can
restrict correlation terms to pixels spatially close in the bitmap, thus reduc-
ing the feature space dimensionality. To conclude, choosing appropriate feature
maps can greatly enhance the flexibility of linear classifiers, resulting in decision
boundaries which can be arbitrarily far from linear hyperplanes.
Given that we step from x to φ(x), we can just as well incorporate the bias
parameter b into the weight vector, by appending a constant 1 to the feature
map. If w̃ = [wT , b]T , φ̃(x) = [xT , 1]T , then wT φ(x) + b = w̃ T φ̃(x). In
this new feature space, all our hyperplanes contain 0, therefore are subspaces.
Whether or not the bias parameter is kept separate, is often up to the taste
of the community. We will frequently use the bias-free parameterization for
simplicity of notation (and of geometry), but will keep b explicit in cases where
incorporating it into φ(x) would change the underlying method.
18 2 Linear Classification

2.3 Perceptron Classifiers


The perceptron has been an early cornerstone of brain-inspired machine learn-
ing. Its history is detailed in several books. I recommend you have a look at
Figure 4.8 in [5, Sect. 4.1.7]. For the purpose of this course, the perceptron is
simply a binary linear classifier f (x) = sgn(y(x)), y(x) = wT φ(x), which is
trained on data in a particular way. In this section, we will absorb a bias pa-
rameter into w, so that decision hyperplanes in feature space always contain
the origin 0.
Suppose we are given a training dataset D = {(xi , ti ) | i = 1, . . . , n}, where
ti ∈ {−1, +1}. We will use the simplifying notation φi = φ(xi ) ∈ Rp . Let us
assume for now that D is linearly separable in our feature space: there exists
a separating linear classifier. The perceptron algorithm is a simple method for
finding such a classifier. It has the following properties:

• Iterative: The algorithm cycles over the data points in some random or-
dering. Visiting (φi , ti ), it extracts some information, based on which w
may be updated.

• Local updates: Each update depends only on the pattern currently visited,
not on what has been encountered previously.

• Updates only for misclassified patterns: When we visit a pattern (φi , ti )


which is correctly classified by the current discriminant,

sgn(wT φi ) = ti , or simpler : ti wT φi > 0,

then w is not changed.

Recall our geometric picture. w defines a hyperplane, separating Rp into two


halfspaces, H+1 and H−1 , with w pointing into the positive halfspace H+1 . It
makes a mistake on the i-th pattern if and only if φi does not lie in its halfspace
Hti , or simpler if the signed pattern ti φi lies in the negative H−1 . Intuitively,
the angle between ti φi and w is more than a right one, larger than 90◦ .
Here is what the inner product is really doing: if a, b ∈ Rd , the angle between
them is θ, where
aT b
cos θ = ∈ [−1, 1].
kakkbk
If we consider normalized vectors only, the inner product is simply a way to
measure the angle. The Cauchy-Schwarz inequality is related to this definition:
T
a b ≤ kakkbk.

In our linear classification problem, we can aim to correct mistakes by decreasing


the angle between w and signed patterns ti φi . In other words, we should increase
ti wT φi on patterns for which it is negative. A simple way to do so is to add the
signed pattern ti φi to w:

ti φTi (w + ti φi ) = ti φTi w + kφi k2 > ti φTi w.


2.3 Perceptron Classifiers 19

1 2 3

4 5 6

Figure 2.8: Perceptron algorithm on R2 vectors. Blue crosses represent −1, red
circles +1 patterns. For each iteration, the update pattern is marked by black
circle. Shown is signed pattern as well as old and new weight vector w.

Algorithm 1 Perceptron algorithm for training a linear classifier.


Discriminant is y(x) = wT φ(x). Use normalized patterns:
φ̃i = φ(xi )/kφ(xi )k. Initialize w ← 0.
repeat
for i ∈ {1, . . . , n} (random ordering) do
if ti wT φ̃i ≤ 0 (pattern misclassified) then
w ← w + ti φ̃i
end if
end for
until no update on any i ∈ {1, . . . , n}

In fact, this is all we need for the perceptron algorithm, which is given in Algo-
rithm 1.
In practice, we run random sweeps over all data points, terminating if there is
no more mistake during a complete sweep. One subtle point is that we normalize
all feature vectors φ(xi ) before starting the algorithm. Convince yourself that
this cannot do any harm: w is a separating hyperplane for D if and only if it is
one for the set obtained by normalizing each φ(xi ). We will see in Section 2.3.2
that normalization does not only simplify the analysis, but also leads to faster
convergence of the algorithm. A run of the algorithm is shown in Figure 2.8.

2.3.1 The Perceptron Convergence Theorem

Convergence? There is no a priori reason to believe that the perceptron algo-


rithm will work. While an update decreases the angle to a misclassified (signed)
20 2 Linear Classification

pattern, these angles can at the same time increase for other patterns not cur-
rently visited (see Figure 2.8).

Theorem 2.1 (Perceptron Convergence) Let D = {(xi , ti ) | i = 1, . . . , n}


be a binary classification dataset, ti ∈ {−1, 1}, which is linearly separable in a
feature space given by φ(x). Then, the perceptron algorithm terminates after
finitely many updates and outputs a separating discriminant. More precisely, let
w∗ be the unit norm weight vector for some separating hyperplane: ti wT∗ φ̃i >
0 for all i = 1, . . . , n, where φ̃i = φ(xi )/kφ(xi )k. The perceptron algorithm
terminates after no more than 1/γ 2 updates, where

γ = min ti wT∗ φ̃i . (2.1)


i=1,...,n

Beware that w∗ determines just some separating hyperplane, not necessarily the
one which is output by the algorithm. w∗ has to exist because D is separable,
it is necessarily only in order to define γ, thus to bound the number of updates.
γ is a magic number, we come back to it shortly.
Recall that w∗ and all φ̃i are unit vectors. Denote by wk the weight vector after
the k-th update of the algorithm, and w0 = 0. wk is not a unit vector, its norm
will grow during the course of the algorithm. The proof is geometrically intuitive.
We will monitor two numbers, the inner product wT∗ wk and the length kwk k.
Why these two? The larger wT∗ wk the better, because we want to decrease the
angle between w∗ and wk . However, the inner product could simply grow by wk
becoming longer, so we need to take care of its length kwk k (while kw∗ k = 1).
Now, both numbers may increase with k, but the first does so much faster than
the second. By Cauchy-Schwarz (or the definition of angle), wT∗ wk ≤ kwk k, so
this process cannot go on forever. Let us zoom into the update wk → wk+1 ,
which happens (say) on pattern (φ̃i , ti ). First,

wT∗ wk+1 = wT∗ (wk + ti φ̃i ) = wT∗ wk + ti wT∗ φ̃i ≥ wT∗ wk + γ

by definition of γ. At each update, wT∗ wk increases by at least γ, so after M


updates: wT∗ wk ≥ M γ. Second, let us expand a squared norm (Section 2.1.1):

kwk+1 k2 = kwk + ti φ̃i k2 = kwk k2 + 2ti wTk φ̃i + kti φ̃i k2 ≤ kwk k2 + 1.
| {z } | {z }
≤0 (mistake!) =1

Crucially, the “cross-talk” ti wTk φ̃i is nonpositive, because wk errs on the i-


th pattern (otherwise, there would be no update). Therefore, √ at each update,
kwk k2 increases by at most 1, so after M updates: kwk k ≤ M . Now,

wT∗ wk ≤ kwk k ⇒ M γ ≤ M ⇔ M ≤ 1/γ 2 .

What does this mean? γ is a fixed number, and we must have M ≤ 1/γ 2 , where
M is the number of updates to the weight vector. Since the perceptron algorithm
updates on misclassified patterns only, this situation arises at most 1/γ 2 times.
After that, all patterns are classified correctly, and the algorithm stops. This
concludes the proof.
The perceptron algorithm is probably the simplest procedure for linear classifi-
cation one can think of. Nevertheless, it provably works on every single linearly
2.3 Perceptron Classifiers 21

separable dataset. However, there are always routes for improvement. What
happens if D is not linearly separable? In such a case, the perceptron algorithm
runs forever. Moreover, if γ is very small but positive, it may run for many it-
erations. It is also difficult to generalize the perceptron algorithm to multi-way
classification problems. Nevertheless, its simplicity and computational efficiency
render the perceptron algorithm an attractive choice in practice.

2.3.2 Normalization of Feature Vectors

Why do we normalize the feature vectors φi = φ(xi ) in the perceptron algo-


rithm, using φ̃i = φi /kφi k instead of φi ? We already convinced ourselves that
doing so does not change the set of all separating hyperplanes, so normalization
should not make things worse. In fact, a closer look at the proof of Theorem 2.1
suggests that normalization should be beneficial, making the perceptron algo-
rithm converge faster in general. To this end, let us try to replicate the proof
with general unnormalized φi . w∗ will still be a unit vector. While in the nor-
malized case, we needed one number:

ti wT∗ φi
γ = min ,
i=1,...,n kφi k

now we need two of them:

γ̃ = min ti wT∗ φi , P = max kφi k.


i=1,...,n i=1,...,n

The first step remains unchanged: wT∗ wk+1 ≥ wT∗ wk + γ̃. For the second step:

kwk+1 k2 = kwk k2 + 2ti wTk φi +kti φi k2 ≤ kwk k2 + P 2 .


| {z }
≤0 (mistake!)

Proceeding as above:

wT∗ wk ≤ kwk k ⇒ M γ̃ ≤ MP2 ⇔ M ≤ (P/γ̃)2 .

We need to show that

(P/γ̃)2 ≥ (1/γ)2 ⇔ γ̃/P ≤ γ.

Now,
γ̃/P = min ti wT∗ (φi /P ) ≤ min ti wT∗ (φi /kφi k) = γ,
i=1,...,n i=1,...,n

using that P ≥ kφi k and ti wT∗ φi > 0 for all i. This means that the upper bound
on the number of updates in the perceptron algorithm never increases (and typ-
ically decreases) if we normalize the feature vectors. The message is: normalize
your feature vectors before running the perceptron algorithm. You should try it
out for yourself, seeing is believing. Run the algorithm on normalized vectors,
then rescale some of them and compare the number of updates the algorithm
needs until convergence.
22 2 Linear Classification

Figure 2.9: Illustration of margin as maximum distance to closest patterns (top)


or as angle swept out by all possible separating hyperplanes (bottom). Details
given in the text.

2.3.3 The Margin of a Dataset (*)

This section can be skipped at a first reading. We will require the margin concept
again in Chapter 9, when we introduce large margin classification and support
vector machines.

Recall the magic number γ from Theorem 2.1. Intuitively, γ quantifies the dif-
ficulty of finding a separating hyperplane (the smaller, the more difficult). For
general vectors, we have

ti wT φ(xi )
γD (w) = min .
i=1,...,n kwkkφ(xi )k

Note that w is a separating hyperplane for D if and only if γD (w) > 0. Aiming
2.4 Error Function Minimization. Gradient Descent 23

for the best bound in Theorem 2.1, we choose3

ti wT φ(xi )
γD = max γD (w) = max min .
p
w∈R \{0} w∈Rp \{0} i=1,...,n kwkkφ(xi )k

γD is called the margin of D for normalized patterns. A dataset D is linearly


separable if and only if γD > 0. In that case, the margin determines the best
upper bound on number of perceptron algorithm updates we can obtain by the
proof of Theorem 2.1.
What is the geometrical meaning of the margin? Pick a unit vector w so that
γD (w) > 0. Then, arccos(γD (w)) is the largest angle between w and any signed
normalized pattern ti φ̃(xi ). Since w is orthogonal to the separating hyperplane,
π/2 − arccos(γD (w)) is the smallest angle between any pattern φ̃(xi ) and the
hyperplane (Figure 2.9, bottom). For unit vectors, angle and distance are closely
related. The distance between a ∈ Sp (the unit hypersphere, Section 2.2) and
the hyperplane is δ = ka − âk, where â is the orthogonal projection of a
onto the hyperplane (more about orthogonal projection in Section 4.2.2). What
do we know about â? First, wT â = 0, as it lies on the hyperplane. Second,
a = â + δw, since a − â is orthogonal to the hyperplane, therefore pointing
along w. Therefore, wT a = wT â + δkwk2 = δ. If we apply this derivation to
a = ti φ̃(xi ), we see that γD (w) is the smallest distance between any pattern
φ̃(xi ) and the hyperplane. γD (w) quantifies the space between the separating
hyperplane and the closest pattern (Figure 2.9, top).
The margin γD = maxw∈Rp \{0} γD (w) quantifies the separation between the
sets of patterns for each class −1, +1 (assuming that γD > 0). It is the largest
room to move for any separating hyperplane. Imagine you have to separate
the training set not by a thin hyperplane, but by a slab with as large a width
as possible. The maximum width you can attain is 2γD (Figure 2.9, top). Or
imagine all possible separating hyperplanes for D. Then, π − 2 arccos(γD ) is the
largest signed angle4 between any two of them (Figure 2.9, bottom). The larger
γD , the easier it is to linearly separate D.

2.4 Error Function Minimization. Gradient De-


scent
We have worked out the simple perceptron algorithm (Algorithm 1) for finding
a separating hyperplane if there exists one. Moreover, we obtained a precise
geometrical understanding of this algorithm and analyzed its convergence be-
haviour. There are open issues. Most troubling perhaps, if our data D is not
linearly separable, then the algorithm never terminates. How can we fix that?
What about more general situations, such as multi-way classification or nonlin-
ear discrimination? We just have an algorithm, but do not really know what it
is doing. Which mathematical problem does it solve?
3 Why max over w ∈ Rp \ {0} and not sup (supremum)? The supremum will always be

attained for some w. To see this, note that we restrict it to the unit hypersphere w ∈ Sd ,
which is compact, and the argument is continuous in w.
4 In the context of this statement, the signed angle between two hyperplanes is the angle

between their normal vectors, which can lie in [0, π].


24 2 Linear Classification

As computer scientists, we love algorithms. Yet, the modern approach to science


and engineering is to think about problems5 instead. Once these are well char-
acterized and understood, we may search for algorithms to solve them well. In
this context, a “problem” is an abstract formulation of what we really want to
do (for example, classifying those MNIST digits), which is amenable to compu-
tationally tractable mathematical treatment. In the context of much of machine
learning, this translates to mathematical optimization problem.

Error Functions

Recall that our problem is to find a linear classifier f (x) = sgn(y(x)), y(x) =
wT φ(x), which separates handwritten 8s from 9s. To this end, we use the
MNIST subset D = {(xi , ti ) | i = 1, . . . , n} as training dataset: if f (x) commits
few errors on D, it should work well for other patterns. Let us formalize this idea
in an optimization problem, by setting up an error function E(w) of the weight
vector, so that small E(w) translates to good performance on the dataset. We
could minimize the number of errors:
n
X n
X
E0/1 (w) = I{f (xi )6=ti } = I{ti wT φ(xi )≤0} .
i=1 i=1

Recall that I{A} = 1 if A is true, I{A} = 0 otherwise. While this error func-
tion formalizes our objective well, its minimization is computationally6 difficult.
E0/1 (w) is piecewise constant in w, so that local information such as the gradi-
ent cannot be used. For a better choice, we turn to the oldest and most profound
concept of scientific computing, the one with which Gauss and Legendre started
it all: least squares estimation. Consider the squared error function
n n
1X 1X T 2
E(w) = (y(xi ) − ti )2 = w φ(xi ) − ti
2 i=1 2 i=1
n  2
1 X Xd
= wj φj (xi ) − ti .
2 i=1 j=1

The significance of the prefactor 1/2 will become clear below. This error function
is minimized by achieving ti ≈ wT φ(xi ) across all training patterns. Whenever
(xi , ti ) is misclassified, i.e. ti y(xi ) < 0, then (y(xi ) − ti )2 > 1: errors are penal-
ized even stronger than in E0/1 (w). By pushing y(xi ) closer to ti , the mistake
is corrected. All in all, E(w) seems a reasonable criterion to minimize in order
to learn a classifier on D.
The major advantage of E(w) is that its minimization is computationally
tractable and well understood in terms of properties and algorithms. Once
more, geometry helps our understanding. Build the vector of targets t = [ti ] ∈
5 As we proceed with this course, we will see that it is even more useful to think about

models of the domain of interest. These give rise, in an almost automatical fashion, to the
problems we should really solve, and then we can be clever about algorithms. But for now, let
us stick with the problems.
6 The problem min E
w 0/1 (w) is NP-hard to solve in general.
2.4 Error Function Minimization. Gradient Descent 25

{−1, +1}n and the so-called design matrix

φ(x1 )T
   
φ1 (x1 ) . . . φp (x1 )
.
.. .. .. .. n×p
Φ= = ∈R .
   
. . .
φ(xn )T φ1 (xn ) . . . φp (xn )

Feeling rusty about matrices? Visit Section 2.4.3. Now, wT φ(xi ) = φ(xi )T w is
simply the i-th element of Φw ∈ Rn . In other words, if we define y = [y(xi )] ∈
Rn , meaning that yi = y(xi ) = wT φ(xi ), then y = Φw, therefore
n
1X 1 1 2
E(w) = (y(xi ) − ti )2 = ky − tk2 = kΦw − tk .
2 i=1 2 2

The squared error is simply half of the squared Euclidean distance between the
vector of targets t and the vector of predictions y, which in itself is a linear
transform of w, given by the design matrix Φ. We will frequently use this rep-
resentation of E(w) in the sequel. Converting expressions from familiar sums
over i and j into vectors, matrices and squared norms may seem unfamiliar, but
once you get used to it, you will appreciate the immense value of “vectoriza-
tion”. First, no more opportunities to get the indices wrong. Debugging is built
in: if you make a mistake, you usually end up with matrix multiplications of
incompatible dimensions. More important, we can use our geometrical intuition
about lengths, orthogonality and projections. Finally, writing E(w) as squared
norm immediately leads to powerful methods to solve the minimization prob-
lem minw E(w) with tools from (numerical) linear algebra, as we will see in
Chapter 4.

2.4.1 Gradient Descent


How can we solve the optimization problem minw E(w), more specifically find
a minimizer w∗ so that E(w∗ ) = minw E(w)? In this section, we will specify a
simple method which can be applied to many other error functions as well. A few
points about the least squares problem. First, a minimizer exists. This is because
E(w) is lower bounded (by zero), continuous, and the domain w ∈ Rp is closed7 .
Second, the error function E(w) is continuously differentiable everywhere. The
mental picture you should develop is that of an error function landspace, where
w is your location and E(w) is height above sea level. To go down, a good idea
is to take a small step in the direction of steepest descent. Standing at w, for
a tiny ε > 0 you step to w + εd, where kdk = 1 (unit vector). The steepest
descent direction is that d for which E(w) − E(w + εd) is largest.
Let us start with a one-dimensional example: E(w) for w ∈ R. There are not
many directions in R, in fact only d = −1, +1. By Taylor’s theorem, E(w ±
ε) = E(w) ± E 0 (w)ε + O(ε2 ), where E 0 (w) = dE(w)/dw is the derivative.
This is simply a way of saying that limε→0 (E(w + ε) − E(w))/ε = E 0 (w), the
definition of the derivative. Suppose that E 0 (w) 6= 0. For very small ε > 0,
E(w) − E(w + dε) = ε(−E 0 (w)d + O(ε)), which is positive for the direction
7 We also need the domain to be bounded. It is easy to see that E(w) → ∞ for kwk → ∞,

so we can restrict ourselves to some large enough hyperball.


26 2 Linear Classification

Figure 2.10: Optimization by steepest descent, following the negative gradient


direction.

d = sgn(−E 0 (w)) only: the sign of the negative derivative. For w ∈ Rd , we use
the multivariate Taylor’s theorem:

E(w + εd) = E(w) + ε(∇w E(w))T d + O(ε2 ),


∇w E(w) = [∂E(w)/∂wj ] ∈ Rp .

Here, ∇w E(w) is the gradient of E(w) at w. As in the univariate case, we


find the direction d of steepest descent as the maximizer of E(w) − E(w +
εd) = ε(−∇w E(w))T d + O(ε2 ) as ε → 0. By the Cauchy-Schwarz inequality,
(−∇w E(w))T d is maximized by d being parallel to −∇w E(w): the direction
of steepest descent is
−∇w E(w)
d= ,
k∇w E(w)k
the direction of the negative gradient. To descend most steeply from E(w),
we compute the gradient ∇w E(w) and then march precisely in the opposite
direction. This is the basis for gradient descent optimization. In practice, an
iteration works as follows:

• Compute the gradient g = ∇w E(w).


• Pick a step size η > 0. Update w0 = w − ηg.

The step size η is typically kept constant for many iterations, which corresponds
to scaling the steepest descent direction by the gradient size k∇w E(w)k. We
take large8 steps as long as much progress can be made, but slow down when
the terrain gets flatter. If ∇w E(w) = 0, we have reached a stationary point
of E(w), and gradient descent terminates. Whether or not this means that we
8 This does not always work well, and we will discuss improvements shortly.
2.4 Error Function Minimization. Gradient Descent 27

have found a minimum point of E(w) (a local or even a global one), depends
on characteristics of E(w) [2]. Only so much: for the squared error E(w) over
linear functions, we will have found a global minimum point in this case (see
Section 4.2).
Will this method converge (and how fast)? How do I choose the step size η? Shall
I modify η as I go? Answers to these questions depend ultimately on properties of
E(w). For the squared error, they are well understood. Many results have been
obtained in mathematical optimization and machine learning. A clear account
is given in [2].

Gradient for Squared Error

What is the gradient for the squared error E(w)? Let us move in steps. First,

∂E 1 ∂
= (yi − ti )2 = yi − ti ⇒ ∇y E = y − t = Φw − t.
∂yi 2 ∂yi

This is why we use the prefactor 1/2. Second, yi = φ(xi )T w, so that ∇w yi =


φ(xi ). Using the chain rule of differential calculus:
n n
X ∂E X
∇w E = ∇ w yi = (yi − ti )φ(xi ) = ΦT (y − t) = ΦT (Φw − t) .
i=1
∂yi i=1

As more complicated settings lie ahead, you should train yourself to do deriva-
tives in little chunks, exploiting this fabulous divide and conquer rule. The ex-
pression for ∇w E is something we will see over and over again. y − t = Φw − t
is called residual vector, it is the difference between prediction and true target
for each data case. The gradient is obtained by mapping the residuals back to
the weights, multiplying by ΦT . In a concrete sense developed in Section 4.2,
we project that part of the residual error we can do anything about by changing
w.

Problems, Error Functions and Algorithms

We have reformulated linear binary classification as the optimization problem


of minimizing the squared error function E(w), which we can do in practice by
gradient descent optimization. The importance of this step will become apparent
as we move to more challenging problems. The perceptron algorithm is nice, but
it solves precisely one problem. Modify the latter a bit, and you are stuck. For
your next machine learning problem to solve, it is easier to design an appropriate
error function and optimization problem, then to pick a solver for it “off the
shelf”, or at least start from 200 years of prior experience in this domain. Our
guiding principles will be:

• For the most part, machine learning is about mapping real-world clas-
sification, estimation or decision-making problems to mathematical opti-
mization problems, which can be solved by robust, well understood and
efficient algorithms.
28 2 Linear Classification

• We separate three things: (a) the real-world problem motivating our ef-
forts, (b) the optimization problem we try to map it to, and (c) the method
for solving the optimization problem. In this section, (a) we would like to
classify handwritten 8s versus 9s. To this end, (b) we minimize the squared
error function E(w), and (c) we do so by gradient descent optimization.
The rationale should be obvious: the mapping from one to the other is not
one to one. If we mix up the problem and an algorithm to solve it, we will
miss out on other algorithms solving the same problem in a better way.

2.4.2 Online and Batch Learning. Perceptron Algorithm


as Gradient Descent
Note that the squared error function E(w) is a sum of terms (yi − ti )2 /2, one
for each data point (xi , ti ). This additive structure occurs very frequently in
machine learning:
n
1X
E(w) = Ei (w).
n i=1
Note that in this section, we consider error functions to be normalized by the
number n of training points. The structure implies that in order to compute the
gradient ∇w E, we have to run an accumulation loop over the whole dataset.
If the number of data points n is very large, gradient descent looks a whole
lot less attractive. To ensure proper convergence, the step size η must be rather
small, yet every update requires a run over all training points. Such optimization
algorithms are called batch methods: for every update of w, the whole dataset (or
at least large batches thereof) has to be processed. In contrast, online methods
update the weight vector based on ∇w Ei (w) computed on a single case. An
example is stochastic gradient descent, whose k-th iteration is:

• Pick training case at random (say, i(k)). Compute g k = ∇wk Ei(k) .


• Update wk+1 = wk − ηk g k

The goal remains unchanged: minimize E(w), a sum over all data points. Com-
pared to gradient descent, each update is about n times faster to compute. On
extremely large training sets, running an online algorithm may be the sole viable
option. On the other hand, stochastic gradient descent does not update along
the direction of steepest descent for E(w). An update may even increase the
error function value, and this will go unnoticed in general. Why would such an
algorithm ever work?
Some rough intuition goes as follows. The usual gradient is
n
1X
∇w E(w) = ∇w Ei (w),
n i=1

the empirical average of the stochastic gradient terms ∇w Ei (w). Each


∇w Ei (w) is like a random sample with mean ∇w E(w), a random perturbation
of the true gradient. As ηk decreases to zero, the fluctuations average out over
successive steps. On the other hand, ηk must decrease sufficiently slowly in order
2.4 Error Function Minimization. Gradient Descent 29

to avoid premature convergence (a rate of ηk = 1/k is typically used). Under


certain conditions on Ei (w), convergence can be established [6].
Many machine learning algorithms are of the online, stochastic gradient de-
scent type. Let us close this section be showing that the perceptron algorithm
(Algorithm 1) is amongPthem. To this end, we need to determine an error func-
n
tion Eperc (w) = n−1 i=1 Eperc,i (w), so that the update of the perceptron
algorithm on case (xi , ti ) corresponds to a stochastic gradient step. As the per-
ceptron algorithm does not feature a step size ηk and is finitely convergent, we
can assume9 that ηk = 1 here. Then,
| ti wT φ̃i ≤ 0
 
−ti φ̃i
∇w Eperc,i (w) = ,
0 | ti wT φ̃i > 0

since the right hand side is what is subtracted from w when visiting (φ̃i , ti ). This
means that Eperc,i (w) should be zero for ti wT φ̃i > 0, Eperc,i (w) = −ti wT φ̃i
for ti wT φ̃i ≤ 0. Concisely,
 
Eperc,i (w) = −ti wT φ̃i I{ti wT φ̃i ≤0} = g −ti wT φ̃i , g(z) = zI{z≥0} .

To conclude, the perceptron algorithm can be understood as a variant of stochas-


tic gradient descent, minimizing the perceptron error function
n
1X  
Eperc (w) = g −ti wT φ̃i , g(z) = zI{z≥0} .
n i=1

This function is continuous, but not everywhere differentiable and piecewise


linear. In general, stochastic gradient descent and even gradient descent can fail
to converge on nondifferentiable criteria. However, in the case of the perceptron
algorithm, convergence is established independently by Theorem 2.1.

2.4.3 Techniques: Matrices and Vectors. Outer Product


At first sight, a matrix is a two-dimensional array of numbers:
 
a11 . . . a1q
A =  ... .. ..  ∈ Rp×q .

. . 
ap1 ... apq
The space of all real-valued matrices of p rows, q columns is denoted Rp×q .
It is itself a vector space. Addition and scalar-multiplication of matrices work
component-wise as for vectors. In fact, our column vectors are special matrices:
Rp = Rp×1 . We use similar notation than with vectors. For example, the design
matrix of Section 2.4 can be defined as Φ = [φj (xi )]ij ∈ Rn×p . Special matrices
are 0 = [0] (matrix of all zeros) and the identity
 
1 0 ... 0
 0 1 ... 0 
p×p
I= . . . ..  ∈ R .
 
.
 . . . . . . 
0 0 ... 1
9 More precisely, the perceptron algorithms runs independently of the size of η > 0, since
k
we can rescale w by any positive constant. Setting ηk = 1 leads to the simplest notation.
30 2 Linear Classification

A matrix is called square if p = q (same number of rows and columns).


A more useful way to understand matrices is as linear transforms of finite-
dimensional vector spaces. A matrix A ∈ Rp×q acts on a vector x ∈ Rq , mapping
it to y = Ax ∈ Rp :
q
X q
X
yi = aij xj ⇒ y= xj a j , A = [a1 . . . ap ] .
j=1 j=1

In particular, Aδ j = aj (recall δ j from Section 2.1.1). A good way to think


about a matrix is in terms of its columns. It is the linear transformation which
maps the standard Euclidean coordinate system [δ j ], j = 1, . . . , q, to [aj ]. For
example, the identity matrix has columns δ j : I = [δ 1 . . . δ p ]. It maps each
vector onto itself. The transpose of a matrix A ∈ Rp×q , denoted by AT , is
obtained by interchanging row and column indices: AT = [aji ]ij ∈ Rp×q . The
columns of A become the rows of AT , and vice versa.
If A ∈ Rp×q and B ∈ Rr×p , we can concatenate the corresponding linear
transformations. If x ∈ Rq , then y = Ax ∈ Rp and z = By ∈ Rr :
 
p
X Xp Xq q
X p
X q
X
zi = bij yj = bij ajk xk =  bij ajk  xk = cik xk .
j=1 j=1 k=1 k=1 j=1 k=1

The new matrix C = BA ∈ Rr×q is the matrix-matrix product (or just matrix
product) of B and A. The matrix product is associative and distributive, but
not commutative: AB 6= BA. If A and B have no special structure, computing
C costs O(r p q). Compare this to O(p q + r p) for computing z from x by two
matrix-vector products. These differences become even more pronounced if A
and B have useful structure, which is lost in C .
A special matrix product is that between a column and a row vector:
 
x1 y1 . . . x1 yq
xy T = [y1 x . . . yq x] =  ... .. ..  ∈ Rp×q , x ∈ Rp , y ∈ Rq .

. . 
xp y1 ... xp yq

This is called the outer product between x and y. Compare this to the inner
product (Section 2.1.1) xT y ∈ R, which works for x, y ∈ Rp only (same dimen-
sionality). The linear transformation of the outer product xy T maps any vector
z ∈ Rq to a scalar multiple of x:

xy T z = x y T z = αx, α = y T z.
 
(2.2)

It is an extreme case which shows you that even if a linear transform maps into
an output vector space Rp of p dimensions, it may not reach every vector in
that space. Test your understanding by the following column representations of
a matrix in terms of outer products:
q
X q
X
A= aj δ Tj , I= δ j δ Tj .
j=1 j=1
2.4 Error Function Minimization. Gradient Descent 31

Moreover, how does the matrix 11T look like (1 ∈ Rp , 1T ∈ R1×q )?


When learning about a subject like machine learning or data analysis, there are
a few very big steps you can take, which will simplify your working life enor-
mously. One of them is getting used to writing tedious, complicated expressions
involving sums and indexing in terms of matrices and vectors. Doing so avoids
indexing bugs, exposes the bigger picture about an equation, and directly leads
to expressions which are most efficient to compute. We have already started do-
ing that for linear least squares estimation in Section 2.4. Here some examples.
To extract columns and rows of a matrix, use the delta vectors:

Aδ j = aj = [aij ]i , δ Ti A = [ai1 , . . . , aiq ].

To sum up columns or rows of a matrix, use the vector 1 of all ones:


q
X hXq i p
X
A1 = aj = aij , 1T A = [ai1 , . . . , aiq ].
j=1
j=1 i=1

For example, q −1 A1 is the arithmetic mean of the columns of A. We will use


tricks like these repeatedly during the course, and I strongly encourage you to
adopt vectorization for yourself.
Vectors are matrices: column vectors in Rp×1 , row vectors in R1×q . Another
important class of square matrices determined by vectors are diagonal matrices:
 
a1 0 . . . 0
 0 a2 . . . 0 
p×p p
diag a =  . .. . . . ∈R , a∈R .
 
 .. . . .. 
0 0 ... ap

The matrix-vector multiplication with a diagonal matrix is simply the


component-wise product: (diag a)x = [ai xi ]. This is also called the Schur or
Hadamard product, denoted a ◦ x. The operator diag maps vectors in Rp to
diagonal matrices in Rp×p . We often have to access the diagonal of a square
matrix:
diag(A) = [aii ] ∈ Rp , A ∈ Rp×p .
Note that we use the diag operator in two different ways, which can be distigu-
ished by checking whether its argument is a vector or a square matrix. Test:
What is diag diag(A)? What is diag(diag a)?
Finally, A ∈ Rp×q linearly maps vectors from Rq to Rp , but the best way to
think about A is that it linearly maps the vector space Rq onto a linear subspace
of Rp , called the range and denoted by ARq . This is a linear subspace precisely
because A is a linear transform. The range is equal to the span of {aj }, the
columns of A:
nXq o
ARq = span({aj }) = xj aj x ∈ Rq .

j=1

Picture it for yourself. A line in R2 is the range of a vector d ∈ R2 \{0}, pointing


along the line. The rank of a matrix A, rk(A), is the dimensionality of its range.
We have that rk(A) ≤ min{p, q}. Namely, ARq is a subspace of Rp (≤ p), and
32 2 Linear Classification

it is generated by the q columns of A (≤ q). If rk(A) = min{p, q}, we say that


A has full rank. An important result is that rk A = rk AT : the ranges ARq and
AT Rp have the same dimensionality.
Some examples. The identity I ∈ Rp×p always has full rank p. On the opposite
end, rk 0 = 0 (zero matrices are the only matrices of rank zero). What is the
rank of the outer product xy T ? Recall (2.2): its range is the span of the single
vector x, so rk(xy T ) = 1 (unless x = 0 or y = 0, then the rank is zero).
While ARq is a subspace of Rp , the second important player is a subspace of
Rq , the null space (or kernel) of A:
n o
ker A = x ∈ Rq Ax = 0 .

Please verify for yourself that ker A is indeed a linear subspace. The role of the
null space is as follows. Suppose that y = Ax0 . If you add any v ∈ ker A to
x0 , then x0 + v is still mapped to y. In fact, the set of all solutions x to the
linear system Ax = y is precisely the affine subspace
n o
x0 + ker A = x0 + v v ∈ ker A .

Test yourself: what is the kernel of the matrix wT ∈ R1×q ? Does this remind you
of something in Section 2.2? It is argued in [42] that you only really understand
matrices if you understand the four subspaces: ARq , AT Rp , ker A, and ker AT .
I highly recommend you study chapters 3 and 4 of [42] to refresh your memory
and gain a better perspective. One important result is that the range ARq
and the null space ker AT are orthogonal subspaces (Section 2.1.1). Namely, if
x = Av is from ARq , y ∈ ker AT , then
 T
y T x = y T Av = AT y v = 0T v = 0.

To test your understanding, combine this fact with rk A = rk AT to show that


for A ∈ Rp×q , the sum of rk A and the dimensionality of ker A is q.
Finally, a square matrix A ∈ Rp×p is invertible if and only if it is full rank:
rk A = p, ARp = Rp . Equivalently, the square matrix A is invertible if and
only if ker A = {0}: Av = 0 implies that v = 0. It is only in this case that we
can go back and invert a system for every pair of vectors: y = Ax to x = A−1 y.
The inverse is defined by

AA−1 = I = A−1 A.

As we will see later during the course, it is not in general a good idea to compute
the inverse in practice, where other techniques are faster and more reliable.
Chapter 3

The Multi-Layer
Perceptron

In this chapter, we discuss multi-layer perceptrons (MLPs), a layered nonlinear


extension of linear techniques treated in the previous chapter. MLPs are among
the most widely used so-called neural network models. We will motivate why
they can improve upon linear classifiers and detail how they can be trained
efficiently through error backpropagation. We discuss some aspects of MLPs in
practice and give links to modern nonlinear optimization methods which play a
central role in applied machine learning.
While many books1 and very many papers have been written about MLPs, they
are mathematically speaking not very deep concepts. They are nonlinear func-
tion classes, whose convenient layered structure leads to efficient computation
of gradients (and even Hessians). The celebrated backpropagation technique is
simply the chain rule of differential calculus. In particular, MLPs are not good
models for parts of the brain. Real neurons exhibit stochastic (noise, leaking
of current) and dynamical behaviour (rhythms, synchronization), MLPs simply
represent fixed deterministic functions. In this course, we will treat MLPs as a
convenient tool to approach machine learning problems.

3.1 Why Nonlinear Classification?


In the previous chapter, we studied linear classification. We worked out a sim-
ple algorithm to learn such perceptrons and understood a good deal about its
geometric properties. Using feature spaces, linear classifiers can be configured
by complex decision boundaries, and they can be trained very efficiently. Why
would we need anything else?
The simplicity of linear discriminants come at a price: they lack in flexibility.
One way to assess the flexibility of a class of classifiers is to ask how many
different discrimination problems can be solved perfectly, versus how many
1 In the author’s opinion, based on limited exposure, there are few good books about MLPs,

or more general neural networks. The author’s favourite is [4]. Then there is [22] and [35].

33
34 3 The Multi-Layer Perceptron

Figure 3.1: The XOR problem is not linearly separable in R2 . In order to solve it
without error, we need discriminant functions which are nonlinear in the input
space R2 .

parameters have to be adjusted during training. Consider a linear classifier


f (x) = sgn(wT φ(x)), where w, φ(x) ∈ Rp . No matter what features you pick,
it is not hard to show that there is always a dataset {(xi , ti )} of size ≤ p + 1
which is not linearly separable. The ratio between size of (always) separable
datasets and number of free parameters is one. One example is the XOR prob-
lem (Figure 3.1): four patterns in R2 which are not linearly separable (note that
p = 3, since w includes a bias parameter). If the input points are (0, 0), (0, 1),
(1, 0), (1, 1), the function sought after is the exclusive-or. In higher dimensions,
these non-separable datasets are not rare worst-case examples, but they are in
fact the typical case. With nonlinear classifiers, we expect to do much better in
that respect.
There are many ways to step from linear to nonlinear. Two powerful ideas will
be studied in this course:

• Make use of feature maps φ(x; θ) which are themselves parameterized


by θ, and learn θ during training as well. Since we know about linear
mappings already, a most sensible approach would be to make use of them
in the construction of φ(x; θ). This idea leads to multi-layer perceptrons,
the topic of this chapter.

• Use a feature map φ(x) and weight vector w in a very high-dimensional


space, p exponential in d (input space dimensionality), maybe even p =
∞. Find training and prediction algorithms which do not represent w
directly, and whose scaling is independent of p. We follow up on this idea
in Chapter 9.

3.2 Multi-Layer Perceptrons


We are computer scientists. If a technique works and we understand it well,
but it just does not do the next job properly, we glue it together in a composite
3.2 Multi-Layer Perceptrons 35

architecture and check whether it works. A first attempt would be to concatenate


linear discriminants. For example, for the linear discriminant y(x) = wT φ(x),
we could use feature functions φj (x) = θ Tj x, coming with free parameters θ j ∈
Rd . However, this does not lead us outside of the linear world. The concatenation
of linear maps is a linear map again (Section 2.4.3), so our combined discriminant
is a linear function and nothing is gained. How about combining linear classifiers
(as opposed to discriminants) in this way? In the example above, we could
use φj (x) = sgn(θ Tj x + bj ). Since sgn(·) is a highly nonlinear function, the
corresponding discriminant y(x) is nonlinear. A major problem with this class
in practice is that, due to the discontinuous sgn(·) functions, it is very hard
to train such a model on data. If we used a continuously differentiable, yet
nonlinear transfer function in place of sgn(·), we could compute the gradient
w.r.t. parameters and run gradient descent. This would be a first example of a
multi-layer perceptron (MLP).

Figure 3.2: Multi-layer perceptron with two layers and tanh(·) transfer function.
See text for explanations.

Let us work out an example and thereby fix concepts. Given inputs x ∈ Rd ,
define activations
d
(1)
X
a(1) (1) T (1)
q = (w q ) x + bq = wqj xj + b(1)
q , q = 1, . . . , h1 .
j=1

(1)
We can collect the activations in a vector a(1) = [aq ] ∈ Rh1 . Note that this is
really a mapping of x. Next, we pass each activation through a transfer function
(1) (1)
g(·): zq = g(aq ). h1 is the size of the first layer. We say that the first layer
has h1 units, more specifically hidden units (“hidden” refers to the fact that
(1)
activation values are not given as part of a training dataset). zq is the output
of the k-th unit in the first layer.
36 3 The Multi-Layer Perceptron

The transfer function is chosen as part of the architecture. It must be defined


on all of R, nonlinear and continuously differentiable. A simple generalization
is to use different transfer functions for each layer, or even for each unit. We
use a single g(·) in this chapter for notational simplicity only. A frequently used
transfer function is the tangent hyperbole:
ea − e−a
g(a) = tanh(a) = .
ea + e−a
This is an odd function, g(−a) = −g(a), g(0) = 0. Its asymptotes are
lima→±∞ g(a) = ±1 (Figure 3.2, right). Moreover, g 0 (a) = 1 − g(a)2 , in par-
ticular g 0 (0) = 1, so that g(ε) ≈ ε for ε ≈ 0: for small arguments, g(a) behaves
like a linear function, but it saturates for larger ones. Notice that a 7→ g(a/ε)
tends to the step function sgn(a) as ε → 0, so that tanh(·) can be seen as smooth
approximation to the sign function.
(1)
We now use z (1) = [zq ] ∈ Rh1 as input to the next layer:

a(2) (2) T (1)


q = (w q ) z + b(2)
q , zq(2) = g(a(2)
q ), q = 1, . . . , h2 .

Note that this layer looks the same as the first, only that inputs are hidden
z (1) instead of observed x. We can use as many layers as we like, but the sim-
(2)
plest choice above the linear class is to use two layers. The activations aq of
the uppermost layer are called outputs. They are taken as they are, no further
(2)
transfer functions and zq are used. For our binary classification example, we
would have h2 = 1 (just one unit in the second layer), and a(2) (x) corresponds
to the nonlinear discriminant function y(x) (we drop the subscript “1” here, as
there is a single activation only). The resulting classifier is f (x) = sgn(a(2) (x)).
An example of a two-layer MLP is given in Figure 3.2. Its parameters are
(1) (1)
{(wq , bq ) | q = 1, . . . , h1 } for the first layer, and the usual w(2) , b(2) for
the second, linear layer. Altogether,
h1 d
!
(1)
X X
(2) (2) (2) (1)
f (x) = sgn(a (x)), a (x) = wq g wqj xj + bq + b(2) .
q=1 j=1
| {z }
(1)
=aq

Our comment at the beginning of section is clear now. A two-layer MLP has
the form of a linear model a(2) (x) = (w(2) )T φ(x; θ) + b(2) , where the features
 
φq (x; θ) = g (w(1) T
q ) x + bq
(1)

are nonlinear functions of x, configured by additional parameters θ =


(1) (1)
{wq , bq }. The fact that θ has to be learned alongside w(2) and b(2) is what
makes this setup nonlinear. Note that the number of layers is determined by
the number of linear activation maps. Beware that other books may count the
number of variable layers, so call our example “three-layer network” (x would
be the “input layer” for them), while others count the number of hidden layers
and would call this a “single hidden layer network”.
Multi-layer perceptrons (MLPs) are learned from training data in the same
general way as linear classifiers. Recall Section 2.4. We pick an error function
3.2 Multi-Layer Perceptrons 37

E(w), quantifying the mismatch between predictions (outputs) and training set
targets. Here, we collect all parameters of the network model in a single large
vector w. Test your understanding by writing down how w would look like
for the two-layer example above. You have to end up with (d + 1)h1 + h1 + 1
parameters. Let us use the squared error function:
n 2
1 X  (2)
E(w) = a (xi ) − ti .
2 i=1

Next, we learn weights w by minimizing E(w), for example by gradient descent.


From Section 2.4.1, we know that the gradient is
n 
X 
∇w E = a(2) (xi ) − ti ∇w a(2) (xi ).
i=1

This has the same flavour than in the linear case. First, we compute residuals
a(2) (xi ) − ti for every training case. Then, we combine these with gradients of
the outputs in order to accumulate the gradient. For online learning (stochastic
gradient descent), we compute the gradient on a single case only. Then, we make
a short step along the negative gradient direction. There is of course one key
difference to the linear case: ∇w a(2) (xi ) is not just some fixed φ(xi ) anymore,
but looks daunting to compute. We will meet this challenge in Section 3.3.
A final “historical” note before we move on. A surprisingly large amount of work
has been spent on figuring out which functions can or cannot be represented by
an MLP with two layers (one hidden, one output). If you really want to know
about this, [4, ch. 4] gives a brief overview. For example, a two-layer network
with tanh transfer functions can approximate any continuous function on a
compact set arbitrarily closely, if only the number h1 of hidden units is large
enough. These results are of close to no practical relevance, not only because the
number h1 has to be very large, but also because what really counts is whether
we can learn complex functions from training data in a reliable and efficient
way.

3.2.1 Vectorization of MLP Formalism

Recall from Section 2.4.3 that part of our mission is to vectorize our procedures.
Vectorization does not only expose important structure most clearly and helps
with “debugging” expressions, it is an essential step towards elegant and efficient
implementation. Most machine learners use Matlab or Python for development
and prototyping, interpreted languages in which vectorization is a must. Let
us vectorize the MLP function evaluation. We already have vectors x ∈ Rd ,
(1)
a(1) ∈ Rh1 . The weight matrix W (1) ∈ Rh1 ×d consists of rows (wq )T , the
(1)
bias vector b(1) = [bq ] ∈ Rh1 . Then, a(1) = W (1) x + b(1) . Moreover, we
extend scalar functions to vectors by simply applying them component-wise.
For example, f (v) = [f (vj )], where f : R → R. Then, the vector of first layer
unit outputs is z (1) = g(a(1) ). All in all, the first layer is vectorized as:

z (1) = g(a(1) ), a(1) = W (1) x + b(1) .


38 3 The Multi-Layer Perceptron

Our second layer is linear with a single activation a(2) :

a(2) = (w(2) )T z (1) + b(2) .

If instead, we used a second hidden layer, it would be given by

z (2) = g(a(2) ), a(2) = W (2) z (1) + b(2) , W (2) ∈ Rh2 ×h1 . (3.1)

Note the clear separation between linear activation maps and nonlinear trans-
fers.

3.3 Error Backpropagation


The content of this section can be summarized in two sentences. Error back-
propagation is nothing but the chain rule of differential calculus. Due to the
multi-layered feed-forward architecture of an MLP, this rule allows us to com-
pute the gradient ∇w E of an error function E(w) in time linear in the number
of training points and the number of weights. This computational edge is what
makes MLPs so attractive in practice.
One point up front. Recall our mantra from Section 2.4. We will carefully sep-
arate our real-world problem, its abstraction as optimization problem, and the
algorithm for solving the latter. Backpropagation is even lower down the hier-
archy, it is simply a deterministic computational technique. You can call it an
algorithm, but not one which solves our optimization problem minw E(w): it
simply computes a gradient. There are many gradient-based optimization algo-
rithms (see Section 3.4.2), and all of them use backpropagation when applied
to MLP training.
Let us compute ∇w E for our two-layer binary classification MLP, continuing
where we left
Pnat the end of Section 3.2. First, the error function decomposes
as E(w) = i=1 Ei (w), where Ei (w) = 21 (a(2) (xi ) − ti )2 . We compute ∇w Ei ,
then sum up the results. We fix one i ∈ {1, . . . , n} and drop the index and the
argument xi from the notation (for example, we write a(2) instead of a(2) (xi )).
As in the linear case, we break the derivatives into small, manageable steps. To
this end, we define error (or residual) variables as

∂Ei ∂Ei
r(2) = , rq(1) = .
∂a(2) ∂aq
(1)

Note that for every activation variable, there is one error variable. Moreover,
the error variables determine the gradient components:
 
∂Ei ∂Ei
∇w(2) Ei = (2)
∇w(2) a(2) = r(2) z (1) , = r(2) ,
∂a ∂b(2)
! (3.2)
∂Ei (1) (1) ∂Ei (1)
∇w(1) Ei = (1)
∇w(1) aq = rq x, (1)
= rq .
q q
∂aq ∂bq

All we need to do is to compute the errors. First, r(2) = a(2) − ti . This is just
the residual we already know from the linear case (Section 2.4.1). Finally, we
3.3 Error Backpropagation 39

need the chain rule:


h1
∂Ei ∂a(2) (2) ∂
X (2) (1)
rq(1) = · = r wk0 g(ak0 ) = r(2) wq(2) g 0 (a(1)
q ).
∂a(2) ∂a(1)
q ∂a
(1)
q k =1
0

(1)
This means we compute rq in terms of r(2) , multiplying it with the weight
(2) (1)
wq linking the output and q-th hidden unit, then by the derivative g 0 (aq ).
(2)
Note the remarkable symmetry with the activation variables. For them, a is
(1) (2)
computed in terms of aq , multiplied by wq as well.

Figure 3.3: Forward pass (left) and backward pass (right) of error backpropa-
gation algorithm for a two-layer MLP. Notice the symmetry between the two
(1)
passes, and how information g 0 (aq ) computed during the forward pass and
stored locally at each node is recycled during the backward pass.

To see the backpropagation technique in its full glory, let us consider a network of
at least three layers. The relationship between gradient terms and error variables
(1)
remain the same. To work out rq , note that the q-th unit in the first layer is
now connected to h2 second layer units. We have to use the chain rule with
respect to the error variables for all of them:
h2 (2) h2 h2
X ∂Ei ∂aj X (2) (2)
X (2) (2)
rq(1) = (2)
· (1)
= rj wjq g 0 (a(1) 0 (1)
q ) = g (aq ) wjq rj . (3.3)
j=1 ∂aj ∂aq j=1 j=1

Compare this to the equation


h1
(2) (2) (2)
X
aj = wjq g(a(1)
q ) + bj
q=1

to appreciate the symmetry between the forward propagation of the activation


variables and the subsequent backward propagation of the error variables. These
two passes are illustrated in Figure 3.3. The backpropagation technique to com-
pute the gradient term ∇w Ei for a L-layer network can be summarized as
follows:
40 3 The Multi-Layer Perceptron

1. Forward pass: Propagate the activation variables forward through the net-
work, from the inputs up to the output layer. At the j-th unit of layer l,
(l) (l)
maintain g 0 (aj ) (or alternatively aj ).

2. Compute the output residual r(L) = a(L) − ti , in order to initialize the


backward pass.
3. Backward pass: Propagate the error variables backward through the net-
work, from the output down to the inputs, using (3.3).
4. Compute the gradient term ∇w Ei based on the error variables, as in (3.2).

The computational cost is dominated by the linear mappings, both in the for-
ward and the backward pass, and the final gradient assembly. In each of the
passes, each weight is touched exactly once (the bias parameters are not used in
the backward pass, but their number is subdominant). Therefore, both passes
scale linearly in the number of weights, and so does the final gradient assembly.
This is the backpropagation technique applied to gradient computations for the
squared error function. Later during the course, in Chapter 8, we will consider
models with several output units and different error functions. Our mantra holds
true: one of the key aspects of backpropagation is that it is invariant to such
changes. You have to write code for it only once in order to serve a wide variety
of machine learning problems.

3.3.1 Vectorization of Error Backpropagation


Once more, vectorization will bring out the basic features even more clearly.
Recall Section 3.2.1, in particular the forward equations (3.1). Here is the vec-
torization corresponding to the backward equations (3.3):
 
r (1) = diag g 0 (a(1) ) (W (2) )T r (2) .

Forward propagation uses W (2) (and b(2) ), backward propagation the transpose
(W (2) )T . For the gradient assembly, (3.2) becomes

∇w(2) Ei = rq(2) z (1) , ∇b(2) Ei = r (2) ,


q

or even more concisely:


∇W (2) Ei = r (2) (z (1) )T .

3.4 Training a Multi-Layer Perceptron


In this section, we have a closer look at the problem of minimizing the squared
error function E(w) for a multi-layer perceptron (w is the vector of all weights).
Since error backpropagation offers a computationally attractive way to deter-
mine the gradient ∇w E, we will focus on gradient-based optimization. It will
become clear that the problem minw E(w) is not easily characterized or solved,
and in practice a host of “tricks of the trade” are usually required in order to get
3.4 Training a Multi-Layer Perceptron 41

good results. We will cover some of the most important ones. Finally, for batch
optimization, there are far better optimization methods than gradient descent.
Covering them in any depth is out of the scope of this course, but if you use
MLPs in practice, you need to know about them. Some pointers are provided
below.
Suppose our MLP for binary classification has at least two layers, one linear
output layer and at least one hidden layer. In (batch) gradient descent opti-
mization, we decrease E(w) along the negative gradient direction, until we find
w∗ so that ∇w∗ E = 0. This means that w∗ is a stationary point. Suppose we
take a tiny step in direction d, to w∗ + εd. The change in error function value
is
E(w∗ + εd) − E(w∗ ) = ε(∇w∗ E)T d + O(ε2 ) = O(ε2 ).
No matter what direction d we pick (kdk = 1), the error function value will
not change (to first order in ε). Now, for some error functions of specific kind,
a stationary point is a global minimum point, so that minw E(w) is solved.
Alas, not for our MLP problem. First, in order to check whether w∗ is a local
minimum point in the strict sense of mathematical optimization, we have to use
a Taylor expansion of E(w∗ + εd) to second order:
1 2 T
E(w∗ + εd) − E(w∗ ) = ε d (∇∇w∗ E)T d + O(ε3 ),
2
where we used that ∇w∗ E = 0, so the first-order term vanishes. Here,

∂2E
 
∇∇w∗ E =
∂w∗,j ∂w∗,k j,k

is the Hessian. w∗ is a local minimum in the strict sense if dT (∇∇w∗ E)T d > 0
for any unit norm direction d, which guarantees E(w∗ + εd) > E(w∗ ) for small
ε > 0. In other words, the Hessian has to be positive definite, a concept we will
revisit later during the course (Section 4.2.2). More details about conditions for
local minima can be found in [2, ch. 1].
There is a more serious problem. Suppose that w∗ is a local minimum point
in the strict sense. It will in general not be a global minimum point: E(w∗ ) >
minw E(w). And there are no tractable conditions for finding out how big the
difference is, or how far away w∗ is from a global minimum point. In Figure 3.4,
we illustrate the the concepts of local and global minima, maxima, and station-
ary points. The squared error function for an MLP with ≥ 2 layers has many lo-
cal minimum points which are suboptimal. To see why that is, consider the high
degree of non-identifiability of the model class: for a particular weight vector w,
we can easily find very many different w0 so that E(w) = E(w0 ). Here are two
examples. First, the transfer function g(a) = tanh(a) is odd, g(−a) = −g(a).
(l)
Pick one latent unit with activation aq and flip the signs of all weights and
bias parameters of connections feeding into the unit from below. This flips the
(l) (l)
sign of aq , therefore of g(aq ) as well. Now, flip the signs of all weights and
biases on connections coming from the unit, which compensates the sign flip
(l)
of g(aq ). All outputs remain the same, and so does the error function value.
Second, we can simply interchange any two units in a hidden layer, leading to
a permutation of the weight vector w. In short, the error function E(w) for an
42 3 The Multi-Layer Perceptron

Figure 3.4: A non-convex function, such as the squared error for a multi-layer
perceptron, can have local minima and maxima, as well as stationary points
which are neither of the two (called saddle points). A stationary point is defined
by ∇w E = 0.

MLP is non-identifiable on a massive scale, which implies local minima issues


in general.
While there are methods from global mathematical optimization which provably
track down global minimum points, these are far too expensive to run for most
real-world MLP training problems. The general consensus is to be satisfied with
local minimum solutions. It is good practice to use a number of training runs
from different initial points (more on initialization below), then to pick the
one attaining the lowest error value. Another good idea is to pick an optimizer
different from gradient descent, which is less prone to getting stuck in shallow
local minima (Section 3.4.2).
A final comment, whose profound significance we will explore in more detail in
later parts of this course. Do we even want to minimize E(w) globally? Mini-
mizing the squared error on our dataset seems overall a sensible objective, but
it does not in general guarantee good performance on the underlying statistical
problem (say, discriminating hand-written 8s from 9s). Remember the lookup
table classifier from Section 2.1, it achieves zero error on the training data.
Anybody who uses MLPs in practice will have run into some situations where
a smaller training error E(w) comes with worse overall performance on unseen
patterns. This over-fitting problem is central to what makes machine learning
challenging and interesting. We will start to understand over-fitting in Chap-
ter 7, where we also learn about strategies how to alleviate this problem in
MLP practice: regularization (Section 7.2) and early stopping (Section 7.2.1).
It is also the case that larger networks with more units and weights do not nec-
essarily imply better performance, a lesson we will learn about in Chapter 10,
along with techniques to select model size. For now, it is important not to draw
erroneous conclusions from the fact that straight minimization of the training
squared error E(w) does not always lead to optimal results. Counter-measures
against over-fitting almost invariably modify, instead of abandon, the data fit
criterion, and its robust numerical optimization remains at the heart of machine
learning practice.
3.4 Training a Multi-Layer Perceptron 43

3.4.1 Gradient Descent Optimization in Practice


Recall the general structure of gradient descent minimization of E(w) from
Section 2.4.1. Methods differ from each other in two independent aspects. First, a
starting point w0 has to be chosen. Second, the k-th update consists in choosing
a direction dk and step size ηk > 0, then updating wk+1 = wk + ηk dk . This
choice is based on the gradient ∇wk E or a stochastic gradient ∇wk Ei(k) .

Initialization (*)

We begin with the initialization issue. For model classes like MLPs, the rela-
tionship between weights w and outputs is highly non-transparent, and it is not
usually possible to select a good starting point w0 in a well-informed way. It is
therefore common practice to initialize w0 at random, drawing each component
of w0 independently from a zero-mean Gaussian distribution (see Section 6.3)
with component-dependent variances. In order to understand why the choice of
these variances makes a difference, and also why we should not just start with
w0 = 0, consider the following argument. A good starting point is one from
which training can proceed rapidly. Recall the transfer function g(a) = tanh(a)
from Section 3.2 (the argument holds for other transfer functions as well). In a
(l)
nutshell, a good initial w0 is chosen so that many or most activation values aq
lie in the area of largest curvature |g 00 (a)|. Let us see why. If a is far away from
zero, g(a) is saturated at sgn(a), with g 0 (a) ≈ 0. Large coefficients of w0 imply
large (absolute) activation values, and the role of g 0 (a) in the error backpropa-
gation formulae (3.2) implies small gradients and slow progress in general. On
the other hand, for a ≈ 0, g(a) ≈ a is linear. If w0 is chosen such that most
activations are small (most extreme: w0 = 0), the network behaves like a linear
model, and many iterations are needed to step out of this linear regime.
The upshot of this argument is that we should sample w0 in a way that most
(l)
activations aq (xi ) are order of unity, on average over the training sample {xi }
and w0 (recall that n−1 E(w) is the average of individual errors over the train-
ing set). Assume that the data has been preprocessed to zero mean and unit
covariance:
n n
1X 1X
xi = 0, xi xTi = I.
n i=1 n i=1
This transformation, known as whitening, is a common step in preprocessing
(1)
pipelines (see also Section 11.1.1). Consider the first layer activation aq . Since
E[w0 ] = 0, we have that
h i Xd h i h i
(1)
E a(1)
q = E wqj E[xi,j ] + E b(1)
q = 0.
j=1

For the variance,


" 2 #
h i d
X (1)
Var a(1)
q =E w xi,j + b(1)
q
j=1 qj

d h i h i Xd h i h i
(1) (1)
X
= Var wqj Var[xij ] + Var b(1)
q = Var wqj + Var b(1)
q .
j=1 j=1
44 3 The Multi-Layer Perceptron

(1) (1) (1)


Here, we used that wq1 , . . . , wqd , bq
are independent under the distribution of
h i
(1)
w0 , and that Var[xij ] = 1 due to preprocessing. In order to obtain Var aq =
(1) (1)
1, we should choose Var[wqj ] and Var[bq ] on the order of 1/(d + 1). While
for units higher up, this argument does not exactly hold anymore (as variables
become dependent), it still provides us with the right scaling for our distribution
(l)
over w0 . For a unit in the l-th layer, the activation aq receives input from hl−1
(l) (l)
units below (here, h0 = d), and Var[wqj ] and Var[bq ] should be chosen on the
order of 1/(hl−1 + 1).

Learning Rate. Learning with Momentum Term

Choosing the learning rate ηk seems more of an art than a science. Ultimately,
the best choice is determined by the local curvature (second derivative), which
is typically not computed for MLPs (although it can be obtained at surprisingly
little extra effort, see [4, ch. 4.10] or [5, ch. 5.4]). In the context of online learn-
ing, the proposal ηk = 1/k is common. For batch gradient descent, a constant
learning rate often works well and is faster [4, ch. 7.5]. If assessing the Hessian is
too much for you, there is a wealth of learning rate adaptation techniques from
the old neural networks days [4, ch. 7.5.3].

Figure 3.5: Gradient descent optimization without (top) and with momentum
term (middle, bottom). We update ∆wk = η((1 − µ)∇wk E + µ∆wk−1 ), where
η is selected by line minimization.

One serious problem with gradient descent is known as zig-zagging. If E(w)


3.4 Training a Multi-Layer Perceptron 45

is highly curved2 around the current point wk , the steepest descent direction,
which ignores the curvature information (Section 3.4), is seriously misleading.
Subsequent steps along the negative gradient lead to erratic steps with slow
overall progress. Zig-zagging is illustrated in Figure 3.5, top panel. The idea
behind a momentum term is the observation that during a phase of zig-zagging,
it would be better to move along a direction averaged over several subsequent
steps, which would smooth out the erratic component, but leave useful system-
atic components in place. Denote by ∆wk = wk+1 − wk the update done in
iteration k. Gradient descent with momentum works as follows:

∆wk = −η(1 − µ)∇wk E + µ∆wk−1 , µ ∈ [0, 1).

Note that we assume a constant learning rate η > 0 for simplicity. The new
update step is a convex combination of the steepest descent direction and the
previous update step. Effects of momentum terms of different strength are shown
in Figure 3.5. To understand what this is doing, we consider two regimes. First,
in a region of low curvature of E(w), the gradient will remain approximately
constant over several updates. Solving ∆w = −η(1 − µ)∇w E + µ∆w results
in ∆w = −η∇w E. In a low-curvature regime, the rule does full-size steep-
est descent updates. On the other hand, if E(w) has high curvature, then
∇w E changes erratically. In this case, subsequent ∆wk will tend to cancel each
other out, and the effective learning rate reduces to η(1 − µ)  η or even to
η(1 − µ)/(1 + µ), damping the oscillations. If you use momentum in practice,
you can afford to experiment with larger learning rates. A momentum term is
pretty obligatory with online learning by stochastic gradient descent. Stochas-
tic gradients have erratic behaviour built in, and a momentum term is often
successful in smoothing away much of the noise.

Analysis of Learning with Momentum Term (*)

Here are details for the analysis of momentum learning in the two extreme
regimes just mentioned. In reality, the gradient will have both constant and
oscillatory components, in which case momentum will roughly leave the former
alone, but damp the latter. First, suppose that ∇wk E = ∇w E over many
updates. To see what the updates are doing, we equate ∆wk+1 and ∆wk :

∆w = −η(1 − µ)∇w E + µ∆w ⇒ (1 − µ)∆w = (1 − µ)(−η∇w E),

so that ∆w = −η∇w E: undamped gradient descent. Next, suppose that ∇wk E


is maximally oscillatory: ∇wk E = (−1)k ∇w E. Expanding for two steps, and
using ∇wk+1 E = ∇w E, ∇wk E = −∇w E, we have

∆wk+2 = −η(1 − µ)∇w E + µ (η(1 − µ)∇w E + µ∆wk ) .

Equating ∆wk+2 and ∆wk :


1−µ
(1 − µ2 )∆w = (1 − µ)2 (−η∇w E) ⇒ ∆w = (−η∇w E).
1+µ
Here, (1 − µ)/(1 + µ)  1, so the osciallations are reduced substantially.
2 In precise terms, the Hessian ∇∇ E has eigenvalues of very different size, or a large
w
condition number.
46 3 The Multi-Layer Perceptron

The Netlab Package

There is a number of free MLP implementations available. In case you use


Matlab (or the free Octave), the author’s favourite is Netlab, which can
be obtained from http://www1.aston.ac.uk/eas/research/groups/ncrg/
resources/netlab/. It consists of a toolbox of functions and scripts based
on the approach and techniques described in [4], but also including more recent
developments in the field. Netlab comes with a textbook [29], which provides a
host of useful advice for MLP-type machine learning in practice.

Testing the Gradient (*)

Training an MLP is tricky enough, due to local minima issues and the fact
that we cannot develop a good “feeling” about what individual parameters in
the network stand for. But there is one point we can (and absolutely should)
ensure: that the backpropagation code for the gradient computation is bug-free.
To keep the presentation simple, we focus on an error function E(w) of a single
weight w ∈ R only. For the general case, the recipe is repeated for each gradient
component (but see below for an idea to save time).
Suppose we are at a point w, and g(w) = E 0 (w) = dE/dw is the gradient (or
derivative) there. Gradient testing works by comparing the result for g(w) with
finite differences, computed by evaluating the error function at other points w0
very close to w:
E(w + ε) − E(w)
g(w) = E 0 (w) ≈ g̃1;ε (w) :=
ε
for a very small ε > 0. This means that we can test our gradient code by
computing g̃1;ε (w) (which costs one forward pass), then inspecting the relative
error
|g(w) − g̃1;ε (w)|
.
|g(w)|
Since finite differences and derivatives are not exactly the same, we cannot
expect the error to be zero, but it should be small, in fact of the same order as
ε if g(w) itself is away from zero. It is important to choose ε not too small, say
ε = 10−8 .
There is a more accurate symmetric finite-difference approximation of gradient
components, about twice as costly to evaluate. As debugging is not about speed,
we recommend this latter one:
E(w + ε) − E(w − ε)
g(w) = E 0 (w) ≈ g̃2;ε (w) := .

This needs two forward passes to compute, instead of just one. Why does this
work better? Let us look at the Taylor expansion of the function E at the current
value w:
1
E(w ± ε) = E(w) + E 0 (w)(±ε) + E 00 (w)ε2 + O(ε3 ).
2
Notice that we expand up to second order, and that the ε2 term does not depend
on the sign in ±ε. Subtracting one from the other line:
E(w + ε) − E(w − ε) = 2E 0 (w)ε + O(ε3 ) ⇒ g̃2;ε (w) = E 0 (w) + O(ε2 ),
3.4 Training a Multi-Layer Perceptron 47

where we divided by 2ε. The error between E 0 (w) and g̃2;ε (w) is only O(ε2 ),
whereas you can easily confirm that the error between E 0 (w) and g̃1;ε (w) is
O(ε), thus much larger for small ε.
If your network has lots of parameters, it can be tedious to test every gradi-
ent component separately. Far better to run the following randomized test in
order to spot problems (which are then analyzed component by component),
and instead test the gradient at many different points w. The idea is to test
directional derivatives along random directions3 d ∈ Rp , kdk = 1. Suppose we
wish to test the gradient g = ∇w E(w) at the point w. If f (t) := E(w + td),
t ∈ R, the directional derivative of E at w along d is f 0 (0) = g T d. Our finite
difference approximation is

f (ε) − f (−ε) E(w + εd) − E(w − εd)


= .
2ε 2ε
We monitor the absolute difference between the directional derivative and its
finite difference approximation. The gradient g is tested by doing this compari-
son for maybe 30 random directions. If any substantial differences are detected,
we have to detect where the problem comes from. To this end, we can now test
along directions d of our choice. For example, d = δ j will test the j-th gradient
component in isolation.
A final hint. At the debugging stage, it is not important to evaluate gradients
over all your training data. Work on very small batches, or even on single data
points. Also, choose a smaller network architecture, in particular a reduced
number of hidden units. Bugs will show up nevertheless, and you save a lot of
time. On the other hand, it is important to test gradients for a range of different
values for the weights.

3.4.2 Optimization beyond Gradient Descent (*)


In this section, we focus on batch training exclusively, minimization of an error
defined as empirical average over a complete training set or large batches thereof.
Our arguments are not specific to the squared error function, but apply just as
well to all other continuously differentiable error functions we will learn about
during this course.
The simplest method for MLP batch training is gradient descent. Gradients are
computed by error backpropagation, which is efficient compared to lesser alter-
natives, but still constitutes a substantial effort for a large network. Sometimes,
simple algorithms are also good algorithms. This is not the case for gradient
descent. There is a range of numerical optimization methods which supersede
gradient descent in any conceivable sense. For many of them, high-quality imple-
mentations are freely available, and you can simply plug in your backpropagation
code.
Basic numerical optimization is a fascinating topic. Maths like calculus and
linear algebra is brought to life, doing useful things for you. Methods are based
on geometric intuition and just fun to learn about. And we already know that
3 A random direction d is sampled by drawing the components independently from a Gaus-

sian N (0, 1), then normalizing the resulting vector to unit length.
48 3 The Multi-Layer Perceptron

optimization is center stage for machine learning. Unfortunately, treating this


topic in more detail is not in the scope of this course. There are many good
and readable books on optimization. For machine learning purposes, a good
place to start is [4, ch. 7]. The Netlab package (Section 3.4.1) implements4 the
optimizers discussed there and has demos you can download for plug-and-play.
Beyond, we recommend [2, 15, 27, 16] for general nonlinear programming and
[7] for convex optimization.
Our discussion is mainly based on [4, ch. 7]. Consider the unconstrained opti-
mization problem minw E(w), where E(w) is continuously differentiable and
lower bounded (unconstrained means that w ∈ Rp arbitrarily), not necessarily
the squared error function. On the surface, modern gradient-based optimizers
have a similar structure to gradient descent. An iteration from wk proceeds as
follows:

• Evaluate the gradient g k = ∇wk E. Determine a search direction dk , based


on g k and information gathered in earlier iterations.
• Find wk+1 by a line search, approximately minimizing f (η) = E(wk +ηdk )
for η > 0.

Beyond variants of gradient descent, the most basic algorithm of this form is con-
jugate gradients, designed originally for minimizing quadratic functions. Exam-
ples for quadratic minimization are linear least squares problems (Section 2.4).
Gradient descent is a poor method for minimizing general quadratic functions,
and even a momentum term does not help much in general. In contrast, conju-
gate gradients is the optimal5 method for minimizing quadratics, given that one
gradient is evaluated per iteration. It is easily extended to non-quadratic func-
tions by incorporating a line search. Applied to MLP training, it tends to out-
perform gradient descent dramatically, at the same cost per iteration. A variant
called scaled conjugate gradients avoids line searches for most updates, which
can be expensive for MLPs. Methods advancing on conjugate gradients are all
motivated by approximating the gold-standard method: Newton-Raphson opti-
mization. Recall from high school that an update of this algorithm is based on
approximating E(w) by a quadratic q(w) locally at wk , then minimizing q(w)
as surrogate for E(w). Unfortunately, this needs computing the Hessian ∇∇wk E
and solving a linear system with it, which is out of scope6 of our recipe above.
However, quasi-Newton methods manage to build search directions over sev-
eral updates which share important properties with Newton directions. Among
them, limited memory quasi-Newton methods are most useful for large MLPs,
and are in general widely used for many other machine learning problems as
well. Finally, the Levenberg-Marquardt algorithm is specialized to minimizing
squared error functions, and is maybe the most widely used general technique
4 For serious applications, codes from numerical mathematicians should be preferred, such

as packages found on www.netlib.org/.


5 In fact, since minimizing a quadratic is equivalent to solving a linear system with a positive

definite matrix (see Section 4.2.2), conjugate gradient is the gold-standard for iterative linear
solvers as well.
6 However, note that there are surprisingly efficient methods for computing expressions

such as (∇∇wk E)v, v an arbitrary vector, by a slight extension of error backpropagation


[4, ch. 4.10.7]. This would allow for applying truncated Newton algorithms to MLP problems,
where each Newton direction is approximated by a conjugate gradient solver.
3.4 Training a Multi-Layer Perceptron 49

for this purpose. It approximates Newton directions directly, using a simple


outer product approximation to the Hessian, together with a model trust region
mechanism to control resulting errors by damping.
50 3 The Multi-Layer Perceptron
Chapter 4

Linear Regression. Least


Squares Estimation

In this chapter, we introduce linear regression or curve fitting, a problem which


is at least as fundamental for machine learning as classification. A running
example will be the fitting of a polynomial curve to real-valued data. We will
encounter over-fitting for the first time.
The optimization problem behind linear regression is (linear) least squares esti-
mation, which has already been introduced for binary classification in Chapter 2.
In this chapter, we will gain a thorough geometrical intuition about least squares
estimation and learn about algorithms for solving it in practice.

4.1 Linear Regression


Not all of machine learning is classification. As we advance through the course,
we will see that it is not even the most basic problem on which we can build
the others: this role is played by curve fitting, or more general representing
and learning real-valued functions. Moreover, since the statistics behind linear
curve fitting is much simpler than for classification, this problem serves as prime
example for introducing some of the most profound concepts of machine learning
later during the course: generalization, regularization and model selection. In
this section, we introduce the linear regression problem, using the example of
polynomial curve fitting.
In fact, we know linear regression already from Section 2.4, where we used it
for binary classification. As will become fully clear later in the course, this is a
somewhat misguided1 application, so let us start all over.
Consider the data D = {(xi , ti ) | i = 1, . . . , n} in Figure 4.1. The targets
are real-valued here, ti ∈ R. Curve fitting amounts to learning the real-valued
1 This was no cheat. People do use linear regression for classification, mainly because it

is so simple. We will see that classification is much better served by error functions whose
minimization is not much harder to do, whether for linear methods (Chapter 2) or multi-layer
perceptrons (Chapter 3).

51
52 4 Linear Regression. Least Squares Estimation

3.5 3.5

3 3

2.5 2.5

2 2

1.5 1.5

1 1

0.5 0.5

0 0

−0.5 −0.5

−1 −1

−1.5 −1.5

−2 −2
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3

Figure 4.1: Two different ways of fitting the data {(xi , ti )}. Left: Piece-wise
linear interpolation. Right: Linear least squares regression.

function x → y represented by this data. What does that mean? For example,
we could just connect neighbouring points by straight lines and be done with it:
piece-wise linear interpolation2 (Figure 4.1, left). However, interpolation is not
what curve fitting is about. Rather, we aim to represent the data as the sum of
two functions: a systematic curve, smooth and simple, plus some random errors,
highly erratic but small. We will refine and clarify this notion in Chapter 8, but
for now this working definition will suffice. The rationale is not to get sidetracked
by errors made during the recording of the dataset.

−1

−2

−3
−5 −4 −3 −2 −1 0 1 2 3 4

Figure 4.2: Illustration of squared error criterion for linear regression. Each
triangle area corresponds to an error contribution of (y(xi ) − ti )2 /2.

The simplest curve fitting technique is linear regression (more correctly: linear
regression estimation, but the shorter term is commonly used). We assume a
2 Interpolation differs from curve fitting in that all data points have to be represented

exactly: if y(x) is my interpolant, then y(xi ) = ti for all i.


4.1 Linear Regression 53

linear function y(x) = wx + b for the systematic part, then fit it to the data by
minimizing the squared error:
n n
1X 1X
E(w, b) = (y(xi ) − ti )2 = (wxi + b − ti )2 . (4.1)
2 i=1 2 i=1

This problem is known as least squares estimation. Its prediction on our data
above is shown in Figure 4.1, right. The squared error criterion is illustrated
in Figure 4.2. Note the rapid growth with |y(xi ) − ti |: large differences are not
tolerated. We solve this problem by setting the gradient w.r.t. [w, b]T to zero
and solving for w, b. You should do this as an exercise, the solution is given in
Section 4.1.1.

Polynomial Regression Estimation

1
p=1 1
p=2

0.5 0.5

0 0

−0.5 −0.5

−1 −1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

1
p=4 1
p=10

0.5 0.5

0 0

−0.5 −0.5

−1 −1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Figure 4.3: Linear regression estimation with polynomials of degree p − 1. The


generating curve (green) is sin(2πx), the noise is Gaussian with standard devi-
ation 0.15.

In order to move beyond lines, we adopt an idea already introduced in Sec-


tion 2.2. We define a feature map φ(x) ∈ Rp and employ linear regression with
y(x) = wT φ(x). Beware that some books speak of “generalized linear regres-
sion” in this case, but we do not follow this convention. Our example will be
54 4 Linear Regression. Least Squares Estimation

polynomial regression:
 
1
 x  
φ(x) =  .  = xj−1 j , y(x) = wT φ(x) = w0 + w1 x + · · · + wp−1 xp−1 .
  
 .. 
xp−1
The function y(x) is a polynomial of maximum degree p − 1, whose coefficients
are fit by least squares estimation. The larger the number of features p, the more
complex functions we are able to represent. In fact, if p0 > p, then the class for
p is strictly included in the class for p0 : lines are special cases of quadratic
functions. We might assume that the accuracy of curve fitting improves as p
grows, since we run our method with more and more flexible function classes.
(p)
Indeed, if w∗ is a minimizer of the squared error E (p) (w(p) ) (making the
dimensionality p explicit in the notation for now), we have
h iT 
(p) (p) (p0 ) (p) T 0 (p0 )
E (w∗ ) = E w∗ , 0, . . . , 0 ≥ E (p ) (w∗ ), p < p0 .

Our fit in terms of squared error can never get worse as p increases, and typi-
cally it improves. In Figure 4.3, we show least squares solutions using different
dimensionalities p. The dataset consists of 10 points, drawn from the smooth
green curve plus random errors. The fit is rather poor for p = 1 (constant) and
p = 2 (line), giving rise to large errors. It improves much for p = 4 (cubic), rep-
resenting the generating curve well. The fit for p = 10 provides a first example
of over-fitting. The fit to the data is even better than for p = 4: the polynomial
is exactly interpolating each data point. But away from the xi locations of the
training set, the prediction behaves highly erratically. It is not at all a useful
representation of the generating curve. We will return to this important issue
in Chapter 7.

4.1.1 Techniques: Solving Univariate Linear Regression


Our aim is to minimize the squared error E(w, b) defined in (4.1). This is a
special case of solving the general normal equation. It is useful as practice and
to get a feeling for the properties of the solution. Denoting yi = y(xi ) = wxi + b,
we have
n n
∂E X ∂E X
= (yi − ti )xi , = (yi − ti ).
∂w i=1
∂b i=1
Defining the empirical expectations
X X X X
hxi = n−1 xi , hx2 i = n−1 x2i , ht xi = n−1 ti xi , hti = n−1 ti ,
i i i i

it is an easy exercise(!) to show that


∂E ∂E
n−1 = whx2 i + bhxi − ht xi, n−1 = whxi + b − hti.
∂w ∂b
Setting these equal to zero, and subtracting hxi times the second from the first
equation gives
w hx2 i − hxi2 = ht xi − htihxi.

4.2 Linear Least Squares Estimation 55

Now,
n
X
hx2 i − hxi2 = n−1 (xi − hxi)2 = Var[x],
i=1
Xn
−1
ht xi − htihxi = n (xi − hxi)(ti − hti) = Cov[x, t],
i=1

as confirmed(!) by multiplying it out (the relations hold in general, see Sec-


tion 5.1.3). Therefore, the minimizer (w, b) is given by
Cov[x, t]
w= , b = hti − whxi.
Var[x]

4.2 Linear Least Squares Estimation


Having motivated linear least squares estimation in order to drive linear re-
gression, let us work out the underlying optimization problem and understand
properties of the solution. Geometrically, this leads to the concept of orthogonal
projection. You will appreciate the foundational nature of this problem as we
move through the course: it is behind concepts like conditional expectation or
the bias-variance decomposition, and serves as general building block of modern
nonlinear optimizers.
We continue where we left in Section 2.4, only that ti ∈ R here, while ti ∈
{−1, +1} there. Our linear function class is given by y(x) = wT φ(x) (the bias
parameter being part of w as usual, for example w0 in polynomial regression
above). We vectorize everything in terms of target vector t = [ti ] ∈ Rn and
design matrix Φ = [φj (xi )]ij ∈ Rn×p . We assume that n ≥ p, and that Φ has
full rank p (Section 2.4.3). Also, denote by y = [yi ] = [y(xi )] ∈ Rn the vector
of predictions. The squared error is
n
1X 1 1 2
E(w) = (y(xi ) − ti )2 = ky − tk2 = kΦw − tk .
2 i=1 2 2

We determined its gradient in Section 2.4.1:


n n
X ∂E X
∇w E = ∇ w yi = (yi − ti )φ(xi ) = ΦT (y − t) = ΦT (Φw − t) .
i=1
∂yi i=1

Let us set ∇w E = 0:
ΦT Φw = ΦT t. (4.2)
These are the celebrated normal equations: a linear system we need to solve in
order to obtain w. Since Φ has full rank, it is easy to see that ΦT Φ ∈ Rp×p
has full rank as well, therefore is invertible. We can write the solution as
 −1
ŵ = argmin E(w) = ΦT Φ ΦT t. (4.3)
w

Beyond simple techniques like gradient descent, many modern machine learning
algorithms solve the normal equations inside, so it is important to understand
how to do this well. Some advice is given in Section 4.2.3.
56 4 Linear Regression. Least Squares Estimation

As an exercise, you should convince yourself that the solution for univariate
linear regression of Section 4.1.1 is a special case of solving the normal equations.

4.2.1 Geometry of Least Squares Estimation

Figure 4.4: Geometrical interpretation of linear least squares estimation. The


vector of targets t is orthogonally projected onto the model subspace ΦRp ,
resulting in the prediction vector ŷ = Φ ŵ.

The first point to note about (4.3) is that ŵ, and also ŷ = Φ ŵ, are linear
maps of the vector of targets t. Not only the forward mapping from weights
to predictions is linear, but also the inverse, the estimator itself. What type of
linear mapping is t 7→ ŷ? The normal equations read ΦT (t − ŷ) = 0. This means
that the residual vector t − ŷ is orthogonal to the subspace ΦRp of possible
prediction vectors (see Figure 4.4). Therefore,
t = (t − ŷ) + ŷ .
| {z } |{z}
⊥ΦRp ∈ΦRp

This decomposition of t into ŷ in ΦR and t − ŷ orthogonal to ΦRp is unique


p

(Section 2.1.1): ŷ is the orthogonal projection of t onto the model space ΦRp .
Try to get a feeling for this result, by drawing in R2 . ŷ is the closest point to t
in ΦRp , so t − ŷ must be orthogonal to this subspace.

4.2.2 Techniques: Orthogonal Projection. Quadratic


Functions
Let us review some general properties about orthogonal projections. Suppose U
is a subspace of Rn (above, U = ΦRp ). Then, any t ∈ Rn can be written uniquely
as t = tk + t⊥ , tk ∈ U, (t⊥ ) ∈ U ⊥ (orthogonal complement, Section 2.1.1), since
Rn = U ⊕ U ⊥ (direct sum). Then, tk is the orthogonal projection of t onto U.
If the columns of Φ form a basis of U, then tk is determined by
ΦT (t − tk ) = 0, tk = Φw, w ∈ Rp ,
solved by
 −1
tk = Φ ΦT Φ ΦT t.
| {z }
=M
4.2 Linear Least Squares Estimation 57

M is the matrix of the orthogonal projection. Its properties are clear from our
geometrical picture: M u = u for u ∈ U, and M v = 0 for v ∈ U ⊥ . In other
words, ker M = U ⊥ , and M has eigenvalues 1 (p dimensions) and 0 (n − p
dimensions) only (see Section 11.1.2).

40

30

20

10

0
2
0 2
0
−2 −2

Figure 4.5: Plot of a positive definite quadratic function.

The linear least squares problem is a special case of minimizing a quadratic


function, a topic which we will require several times during the course. A general
quadratic function is q(x) = 12 xT Ax − bT x + c, where A ∈ Rd×d , x, b ∈ Rd .
Convince yourself that we can always replace A by 21 (A +AT ) without changing
q(x), so we can assume that A is a symmetric matrix: AT = A. Our problem
is  
1
min q(x) = xT Ax − bT x + c .
x 2
We want to find the minimum value q∗ = minx q(x) and a minimizer x∗ . This
problem makes sense only if q(x) is lower bounded. If d 6= 0 is such that
α = dT Ad ≤ 0, then

1 2
q(td) = t α − tbT d + c → −∞ (t → ∞).
2

For the case α = 0, we pick d so that bT d > 0 (ignoring the special case that
bT d may be zero). This means that in general3 , we require that dT Ad > 0 for
all d 6= 0. A symmetric matrix with this property is called positive definite. It is
these matrices which give rise to lower bounded quadratics, curving upwards like
a bowl in all directions (Figure 4.5). To solve our problem: ∇x q(x) = Ax − b =
0, so that x∗ = A−1 b. A positive definite matrix A is also invertible (why?),
so x∗ is the unique stationary point, which is a global minimum point by virtue
3 For the meticulous: The precise condition is dT Ad ≥ 0 for all d 6= 0, and if dT Ad = 0,

then bT d must be zero as well.


58 4 Linear Regression. Least Squares Estimation

of the Hessian ∇∇x q(x) = A being positive definite. Minimizing quadratic


functions is equivalent to solving positive definite linear systems.
For moderate d (up to a few thousand), the best way to solve such a system is
as follows. We use the Cholesky decomposition [42, ch. 6.5] A = LLT , where L
is lower triangular (lij = 0 for i < j) and invertible. This decomposition exists
if and only if A is positive definite, and it is easy to compute in O(d3 ). Then,
 
L LT x∗ = b ⇔ x∗ = L−T L−1 b .


Here, C −T is short for (C −1 )T = (C T )−1 . Solving two systems instead of one?


The point is that systems with L and LT are easy to solve by backsubstitution,
in O(d2 ). Hints and code pointers are given in Section 4.2.3.

4.2.3 Solving the Normal Equations (*)


How to solve the normal equations (4.2) in practice? There are many bad ways
to do it, and some good ways. In this section, we focus on the best practice,
obtained by numerical mathematicians over a long period. The history and
details can be found in [17]. Recall that n ≥ p and rk Φ = p. The main difficulty
with solving (4.2) is that while ΦT Φ is invertible in exact arithmetic, in practice
it is often just barely so: some of its eigenvalues can be very close to zero, in
other words its condition number (ratio between largest and smallest eigenvalue)
can be very large. For such matrices, numerical roundoff errors can get amplified
dramatically. We will understand the statistical meaning of close-to-singular, or
ill-conditioned ΦT Φ in Chapter 7, where we study mechanisms to improve the
conditioning. For now, assume we really just want to solve (4.2) to best practice.
In essence, there are two different families of solvers: direct and iterative meth-
ods. As a simple rule, you use the former if n and p are not too large, otherwise
you use the latter. “Not too large” depends on your hardware, but you should
certainly use direct methods for n and p up to a few thousand. In contrast, if
n and p are in the tens of thousands or beyond, you will have to go for iter-
ative methods. In a nutshell, direct methods are black-box and safe to use. If
a best practice direct method breaks down, your problem is hopeless. On the
other hand, runtime and memory requirements of a direct method is O(np2 )
and O(p2 ) respectively, which may be prohibitive. Special structure or sparsity
of Φ is often present, but in order to use it to speed things up, you need to
employ an iterative solver.
A warning up front. Maybe the worst way to approach the normal equations is
to compute and invert ΦT Φ, something you should never do even with small n
and p. Matrix inversion is mainly a theoretical concept, it should not be used in
practice due to its inherently poor numerical properties. In the real world, we
use matrix factorizations (such as Cholesky or QR) instead of inversion. In the
case of the normal equations, we should not even compute and factorize ΦT Φ.
Namely, if Φ has a large condition number, the condition number of ΦT Φ is the
square of that, which is much worse. As we will see next, the normal equations
can be solved without ever computing ΦT Φ.
A best practice direct method for solving (4.2) employs the QR decomposition:
Φ = QR, where Q ∈ Rn×p has orthonormal columns (QT Q = I), and R ∈
4.2 Linear Least Squares Estimation 59

Rp×p is upper triangular (rij = 0 for i > j) and invertible [23, 42]. The QR
decomposition is computed using Householder’s method. We do not have to
care about this, as we simply use good numerical4 code. Plugging this into
the normal equations and using QT Q = I, you get RT Rw = RT QT t, or
Rw = QT t (as R is invertible). This is solved as

ŵ = R−1 (QT t),

by a simple algorithm called backsubstitution (exercise: derive backsubstitution


for yourself; solution in [42, ch. 2.2]). The cost of the QR decomposition is
O(n p2 ), so you see that p matters more than n. It is not necessary to store Q,
since QT t can be computed alongside the decomposition. Note that ΦT Φ is not
computed.
Let us get some geometrical intuition about the QR decomposition, linking it to
the geometrical picture of Section 4.2.1. We can decompose the squared distance
as
kΦw − tk2 = kRw − QT tk2 + kt − QQT tk2 . (4.4)
This is because

kΦw − tk2 = k(QRw − QQT t) + (QQT t − t)k2


= kQv + (QQT t − t)k2 , v = Rw − QT t.

Note that kQvk2 = kvk2 by the column orthonormality of Q, and

(Qv)T (QQT t − t) = v T (QT t − QT t) = 0,

so that the “cross-talk” vanishes (Section 2.1.1). The solution ŵ makes the
first term in (4.4) vanish, and the second is the remaining (minimum) squared
distance. Recalling Section 4.2.1,

ŷ = Φ ŵ = QR ŵ = QQT t

is the orthogonal projection of t onto the space ΦRp , and Q is an orthonormal


basis of this space. The projection matrix is
 −1
M = Φ ΦT Φ ΦT = QQT .

A discussion of iterative solvers is beyond the scope of this course. For systems
of the form (4.2), with ΦT Φ symmetric positive definite, the most commonly
used method is conjugate gradients [17]. I can recommend the intuitive deriva-
tion in [4, ch. 7.7]. However, specifically for the normal equations, the LSQR
algorithm [30] is to be preferred (code at http://www.stanford.edu/group/
SOL/software/lsqr.html).

4 Good numerical code is found on www.netlib.org. The standard package for matrix de-

compositions (QR, Cholesky, up to the singular value decomposition) is LAPACK. This code
is wrapped in Matlab, Octave, Python, or the GNU scientific library. In contrast, “Numerical
recipes for X” is not recommended.
60 4 Linear Regression. Least Squares Estimation
Chapter 5

Probability. Decision
Theory

In this chapter, we refresh our knowledge about probability, the language and
basic calculus behind decision making and machine learning. We also introduce
concepts from decision theory, setting the stage for further developments. Rais-
ing from technical details like binary classifiers, linear functions, or numerical
optimizers up towards the big picture, we will understand how decision making
works in the optimal case, and which probabilistic aspects of a problem are
relevant. We will see that Bayesian computations, or inference, are the basis for
optimal decision making. In later chapters, we will see that learning is based on
inference as well.

5.1 Essential Probability


Probability is the language of modern machine learning. The most important
concept you need when you want to think about, formalize and automate learn-
ing and decision making, is uncertainty, and probability is the calculus of uncer-
tainty. Pierre Simon Laplace, a founding father of probability theory: “Probabil-
ity is nothing but common sense reduced to calculation.” When humans think,
act, speak, or reason, they use probabilistic statements all the time (“the chance
of rain tomorrow is 20%”, “it is more likely that Switzerland wins the next world
cup than Austria”). There are a range of books which lucidly explain why valid
reasoning needs probability [31]. Machine learning needs decision making in the
presence of uncertainty about the data (measurement errors, outliers, missing
cases), the parameters, and even the model itself. If we do not understand why
and how probability ties all these concepts together into a consistent whole, we
may end up chasing algorithms, estimators, loss functions, hypothesis classes,
weighting or voting schemes, theoretical assumptions and “learning paradigms”
forever, never to see the wood for all the trees.
This course is about machine learning, but let us briefly step out of this context,
so that you do not get the wrong impression. The importance of probability,

61
62 5 Probability. Decision Theory

understanding and quantifying uncertainty and risk, far transcends machine


learning, statistics, and even scientific applications. There is simply no other
rational way to make sense of the data gathered today which is not based on
these concepts. We are bombarded with a growing stream of ever-more sen-
sational and poorly researched news. In order to maintain a half-way consis-
tent world view about important topics, we have to understand the context
in which these numbers live, and that is provided by probability and statis-
tics. The most interesting jobs at cool companies (such as your favourite search
engine) go to those with a profound understanding of probability and algo-
rithms. Statistics and probability is used to decipher speech, decode communi-
cations, route data packets, predict prices, understand images, make robots
explore their environment and translate natural language texts. Probability
is useful but also great fun. http://understandinguncertainty.org/ is a
website dedicated to understanding probability. If you have the impression
that statistics is the boring pastime of old men in tweeds, watch this movie
http://www.gapminder.org/videos/the-joy-of-stats/ and think again.
We cannot give more than a slight reminder of probability in this section, infor-
mal and on a very elementary level. This is not enough to see what it is all about
and to enjoy it. The role that probability plays for machine learning should mo-
tivate you to refresh your knowledge. There are many good elementary books
on this topic, for example [19, 18, 8]. The classics remain the books by Feller
[13, 14].

Box
Fruit red blue
apple 1/10 9/20
orange 3/10 3/20

Figure 5.1: A simple experiment to introduce probability concepts. There are


two Boxes, a red and a blue one. Each contain a certain number of Fruit, namely
apples and oranges (the apples are the green balls). One of the boxes is picked
with probability P (B), then a fruit is drawn out if it at random, according to
P (F |B).

We will use an example losely based on [5, ch. 1.2]. We draw a fruit out of a box,
as illustrated and explained in Figure 5.1. When thinking about a probabilistic
setup, the following concepts have to be defined:

• The probability (or sample) space Ω: the set of all possible outcomes. In
our example, (F, B) can take values Ω = {(a, r), (o, r), (a, b), (o, b)}.

• The joint probability distribution P over the probability space. This is a


nonnegative function on Ω which sums to one over the whole space:
X
P (ω) ≥ 0, ω ∈ Ω, P (ω) = 1.
ω∈Ω
5.1 Essential Probability 63

In our example, P (ω) = P (F, B) is given by the table in Figure 5.1.

Figure 5.2: Illustration of events A, B within a probability space Ω. The joint


event A ∩ B is the intersection of A and B, containing all outcomes ω which lie
both in A and in B.

At least for a finite sample space Ω, we can picture probabilities as sizes of


subsets A ⊂ Ω, called events. In Figure 5.2, we illustrate two events A, B. Their
intersection is the joint event A ∩ B, the set of all outcomes both in A and
in B. For example, if A = {F = a} = {(a, r), (a, b)} and B = {B = r} =
{(a, r), (o, r)}, then A ∩ B = {F = a, B = r} = {(a, r)}. The probability of an
event E is obtained by summing P (ω) over all ω ∈ E:
X
P (E) = P (ω).
ω∈E

For example,

P ({F = a}) = P ((a, r)) + P ((a, b)) = 1/10 + 9/20 = 11/20.

Note that both Ω and ∅ (the empty set) are events, with P (Ω) = 1 and P (∅) =
0 for any joint probability distribution P . Many rules of probability can be
understood by drawing Venn diagrams such as Figure 5.2. For example,

P ({A or B}) = P (A ∪ B) = P (A) + P (B) − P (A ∩ B).

For most probability experiments, the relevant events are conveniently described
by random variables. Formally, a random variable is a function from Ω into some
set, for example R, {0, 1}, or {a, o}. This notion is so natural that we have chosen
it implicitly in order to define the example in Figure 5.1: F is a random variable
mapping to {a, o}, and B is a random variable mapping to {r, b}. We can define
random variables in terms of others. For example I{F =a or B=b} is a random
variable mapping into {0, 1}. It is equal to zero for the outcome ω = (o, r),
equal to one otherwise.
The joint probability distribution, P (F, B), is a complete description of the ex-
periment. All other distributions are derived from it. First, there are marginal
64 5 Probability. Decision Theory

distributions. Suppose we want to predict which fruit will be drawn with which
probability, no matter what the box. Every time you hear “no matter what B”,
you translate “sum over all possible values of B”. The marginal distribution
P (F ) is  
X 11/20 |F =a
P (F ) = P (F, B) = .
9/20 |F =o
B=r,b

P (F ) is a probability distribution just like P (F, B) (confirm this for yourself).


It can also be understood in terms of events and Venn diagrams (Figure 5.2).
For our example above, A = {F = a} is the union of the disjoint events {F =
a, B = r} = A ∩ B and {F = a, B 6= r} = A ∩ (Ω \ B).
Second, there are conditional distributions. Given that I picked the red box
(B = r), which fruit will I get? This is P (F |B = r), the conditional distribution
of F , given that B = r. There are three times as many oranges as apples in
the red box, so P (F = o|B = r) = 3/4, P (F = a|B = r) = 1/4. Here is a rule
linking all these probabilities. I pick (F, B) according to P (F, B) by first picking
B according to P (B), then F according to P (F |B):

P (F, B) = P (F |B)P (B).

If you are given a joint distribution P (F, B), you can first work out the marginals
P (B) and P (F ) by summing out over the other variable respectively, then obtain
P (F, B) P (F, B)
P (F |B) = , P (B) 6= 0, P (B|F ) = , P (F ) 6= 0.
P (B) P (F )
For our Venn diagram example,
P ({F = a, B = r}) P (A ∩ B)
P (A|B) = P ({F = a}|{B = r}) = = .
P ({B = r}) P (B)
Given that we are in B, the probability of being in A is obtained by the fractional
size of the intersection A ∩ B within B.
Note that we speak about the conditional distribution P (F |B), even though it
is really a set of distributions, P (F |B = r) and P (F |B = b). Storing it in a
table needs as much space as the joint distribution. Note that if, for a different
experiment, P (B = r) was zero, then P (F |B = r) would remain undefined: it
does not make sense to reason about F given B = r if B = r cannot happen.
To sum up, you only really need two rules:

• Sum rule: If P (X, Y ) is a joint distribution, the marginal distribution


P (X) is obtained by summing Y over all possible values:
X
P (X) = P (X, Y ).
Y

The interpretation of summing (or marginalization) over Y is disinterest


in its value. What will the fruit be, no matter what the box?
• Product rule: If P (X, Y ) is a joint distribution with conditional distri-
bution P (X|Y ) and marginal distribution P (Y ), then

P (X, Y ) = P (X|Y )P (Y ).
5.1 Essential Probability 65

This rule is the basis for factorizations of probability distributions into


simpler components.

These rules formalize two basic operations which drive probabilistic reasoning.
Whenever a variable Y is observed, so we know its value (for example, it is a
case in a dataset), we condition on it, consider conditional distributions given
Y . Whenever a variable X, whose precise value is uncertain, is not currently of
interest (not the aim of prediction), we marginalize over it, sum distributions
over all values of X, thereby eliminate it.
There is no sense in which the variables F and B are ordered, we could write
P (B, F ) instead of P (F, B). After all, the intersection of events (sets) is not
ordered either. In fact, in derivations, we will sometimes simply use P (B, F ) =
P (F, B): the argument names matter, not their ranking. Given that, let us apply
the product rule in both orderings:

P (F |B)P (B) = P (F, B) = P (B, F ) = P (B|F )P (F ) ⇒


P (F |B)P (B)
P (B|F ) = .
P (F )

This is Bayes’ formula (or Bayes’ rule, or Bayes’ theorem). This looks harmless,
but substitute “cause” for B, ”effect” for F , and Bayes’ formula provides the
ultimately solution for the inverse problem of reasoning. Suppose you show me
the fruit you got: F = o (an orange). What can I say about the box you picked?

P (F = o|B = r)P (B = r) 3/4 · 2/5


P (B = r|F = o) = = = 2/3,
P (F = o) 9/20
P (B = b|F = o) = 1 − P (B = r|F = o) = 1/3.

It is twice as likely that your box was the red one. One point should become
clear even from this simple example. The fact that events happen at random
and we have to be uncertain about them, does not mean that they are entirely
unpredictable. It just means that we might not be able to predict them with
complete certainty. Most events around you are uncertain to some degree, almost
none are completely unpredictable. The key to decision making from uncertain
knowledge is to quantify your probabilistic belief1 in dependent variables, so that
if you observe some of them, you can predict others by probability computations.

5.1.1 Independence. Conditional Independence


Let us consider two random variables X, Y with joint distribution P (X, Y ).
Suppose we observe X. Does this tell us anything new about Y ? Our knowledge
about Y without observing X is P (Y ). Knowing X, this becomes P (Y |X). X
contains no information about Y if and only if

P (Y |X) = P (Y ) ⇔ P (X, Y ) = P (X)P (Y ).

Such variables are independent: the joint distribution is the product of the
marginals. Equivalently, the events A and B are independent if P (A|B) = P (A),
1 Belief, or “subjective probability”, is another word for probability distribution.
66 5 Probability. Decision Theory

or P (A ∩ B) = P (A)P (B). For example, if we pick a fruit Fr from the red box
and a fruit Fb from the blue box, then Fr and Fb are independent random
variables. Within a machine learning problem, independent variables are good
and bad. They are good, because we can treat them separately, thereby sav-
ing computing time and memory. Independence can simplify derivations a lot.
They are bad, because we only learn things from dependent variables. Inde-
pendence is often too strong a concept, we need something weaker. Consider
three random variables X, Y , Z with joint distribution P (X, Y, Z). X and Y
are conditionally independent, given Z if they are independent under the con-
ditional P (X, Y |Z), namely if P (X, Y |Z) = P (X|Z)P (Y |Z). Conditional inde-
pendence constraints are useful for learning. In general, X, Y , Z are still all
dependent, but the conditional independence structure can be used to simplify
derivations and computations. In Figure 5.1, draw a box B, then two fruits
F1 , F2 at random from B (with replacement: you put the fruits back). Then,
P (F1 , F2 |B) = P (F1 |B)P (F2 |B), but P (F1 , F2 ) 6= P (F1 )P (F2 ) (check for your-
self that P (F2 = o|F1 = o) = 7/12 6= P (F2 = o) = 9/20). Note that it can go
the other way around. Throw two dice X, Y ∈ {1, . . . , 6} independently. If I tell
you that Z = X + Y = 7, then X and Y are dependent given Z, but they are
independent as such.

Figure 5.3: From a discrete distribution (histogram) on a finer and finer grid to
a continuous probability density.

5.1.2 Probability Densities


Recall xi ∈ Rd , w ∈ Rp . We need probability distributions over continuous
variables as well. The probability space Ω is a continuous (overcountably) in-
finite set, and P (·) is a probability measure. Sums over ω ∈ Ω do not make
sense anymore, but from calculus we know how they are replaced by integrals.
Consider a random variable x ∈ R defined on such a space. Imagine R being
gridded up into cells of size ∆x each. The interval [x, x + ∆x) is an event with
probability P ([x, x + ∆x)). The key idea is to relate this probability “volume”
(the probability of x landing in the interval) to the volume ∆x of the interval
5.1 Essential Probability 67

itself:
P ([x, x + ∆x))
p(x) = lim .
∆x→0 ∆x
Under technical conditions on P (·) and the random variable, this limit exists
for all x ∈ R: it is called the probability density of the random variable x. Just
as for P , we have that
Z
p(x) ≥ 0, x ∈ R, p(x) dx = 1.

A “graphical proof” for the latter is given in Figure 5.3. Careful: the value p(x)
of a density at some x ∈ R can be larger than 1, in fact some densities are
unbounded2 above. The density allows us to compute probabilities over x, for
example:
Z b Z Z
P ({x ∈ [a, b]}) = p(x) dx = I{a≤x≤b} p(x) dx, P (E) = I{x∈E} p(x) dx.
a
The cumulative distribution function is
Z x
F (x) = P ({x ∈ (−∞, x]}) = p(t) dt ⇒ p(x) = F 0 (x).
−∞

In this course, what probability distributions are for discrete variables, prob-
ability densities are for continuous variables. Sum and product rule work for
densities as well:
Z
p(x) = p(x, y) dy, p(x, y) = p(x|y)p(y),

and so does Bayes rule, always replacing sums by integrals. At this point, you
will start to appreciate random variables. If x and y are random variables, so
are ex and x + y, and computing probabilities for the latter is merely an exercise
in integration.
This is about the level of formality we need in this course. However, you should
be aware that we are avoiding some difficulties here which do in fact become
relevant in a number of branches of machine learning, notably nonparametric
statistics and learning theory. For example, it is not possible to construct a
probability space on R so that every subset is an event. We have to restrict
ourselves to measurable (or Borel) sets. Also, not every function is allowed as
random variable, and not every distribution of a random variable has a prob-
ability density (a simple example: the constant variable x = 0 does not have
a density). Things also become difficult if we consider an infinite number of
random variables at the same time, for example in order to make statements
about limits of random variable sequences. None of these issues are in the scope
of this course. Good general expositions are given in [3, 18].

5.1.3 Expectations. Mean and Covariance


Let x ∈ R be a random variable. The expectation (or mean, or expected value)
of x is X
E[x] = xP (x)
x
2 An example is the gamma density p(x) = π −1/2 x−1/2 e−x I{x>0} .
68 5 Probability. Decision Theory

if x is discrete with distribution P (x),


Z
E[x] = xp(x) dx

if x is continuous with density p(x) (always under the assumption that sum or
integral exists). Note that our definition covers extensions such as
Z
E[f (x)] = f (x)p(x) dx,

since f (x) is a random variable if x is one. If x ∈ Rd is a random vector, then

E[x] = [E[xj ]] ∈ Rd .

Expectation is linear: if x, y are random variables, α ∈ R a constant, then

E[x + αy] = E[x] + αE[y].

If x and y are independent, then

E[xy] = E[x]E[y].

However, the reverse3 is not true in general. Moreover, the conditional expecta-
tion is X
E[x | y] = xP (x|y)
x
or Z
E[x | y] = xp(x|y) dx.

Note that in general, E[x | y] is itself a random variable (a function of y, which is


random), although we can also consider expectations conditioned on an event,
for example X
E[x | y = y0 ] = xP (x|y = y0 ),
x

which are simply numbers.

Variance and Covariance

Picture the mean as value around which a random variable is fluctuating. By


how much? The variance gives a good idea:

Var[x] = E (x − E[x])2 ,
 

the expected squared distance of x from its mean. Another formula for the
variance is

Var[x] = E x2 − 2xE[x] + E[x]2 = E[x2 ] − E[x]2 .


 

3 If E[f (x)g(y)] = E[f (x)]E[g(y)] for every pair of (measurable) functions f , g, then x and

y are independent.
5.1 Essential Probability 69

Again,
Var[x | y] = E (x − E[x])2 | y = E[x2 | y] − E[x | y]2 .
 

Mean and variance are examples of moments of a distribution. The covariance


between random variables x, y ∈ R is
Cov[x, y] = E [(x − E[x])(y − E[y])] = E [xy − E[x]y − xE[y] + E[x]E[y]]
= E[xy] − E[x]E[y].
Note that p
|Cov[x, y]| ≤ Var[x]Var[y],
a special case of the Cauchy-Schwarz inequality (Section 2.3). Note that
Var[x + y] = Var[x] + 2Cov[x, y] + Var[y].
Namely, if x0 = x − E[x], y 0 = y − E[y], then
Var[x + y] = E (x0 + y 0 )2 = E[(x0 )2 ] + 2E[x0 y 0 ] + E[(y 0 )2 ].
 

Therefore, Var[x+y] = Var[x]+Var[y] if x and y are uncorrelated, which holds in


particular if they are independent. More general, if x ∈ Rd is a random vector,
we collect all Cov[xj , xk ] in the covariance matrix
Cov[x] = [Cov[xj , xk ]] = E (x − E[x])(x − E[x])T = E xxT − E[x]E[x]T .
   

More general, the cross-covariance matrix between x ∈ Rd and y ∈ Rq is


Cov[x, y] = E (x − E[x])(y − E[y])T = E xy T − E[x]E[y]T .
   

Mean and Covariance after Linear Transformation

If x ∈ Rd is a random vector with mean E[x], covariance matrix Cov[x], what


are the corresponding moments of y = Ax, where A ∈ Rq×d ? By linearity,
E[Ax] = AE[x].
Also,
Cov[Ax] = E Ax(Ax)T − E[Ax]E[Ax]T
 

= AE xxT A − AE[x]E[x]T AT = ACov[x]AT ,


 

left-multiplication by A, right-multiplication by AT .
Given a dataset D = {xi | i = 1, . . . , n}, empirical mean and empirical covari-
ance (or sample mean, sample covariance) are computed as
n n
1X 1X
µ̂ = xi , Σ̂ = xi xTi − µ̂ µ̂T .
n i=1 n i=1

If the data points are drawn independently from a distribution with mean E[x],
covariance Cov[x], then µ̂ and Σ̂ converge4 against E[x] and Cov[x] as n → ∞.
We will gain a more precise idea about these estimators in Chapter 6.
4 Convergence happens almost surely (with probability one), a notion of stochastic conver-

gence. This result is called law of large numbers.


70 5 Probability. Decision Theory

Finally, be aware that not all continuous random variables with a density have a
variance, some do not even have a mean, since corresponding integrals diverge.
For example, the Cauchy distribution with density

1
p(x) = .
π(1 + (x − x0 )2 )

does not have mean or variance. The reason for this is that too much proba-
bility mass resides in the tails (−∞, −x0 ] ∪ [x0 , ∞), x0 large. Such heavy-tailed
distributions become important in modern statistics and machine learning, as
good models for data with outliers. They can be challenging to work with.

5.2 Decision Theory


Assessing, manipulating and learning about probabilities is a means towards an
end. That end is expressed in decision theory. We want to act optimally, make
optimal decisions in the presence of uncertain knowledge. This is a two-stage
process. First, we condition on data in order to resolve uncertainty by way of
probability calculus. Second, we take a decision which minimizes expected loss.
Consider an example. In a hospital, a tissue sample is taken from a patient,
giving rise to an input vector x. An automatic classifier f (x) is to predict
whether the patient has cancer (t = 1) or not (t = 0). How can we calibrate this
procedure in terms of familiar units?
In order to develop decision theory, it is customary to make the assumption that
the true probabilistic law between relevant variables (x and t in our screening
example) is known exactly. First, this allows us to talk quantitatively about the
best possible solution to a statistical problem. Optimal solutions can be worked
out for simple setups, and they will provide additional motivation for common
model assumptions. Second, a decision-theoretic analysis can clarify what are
the most important aspects about a problem, and we can concentrate modelling
and learning efforts on those. Some vocabulary:

• Class-conditional distribution/density p(x|t): The distribution of inputs


x, given class label t. In our example, p(x|t = 0) is the distribution of
tissue sample vectors for healthy, p(x|t = 1) for cancerous patients. Be-
ware that the class-conditional distribution is termed “likelihood” in some
books (this becomes clear in Section 6.2), but this nomenclature mixes up
concepts and will not be used in this course.

• Class prior probability distribution P (t): The distribution of the class label
t on its own. What is the fraction of healthy versus cancerous patients to
be exposed to our screening procedure?

• Class posterior probability distribution P (t|x): Obtained from the class-


conditional and prior probabilities by Bayes’ rule:

p(x|t)P (t) X
P (t|x) = , p(x) = p(x|t)P (t). (5.1)
p(x) t
5.2 Decision Theory 71

The same definition applies to a multi-way classification problem, where t ∈


{0, . . . , K − 1}. Given our “apples and oranges” intuition about probability, we
would proceed as follows. The setup is defined by the joint probability density
p(x, t) = p(x|t)P (t). When a patient comes in, we have to predict t from x, our
(un)certainty about which is quantified by the posterior P (t|x), so this would
be the basis for our prediction.

5.2.1 Minimizing Classification Error

A natural goal in classification is to commit as few errors as possible, in other


words to choose a classification rule f (x) such that the error probability

R(f ) = P {f (x) 6= t}

is as small as possible. How does such an optimal classifier look like? Consider
the example in Figure 5.4, where x ∈ R and t ∈ {1, 2}.

x0 x
b
p(x, C1 )

p(x, C2 )

x
R1 R2

Figure 5.4: Joint distributions p(x, Ct ) = p(x|t)P (t) for a binary classification
problem with x ∈ R, T = {1, 2}. The classifier f (x) = 1 + I{x≥x̂} has decision
regions H2 = [x̂, ∞) and H1 = (−∞, x̂) (called R2 and R1 in the figure). Its
error probability R(f ) is visualized as the combined area of the blue, green and
red regions. The blue region stands for points x from class 1 being classified as
f (x) = 2, while the union of green and red regions stands for points x from class
2 being classified as f (x) = 1. If we move the decision threshold away from x̂,
the combined area of green and blue regions stays the same. It symbolizes the
unavoidable Bayes error R∗ . On the other hand, we can reduce the red area to
zero by setting x̂ = x0 , the point where the joint density functions intersect.
f ∗ (x) = 1 + I{x≥x0 } is the Bayes-optimal rule.
Figure from [5] (used with permission).

It becomes clear from this example that the error R(f ) can be decomposed into
several parts, some of which are intrinsic, while others can be avoided by a better
choice of f . From the figure, it seems that the optimal classifier f ∗ follows the
lower of the two curves p(x|t)P (t), t = 1, 2, and its error R∗ is the area under
72 5 Probability. Decision Theory

the curve min{p(x, t = 1), p(x, t = 2)}, the best we can do. In particular, this
minimum achievable error is not zero.
To cement our intuition, denote the space of label values by T : T = {0, 1} for
our cancer screening example, or T = {0, . . . , K − 1} for K-way classification.
A classifier f (x) is characterized by its decision regions
n o
Ht = x f (x) = t .

For example, recall from Section 2.2 that the decision regions for a binary linear
classifier are half-spaces in feature space. First,
 
P {f (x) 6= t} = E I{f (x)6=t} ,

recalling that I{A} = 1 if A is true, I{A} = 0 otherwise. The expectation is just


a weighted sum/integral over the joint probability. The product rule provides
the factorizations p(x, t) = p(x|t)P (t) = P (t|x)p(x), so we can express R(f ) in
two different ways. First,
X Z
R(f ) = P (t = k) I{f (x)6=k} p(x|t = k) dx
k∈T | {z }
=R(f |t=k)
X Z
=1− P (t = k) p(x|t = k) dx.
k∈T Hk

The total error is the expectation of the class-conditional errors R(f |t = k) over
the class prior P (t). This is useful in order to understand the composition of
the error. Test your understanding by deriving the second equation (note that
1 − R(f ) is the probability of getting it right). Second,
Z !
X
R(f ) = p(x) I{f (x)6=k} P (t = k|x) dx
k∈T
Z
= p(x) (1 − P (t = f (x)|x)) dx.

In order to minimize R(f ), we should minimize 1 − P (t = f (x)|x) for every x.


The best possible classifier f ∗ (x), called Bayes-optimal classifier, is given by

f ∗ (x) = argmax P (t|x) = argmax p(x|t)P (t).


t∈T t∈T

The second equation is due to the fact that in the definition (5.1) of the posterior
P (t|x), the denominator p(x) does not depend on t. In Figure 5.4, we plot
p(x|t = k)P (t = k) for two classes. The Bayes-optimal classifier picks t = k
whenever the curve for k lies above the curve for all other classes. Its decision
regions are
n o
Hk∗ = x p(x|t = k)P (t = k) > p(x|t = k 0 )P (t = k 0 ), ∀k 0 ∈ T \ {k} .

The probability of error for the Bayes-optimal classifier is called Bayes error:
Z    
∗ ∗
R = R(f ) = p(x) 1 − max P (t = k|x) dx = 1 − E max P (t = k|x) .
k∈T k∈T
5.2 Decision Theory 73

In the binary case, 1 − maxt∈T P (t|x) = mint∈T P (t|x), so that


  Z
R∗ = E min P (t = k|x) = min P (t = k|x)p(x) dx.
k=0,1 k=0,1

In Figure 5.4, the Bayes error R∗ is the area under the curve min{p(x, t =
1), p(x, t = 2)}.

5.2.2 Discriminant Functions


We saw that the Bayes-optimal classifier compares joint probabilities p(x|t)P (t)
and decides for the largest. It is most convenient to describe this procedure by
way of discriminant functions. Consider binary classification, t ∈ T = {0, 1}.
The optimal rule decides for t = 1 if

p(x|t = 1)P (t = 1) p(x|t = 1) P (t = 1)


= · > 1.
p(x|t = 0)P (t = 0) p(x|t = 0) P (t = 0)

A product, thresholded at 1? Recall from Chapter 2 that we prefer sums which


are thresholded at 0. Let us take the logarithm5 , a strictly increasing function:

p(x|t = 1) P (t = 1)
y ∗ (x) = log + log > 0.
p(x|t = 0) P (t = 0)

y ∗ (x) is a Bayes-optimal discriminant function, in that thresholding it at zero


provides a Bayes-optimal classifier: f ∗ (x) = I{y∗ (x)>0} . To relate this to Chap-
ter 2, note that if our class labels were −1, +1 instead of 0, 1, the relationship
would be f ∗ (x) = sgn(y ∗ (x)).
How about K-way classification, K > 2? In this case, the optimal classifier picks
the maximum among K functions

yk∗ (x) = log p(x|t = k) + log P (t = k), k = 0, . . . , K − 1,

in that f ∗ (x) = argmaxt∈T yt∗ (x). As in the binary case, we could get by with
K − 1 functions, say yk∗ (x) − y0∗ (x), k = 1, . . . , K − 1. However, this singles out
one class (t = 0) arbitrarily and creates more problems than it solves, so usually
K discriminant functions are employed.

5.2.3 Example: Class-conditional Cauchy Distributions


Let us work out a binary classification example. We have a uniform class prior
P (t = 0) = P (t = 1) = 1/2 and class-conditional Cauchy densities
1 1
p(x|t) = ·  , b > 0.
πb 1 + x−at 2
b

The setup is illustrated in Figure 5.5. Recall from Section 5.1.3 that the Cauchy
distribution is peculiar in that mean and variance do not exist. If you sample
5 It does not matter to which basis the logarithm is taken, as long as we keep consistent.

In this course, we will use the natural logarithm to Euler’s basis e.


74 5 Probability. Decision Theory

0.5

0.45

0.4

0.35

0.3

R*
0.25

0.2

0.15

0.1

0.05

0
0 2 4 6 8 10
Delta=(a1−a2)/(2 b)

Figure 5.5: Bayes-optimal classifier and Bayes error for two class-conditional
Cauchy distributions, centered at a0 and a1 . The optimal rule thresholds at the
midpoint a = (a0 + a1 )/2. Since the class prior is P (t = 0) = P (t = 1) = 1/2,
the Bayes error R∗ is twice the yellow area. Right plot show R∗ as function of
separation parameter ∆. The slow decay of R∗ is due to the very heavy tails of
the Cauchy distributions.

values from it, you may encounter very large values occasionally. Assume a1 > a0
and define the midpoint a = (a0 + a1 )/2. The Bayes-optimal rule is obvious by
symmetry: f ∗ (x) = I{x>a} . Moreover,

1 X
Z
R∗ = p(x|t = k) dx.
2 H∗
k=0,11−k

By symmetry, the two summands are the same, so


Z a
1 1
R∗ = · 2 dx.
−∞ πb 1 +
x−a 1
b

Substitute y = (x − a1 )/b, and denote ∆ = (a1 − a0 )/(2b):


−∆
1 dy 1 π  1 arctan(∆)
Z
R∗ = 2
= arctan(−∆) + = − .
π −∞ 1+y π 2 2 π

The Bayes error R∗ is a function of the separation ∆ of the classes (Figure 5.5,
right). For ∆ = 0, the class-conditional densities are the same, so that R∗ = 1/2.
Also, R∗ → 0 as ∆ → ∞. However, the decay is very slow, which is a direct
consequence of the heavy tails of the p(x|t). Any x > a is classified as f ∗ (x) = 1,
but even a x  a could still come from p(x|t = 0) whose probability mass far
to the right is considerable.

5.2.4 Loss Functions. Minimizing Risk


Our cancer screening procedure is to be implemented at the local hospital.
Knowing about decision theory, we determine the Bayes-optimal classifier f ∗ (x).
5.2 Decision Theory 75

Based on a tissue sample x, if f ∗ (x) = 1 we alert a human medical doctor. If


f ∗ (x) = 0, we send the patient home. Optimal.
Wait a moment! If t = 0 (no cancer), but f ∗ (x) = 1, we waste the time of a
doctor and some money for additional tests. If t = 1 and f ∗ (x) = 0, a human
being will die for not being treated on time. Even the most cynical health care
reformer will agree that the losses are hugely imbalanced, yet our Bayes-optimal
rule treats them exactly the same. Fortunately, this problem is simple to address.
Along with other specifications (such as the attributes of x), we have to assess
a loss function L(y, t) from T × T → R. If our classifier predicts f (x), we incur
a loss of L(f (x), t), t the true label. In our cancer screening example, we could
choose
L(0, 0) = L(1, 1) = 0, L(1, 0) = 1, L(0, 1) = λ  1.
There is no loss for getting it right. Calling an ultimately unnecessary checkup
costs 1, while a misdiagnosis costs λ. We can now calibrate λ to our needs. The
goal is to minimize expected loss, called risk:

R(f ) = E [L(f (x), t)] .

The Bayes-optimal classifier f ∗ (x) under loss L(y, t) minimizes the risk, and
R∗ = R(f ∗ ) is called Bayes risk. Since
Z !
X
R(f ) = p(x) L(f (x), k)P (t = k|x) dx,
k∈T

the Bayes-optimal classifier is given by


X
f ∗ (x) = argmin L(j, k)P (t = k|x),
j∈T
k∈T

its Bayes risk is " #


X

R = E min L(j, k)P (t = k|x) .
j∈T
k∈T

You have probably noticed by now that minimizing the classification error is
just a special case under the zero-one loss L(y, t) = I{t6=y} : no loss for getting it
right, loss 1 for an error.
How does the optimal decision rule depend on the loss function values? Let us
work out the Bayes-optimal discriminant function y ∗ (x) for our cancer screening
example. It is positive (classifies 1) if

L(1, 0)P (t = 0|x) + L(1, 1)P (t = 1|x) < L(0, 0)P (t = 0|x) + L(0, 1)P (t = 1|x)
⇔ (L(0, 1) − L(1, 1))P (t = 1|x) > (L(1, 0) − L(0, 0))P (t = 0|x) ⇔
p(x|t = 1) P (t = 1) L(1, 0) − L(0, 0)
log + log > log = − log λ ⇔
p(x|t = 0) P (t = 0) L(0, 1) − L(1, 1)
p(x|t = 1) P (t = 1)
y ∗ (x) = log + log + log λ > 0.
p(x|t = 0) P (t = 0)

The loss function values only shift the threshold of the optimal discriminant.
The larger λ, the more y ∗ (x) = 1 outputs will happen (human checkup). You
76 5 Probability. Decision Theory

might think an error is inacceptable and set λ = ∞. However, this leads to


y ∗ (x) = 1 for every patient, and the screening becomes uninformative.
Armed with loss functions, we can extend decision theory to scenarios where
the prediction space (output of f (x)) is different from the label space T . A
common example is to allow the classifier to output “don’t know”. None of this
adds complexity, it is left to the reader to be explored.
Finally, we need to mention some vocabulary coming from the area of statis-
tical tests, as it is widely used in machine learning. Take our cancer screening
example. Our method outputs f (x), the true label is t. If f (x) = 1, our vote
is positive, if f (x) = 0, it is negative. Then, if f (x) = t, it is true, otherwise
it is false. Any combination (f (x), t) is a true/false positive/negative. The two
types of errors are false positive and false negative. A false positive is a checkup
for a healthy patient, a false negative is sending a cancerous patient home. In
most situations, false negatives are more serious than false positives.

5 1.2
p(C1 |x) p(C2 |x)
p(x|C2 )
1
4

0.8
class densities

3
0.6
2
0.4
p(x|C1 )
1
0.2

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x

Figure 5.6: Left: Class-conditional densities for binary classification problem,


where P (t = 1) = P (t = 2). Right: Corresponding posterior probabilities. The
green line constitutes the Bayes-optimal threshold. The complex bimodal shape
of p(x|C1 ) = p(x|t = 1) has no effect on the posterior probabilities and the
optimal classifier.
Figure from [5] (used with permission).

5.2.5 Inference and Decisions

Our insights from decision theory can be summed up in the statement that
optimal decision making is driven by probabilistic inference, the computation of
posterior probabilities. The standard approach to a problem is to:

• Model the problem by assessing a complete joint distribution, such as


p(x, t) = p(x|t)P (t), as well as a loss function L(y, t).

• Compute the posterior distribution P (t|x).


5.2 Decision Theory 77

• Make decisions so as to minimize risk (expected loss), using optimal dis-


criminant functions based on P (t|x).

However, we are in a comfortable position in this chapter: we know all of p(x, t),
and P (t|x) is easily computed. This is not so for most real-world situations.
The true, or even a very good model is unknown. We have to learn from data,
and computing posterior distributions can be very difficult. In the real world,
we have to find shortcuts to the standard approach. For example, if only the
posterior P (t|x) enters decision making, why don’t we model it directly, rather
than bothering with all of p(x, t)? It may well be that the precise shape of the
class-conditionals p(x|t) is irrelevant for optimal decision making (Figure 5.6),
in which case learning it from data is wasteful. This rationale underlies the
discriminative (as opposed to generative) approach to probabilistic modelling,
which we will learn about in Chapter 8. Even more economical, we could learn
discriminant functions directly, without taking the detour over the posterior
P (t|x). On the other hand, the posterior P (t|x) provides much more useful
information about a problem than any single discriminant function. For one, we
can evaluate our risk and act accordingly. Moreover, the combination of multiple
probabilistic predictors is easy to do by rules of probability.
78 5 Probability. Decision Theory
Chapter 6

Probabilistic Models.
Maximum Likelihood

In this chapter, we introduce probabilistic modelling, the leading approach to


make decision theory work in practice. We also establish the principle of max-
imum likelihood, the most widely used framework for deriving statistical esti-
mators in order to learn from data. We will learn about Gaussian (or normal)
distributions, the most important family of probability distributions over con-
tinuous variables. We will also introduce naive Bayes classifiers based on discrete
distributions, which are widely used in the context of information retrieval and
machine learning on natural language documents.

6.1 Generative Probabilistic Models


One problem with decision theory (Chapter 5) is that it assumes full knowledge
of “true” distributions, which we know little about. All we have in practice is
some vague ideas and data to learn the distributions from.
Consider the dataset of final grades of the PCML course in 2010, shown in
Figure 6.1, left. How can we understand the gist of this data? Let us compute
a histogram. We divide1 the relevant range into bins of width ∆x. For each bin
j, we count the number of data points falling in there (say, nj ) and draw a bar
of heigth nj /(n∆x). The normalization ensures that the total area of bars adds
up to one. This way, we get an idea about the distribution of scores. However,
histograms come with a few problems:

• The choice of ∆x is crucial. Too large, and we miss important details. Too
fine, and most bins will be empty. Histograms are smoothed out in kernel
density estimators [5, ch. 2.5.1].
• Histograms do not work in more than five dimensions or so. The number
of cells grows exponentially with the number of dimensions, and most cells
1 For the final grades data, responses are naturally quantized at bin width 0.5, so there is

little to choose in this case.

79
80 6 Probabilistic Models. Maximum Likelihood

0.4
6
0.35
5
0.3

4 0.25

0.2
3

0.15
2
0.1

1
0.05

0 0
0 5 10 15 20 25 30 35 40 45 0 1 2 3 4 5 6 7

Figure 6.1: Left: Final grades from Pattern Classification and Machine Learning
course 2010. Right: Histogram of data. Overlaid is the maximum likelihood fit
for a Gaussian distribution. For this data, the responses ti are quantized to
{0, 0.5, 1, . . . , 5.5, 6}, so the bin width is ∆x = 0.5.

will simply be empty. This problem of histograms and related techniques


is called curse of dimensionality. Our exam scores are in one dimension,
but how do we get a good idea about data in 25 dimensions? Or in 784
dimensions, where our MNIST digits live?

• Histograms are often not versatile enough. There are no knobs we can
adjust in order to start analyzing the data, instead of just staring at it
from one angle.

These problems sound familiar. In Chapter 4, we decided to fit simple lines or


polynomials to noisy data, so to find the basics behind the chaff. Maybe we
should fit simple distributions to our data. In Figure 6.1, right, a Gaussian is
fitted to the exam results. One glance reveals the mean score as well as its
approximate spread. On the other hand, it seems that the fit could be improved
by employing a bimodal2 density (two bumps). The assumptions on the density
are the knobs we can play with in order to understand our data.
Fitting a probability density to data is called density estimation. In fact, his-
tograms and kernel smoothers are density estimation methods. However, in this
chapter, we focus on parametric techniques, where, just as in Chapter 4, the
densities to choose from are parameterized by some w ∈ Rp .

Assumptions about Data

At this point, we need to specify our assumptions about the data. Consider a
dataset D = {xi | i = 1, . . . , n}. The most commonly made assumption is that
the data points xi are independently and identically distributed (short: i.i.d.).
There exists a “true” distribution, from which the xi are drawn independently
(recall independence from Section 5.1.1).
2 We
will not be concerned with multimodal densities in the present chapter, but the tech-
niques we develop here will be the major ingredient for Gaussian mixture models in Sec-
tion 12.2.
6.2 Maximum Likelihood Estimation 81

In order to build a foundation for density estimation, we go further and formu-


late model assumptions. In fact, we propose a generative probabilistic model for
our data. In order to fit a Gaussian to our exam results D = {xi | i = 1, . . . , n},
we postulate that each xi is drawn independently from a Gaussian distribution
with unknown mean µ and unknown variance σ 2 (Gaussians are introduced
shortly, their details do not matter at this point). This model assumption di-
rectly and uniquely leads to an optimization problem for the fitting. Let us
pause a second to understand that. We do not come up with an algorithm, an
optimization problem, or an error function to minimize. All we do is to suggest
a way in which the data could have been generated, leaving our ignorance about
any details in the unknown parameters. As we will see shortly, everything else
is automatic. We just have to do the forward modelling, from unknown param-
eters to observed data. The inverse problem, from data back to parameters, is
implied by general statistical principles.
It is important to distinguish the i.i.d. assumption from model assumptions.
Model assumptions are choices we make about our model or method. They are
on par with options such as “I will use a linear classifier” or “let me use an MLP
with 4 layers.” These choices can be good or bad, useful or not useful, but it
does not make sense to call them “right” or “wrong”. George Box, one of the
pioneers of Bayesian inference, quality control and design of experiments: “All
models are wrong, but some are useful.” The i.i.d. assumption3 is a different
story. It is like an axiom for much of machine learning. If it fails for our data,
we may as well read tea leaves than trying to learn something from it.

Figure 6.2: Repeatedly dropping a thumbtack on an even surface can be used


to draw a sample of binary data.

6.2 Maximum Likelihood Estimation


Browsing in your drawer, you find an odd-shaped thumbtack (Figure 6.2). When
you drop it on the floor, its point x is either facing up (x = 1) or down (x = 0).
You get curious. What is the probability for it landing point up, p1 = P {x = 1}?
3 Or weaker assumptions such as ergodicity.
82 6 Probabilistic Models. Maximum Likelihood

You throw it n = 100 times, thereby collecting data D = {x1 , . . . , x100 }. We can
assume that this data is i.i.d. Now, if P {x1 = 1} = p1 , the probability of
generating D is
n
Y n
X
P (D|p1 ) = px1 i (1 − p1 )1−xi = pn1 1 (1 − p1 )n−n1 , n1 = xi .
i=1 i=1

n1 is the number of times the thumbjack lands with point facing up. The distri-
bution of xi is called Bernoulli distribution with parameter p1 . We can regard
P (D|p1 ) as the probability of the data D under the model assumption. This has
nothing to do with the “true” probability of data, whatever that may be. After
all, for every value of p1 ∈ [0, 1], we get a different probability P (D|p1 ). This
model probability, as a function of the parameter p1 , is called likelihood func-
tion. If our goal is to fit the model P {x1 = 1} = p1 with parameter p1 to the
i.i.d. data D, it makes sense to maximize the likelihood. The maximum likelihood
estimator (MLE) for p1 is

p̂1 = argmax P (D|p1 ).


p1 ∈[0,1]

−20

−40

−60

n1/n=0.1
−80
n1/n=0.7
n1/n=0.5
−100

0 0.2 0.4 0.6 0.8 1

Figure 6.3: Log likelihood functions for Bernoulli distributions (thumbtack ex-
ample). The sample size is n = 20 here. Note that the log likelihood function
log P (D|p1 ) depends on the data D only through p̂1 = n1 /n and n. Its mode is
at p1 = p̂1 , the maximum likelihood estimate.

Let us solve for p̂1 . Assume for now that n1 ∈ {1, . . . , n − 1}. Then, P (D|p1 ) > 0
for p1 ∈ (0, 1), P (D|p1 ) = 0 for p1 ∈ {0, 1}, so we can assume that p1 ∈
(0, 1). It is always simpler to maximize the log-likelihood log P (D|p1 ) instead.
The derivative is
d log P (D|p1 ) n1 n − n1 n1
= − =0 ⇔ n1 (1−p1 ) = p1 (n−n1 ) ⇔ p1 = .
dp1 p1 1 − p1 n
p̂1 = n1 /n is indeed a maximum point, the unique maximizer, and this holds for
n1 ∈ {0, n} as well. The maximum likelihood estimator (MLE) for p1 = P {x1 =
6.3 The Gaussian Distribution 83

1} is
n1
p̂1 =
,
n
the fraction of the thumbjack pointing up over all throws. This is what we
would have estimated anyway. Indeed, maximum likelihood estimation often
coincides with common sense (ratios of counts) in simple situations. Bernoulli
log likelihood functions for the thumbtack setup are shown in Figure 6.3.
Is this “correct”? Remember that this question is ill-defined. Is it any good
then? Maximum likelihood estimation works well in many situations in prac-
tice. It comes with a well-understood asymptotic theory. Given that, we will see
that it can have shortcomings as a statistical estimation technique, in particular
if applied with small datasets, and we will study modifications of maximum like-
lihood in order to overcome these problems. For now, it is simply a surprisingly
straightforward and general principle to implement density estimation.

6.3 The Gaussian Distribution


Our exam marks data is real-valued, xi ∈ R. The most important distribu-
tion over real numbers, arguably the most important distribution at all, is the
Gaussian distribution (or normal distribution). It is behind the squared error
function. Maximum likelihood estimation for Gaussians is maybe the most fre-
quently used data analysis technique there is. There is virtually no machine
learning work dealing with continuous variable data, for which Gaussians do
not play a role. In the context of this chapter, Gaussians are important for (at
least) two reasons:

• Maximum likelihood estimators for Gaussian densities coincide with sam-


ple mean and sample covariance matrix.

• For Gaussian class-conditional densities p(x|t), t = 0, 1, linear classifiers


turn out to be Bayes-optimal classifiers (for equal covariance matrices).

Here is the probability density of a Gaussian:


1 1 2
/σ 2
N (x|µ, σ 2 ) = √ e− 2 (x−µ) , σ 2 > 0. (6.1)
2πσ 2

We also sometimes write N (µ, σ 2 ) if the argument x is clear from context. We


will work out in Section 6.3.2 that if x ∼ N (µ, σ 2 ) (read: “x is distributed
according to N (µ, σ 2 )”), then µ = E[x], σ 2 = Var[x]. The parameters are mean
and variance of the Gaussian. Please note one thing: it is µ and σ squared, mean
and variance. N (0, 2) means variance σ 2 = 2, not standard deviation σ = 2. We
work out several properties of the Gaussian below in this chapter, and more as
we move through the course. Just one observation here:
1 1
− log N (x|µ, σ 2 ) = − (x − µ)2 − log(2πσ 2 ).
2σ 2 2
The negative log of a Gaussian is a quadratic function.
84 6 Probabilistic Models. Maximum Likelihood

The real importance of the Gaussian becomes apparent only for multivariate dis-
tributions. In fact, Gaussians are maybe the only distributions we can tractably
work with in high dimensions. The density of a Gaussian distribution of x ∈ Rd
is
−1/2 − 12 (x−µ)T Σ−1 (x−µ)
N (x|µ, Σ) = |2πΣ| e . (6.2)
This looks daunting, but is nothing to worry about. |Σ| is the determinant of
Σ. Refresh your memory about determinants in Section 6.3.1. In Section 6.3.3,
we work out the form of this density from the univariate Gaussian, using trans-
formation rules which come in handy elsewhere as well. We also show there that
µ = E[x], Σ = Cov[x]. The parameters of the multivariate Gaussian are mean
and covariance matrix.

Covariance Matrices are Positive Definite

Not every matrix Σ ∈ Rd×d is allowed here. First, Σ must be invertible. Also,
− log N (x|µ, Σ) is a quadratic function. Recall our discussion of quadratic func-
tions in Section 4.2.2: Σ−1 , therefore Σ should be symmetric. It also must be a
valid covariance matrix. What does that mean? If x ∈ Rd is a random variable
with covariance Cov[x] (not necessarily Gaussian), then for any v ∈ Rd :

0 ≤ Var[v T x] = v T Cov[x]v.

A symmetric matrix A ∈ Rd×d is called positive semidefinite if

v T Av ≥ 0 ∀v ∈ Rd .

Valid covariance matrices are precisely symmetric positive semidefinite matrices.


We used a stronger condition in Section 4.2.2. A symmetric matrix A is positive
definite if
v T Av > 0 ∀v ∈ Rd \ {0}.
So our Σ is invertible and positive semidefinite. These two conditions together
are equivalent to positive definite. Why? If v T Σv = 0 for v 6= 0, we must have
Σv = 0. Namely, for any z ∈ Rd , λ ∈ R:

0 ≤ (z + λv)T Σ(z + λv) = z T Σz + 2λz T Σv.

This means that z T Σv = 0, otherwise sending λ to ∞ or −∞ gives a contra-


diction. But then, Σ cannot be invertible. To conclude, Σ in (6.2) must be a
symmetric positive definite matrix.

Contour Plots of Gaussian Distributions

A Gaussian distribution4 over x ∈ R2 (Figure 6.4, left) can be visualized by a


contour plot (Figure 6.4, right), which works like a topographic map (a contour
is a curve of equal density value, or “height over sea level”). How to read such
a plot? First, the point of highest density (the mode of the density) is the mean
E[x]. For a Gaussian, mean and mode coincide. Next, all contours of a Gaussian
4 In dimensions d > 2, we can still visualize aspects of a Gaussian by projecting it into 2D

subspaces (Chapter 11).


6.3 The Gaussian Distribution 85

Isotropic Gaussian Independent Gaussian


2 2

0.08 1 1

0 0
0.06
−1 −1

0.04 −2
−2 −1 0 1 2
−2
−2 −1 0 1 2

0.02 General Gaussian


2

1
0
2 0

0 1 −1
0
−1 −2
−2 −1 0 1 2
−2 −2

Figure 6.4: Left: Bivariate Gaussian density function. Right: Contours of Gaus-
sian density functions. Contours are spherical for isotropic Gaussians (no pre-
ferred direction), they are aligned with the standard coordinate axes for inde-
pendent Gaussians (diagonal covariance matrix). Each contour line is an ellipse,
whose major axes are given by the eigenvectors of the covariance matrix.

density are ellipses. This is easy to understand: − log N (x|µ, Σ) is a quadratic


function with positive definite Σ (therefore positive definite Σ−1 , why?), and el-
lipses are solutions of positive definite quadratic equations. For the same reason,
different contour lines are related by a uniform scaling transformation centered
at µ. In general, the ellipses are maximally elongated along a single direction
(axis). We will find in Section 11.1.1 that this principal axis d, maximizing the
variance Var[dT x] over all kdk = 1, corresponds to the maximum eigendirection
of the covariance Σ.
For now, let us distinguish between a few characteristic contour plot shapes,
shown in the right panel of Figure 6.4. The upper two contour plots are mirror-
symmetric with respect to the standard coordinate system anchored at µ, while
the lower is not. Mirror symmetry implies that Σ−1 (and therefore Σ) is diago-
nal. The coordinates of a random vector x ∈ Rd with diagonal covariance Cov[x]
are uncorrelated variables. In general, the correlation between two variables xj ,
xk is
Cov[xj , xk ]
p .
Var[xj ] Var[xk ]
Knowing about the Cauchy-Schwarz inequality, you will have no problem show-
ing that correlations range from −1 (fully anti-correlated) over 0 (uncorrelated)
to 1 (fully correlated). A glance at (6.2) reveals that for a Gaussian distribution,
uncorrelated components are independent components. For example, if d = 2 and
x1 , x2 are uncorrelated, then Σ is diagonal, so that Σ−1 is diagonal as well, and
p(x1 , x2 ) = p(x1 )p(x2 ). This implication is specific to Gaussians and does not
hold for other distributions in general. Both contour plots on the top depict
Gaussians with independent components. For the left of these, the contours are
circles: the variance along any direction is the same: Var[dT x] = dT Σd is the
same for all kdk = 1, which is possible only if Σ is a multiple of I. Such a
covariance structure is called isotropic or spherical. Imagine you stand at the
mean and look into any direction. For an isotropic Gaussian, what you see is
always the same.
86 6 Probabilistic Models. Maximum Likelihood

Marginal Distribution of Gaussian

Given that x ∼ N (µ, Σ), x ∈ Rd , what is the marginal distribution over a


subset of the x components? Say, xJ = [xj ]j∈J for J ⊂ {1, . . . , d}. The an-
swer is obvious from our geometrical intuition: xJ is Gaussian again, with
xJ ∼ N (µJ , ΣJ ). A more general statement is as follows. If A ∈ Rp×d has
full rank rk A = p, p ≤ d and y = Ax, then y is Gaussian again, with
y ∼ N (Aµ, AΣAT ). First, mean and covariance of y follow from the transfor-
mation rules of Section 5.1.3. The Gaussianity of y is shown as in Section 6.3.3:
its density must have the form of (6.2). In other words, the family of Gaussian
distributions is closed under full-rank affine linear transformations. In partic-
ular, all marginal distributions of a joint Gaussian distribution are Gaussian
again.
A final point which you might be puzzled about. We saw that covariance matri-
ces are positive semidefinite in general, while Σ in (6.2) must be positive definite.
Are there Gaussians whose covariance matrix is not invertible? Consider an ex-
treme example. For a fixed vector z ∈ Rd \ {0}, d > 1, and α ∼ N (0, 1), does
x = αz have a Gaussian distribution? Its covariance would be Cov[αz] = zz T ,
which is not invertible. The geometry behind such degenerate Gaussians is sim-
ple: they are perfectly normal Gaussians, but confined to an affine subspace
of Rd . Picture them as entirely flat in Rd . There are always directions v 6= 0
along which they do not fluctuate at all: Var[v T x] = 0, in fact v T x is constant.
In contrast, a proper Gaussian with density (6.2) always fluctuates along all
directions. Beware that some texts simply extend the family of Gaussian dis-
tributions to these cases. However, in this course a Gaussian distribution has a
positive definite covariance matrix and a density of the form (6.2).

Gaussians in a Non-Gaussian World (*)

In this basic course, we will mainly be concerned with linear methods and mod-
els based on Gaussian distributions. However, it is increasingly appreciated in
statistical physics, statistics, machine learning, and elsewhere that many real-
world distributions of interest are not Gaussian at all. Relevant buzzwords are
“heavy-tailed” (or “fat tails”), “power law decay”, “scale-free”, “small world”,
etc. Real processes exhibit large jumps now and then, which are essentially
ruled out in Gaussian random noise. As Gaussian distributions are simple to
work with, these facts have been widely ignored until quite recently. In this
sense, parts of classical statistics, economics and machine learning are founded
on principles which do not fit the data.
Does this mean we waste our time learning about Gaussians? Absolutely not!
The most natural way to construct realistic heavy-tailed distributions with
power law decay is by mixing together Gaussians of different scales. Non-
Gaussian statistics is built on top of the Gaussian world. The bottomline is this.
Classical statistics use Gaussians and linear models to represent data directly.
Modern statistics employs non-Gaussian distributions and non-linear models,
which are built from Gaussians and linear mappings inside, mainly via the pow-
erful concept of latent variables (we will explore this idea in Chapter 12). Modern
methods behave non-Gaussian, but they are driven by Gaussian mathematics
and corresponding numerical linear algebra as major building blocks.
6.3 The Gaussian Distribution 87

6.3.1 Techniques: Determinants

The multivariate Gaussian density (6.2) features a determinant |Σ| of the matrix
Σ. It is highly recommended that you study [42, ch. 5.1], even if you think you
know everything about determinants. The following is just a brief exposition,
mainly taken from there.
Every square matrix A ∈ Rp×p has a determinant |A|. Other texts use the
notation detA to avoid ambiguities, but the |A| is most common and will be
used in this course. The number |A| ∈ R contains a lot of information about
the matrix A. First, |A| 6= 0 if and only if A is invertible. If this is the case,
then |A−1 | = 1/|A|. You should memorize the 2 × 2 case:
   
a b 1 d −b
A= ⇒ |A| = ad − bc, A−1 = .
c d ad − bc −c a

The last equation holds only if A is invertible, |A| =


6 0.
Some useful facts about determinants (they all follow from three simple axioms,
see [42, ch. 5.1]):

• Linear in each column/row: If

F (a1 , . . . , ap ) = |[a1 , . . . , ap ]| ,

ak the columns of the matrix A, then F is a linear function in each of its


arguments (F is called multilinear). The same holds for the rows of A.

• Product: If A, B ∈ Rp×p , then |AB| = |A| |B|.


In particular: |A−1 | = 1/|A|, since |I| = 1 and

AA−1 = I.

• Transpose: The determinant is invariant under transposition:


|AT | = |A|

• Triangular matrices: If a matrix A is upper triangular, meaning that aij =


0 for all i > j:
 
a11 . . . . . . a1p
 .. .. 
 0 . . 
A=  . ..
,
 .. .. .. 
. . . 
0 . . . 0 app
its determinant is the product of the diagonal entries:
p
Y
|A| = aii .
i=1

The same holds for lower triangular matrices (and of course for diagonal
matrices).
88 6 Probabilistic Models. Maximum Likelihood

The determinant in the Gaussian density (6.2) is of a positive definite matrix Σ.


In this case, the rules provide a method for computing |Σ|. We use the Cholesky
decomposition from Section 4.2.2: Σ = LLT . Then,
p
Y
|Σ| = |L||LT | = |L|2 = 2
lii .
i=1

For large matrices Σ, it is always numerically better to compute


p
X
log |Σ| = 2 log |L| = 2 log lii .
i=1

Figure 6.5: Illustration of determinant of R = [r j ] in 2D and 3D.


Left: The absolute value of the determinant in 2D quantifies the area of the
parallelogram spanned by r 1 = [a, b]T , r 2 = [c, d]T . Right: The absolute value
of the determinant in 3D quantifies the volume of the parallelepiped spanned
by r 1 , r 2 , r 3 .
Figures from wikipedia (used with permission). Left: Copyright by Jitse
Nielsen. Right: Copyright Claudio Rocchini, GNU free documentation license.

Finally, what is the geometrical meaning of |A| = F (a1 , . . . , ap )? Properties


such as |I| = 1 and |αA| = αp |A| hint towards volume, and that is correct.
Consider p = 2. The two vectors a1 , a2 span a parallelogram (Figure 6.5, left),
and the absolute value of |A| = F (a1 , a2 ) is its area. We denote the absolute
value of the determinant by ||A||, but typically point this out to avoid confusion.
The area of a triangle? No problem, take half the determinant (why?). The
same holds for p = 3. Now, a1 , a2 , a3 span a parallelepiped, and ||A|| is its
volume (Figure 6.5, right). In the context of multivariate distributions such as
the Gaussian (6.2),
|Σ| = |Cov[x]|
measures the “volume of covariance”.
6.3 The Gaussian Distribution 89

6.3.2 Techniques: Working with Densities (*)


Recall the Gaussian density from (6.1). Suppose we did not know what µ and σ 2
were. Let us practice some elementary manipulation of densities, not restricted
to the Gaussian. If the random variable t has the density p(t), then the variable
x = µ+σt, σ > 0, has the density σ −1 p((x−µ)/σ). This is because expectations
have to come out the same, whether we do them w.r.t. x or t. And dx = σdt,
so that
p(t)dt = p((x − µ)/σ)σ −1 dx.
Confirm for yourself that the Gaussian (6.1) is obtained in this way from the
standard normal density
1 1 2
N (t|0, 1) = √ e− 2 t .

Moreover, we know how mean and variance transform from t to x (Section 5.1.3):

E[x] = µ + σE[t], Var[x] = σ 2 Var[t]. (6.3)

So it all comes down to N (0, 1). Being an even function, its mean must be zero if
it exists (it may not, remember the Cauchy). Let us go for the variance directly:
if this exists, so does the mean (why?). We substitute r = t2 /2, so dr = t dt:
Z ∞ Z ∞
1 2
Var[t] = 2 t2 (2π)−1/2 e− 2 t dt = (2/π)1/2 (2r)1/2 e−r dr
0 0
Z ∞
= 2π −1/2 r1/2 e−r dr = 2π −1/2 Γ(3/2).
0

The last integral involves Euler’s Gamma function:


Z ∞
Γ(x) = rx−1 e−r dr,
0

interpolating the √ x ∈ N. In general, Γ(x + 1) = xΓ(x)


√ factorial via Γ(x + 1) = x!,
and Γ(1/2) = π. Therefore, Γ(3/2) = π/2, and Var[t] = 1. The standard
normal distribution N (0, 1) has mean 0, variance 1, and (6.3) implies that µ is
the mean, σ 2 the variance of the Gaussian (6.1).

6.3.3 Techniques: Density after Transformation (*)


Let us make sense of the multivariate Gaussian density (6.2) and show that µ
and Σ do correspond to mean and covariance. We have already firmly estab-
lished the univariate Gaussian (6.1) in Section 6.3.2. We construct a random
vector t ∈ Rd with independent components, distributed as tj ∼ N (0, 1), so
that E[t] = 0, Cov[t] = I (the identity). Its density is
d
1 1 2 1 T −1
√ e− 2 tj = |2πI|−1/2 e− 2 t I t ,
Y
N (t|0, I) = (6.4)
j=1

using that tT I −1 t = ktk2 =


P 2 d d
j tj and |2πI| = (2π) |I| = (2π) . Not bad.
Given µ and Σ, we work backwards. Recall that Σ is positive definite. We use
90 6 Probabilistic Models. Maximum Likelihood

the Cholesky decomposition discussed in Section 4.2.2: Σ = LLT , where L is


triangular and invertible.
If x = µ+Lt, how does the density p(t) of t transform? The following step is not
specific to Gaussian distributions. Again, we have to make sure that expectation
is conserved when switching from t to x. And by multivariate calculus:

p(t)dt = p L−1 (x − µ) ||L||−1 dx.




Here, ||L|| denotes the absolute value of the determinant |L|. This rule is easy
to understand. The differentials dx and dt are infinitesimal volume elements.
Picture dt as tiny hypercube. Since t → x involves multiplication with L, the
cube is transformed. What is its new volume? By our volumne intuition about
the determinant, it must be dx = ||L||dt. Plugging this into (6.4):
1 2 1 T
L−T L−1 (x−µ) 1 T
Σ−1 (x−µ)
e− 2 ktk = e− 2 (x−µ) = e− 2 (x−µ) .

Moreover, |LT | = |L|, so that |Σ| = |LT L| = |L|2 , and the prefactor becomes
||L||−1 = |Σ|−1/2 . All in all, the density of x is (6.2). Finally, we know how
mean and covariance transform (Section 5.1.3):

E[x] = µ + LE[t] = µ, Cov[x] = LCov[t]LT = LLT = Σ.

We have derived (6.2) from the univariate case and shown that µ, Σ are mean
and covariance respectively.

6.4 Maximum Likelihood for Gaussian Distribu-


tions
We are ready now to fit a Gaussian distribution N (µ, σ 2 ) to our exam results
dataset. We minimize the negative log likelihood
n
1 X
L(µ, σ 2 ) = − log p(D|µ, σ 2 ) = (xi − µ)2 /σ 2 + log(2πσ 2 ) .

2 i=1

First,
n n
∂L X n 1X
= (µ − xi )/σ 2 = 2 (µ − x̄), x̄ = xi .
∂µ i=1
σ n i=1
Therefore, µ̂ = x̄: the empirical mean. Note that this is a minimum point, since
the second derivative is positive. Plugging in µ = µ̂:
n
1 X n
L(σ 2 ) = (xi − x̄)2 /σ 2 + log(2πσ 2 ) = S/σ 2 + log(2πσ 2 ) ,

2 i=1 2
n
1X
S= (xi − x̄)2 .
n i=1

If τ = σ −2 , then
 
n ∂L n 1
L(τ ) = (Sτ + log(2π) − log τ ) , = S− .
2 ∂τ 2 τ
6.4 Maximum Likelihood for Gaussian Distributions 91

Therefore, τ̂ = 1/S, or σ̂ 2 = S. This is a maximum point, since

∂ 2 (2L/n) ∂ 1
2
= (S − 1/τ ) = 2 > 0.
∂τ ∂τ τ
The maximum likelihood estimator for mean and variance of a Gaussian (6.1)
is
n n
1X 1X
µ̂ = x̄ = xi , σ̂ 2 = (xi − x̄)2 .
n i=1 n i=1
The Gaussian ML fit to our exam results data employs the empirical mean and
variance. For multivariate data D = {xi | i = 1, . . . , n}, xi ∈ Rd , the maximum
likelihood estimator is equally intuitive:
n n
1X 1X
µ̂ = x̄ = xi , Σ̂ = (xi − x̄)(xi − x̄)T , (6.5)
n i=1 n i=1

sample mean and covariance. The latter holds only if the sample covariance has
full rank d (in particular, n ≥ d). Otherwise, the MLE for the covariance is
undefined. We derive these estimators in Section 6.4.3.

6.4.1 Gaussian Class-Conditional Distributions


Fitting bell-shaped curves to exam results is fine. But how does all this help
us with classification? Let us combine decision theory (Section 5.2) with ML
estimation for Gaussians. Consider a binary classification problem with input
patterns x ∈ Rd and targets t ∈ {−1, +1}. Decision theory provides optimal
discriminants, given that we know the joint density p(x, t). MLE allows us to
fit parametric distribution families to data. Therefore, if we make a parametric
model assumption about what p(x, t) could be, everything else falls in place. Let
us assume that the class-conditional densities are Gaussian with unit covariance
matrix: 2
1
p(x|t) = N (x|µt , I) = (2π)−d/2 e− 2 kx−µt k , µt ∈ Rd .
Moreover, P (t = +1) = π1 ∈ (0, 1), P (t = −1) = 1 − π1 . The setup is depicted
in Figure 6.6. Knowing that contour lines of spherical Gaussians are circles
centered at the mean, the optimal classifier assigns x to class t for the closest
mean µt in Euclidean distance. At least in this two-dimensional example, the
Bayes-optimal decision boundary is a line orthogonal to µ+1 − µ−1 . We have
seen this example before in Section 2.1.
In general, the optimal discriminant under this setup is
p(x|t = +1) π1 1 π1
y ∗ (x) = log = − kx − µ+1 k2 − kx − µ−1 k2 +log

.
p(x|t = −1) (1 − π1 ) 2 1 − π1

Expanding the squared distances, the kxk2 terms cancel each other:
1 π1
y ∗ (x) = (µ+1 − µ−1 )T x − kµ+1 k2 − kµ−1 k2 + log

. (6.6)
2 1 − π1

The optimal discriminant function is linear. A hyperplane in input space, just


like those we discussed in Chapter 2. Its (unnormalized) normal vector is w =
92 6 Probabilistic Models. Maximum Likelihood

Figure 6.6: Two Gaussian class-conditional distribution with spherical covari-


ance I. The optimal discriminant is a hyperplane with normal vector w =
µ+1 − µ−1 and offset point x0 . If P (t = −1) = P (t = +1) (equal class priors),
then x0 = (µ+1 + µ−1 )/2. Otherwise, it is translated along the line through
µ−1 and µ+1 , towards the class mean whose P (t) is the smaller.

µ+1 − µ−1 . If c = log{π1 /(1 − π1 )}, we can use kµ+1 k2 − kµ−1 k2 = (µ+1 −
µ−1 )T (µ+1 + µ−1 ) to obtain

1 c
y ∗ (x) = wT (x − x0 ), x0 = (µ+1 + µ−1 ) − w.
2 kwk2

The geometrical picture is clear (Figure 6.6). Imagine a line through the class
means µ+1 , µ−1 . The optimal hyperplane is orthogonal to this line. It intersects
the line at x0 , which is the midpoint between the class means, (µ+1 + µ−1 )/2, if
and only if c = 0, or P (t = +1) = P (t = −1) = 1/2. For unequal class priors, x0
is obtained by translating the midpoint along the line, towards the class mean
whose P (t) is smaller. For this simple setup, we can compute the Bayes error
analytically (Section 6.4.2). For the special case c = 0 (equal class priors),
Z x
2
R∗ = Φ(−kwk/2), Φ(x) = (2π)−1/2 e−t /2
dt.
−∞

Here, Φ(x) is the cumulative distribution function of the standard normal distri-
bution N (0, 1): Φ(x) = P {t ≤ x}, t ∼ N (0, 1). As expected, R∗ is a decreasing
function of the distance kwk = kµ+1 −µ−1 k between the class means, R∗ = 1/2
for µ+1 = µ−1 , and R∗ → 0 as kwk → ∞.

Maximum Likelihood Plug-in Discriminants

In order to use this in the real world, we can estimate the parameters µ+1 , µ−1 ,
and π1 = P (t = +1) from data D = {(xi , ti ) | i = 1, . . . , n}, using the principle
6.4 Maximum Likelihood for Gaussian Distributions 93

of maximum likelihood:
n
!
Y X
p(D|µ+1 , µ−1 , π1 ) = N (xi |µti , I) π1n1 (1 − π1 )n−n1 , n1 = I{ti =+1} .
i=1 i

Using our results from above,


1 X 1 X n1
µ̂+1 = I{ti =+1} xi , µ̂−1 = I{ti =−1} xi , π̂1 = .
n1 i n − n1 i n

The maximum likelihood plug-in discriminant ŷ(x) is given by plugging the ML


estimates into (6.6):
1 π̂1
ŷ(x) = ŵ T x − kµ̂+1 k2 − kµ̂−1 k2 + log

, ŵ = µ̂+1 − µ̂−1 .
2 1 − π̂1

In summary, plug-in classifiers are obtained by the following schema:

• Pick a model p(x, t|θ) = p(x|t, θ)P (t|θ), parameterized by θ. This choice
determines everything else, and for real-world situations there is no uni-
formly optimal recipe for it. In this course, we will learn about conse-
quences of modelling choices, so we can do them in an informed way. We
need two basic properties:
– For any fixed θ, the Bayes-optimal classifier is known and has a
reasonably simple form.
– ML density estimation is tractable for p(x|t, θ) and P (t|θ)
(1+t)/2
In our example above, p(x|t, θ) = N (x|µt , I), P (t|θ) = π1 (1 −
π1 )(1−t)/2 , and θ = [µT+1 , µT−1 , π1 ]T .
• Given training data D = {(xi , ti )}, estimate θ by maximizing the likeli-
hood, resulting in θ̂.
• The ML plug-in classifier is obtained by plugging the estimated parameters
θ̂ into the optimal rule.

Plug-in classifiers are examples of the generative modelling paradigm. Our goal
is to predict t from x, and we get there by estimating the whole joint density
by p(x, t|θ̂), then use Bayes’ formula to obtain the posterior P (t|x, θ̂), based
on which we classify. Another idea would be to estimate the posterior P (t|x)
directly, bypassing the modelling of inputs x altogether, which is what we do in
discriminative modelling. We will get to the bottom of this important distinction
in Chapter 8.

Equal Covariances 6= I

A slightly more general case is given by the model assumptions p(x|t) =


N (µt , Σ). We allow for a general covariance matrix Σ, which is shared by all
class-conditional distributions. To save space, let us write
q
kvk2Σ := v T Σ−1 v, kvkΣ := v T Σ−1 v.
94 6 Probabilistic Models. Maximum Likelihood

The optimal discriminant function is


p(x|t = +1) π1 1
y ∗ (x) = log = − kx − µ+1 k2Σ − kx − µ−1 k2Σ + c

p(x|t = −1) (1 − π1 ) 2
1 π1
= (µ+1 − µ−1 )T Σ−1 x − kµ+1 k2Σ − kµ−1 k2Σ + c, c = log

.
2 1 − π1
A linear discriminant once more, with normal vector w = Σ−1 (µ+1 − µ−1 ).
The difference between the means is transformed by the covariance. Roughly
speaking, contributions of µ+1 − µ−1 along directions of large variance are
downweighted, since the Gaussians overlap more along these directions. What
about the Bayes error? We can reduce the equal covariance case to the spherical
covariance case, for which we know the Bayes error already (Section 6.4.2).
Namely, let Σ = LLT the Cholesky decomposition (Section 4.2.2), and x̃ =
L−1 x. Then, p(x̃|t) = N (µ̃t , I), where µ̃t = L−1 µt . The optimal discriminating
hyperplane has normal vector w̃ = µ̃+1 − µ̃−1 , and

w̃ T x̃ = (µ+1 − µ−1 )T L−T L−1 x = (µ+1 − µ−1 )T Σ−1 x,

which is what we obtained above. This means that the Bayes error is given by
the expressions derived in Section 6.4.2, replacing kwk by
q
kw̃k = (µ+1 − µ−1 )T Σ−1 (µ+1 − µ−1 ) = kµ+1 − µ−1 kΣ .

For fixed positive definite Σ, this norm is known as Mahalanobis distance.


The corresponding ML plug-in classifier is obtained by estimating means µt ,
t = −1, 1, and covariance Σ by maximum likelihood (6.5). The reduction from
equal to spherical covariances is unproblematic in the decision-theoretic context.
However, if we lift it to plug-in rules by ML estimation, the non-spherical case
can lead to substantial difficulties, in particular if n (number of data points) is
not much larger than d (dimensionality of x). To appreciate the nature of these
difficulties, consider the case d > n. If the dimensionality is greater than the
number of data points, the ML estimator for the covariance matrix is not even
defined! This is an important general problem with generative ML approaches,
we will analyze it in greater detail in Chapter 7.
What about the general case p(x|t) = N (µt , Σt )? In this case, the optimal
discriminant function is quadratic in x. Contrary to quadratic functions used
elsewhere in this course, it is not positive definite in general (its matrix is Σ−1
+1 −
Σ−1
−1 ), therefore can give rise to complex decision boundaries. In particular,
decision regions need not be connected. We will not be concerned with this
general case any further during this course. If you are interested, [12] has some
pretty figures.

Multi-Way Classification

Finally, what about multi-way classification, K > 2 classes? Assume that


t ∈ T = {0, . . . , K − 1}. Decision theory (Section 5.2) suggests to employ one
discriminant function yk∗ (x) per class,
1
yk∗ (x) = log{p(x|t = k)P (t = k)} = − kx − µk k2 + log P (t = k) + C
2
6.4 Maximum Likelihood for Gaussian Distributions 95

C3
C1

?
R1
R3
R1 C1 ?
R2
C3
C1 R2
R3 C2
C2
not C1
C2
not C2

Figure 6.7: Attempts to construct a multi-way discriminant from a number of


binary discriminants leads to ambiguous regions, shown in green. In the left
panel, two “one-against-rest” discriminants are combined, one for class C1 , the
other for class C2 . Both label the green region as positive. In the left panel, three
discriminants are employed, one for each pair of classes. Each of them predicts
the green region to belong to a different class.
Figure from [5] (used with permission).

in the spherical covariance case, where C is a constant. Given our model assump-
tions, the Bayes-optimal rule is f ∗ (x) = argmaxk∈T yk∗ (x). The ML plug-in
classifier has the same form, plugging in ML estimates

n n
X nk X
µ̂k = n−1
t I{ti =k} xi , P̂ (t = k) = , nk = I{ti =k} .
i=1
n i=1

Importantly, the optimal rule for Gaussian class-conditional densities, while de-
fined in terms of K linear discriminant functions, is not a combination of binary
linear classifiers. A lot of effort has been spent in machine learning in order
to combine multi-way classifiers from binary linear ones. The most commonly
used heuristic, “one-against-rest”, uses K binary classifiers fk (x), discriminat-
ing t = k versus t 6= k. There are always regions in which the binary votes of all
fk (x) remain ambiguous (Figure 6.7, left). Another heuristic uses K(K − 1)/2
binary classifiers fk,k0 (x), k 6= k 0 , discriminating t = k versus t = k 0 . Again,
their votes remain ambiguous in some regions (Figure 6.7, right). More complex
heuristics employ “error-correcting codes” or directed acyclic graphs. Decision
theory indicates that such attempts cannot in general attain optimal perfor-
mance5 .

5 Note that if “one-against-rest” is based on optimal binary discriminant functions y ∗ (x) =


k
log{P (t = k|x)/(1 − P (t = k|x))} (instead of classifiers fk∗ (x) = sgn(yk∗ (x))), the optimal
∗ ∗
K-way classifier can be combined as f (x) = argmaxk yk (x), since p 7→ log{p/(1 − p)} is
increasing. However, “one-against-rest” is typically used with large margin binary classifiers,
which do not provide consistent estimators of posterior class probabilities P (t = k|x) (see
Section 9.4).
96 6 Probabilistic Models. Maximum Likelihood

6.4.2 Techniques: Bayes Error for Gaussian Class-


Conditionals (*)
We saw in Section 6.4.1 that the Bayes-optimal classifier for Gaussian class-
conditionals p(x|t) = N (µt , I) is f ∗ (x) = sgn(y ∗ (x)) with a linear discriminant
function y ∗ (x) (6.6). In this section, we derive the generalization error of this
rule, the Bayes error. Recall that c = log{P (t = +1)/P (t = −1)}. The Bayes
error is the sum of two parts, the first being
n o
P {f ∗ (x) = −1 and t = +1} = P (t = +1)P kx − µ+1 k2 > kx − µ−1 k2 + 2c t = +1 ,

where x ∼ N (µ+1 , I). Denote w = µ+1 − µ−1 . If x̃ = x − µ+1 ∼ N (0, I), the
event is
1 2 1 2 T 1 2
2 kx̃k > 2 kx̃ + wk + c ⇔ w x̃ < − 2 kwk − c.

Since wT x̃ ∼ N (0, kwk2 ) (recall Section 6.3), this is

kwk2 /2 + c
   
1 c
Φ − = Φ − kwk − ,
kwk 2 kwk
where Z x Z x
2
Φ(x) = N (t|0, 1) dt = (2π)−1/2 e−t /2
dt
−∞ −∞

is the cumulative distribution function of N (0, 1). The second part of the error
is P {f ∗ (x) = 1 and t = −1}. Due to the symmetry of the setup, this must be
the same as the first if we replace indices +1 and −1 everywhere: kwk remains
unchanged, c is replaced by −c. Therefore, the Bayes error is
   
∗ 1 c 1 c
R = P (t = +1)Φ − kwk − + P (t = −1)Φ − kwk +
2 kwk 2 kwk
A few sanity checks. First, R∗ → 0 as kwk → ∞. Also, R∗ = min{P (t =
+1), P (t = −1)} for kwk = 0: if x is independent of the target t, the optimal
rule uses P (t) only. In the equal class prior case P (t = +1) = P (t = −1) = 1/2,
R∗ simplifies to Φ(−kwk/2).

6.4.3 Techniques: MLE for Multivariate Gaussian (*)


The ML estimators for mean and covariance of a multivariate Gaussian
N (x|µ, Σ), x ∈ Rd , are given by (6.5). In this section, we derive this result
and learn to know some useful techniques.
Let D = {xi | i = 1, . . . , n}, xi ∈ Rd , and L = − log p(D|µ, Σ). For the mean
µ̂, we can set ∇µ L = 0 and solve for µ. But since L is quadratic in µ, it is
often easier to simply “see the solution” using a technique called completing the
square:
n
X 1 n T −1
(xi − µ)T Σ−1 (xi − µ) = −nx̄ T Σ−1 µ + µ Σ µ + C1
i=1
2 2
n
= (µ − x̄)T Σ−1 (µ − x̄) + C2 ,
2
6.4 Maximum Likelihood for Gaussian Distributions 97

where C1 , C2 do not depend on µ. But this looks like the quadratic we know
from a Gaussian over µ, something like N (µ|x̄, Σ/n). We know this quadratic
is smallest at the mean, therefore at x̄. Note that we dropped additive terms
not involving µ immediately: they do not influence the minimization, and there
is no merit in keeping them around.
The derivation of Σ̂ is a bit more difficult. First, since µ̂ does not depend on
Σ, we can plug it into L, then minimize the result w.r.t. Σ. To this end, we will
compute the gradient ∇Σ L, set it equal to zero and solve for Σ. The gradient
w.r.t. a matrix may be a bit unfamiliar, but recall that matrix spaces are vector
spaces as well. Before we start, recall the trace of a square matrix:
d
X
tr A = ajj = 1T diag(A), A ∈ Rd×d ,
j=1

the sum of the diagonal entries. Unlike the determinant, the trace is a linear
function: tr(A + αB) = tr A + α tr B. An important result is tr BC = tr C B,
given the product is a square matrix. Also, tr AT = tr A is obvious. Note that
X
tr B T C = bjk cjk ,
j,k

so that tr B T C can be seen as generalization of the Euclidean inner product to


the space of matrices B, C (which need not be square). A manipulation we will
use frequently is
xT Ax = tr xT Ax = tr AxxT ,
the trace of A times an outer product.
Define the sample covariance matrix as
n
1X
S= (xi − x̄)(xi − x̄)T .
n i=1

We assume that S is invertible. As a covariance matrix, it is symmetric positive


definite. The negative log likelihood as function of Σ, where we plug in x̄ for
µ, is
n
X
log |Σ| + (xi − x̄)T Σ−1 (xi − x̄)

i=1
n
X
tr(xi − x̄)(xi − x̄)T Σ−1 = n log |Σ| + tr S Σ−1 .

= n log |Σ| +
i=1

Here, we dropped an additive constant and also the prefactor 1/2. We also used
the outer product manipulation involving the trace. If P = Σ−1 , then
−1
P̂ = argmin {f (P ) = tr S P − log |P |} , Σ̂ = P̂ ,
P

where we used |P | = 1/|Σ|, so log |P | = − log |Σ|. The minimization is over


positive definite matrices P , but we ignore this constraint for now. Let us com-
pute the gradient ∇P f = [∂f /∂pjk ] ∈ Rd×d . Obviously, ∇P tr S P = S (we
98 6 Probabilistic Models. Maximum Likelihood

used the symmetry of S here). Next, we will show a result which will be of
independent interest later during the course. At any P ∈ Rd×d with |P | > 0:

∇P log |P | = P −T . (6.7)

Altogether, ∇P f (P ) = S − P −T = 0 if and only if P = S −1 . This solu-


tion is positive definite, so the constraint is satisfied automatically in this case.
Therefore, the ML estimator for the covariance is Σ̂ = S .
To prove6 (6.7), we employ properties of the determinant (recall Section 6.3.1).

∂ log |P | 1 
= lim log |P + εδ j δ Tk | − log |P | .
∂pjk ε→0 ε

Here, P + εδ j δ Tk denotes the matrix obtained by adding ε to element (j, k) of


P . Now,
n o
log |P + εδ j δ Tk | − log |P | = log |P + εδ j δ Tk | · |P −1 | = log |I + εP −1 δ j δ Tk |.

Denote v = P −1 δ j , the j-th column of the inverse. Now, I + εvδ Tk is obtained


from the identity by adding εv to the k-th column. We know that the determi-
nant is linear w.r.t. each column, so

|I + εvδ Tk | = |I| + ε|M | = 1 + ε|M |, M = I + (v − δ k )δ Tk ,

where M is I, except the k-th column is replaced by v. Plugging this in, we


have
∂ log |P | 1
= lim log (1 + ε|M |) = |M |.
∂pjk ε→0 ε

Recall from Section 6.3.1 that we can do column eliminations without changing
the determinant. Eliminating all entries of v except vk :

|M | = I + (vk − 1)δ k δ Tk = vk = (P −1 )kj .

This concludes the proof.

6.5 Maximum Likelihood for Discrete Distribu-


tions
Handwritten digits? The internet and the “data deluge” provide a richer play-
ground for machine learning today. Consider the problem of text classifica-
tion (Figure 6.8). Given a document (for example, a news article), what is
it talking about? We will concentrate on a simple K-way classification setup,
where each document is to be classified according to a flat fixed-sized target
set T = {0, . . . , K − 1} (for example: politics, business, sports, science, movies,
. . . ). Modern models tend to employ hierarchical grouping schemes (a document
6 There are more direct proofs, involving the eigendecomposition of S and P . Our proof

just uses elementary properties of the determinant.


6.5 Maximum Likelihood for Discrete Distributions 99

Figure 6.8: The Reuters RCV1 collection is a set of 800,000 documents (news
articles), with about 200 words per document on average. After standard pre-
processing (stop word removal), its dictionary (set of distinct words) is roughly
of size 400,000. A common machine learning problem associated with this data
is to classify documents into groups (for example: politics, business, sports, sci-
ence, movies), which are often organized in a hierarchical fashion.

could be about politics, US politics, Barack Obama) and may associate parts
of a document with different topics.
Some vocabulary. The atomic unit is the word. A document is an ordered set
of words. A corpus (plural: corpora) is a dataset of documents, part of which
may be labeled according to a grouping scheme. Roughly, preprocessing works
as follows:

• Remove punctuation, non-text entities.

• Remove stop words: frequent words which occur in most documents and
tend to carry no discriminative information. For example: a, and, is, it,
be, by, for, to, . . .

• Stemming: Strip prefixes and endings in order to reduce words to their


stem (this may not be done for certain natural language processing tasks,
but is typically done for text classification).

• Build dictionary C = {c1 , . . . , cM } of distinct words occuring somewhere


in the corpus. The dictionary size is M .

Given that, we can represent a document of N words as x = [x1 , . . . , xN ]T , xj ∈


{1, . . . , M }. xj = m specifies that the j-th word is cm . In order to implement a
generative classifier, we have to specify class-conditional distributions P (x|t =
k). Note that the occurence of individual words can be highly indicative for one
class or another. “Currency”, “Dow”, “Credit” sounds more like business than
sports. To model this observation, we can use one distribution p(k) over words
in the dictionary C for each class k = 0, . . . , K − 1. Formally,
 XM 
(k) M
p ∈ ∆M = q ∈ R qm ≥ 0 ∀m = 1, . . . , M, qm = 1 .
m=1
100 6 Probabilistic Models. Maximum Likelihood

The set ∆M of all distributions over M objects is called the M -dimensional


probability simplex. Now, if p(k) ∈ ∆M , k = 0, . . . , K − 1, our generative model
is
YN
P (x|N, t = k) = p(k)
xj , x ∈ {1, . . . , M }N .
j=1

The conditioning on the document length N is a technical point,Pwhich will play


no role for the discriminant. We need it in order to ensure that x P (x|N, t =
k) = 1. We will use x, N to represent a document, with

N
Y
P (x, N |t = k) = P (N ) p(k)
xj ,
j=1

where P (N ) is a distribution over N which does not depend on t. For our clas-
sification purposes, we can use P (x|N, t) in place of P (x, N |t), since the P (N )
factor cancels out in ratios like P (x, N |t = k)/P (x, N |t = k 0 ), and these are all
we ever need in the end. If you are confused at this point, simply move on and
ignore the N in P (x|N, t) as a technical detail.
Imagine the build-up of P (x|N, t) for two distinct classes t = k, k 0 , say busi-
ness (k) and sports (k 0 ). For each word xj of x, we multiply P (x|N, t = k)
(k) (k0 )
and P (x|N, t = k 0 ) by pxj and pxj respectively. For example, if cx1 =“CEO”,
(k) (k0 )
presumably px1 > px1 , so the fraction P (x|N, t = k)/P (x|N, t = k 0 ) in-
creases. On the other hand, cx5 =“Football” implies a decrease of the ra-
(k) (k0 )
tio by way of multiplication with px5 /px5 < 1. Each word contributes to
the accumulation of evidence in a multiplicative fashion, independent of the
distribution of other words. The combined parameters7 of this model are
θ = [(p(0) )T , . . . , (p(K−1) )T , P (t = 0), . . . , P (t = K − 1)]T . In practice, we
deal with many documents xi of different lengths Ni , and a representation with
M factors is preferable:

N
Y M 
Y φm (x) N
X
P (x|N, t = k) = p(k)
xj = p(k)
m , φm (x) = I{xj =m} . (6.8)
j=1 m=1 j=1

φm (x) is the number of times the word cm occurs in document x. Note that in
PM
general, the majority of the counts will be zero. Also, m=1 φm (x) = N . The
feature vector φ(x) = [φm (x)] ∈ NM summarizes the information in x required
to compute P (x|N, t = k) for all k = 0, . . . , K − 1. Such summaries are called
sufficient statistics. Given our model assumptions, this is all the information we
need to know (and therefore, compute) in order to learn and predict. Whenever a
model for documents x has single word occurence features as sufficient statistics,
it falls under the bag of words assumption. The same count vector is obtained
by permuting words in x arbitrarily, their ordering does not matter. It is as if
we cut x into single words, put them in a bag and mix them up. This seems like
a drastic assumption, but it is frequently made in information retrieval, since
resulting computational simplifications are very substantial.
7 We do not include P (N ) in θ, since the discriminant functions do not depend on it.
6.5 Maximum Likelihood for Discrete Distributions 101

What are the optimal discriminant functions under this model?


M
X
yk∗ (x) = log {P (x|N, t = k)P (t = k)} = φm (x) log p(k)
m + log P (t = k)
m=1
h i
T
= (wk ) φ(x) + log P (t = k), wk = log p(k)
m ∈ RM .
(6.9)

Linear functions again! The weight vectors wk correspond to log probabilities,


which implies constraints8 on the coefficients, but these are satisfied automat-
ically in ML estimation. To discriminate between classes k and k 0 , we would
use
M (k)
X pm P (t = k)
yk∗ (x) − yk0 (x) = φm (x) log (k0 ) + log .
m=1 pm P (t = k 0 )
| {z }
wk,m −wk0 ,m

The evidence for our decision is combined by summing up log odds


(k) (k0 )
log{pm /pm } for words cm , each proportional to the number of occurences
in x. Each term depends on statistics and distributions for a single word only,
there are no cross-terms.

6.5.1 Using Indicators in Maximum Likelihood Estima-


tion
Given data D = {(xi , ti ) | i = 1, . . . , n}, where xi ∈ {1, . . . , M }Ni (Ni is the
length of the i-th document in our training corpus), we can estimate the model
parameters θ by maximizing the data likelihood. We will simplify the deriva-
tion of the ML estimators by systematic use of indicator variables, a technique
which is indispensable with discrete variable models of complex structure used
in modern information retrieval and machine learning. Let us rederive (6.8),
using a funny way to put things:
M  I{x
j =m}
Y
p(k)
xj = p(k)
m .
m=1

This identity is based on the rules9 a1 = a, a0 = 1 for all a ∈ R. I{xj =m} is an


example for an indicator variable associated with xj . The name for indicators
in some machine learning texts is “1-of-M-coding” or “winner-takes-all-coding”.
Seen as a vector [I{xj =m} ] ∈ RM , exactly one component is one (namely, the
(k)
xj -th), all others are zero. On its own, the term pxj is of course simpler than
the indicator product, but if we multiply many such terms, we simply have to
add up the indicators in the exponent:
N Y
M  I{x M  PN
j =m} j=1 I{x j =m}
Y Y
p(k)
m = p(k)
m .
j=1 m=1 m=1

8 Namely, ewk,m = 1.
P
m
9 In particular, 00 = 1 everywhere in this course. This is mainly a convention to make
indicators work. The argument limx→0 xx = exp(limx→0 x log x) = 1 is also convincing.
102 6 Probabilistic Models. Maximum Likelihood

This was pretty simple. A more complex example is given by the likelihood for
our text classification model. Here, we use indicators I{xi,j =m} , where xi,j is the
j-th word of the i-th document, xi = [xi,j ]. Moreover, we use label indicators
I{ti =k} . The first step is to expand the likelihood, a product over i = 1, . . . , n,
by introducing products over label values k, word values m:
n n K−1
I
Y Y Y
P (D|θ) = P (xi , Ni |ti )P (ti ) = (P (xi |Ni , t = k)P (t = k)P (Ni )) {ti =k}
i=1 i=1 k=0
n K−1 M  I{t PNi
i =k} j=1 I{xi,j =m}
Y Y Y
I{ti =k}
= P (Ni ) P (t = k) p(k)
m .
i=1 k=0 m=1

The important point here is that the products over k and m are unconstrained
over all label and word values. All constraints are encoded in the indicators. In
particular, we can always interchange unconstrained products (or sums). Pulling
the product over data points inside, then summing exponents instead:
K−1 M  Pi I{t PNi
i =k} j=1 I{xi,j =m}
Y P Y
P (D|θ) = C P (t = k) i I{ti =k} p(k) ,
m
k=0 m=1
Y
C= P (Ni ).
i
Q P Q
Here,
Q i and i is over all data points. In the same way, we could write k and
m over all label and word values. Dropping the range in sums and products
is very commonly done in research papers, and we will sometimes follow suit to
keep notations simple.
Indicator variables provide Q
a simple and mechanical way to convert theQ likeli-
Q
hood in its original form ( i ) into the likelihood in a useful form ( k m ),
(k)
directly in terms of the model parameters P (t = k) and pm . We have to accu-
mulate the counts which appear in the exponents:
X X Ni
X
nk = I{ti =k} , N (k,m) = I{ti =k} I{xi,j =m} .
i i j=1

N (k,m) is the number of times the word cm occurs in the documents xi labeled
as ti = k, and nk is the number of documents labeled as ti = k. With these, the
log likelihood is
K−1
X K−1
X X M
log P (D|θ) = nk log P (t = k) + N (k,m) log p(k)
m + log C. (6.10)
k=0 k=0 m=1

There is no need to drag around additive constants, and we can drop log C at
will. Given the thumbtack example of Section 6.2 and common sense, you may
guess the following maximizers of the log likelihood is
N (k,m) nk
p̂(k)
m = P (k,m0 )
, P̂ (t = k) = , (6.11)
m0 N n
the ratios of empirical counts. This is correct. We establish these maximum
likelihood estimators for our text classification model in Section 6.5.3.
6.5 Maximum Likelihood for Discrete Distributions 103

6.5.2 Naive Bayes Classifiers


Let us put things together. The optimal discriminant functions for known prob-
abilities are the linear functions of (6.9). Not knowing them, we plug in their ML
estimates (6.11) instead. Variants of this particular maximum likelihood plug-
in classifier are widely used for document classification. While there are more
advanced methods, they are also more costly to train and evaluate. Comparing
against this simple and efficient baseline technique is a must. It is an example
of a naive Bayes classifier.
A naive Bayes classifier comes with class-conditional distributions P (x|t) which
have a factorizing structure. For our text classifier,
N
Y M 
Y φm (x)
P (x|N, t = k) = p(k)
xj = p(k)
m .
j=1 m=1

Each term in the rightmost product depends on a distinct part of the parameter
(k)
vector θ (here: a single coefficient pm ). This factorization assumption implies
crucial simplifications. The plug-in discriminants are linear functions, which can
be evaluated rapidly. More importantly, the log likelihood decomposes additively
and can be maximized efficiently.
A naive Bayes classifier does not have to use word count features φm (x) over
a dictionary, but can be run based on any feature map φ(x) = [φm (x)] over
the input point x. Naive Bayes far transcends text classification and can be
used with rather arbitrary input points, even discrete and continuous attributes
mixed together. Our example above is special, in that the different parameter
(k)
values pm , m = 1, . . . , M , come together to form one distribution in ∆M . Naive
Bayes can be used with features and corresponding probabilities which are not
linked in any way. For example, consider a binary feature map φ̃(x) = [φ̃m (x)],
where φ̃m (x) ∈ {0, 1} are not linearly dependent like the word counts. If x is a
document once more, each φ̃m (x) could be sensitive to a certain word pattern,
indicating its presence by φ̃m (x) = 1. A naive Bayes setup10 would be
M 
Y φ̃m (x)  1−φ̃m (x)
P (x|t = k) = p(k)
m 1 − p(k)
m . (6.12)
m=1

(k)
If you parse this expression with indicators in mind, you see that pm encodes
P {φ̃m (x) = 1 | t = k}. Once more, the log likelihood decouples additively over
(k)
the different pm , and we can estimate them independently of each other. Naive
Bayes classification for binary features is summarized at the end of this subsec-
tion.
At this point, we should note that the term “naive Bayes” is not used consis-
tently in the literature. There is a narrow and a more general definition, and
we use the latter here. To understand the difference, consider the two exam-
ples we looked at. The narrow definition stipulates that the class-conditionals
P (x|t = k) are such that, given t = k, the different features φm (x) are condi-
tionally independent in the probabilistic sense (see Section 5.1.1). This is the
10 For the meticulous: We should use P (φ̃(x)|t = k) instead of P (x|t = k), in order to

obtain a proper distribution. After all, x ↔ φ̃(x) may not be one-to-one.


104 6 Probabilistic Models. Maximum Likelihood

case for our latter binary feature example (6.12), but it is not the case for the
PM
document classification setup (6.8): the constraint m=1 φm (x) = N obviously
links the features. The more general definition of naive Bayes, adopted here, re-
(k)
quires the class-conditionals P (x|t = k) to factorize w.r.t. the parameters pm ,
so that the log likelihood decouples additively. However, the features may still
be linked11 , and so are the corresponding parameters.
One final observation, in preparation of things to come. Back to the document
classification example. What happens if one specific word cm does not occur in
any documents xi labeled as ti = k? Go back up and check for yourself. The
(k)
ML estimator p̂(k) will have p̂m = 0 in this case. Under this distribution, it
is impossible that any document of class k ever contains cm . In other words,
suppose I come along with a document x∗ which contains cm at least once:
φm (x∗ ) > 0. Then, the ML plug-in discriminant function ŷk (x) of the form
(6.9) is
ŷk (x∗ ) = φm (x∗ ) log p̂(k)
m + · · · = −∞.
The hypothesis t = k is entirely ruled out for x∗ , due to the absence of a
single word cm from its training data. Does this matter? Yes, very much so.
Natural language distributions over words are extremely heavy-tailed, meaning
that new unseen words pop up literally all the time. The fact that some cm does
not occur in documents for some class will happen with high probability for any
real-world corpus. We will analyze this serious shortcoming of ML plug-in rules
in Chapter 7, where we will learn how it can be alleviated.

Summary: Naive Bayes Classification for Binary Features

Suppose that φ̃(x) = [φ̃m (x)], where φ̃m (x) ∈ {0, 1}. Different from word
count features, the different φ̃m can be on or off independent of each other.
Naive Bayes classification is based on the model (6.12). This means that given
t = k, the features φ̃m (x) are conditionally independent. The classifier comes
(k)
with parameters pm ∈ [0, 1], m = 1, . . . , M , k = 0, . . . , K − 1, one for each
feature and class. These are estimated by maximum
PK−1 likelihood. If there are nk
input points xi labeled as ti = k, and n = k=0 nk , then
Pn
(k) I{t =k} φ̃m (xi )
p̂m = i=1 i ,
nk
the fraction of points xi of class k with φ̃m (xi ) = 1. The ML estimator for P (t)
is
nk
P̂ (t = k) = .
n
The trained classifier uses P̂ (t = k) and P̂ (x|t = k), the latter is (6.12) with
(k)
p̂m plugged in. For example, the (posterior) probability of class k̃ for a new
point x is computed by first computing
M
!
Y nk
P̂ (x|t = k)P̂ (t = k) = (p̂(k)
m )
φ̃m (x)
(1 − p̂(k)
m )
1−φ̃m (x)
.
m=1
n
11 For the meticulous: This linkage is typically a linear one (linear equality constraints),
expressed in not overly many constraints. It is possible to write down models with decoupling
log likelihood and intricate nonlinear constraints between the features. Such models would not
be called “naive Bayes” anymore.
6.5 Maximum Likelihood for Discrete Distributions 105

Then, using Bayes’ formula:

P̂ (x|t = k̃)P̂ (t = k̃)


P̂ (t = k̃|x) = P .
k P̂ (x|t = k)P̂ (t = k)

In practice, we have to use log in order to convert products into sums, otherwise
we produce overflow or underflow. It is easiest to work in terms of discriminant
functions:
n o
ŷk (x) = log P̂ (x|t = k)P̂ (t = k)
M 
X  nk
= φ̃m (x) log p̂(k)
m + (1 − φ̃ m (x)) log(1 − p̂ (k)
m ) + log .
m=1
n

The naive Bayes classifier decides for the class argmaxk ŷk (x). Moreover,

eŷk̃(x )
P̂ (t = k̃|x) = P ŷ (x) .
ke
k

6.5.3 Techniques: Maximizing Discrete Log Likelihoods


(*)
In this section, we establish the maximum likelihood estimators (6.11) for
the document classification setup. In fact, the log likelihood decomposes ad-
ditively in K + 1 separate terms, one for P (t) ∈ ∆K , and K for p(k) ∈ ∆M ,
k = 0, . . . , K − 1. They are instances of ML estimation for a multinomial dis-
tribution. Suppose D = {x1 , . . . , xn }, xi ∈ {1, . . . , L}, is sampled independently
from p ∈ ∆L (model assumption). The log likelihood is
L
X n
X
L = log P (D|p) = nl log pl , nl = I{xi =l} .
l=1 i=1

The maximum likelihood estimation problem is maxp∈∆L log P (D|p). For L = 2,


we can parameterize p by a single parameter, and the solution was obtained in
Section 6.2. In the general case, we could use the vanilla technique of Lagrange
multipliers (see Appendix A). However, let us do things differently, learning
about one of the most fundamental inequalities in mathematics in passing. First,
L
X nl
L=n p̂l log pl , p̂l = .
n
l=1
P
Note that p̂ = [p̂l ] ∈ ∆L just like p, since l nl = n. What you should take
away from this section is the following:
 XL 
q = argmax Fq (p) = ql log pl , q ∈ ∆L . (6.13)
p∈∆L l=1

If you are given a distribution q ∈ ∆L and seek the maximizer of Fq (p) for
p ∈ ∆L , the unique answer is p̂ = q. This problem appears over and over again12
12 It holds just as well for distributions over continuous variables, even though our proof

here only covers the discrete finite case.


106 6 Probabilistic Models. Maximum Likelihood

in machine learning. Whenever you recognize this pattern, you immediately


know the solution: no need for Lagrange multipliers. Sometimes, the best way
to derive pattern recognition methods is pattern recognition!
Let us use (6.13) to establish the ML estimators (6.11). How often do you
spot our pattern in the log likelihood (6.10)? K + 1 times. First, L → K,
qk → nk /n, pk → P (t = k) (multiply and divide by n to obtain frequencies
nk /n). Solution: P̂ (t = k) = p̂k = qk = nk /n. Next, fix k = 0, . . . , K − 1, and
(k)
let N (k) = m N (k,m) . L → M , qm → N (k,m) /N (k) , pm → pm (multiply and
P
(k)
divide by N (k) to obtain frequencies). Solution: p̂m = p̂m = qm = N (k,m) /N (k) .
In order to establish our pattern rule (6.13), let us look at the difference
L L L
X X X ql
D[q k p] := Fq (q)−Fq (p) = ql log ql − ql log pl = ql log , q, p ∈ ∆L .
pl
l=1 l=1 l=1

This function of two distributions over the same set is called relative entropy
(or Kullback-Leibler divergence, or also “cross-entropy” in the neural networks
literature). We need to show that D[q k p] ≥ 0 for any q, p ∈ ∆L , and that
D[q k p] = 0 only if q = p. This is the information inequality (or Gibbs
inequality), one of the most powerful “inequality generators” in mathematics.
It holds for general probability distributions, a general proof is found in [10].

2
x−1
1 log(x)

−1

−2

−3
0 0.5 1 1.5 2 2.5 3

Figure 6.9: log(x) is upper bounded by x − 1. This bound is used in order to


prove the information (or Gibbs) inequality.

Let us prove the information inequality for discrete finite distributions q, p ∈


∆L . The proof is based on the inequality log x ≤ x − 1, which holds for all x > 0
(Figure 6.9). Namely, if f (x) = x − 1 − log x (continuously differentiable), then
f (1) = 0, f 0 (x) = 1 − 1/x ≥ 0 if and only if x ≥ 1, so that f (x) ≥ 0 for all
x > 0. In fact, |f 0 (x)| > 0 for x 6= 1, so that f (x) = 0 if and only if x = 1. If we
define log 0 = −∞, then log x ≤ x − 1 holds for all x ≥ 0. Given that,
   
pl pl
−D[q k p] = Eq log ≤ Eq − 1 = Ep [1] − Eq [1] = 0.
ql ql
P
Here, Eq [f (l)] = l ql f (l) denotes expectation over q. This proves the in-
equality. Now, suppose that D[q k p] = 0. We need to show that ql = pl for
6.5 Maximum Likelihood for Discrete Distributions 107

all l = 1, . . . , L. Suppose that ql∗ 6= pl∗ for some l∗ with ql∗ = 6 0. Then,
ε = f (pl∗ /ql∗ ) > 0, and
   
pl pl
−D[q k p] = Eq log ≤ Eq − 1 − εI{l=l∗ } = −εql∗ < 0,
ql ql

a contradiction. Therefore, pl = ql for all ql 6= 0. Since both q and p sum to 1,


we must have pl = ql for all l = 1, . . . , L. This completes the proof.
108 6 Probabilistic Models. Maximum Likelihood
Chapter 7

Generalization.
Regularization

In this chapter, we introduce the concept of generalization and shed some light
on the phenomenon of over-fitting, which can negatively affect estimation-based
learning techniques. We will study regularization as a simple and frequently
effective remedy against over-fitting. Finally, MAP estimation is introduced as
general way of regularizing ML estimation, employing prior distributions over
model parameters.

7.1 Generalization
The world around us is complex. And the closer we look, the more details we
see. Arguably, for a model to stand any chance making interesting predictions, it
ought to reflect this complexity: many variables, high-dimensional feature maps,
a great depth of hidden layers separated by nonlinearities, so that training data
can be fit with high precision. What could be wrong with that?
Certainly, there will be computational issues. For highly detailed models, max-
imum likelihood estimation can be a hard problem to solve. But leaving these
issues aside, there is a fundamental problem: generalization. Understanding gen-
eralization is arguably the single most important lesson we will learn in this
course. Without a solid understanding of this concept, there can be no valid
statistics or useful machine learning. Its relevance goes far beyond statistics and
machine learning, essentially governing all natural sciences: Occam’s razor dic-
tates that we should always favour the simplest model appropriate for our data,
not the most complex one.
We start with some definitions which link back to decision theory (Sec-
tion 5.2) and discriminants (Chapter 2). Suppose you are given some data
D = {(xi , ti ) | i = 1, . . . , n}, ti ∈ {−1, +1}, and your goal is binary classi-
fication: finding a classifier f (x), mapping to labels {−1, +1}, which predicts
well. What does “predict well” mean? We can give several answers. After read-
ing Chapter 2, you might say: f (x) is a good classifier if it does well on the

109
110 7 Generalization. Regularization

training data D. In other words, its training error


n
1X
R̂n (f ) = R̂n (f ; D) = I{f (xi )6=ti }
n i=1

is small. For example, if D is linearly separable in some feature space we chose,


then the perceptron algorithm outputs a linear classifier with R̂n (f ) = 0. How-
ever, after reading about decision theory (Section 5.2), you might give a dif-
ferent answer. After all, there is the i.i.d. assumption (Section 6.1): the data
D is drawn independently from some underlying “true” distribution with joint
density p∗ (x, t). If we knew this distribution, the optimal classifier would be the
one which minimizes the generalization error (or test error)

R(f ) = P ∗ {f (x) 6= t} = E∗ I{f (x)6=t} .


 

Of course, we don’t know p∗ (x, t), we only know the data D. But we could use
D in order to learn about p∗ (x, t), using any number of subtle ideas (an exam-
ple is probabilistic modelling, leading to maximum likelihood plug-in rules; see
Chapter 6), and this may well lead to a classifier f (x) which does not minimize
the training error R̂n (f ) well, but attains a small test error R(f ). Training error
R̂n (f ) and test error R(f ) are different numbers, which in extreme cases can
have little relationship with each other. It is perfectly possible to attain very
small, even zero, training error and at the same time run up a large test error.
This nightmare scenario for statistical machine learning is called over-fitting.
This is all not very deep and fairly intuitive. But here is the interesting part. It
is possible to predict under which circumstances over-fitting is likely to occur.
Moreover, there are automatic techniques to guard against it, one of which we
will study in this chapter.
As far as over-fitting is concerned, there is nothing special about the training
error as a learning statistic. We will see that maximum likelihood estimation is
equally affected, where the statistic to minimize is the negative log likelihood.
Before we look into over-fitting and what to do about it, let us clarify our goals.
The correct answer above is the second: we wish to find a predictor with as small
a test error as possible. The catch with this goal is that it is in general impossible
to attain. We do not know p∗ (x, t), but only have a finite dataset D drawn from
it. The next best idea seems to select a classifier which minimizes the training
error, a statistic we can compute on D. This idea is a good one in general, it
works well in many cases. Yet training error minimization has some problems
which can make it fail poorly in certain relevant situations. Understanding and
mediating some of these problems is the subject of this chapter.

7.1.1 Over-fitting
We have already encountered over-fitting at several places. In Section 4.1, poly-
nomial curve fitting gave absurd, yet interpolating results for too high a poly-
nomial degree. In Section 4.2.3, we noted potential difficulties when solving
the normal equations of linear regression. In Section 6.4.1, we mentioned that
maximum likelihood plug-in classification can run into trouble if Gaussian class-
conditionals come with a full covariance matrix to be estimated. Finally, at the
7.1 Generalization 111

end of Section 6.5, we observed an extreme sensitivity of our naive Bayes docu-
ment classifier to zero word counts. In this section, we expose the commonalities
between these issues.

Figure 7.1: Example of over-fitting for binary classification. The simple linear
discriminant on the left errs on a single pattern. In order to drive the traning
error to zero, a more complex nonlinear discriminant is required (right). Given
the limited amount of data, the latter solution is less likely to generalize well.

Over-fitting comes about due to a mismatch1 between amount of training data


on the one, choice of model parameterization and learning method on the other
hand. It has several aspects. First, a certain model parameterization and learn-
ing procedure (for example, minimizing the training error for linear discrim-
inants) may not result in a unique solution, at least in practice. This aspect
is linked to non-identifiability (or ill-posedness) and to ill-conditioning. Non-
identifiability is easy to understand. If the family of classifiers you learn with is
so large than many different candidates attain the minimum training error (say,
zero), then the training error alone remains silent about how to choose among
them, its minimization does not identify a unique solution. Ill-conditioning is
slightly more subtle. You should think about it as “non-identifiability about to
happen”. It is often closely linked to numerical inaccuracy. Examples below will
make this clear. A second aspect of over-fitting is that often the best predictors
in hindsight, which minimize the test error, are not among those which minimize
the training error. Remember our discussion of curve fitting in Section 4.1. The
data is a stochastic sample from the “true” distribution, its points are typically
obscured by random noise. Solutions which minimize the training error are often
those which fit the noise on top of the systematic signal an optimal predictor
would uncover. The same comments apply to classification as well (Figure 7.1;
see also Section 9.2.1).
Recall polynomial curve fitting from Section 4.1, an instance of linear regression.
A polynomial y(x) = w0 + w1 x + · · · + wp−1 xp−1 of degree p − 1 is fit to n data
points D = {(xi , ti ) | i = 1, . . . , n} by way of minimizing the squared error.
1 As the name suggests, over-fitting is contingent on the fact that a model is fit to data in

the first place. In the Bayesian statistics approach to machine learning, over-fitting is ruled
out up front, at least in principle. Bayesian machine learning will not feature much in this
basic course, but see [28, 5].
112 7 Generalization. Regularization

1
p=1 1
p=2

0.5 0.5

0 0

−0.5 −0.5

−1 −1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

1
p=4 1
p=10

0.5 0.5

0 0

−0.5 −0.5

−1 −1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Figure 7.2: Linear regression estimation with polynomials of degree p − 1. The


generating curve (green) is sin(2πx), the noise is Gaussian with standard devi-
ation 0.15.

In Figure 7.2, these least squares solutions are plotted for n = 10 data points
and different numbers p of free parameters. The data comes from a smooth
curve, yet the targets ti are obscured by additive noise. In the presence of noise,
a good predictor should refrain from interpolating the points. However, as p
grows close to n, it is the interpolants which minimize the training error, no
matter how erratic they behave elsewhere. For p = n, the training error drops
to zero, all training points are fit exactly. For p > n, we face a non-identifiable
problem: infinitely many polynomials interpolate the data with training error
zero. However, even for p ≈ n, p ≤ n, the least squares solutions behave terribly.
As noted in Section 4.2.3, this is due to ill-conditioning. For p ≤ n, the design
matrix Φ ∈ Rn×p typically has full rank p, and the system matrix ΦT Φ of
the normal equations is invertible. However, in particular for large p and n,
some of its eigenvalues are very close to zero. Geometrically speaking, for some
directions d, dT ΦT Φd = kΦdk2 ≈ 0. This means that our data remains silent
about contributions of the weight vector along d. Such matrices are called ill-
conditioned, and solving systems with them tends to produce solutions with
large coefficients, which in turn give rise to highly erratic polynomials. In short,
over-fitting occurs for least squares polynomial regression if the data is noisy
and the number of parameters p is overly close to n. We can do little about noise,
but in Section 7.2 we will learn to know remedies against ill-conditioning and
7.1 Generalization 113

non-identifiability. In Figure 7.2, we can also observe the opposite problem of


under-fitting. Clearly, constant (p = 1) or affine (p = 2) functions are insufficient
to describe the systematic part of the data well. Given the data D, how can we
choose p so that fits are most likely to generalize well and avoid both over-
and under-fitting? We will take up this model selection problem in Chapter 10.
To summarize, the root of over-fitting in least squares polynomial regression is
that for p of a certain size, the n data points remain nearly silent about certain
directions d, and the training error is minimized only by growing the weight
vector dramatically along d.
Our problems with the naive Bayes bag of words document classifier of Sec-
tion 6.5 come from the unruly importance attached to word counts being zero
(say, word cm does not occur in data for class k). Dictionaries for natural lan-
guage tasks grow rapidly with corpus size, and there are usually many classes.
As many zero count events happen simply by chance, the infinite sensitivity
attached to them is plain wrong. Once more, this is an over-fitting effect. For a
large dictionary size M and many classes K, there are very many parameters
(k)
pm and too little data to fit all of them by maximum likelihood. Small (but
nonzero) probabilities in particular are not determined well by training data,
but have a large effect on predictions. A simple remedy is called Laplace smooth-
ing: add 1 to each count N (k,m) and use the modified (or “smoothed”) counts
to estimate the word probabilities. This mitigates the zero counts artefact of
straight ML naive Bayes without any substantial negative influence.
Finally, recall the ML plug-in rule for Gaussian class-conditionals, where x ∈ Rp
(Section 6.4.1). Typically, class-conditional data has a pronounced covariance
structure, which is not captured by spherical distributions P (x|t) = N (µt , I).
This affects classification performance. If classes spread unequally in different
directions, they will overlap more along certain direction than a spherical covari-
ance fit would make us believe. General assumptions P (x|t) = N (µt , Σ) should
improve things, but now we have to estimate a full covariance matrix Σ of about
p2 parameters from our data. If the training set size n is not much larger than
p, plugging in the ML estimator Σ̂ (Section 6.4) can lead to poor classification
performance. Once more, the culprits are directions of small variance which are
underestimated in Σ̂, and which exercise a large effect on the final predictor.
To sum up, over-fitting happens in the absence of large enough training sets,
given all our choices. If you can get more2 data, do so by any means. However,
there is a pattern in the examples discussed above. Small variations or probabil-
ities are typically underestimated from limited data. With little room to move,
they might even be set to zero (for example, a rare word cm may occur zero
times in training documents for some class k). These estimation effects happen
by chance, since our data is a random sample. Nevertheless, they exert a very
strong influence on the final predictor. Viewed this way, over-fitting is an arte-
fact of learning methodology applied to small samples, and in the next section,
we discuss a remedy. Beyond, over-fitting may come from non-optimal choice of
model size and parameterization. In Chapter 10, we will learn about techniques
to assess the suitability our our model choices, and ways to validate learned
predictors.
2 However, with more data, you might also want to explore more complex and realistic

models. Over-fitting will not go away with the “data deluge”.


114 7 Generalization. Regularization

7.2 Regularization
Simple learning techniques like training error minimization or maximum like-
lihood estimation can run into serious trouble, collectively termed over-fitting.
Will we have to sacrifice their simple geometrical structure and efficient learning
algorithms and do something else altogether? Will we have to painstakingly sift
through data, identify smallish counts and treat them by hand-tuned heuris-
tics? We don’t. It turns out that with a simple modification of the standard
techniques, we can alleviate some of the most serious over-fitting issues. Impor-
tantly, this modification does not add any computational complexity. In fact,
many algorithms behave better and may converge faster in that case. This idea
is called regularization (or penalization).

15

data
nu=0 [alpha=1394565]
nu=1e−7 [alpha=1024]
nu=1e−4 [alpha=113]
10 nu=1e−2 [alpha=6.8]
true

−5
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1

Figure 7.3: Regularized polynomial curve fitting. n = 30 data points were drawn
from a smooth curve (dashed) plus Gaussian noise. Polynomials of degree p − 1
are used for the fitting, where p = 20. The black curve is the standard least
squares solution without regularization. The other curves are regularized least
squares solutions with different regularization parameter values ν. The α values
are α(ν) = kŵν k, where ŵν = argminw Eν (w). Notice the extreme size of α(0)
for the non-regularized solution (ν = 0). Its erratic behaviour is smoothed out
in what amounts to better tradeoffs between data fit and curve complexity.

Recall how over-fitting manifests itself in polynomial curve fitting. Since some
directions are almost unconstrained by the data, contributions of the weight
vector w can become large along these. Not required to pay for it, least squares
estimation uses these degrees of freedom in order to closely fit all training points.
As these are noisy, large weights are required, and the least squares fit behaves
very erratically elsewhere (Figure 7.3, black curve). A remedy is to make LS
estimation pay for using large weights, no matter along which directions. We
7.2 Regularization 115

can do so by adding an extra penalty term (ν/2)kwk2 to the squared error,


where ν ≥ 0, ending up with a different criterion function
1 ν
Eν (w) = kΦw − tk2 + kwk2 . (7.1)
2
| {z } |2 {z }
error function regularization term

The procedure of finding the weights w by minimizing Eν (w) is known as


(Tikhonov) regularized least squares estimation (or penalized least squares es-
timation). Eν (w) is the sum of the usual squared error function and a second
term, called regularization term (or regularizer, or penalization term, or penal-
izer). The constant ν ≥ 0 is known as regularization constant. In Figure 7.3,
regularized least squares estimates ŵTν φ(x) are shown for different values of ν,
where ŵν = argminw Eν (w). Obviously, curves becomes smoother, the larger
ν. This makes sense. The larger ν, the more price to pay (in terms of size of
Eν (w)) for large weights. Therefore, kŵν k will decrease as ν increases. On the
other hand, the training squared error alone increases as ν increases. We estab-
lish these points rigorously in Section 7.2.2.
The regularized least squares problem implies modified normal equations:
 
ΦT Φ + νI w = ΦT t.

This is because the quadratic term (1/2)wT ΦT Φw in E(w) is extended by


adding (ν/2)wT w. Recalling Section 4.2.3, we see that regularization also im-
proves the conditioning of this problem. As analyzed in Section 7.2.2, the effect
of regularization is to add ν to all eigenvalues of ΦT Φ, in particular to lift the
tiny ones at the lower end of the spectrum. Regularization does not only smooth
out solutions to make them behave less erratically, regularized problems can also
be solved more robustly in practice.
Moreover, our analysis in Section 7.2.2 shows that regularization tackles over-
fitting at its roots. Comparing the standard least squares solution ŵ0 with ŵν ,
it is not just that kŵν k is smaller than kŵ0 k. ŵν is shrunk compared to ŵ0
along all directions d, but shrinkage is most pronounced along such directions d
which are least determined by the data. The large effect of such poorly estimated
directions on the least squares solution is diminished.
How do we choose the regularization parameter ν? In a way, this is just one more
model choice, along with the number p of weights or aspects of the feature map
φ(x), and model selection techniques discussed in Chapter 10 can be applied.
Since ν is a single parameter with an obvious effect on the smoothness of the
prediction, another common approach in data analysis is to inspect curves for
different values of ν and to choose it by hand.
To conclude, in order to regularize an estimation method which is based on
minimizing an error function (for example the squared error for linear regres-
sion, or the training error or perceptron error function for a linear classifier), we
add a regularization term which grows with the sizes of the weights. Due to the
presence of this term, regularized estimation will prefer a solution with small
weights (and therefore smooth functions) if it fits the training data equally well
or even a little less well than another solution with large weights. No matter what
the error function, by far the most commonly used regularizer is the squared
116 7 Generalization. Regularization

Euclidean norm (ν/2)kwk2 . If this term is used, we speak of Tikhonov regu-


larization. The same idea is behind Wiener filtering in signal processing, ridge
regression in statistics, and weight decay regularization in the neural networks
literature.
Tricks of the trade of regularization in practice are numerous and out of the
scope of this course. If you are working with multi-layer perceptrons, you should
be aware of them. Both [5, ch. 5.5] and [4, ch. 9.2] give a good overview.

0.45

0.25

0.4

0.2

0.15 0.35
0 10 20 30 40 50 0 10 20 30 40 50

Figure 7.4: Illustration of over-fitting and early stopping. Shown are error on
the training dataset (left) and on an independent validation dataset not used
for training (right), for a MLP applied to a regression problem (details in [5],
Figure 5.12), the horizontal unit is number of gradient descent iterations. The
training error curve is monotonically decreasing as expected. In contrast, the
validation error curve drops only up to a point, after which it increases. Early
stopping corresponds to monitoring the error on a validation set, terminating
MLP training once this statistic starts to increase.
Figure from [5] (used with permission).

7.2.1 Early Stopping

There are other techniques to keep over-fitting at bay. One simple technique,
early stopping, is frequently used with multi-layer perceptrons or other neural
network models. Early stopping is based on monitoring an estimate of the test
error R(f ) alongside training (minimization of R̂n (f )). To do so, we hold out
some of our data exclusively for the purpose of validation, split our data into
a training set DT and a validation set DV . Since the latter is never used for
training, the empirical error R̂(fˆ; DV ) provides a reliable estimate of R(fˆ) even
as fˆ is fitted to DT . A typical MLP training run is shown in Figure 7.4. By
definition, the training error R̂(fˆ, DT ) (left panel) decreases monotonically. In
contrast, the validation error R̂(fˆ; DV ) does so only up to a point, after which
it begins to increase. We can stop the MLP training early at this point, since
any further decrease in training error does not lead to a decrease in R̂(fˆ; DV ).
Compared to regularization by adding a complexity penalty term(for example,
a Tikhonov squared norm), early stopping has the advantage of not changing
the standard training procedure at all, so existing code does not have to be
modified. In contrast to penalization, whose success relies on a good choice of
7.2 Regularization 117

the regularization constant, early stopping is free of parameters. Moreover, it


can sometimes speed up training. Its chief drawback is that a part of the data
has to be sacrificed for validation and cannot be used to learn the weights. It
is tempting to choose only a few points for DV , but in this case, R̂(fˆ; DV ) does
not represent the test error R(f ) well enough, which defies the whole purpose.
Typically, at least 20% of the data should be used for validation. Moreover, early
stopping can be difficult to use and is hard to automatize. It can happen that
R̂(fˆ; DV ) increases for some iterations, then continues to drop. Finally, notice
that early stopping does not modify the training procedure as such, therefore
does not help with ill-conditioning, apart from stopping at the first sign of things
going wrong.

7.2.2 Regularized Least Squares Estimation (*)


In this section, we obtain insight into Tikhonov regularization for least squares
estimation by working out some properties. We focus on linear regression (for
example, polynomial curve fitting). The minimizer of the regularized squared
error is given by
 −1
ŵν = ΦT Φ + νI ΦT t.
The main message is this. The norm of ŵν is shrunk towards zero as ν gets
larger. This shrinkage does not happen uniformly, but ŵ0 (the standard least
squares solution) is shrunk more along directions which are less well determined
by training data. This shrinkage alleviates the erratic behaviour of the least
squares estimate, it has to be paid for by an increase in the training set squared
error.
Let us compare the standard least squares solution ŵ0 against the regularized
ŵν . To do so, we will expand the weight vectors in the eigenbasis of the relevant
system matrix ΦT Φ. Let
ΦT Φ = U ΛU T
the eigendecomposition of ΦT Φ. Here, U = [uj ] ∈ Rp×p is an orthonormal
matrix, U T U = I, whose columns uj are the eigenvectors, and Λ = diag[λj ] is
a diagonal matrix of the eigenvalues λ1 ≤ · · · ≤ λp . If you are feeling lost about
all things eigen, you might want to skip to Section 11.1.2, or otherwise return
to the present section at a later time. Now, {uj } is an orthonormal basis of Rp ,
so we can represent each ŵν as a linear combination of the eigenvectors:
p
X p
X
ŵν = βν,j uj , ŷν (x) = ŵTν φ(x) = βν,j φ(x)T uj .
j=1 j=1

The predictor ŷν (x) is a weighted sum of the basis functions φ(x)T uj , each be-
ing the inner product between the feature map φ(x) and the eigendirection uj .
At least for large p, these basis functions differ dramatically when we evaluate
them at the training points {xi } only. Namely,

λj = uTj ΦT Φuj = kΦuj k2 .

For large p, the smallest eigenvalue λ1 is typically very close to zero, so that
Φu1 ≈ 0, therefore φ(xi )T u1 ≈ 0 for all i = 1, . . . , n. In contrast, the basis
118 7 Generalization. Regularization

function φ(x)T up for the largest eigenvalue makes sizeable contributions at


the training points. Importantly, these differences manifest themselves at the
training points only: as a polynomial with weights of unit norm, φ(x)T u1 is of
healthy size elsewhere.
What does this mean? If φ(x)T u1 is to contribute significantly to the fit of the
training data, then β0,1 has to be very large, which is precisely what happens
for the least squares solution ŵ0 , in its effort to minimize the training error. But
then, ŷ0 (x) sports a very large component β0,1 φ(x)T u1 , which explains its er-
ratic behaviour away from the training data. To understand what regularization
does about this, we need to work out the βν,j coefficients.

ΦT Φ + νI = U ΛU + νU T U = U (Λ + νI) U T ,
 −1
−1
ΦT Φ + νI = U (Λ + νI) U T

and
−1 n

−1
X ỹj
ŵν = ΦT Φ + νI ΦT t = U (Λ + νI) ỹ = uj ,
j=1
λj + ν
T T
ỹ = U Φ t.
Therefore,
ỹj ỹj λj
β0,j = , βν,j = = β0,j .
λj λj + ν λj + ν
Regularization with ν > 0 transforms β0,j (standard least squares) to βν,j .
Obviously, |βν,j | < |β0,j |, but the amount of shrinkage depends strongly on λj :
 
λj ≈1 | λj  ν (well determined by data)
= .
λj + ν 1 | λj  ν (poorly determined by data)
Therefore, regularization precisely counteracts the erratic behaviour of the stan-
dard least squares solution. Put differently, β0,j is inversely proportional to λj ,
explaining the very large contributions of the smallest eigendirections to ŵ0
and ŷ0 (x). Regularization exchanges the denominator by λj + ν, uniformly lim-
iting the influence of these poorly determined directions. In contrast, for well
determined directions with λj  ν, regularization hardly makes a difference.
The expansion of ŵν in the orthonormal eigendirections uj implies that for
0 ≤ ν1 < ν2 :
n n n n
2
X X ỹj2 X ỹj2 X
kŵν2 k = βν22 ,j = < = βν21 ,j = kŵν1 k2 .
j=1 j=1
(λj + ν2 )2 j=1
(λ j + ν1 )2
j=1

The first equality uses the orthonormality: uTi uj = I{i=j} . On the other hand,
the squared training error increases with ν:
kΦ ŵν2 − tk2 > kΦ ŵν1 − tk2 .
To establish this, assume that the opposite holds. Then,
ν1 ν2
Eν1 (ŵν2 ) = E0 (ŵν2 ) + kŵν2 k2 < E0 (ŵν2 ) + kŵν1 k2
2 2
ν2
≤ E0 (ŵν1 ) + kŵν1 k2 = Eν1 (ŵν1 ),
2
7.3 Maximum A-Posteriori Estimation 119

which contradicts the optimality of ŵν1 . To conclude, a stronger penalty term


leads to strictly smaller weights and a smoother predictor ŷν (x), but also implies
a worse training set fit.

7.3 Maximum A-Posteriori Estimation


Regularization can help to alleviate over-fitting problems which can affect lin-
ear regression and classification. How does this principle generalize to maximum
likelihood estimation? We could add a Tikhonov regularizer to the negative log
likelihood and see what we get. Apart from being unfounded, this approach
creates several problems. First, a quadratic penalizer does not make much sense
on probability distributions or covariance matrices. Second, the simple and el-
egant MLE solutions (ratios of counts, empirical covariance) do not carry over
to such a modification, and we would have to solve tedious optimization prob-
lems. In this section, we introduce a probabilistic viewpoint on regularization
of ML estimation, which not only helps to construct regularizers in an informed
way, but whose modified optimization problems are typically solved in much
the same way as their MLE counterparts. This framework is called maximum a-
posteriori (MAP) estimation, and regularizers correspond to negative log prior
distributions over parameters.
Recall the first ML estimation example we came across earlier, featuring the
thumbtack you found in your drawer (Section 6.2). Curious about the probability
of landing point up, p1 = P {x = 1}, you could use ML estimation: throw it
n = 100 times, collect data D = {x1 , . . . , x100 }, and maximize the likelihood
function
Xn
P (D|p1 ) = (p1 )n1 (1 − p1 )n−n1 , n1 = xi ,
i=1

resulting in the ML estimator p̂1 = n1 /n. However, having enjoyed some training
in physics, you take a good look at the thumbtack and convince yourself that
its shape implies that p1 > 1/2 is substantially more probable than p1 < 1/2
(“if not, I will eat my hat”). It is not that you dare to pin down the value p1
from first principles. But you are more or less certain about some properties of
p1 . You could formulate your thoughts in a probability distribution over p1 .
Wait a second. “More probable”, “probability distribution”? p1 is not random,
it’s just a parameter we don’t know. Recall our introduction of probability
in Section 5.1. We do not care whether p1 is “random” or just an unknown
parameter (whatever this distinction may mean). We are uncertain about it,
and that is enough. We encode our uncertainty in the precise value of p1 by
treating it as random variable. In other words, we can maintain a distribution
over p1 in the same way as we maintain one over x. The latter is a condi-
tional distribution P (x|p1 ). If you plug in data D, this becomes the likelihood
P (D|p1 ) = i P (xi |p1 ). The former3 is p(p1 ), called prior distribution over p1 .
Q

If this vocabulary reminds you of decision theory (Section 5.2), you are on the
right track. Back there, we predict a class label t from an observed input point
x as follows. We know the class-conditional distributions p(x|t) and the class
3 p(p is a density, since p1 ∈ [0, 1] is continuous.
1)
120 7 Generalization. Regularization

prior P (t). We determine the class posterior P (t|x) = p(x|t)P (t)/p(x) by Bayes’
formula, then predict f ∗ (x) = argmaxt P (t|x). The normalization by p(x) does
not matter, so that f ∗ (x) = argmaxt p(x|t)P (t). Here, we observe data D and
want to predict the probability p1 . We determine the posterior distribution
P (D|p1 )p(p1 )
Z
p(p1 |D) = , P (D) = P (D|p01 )p(p01 ) dp01
P (D)

and predict (or estimate)

p̂MAP
1 = argmax p(p1 |D) = argmax {P (D|p1 )p(p1 )} ,
p1 ∈[0,1] p1 ∈[0,1]

the maximum point of the posterior, the maximum a-posteriori (MAP) esti-
mator. How does this differ from ML estimation? As usual, we minimize the
negative log:

− log {P (D|p1 )p(p1 )} = − log P (D|p1 ) + − log p(p1 ).


| {z } | {z }
error function regularization term

The MAP criterion is the sum of the negative log likelihood and the negative
log prior. The first quantifies data fit and plays the role of an error function.
The second is a regularizer. MAP estimation is a form of regularized estimation,
obtained by choosing the negative log prior as regularizer.

Figure 7.5: Density functions of the Beta distribution Beta(α, β) for different
values (α, β).
Figures from wikipedia (used with permission). Copyright by Krishnavedala,
Creative Commons Attribution-Share Alike 3.0 Unported license.

What is a good prior p(p1 )? In principle, we can choose any distribution we like.
The choice of a prior is simply a part of the choice of a probabilistic model. But
certain families of models are easier to work with than others, and the same
7.3 Maximum A-Posteriori Estimation 121

holds for prior distributions. For our thumbtack example, a prior from the Beta
family Beta(α, β) is most convenient:

1 Γ(α)Γ(β)
p(p1 |α, β) = (p1 )α−1 (1 − p1 )β−1 I{p1 ∈[0,1]} , B(α, β) = ,
B(α, β) Γ(α + β)
α, β > 0.

Here, Γ(x) is Euler’s Gamma function (Section 6.3.2). Before we move on, a gen-
eral hint about working with probability distributions. Never write, or even try
to remember, the normalization constants. What matters is the part depending
on p1 . Memorize p(p1 |α, β) ∝ (p1 )α−1 (1 − p1 )β−1 as Beta(α, β) density. Here,
A ∝ B reads “A proportional to B”, or A = CB for some constant C > 0. If you
need the normalization constant at all (you usually do not), you can look it up.
In particular, keeping normalization constants up to date during a derivation of
a posterior is a waste of time and an unnecessary source of mistakes.
Back to Beta(α, β). It is a distribution over a probability p1 ∈ [0, 1]. Its mean
is E[p1 ] = α/(α + β). Density functions for a range of (α, β) values are shown
in Figure 7.5. The larger the sum α + β, the more peaked the distribution. The
density is bounded above only if α ≥ 1, β ≥ 1. Beta(1, 1) is a special case you
already know: the uniform distribution over [0, 1]. For α > 1, β > 1, the unique
mode (maximum point) of the density is (α − 1)/(α + β − 2), which is always
further away from 1/2 than the mean, unless α = β. Beta(α, β) is symmetric
around 1/2 if and only if α = β. What is special about this family? You may have
noticed the similarity between p(p1 |α, β) and the likelihood P (D|p1 ). Indeed,

P (D|p1 )p(p1 |α, β) ∝ (p1 )n1 (1 − p1 )n−n1 × (p1 )α−1 (1 − p1 )β−1


∝ (p1 )α+n1 −1 (1 − p1 )β+n−n1 −1 .

In our first exercise in dropping normalization constants, we directly conclude


that the posterior p(p1 |D) must be Beta(α + n1 , β + n − n1 ). That is because
p(p1 |D) ∝ (p1 )α+n1 −1 (1 − p1 )β+n−n1 −1 , and there is only one density of this
form (of course, we have to check that both α + n1 and β + n − n1 are positive).
The posterior is Beta again! All the data is doing is changing the constants to
αn = α + n1 , βn = β + n − n1 . Since we know what the mode of a Beta density
is, we can look up the MAP estimator:
α + n1 − 1
p̂MAP
1 = .
α+β+n−2
This can be seen as convex combination between the ML estimate p̂1 and the
MAP estimate based on the prior p(p1 ) alone:
α+β−2 α−1 n n1
p̂MAP
1 = · + ·
α+β−2+n α+β−2 α+β−2+n n
α−1 α+β−2
=κ· + (1 − κ) · p̂1 , κ = .
α+β−2 α+β−2+n
If the sample size n is not too large compared to α + β − 2, the estimator is
pulled away from p̂1 towards the prior mode. For example, to incorporate your
insight that p1 should rather be > 1/2, you would specify α > β, and p̂1 is
shifted towards the prior mode > 1/2.
122 7 Generalization. Regularization

Figuring out thumbtack probabilities is just a simple example of what turns


out to be a very useful framework. It is often much easier to formulate one’s
prior belief about a parameter in a distribution then to come up with a useful
regularizer from scratch. Examples are given in Section 7.3.1, where we rederive
Laplace smoothing as MAP estimation and show how to regularize the MLE for
a Gaussian covariance matrix.
ML estimation can often be interpreted as MAP estimation with a particular
prior. In the thumbtack example, MLE corresponds to MAP estimation with
prior Beta(1, 1): p(p1 ) = I{p1 ∈[0,1]} , the uniform distribution. This is a fairly
benign assumption, and MLE is unproblematic in this case. But in other cases
(Section 7.3.1), MLE’s lack of a prior runs the estimation method into serious
trouble, and MAP regularization can save the day.

Final Comments for Curious Readers (*)

Finally, some more advanced comments, which can be skipped at first reading.
ML estimation was motivated by treating the likelihood P (D|p1 ) as a scoring
function to assess data fit, whose maximization w.r.t. p1 intuitively makes sense.
However, the posterior p(p1 |D) is a distribution over p1 . Why should the pos-
terior be particularly well represented by its mode? Why not for example the
mean E[p1 | D]? Our analogy with decision theory is useful, in that it motivates
the role of the prior p(p1 ), but it is not perfect. p1 is not a class label from a
finite set, but a continuous probability. Why not output all of p(p1 |D) as result,
given that it is just a Beta distribution with two parameters? These questions
are resolved in the Bayesian approach to machine learning, the ultimately cor-
rect way to implement decision theory from finite data, whereas ML or MAP
estimation are computationally attractive shortcuts. Bayesian machine learning
will not feature much in this course, but [28, 5] provide good introductions.
If MAP corresponds to regularized ML estimation, then Beta(α, β) implies the
following regularizer:

− log p(p1 |α, β) = −(α − 1) log p1 − (β − 1) log(1 − p1 )


α−1
= (α + β − 2) {−q1 log p1 − (1 − q1 ) log(1 − p1 )} , q1 = ,
α+β−2

dropping an additive constant. Here, q1 is the mode of p(p1 |α, β). This is cer-
tainly not a quadratic function of p1 . What is it then? The attentive reader may
remember the pattern from Section 6.5.3, which allows us to write

− log p(p1 |α, β) = (α + β − 2)D[(q1 , 1 − q1 ) k (p1 , 1 − p1 )],

dropping another additive constant. We can interpret α + β − 2 as regularization


constant. The larger α + β − 2, the more peaked the regularizer is around the
mode q1 . The special role of this constant for the convex combination of p̂MAP
1
from q1 and the MLE p̂1 becomes apparent now. Second, the regularization
term is simply the relative entropy between the prior mode q1 and p1 , viewed
as binary distributions.
7.3 Maximum A-Posteriori Estimation 123

7.3.1 Examples of Conjugate Prior Distributions (*)

In this section, we provide further examples of MAP estimation. We focus on


models where ML estimation is prone to over-fitting (Section 7.1), and MAP
estimation can be seen as a particular form of regularization. Recall the thumb-
tack example. Using a Beta prior distribution proved convenient there: no matter
what the data, the posterior is Beta again. Such prior distributions are called
conjugate (for the likelihood), and the examples in this section are of this type
as well.

Figure 7.6: Density plots of Dirichlet distributions over the probability P


simplex
∆3 , with corners x = [1, 0, 0], y = [0, 1, 0], z = [0, 0, 1]. Note how m αm
determines the concentration.
Figure from wikipedia (used with permission). Released into public domain by
en:User:ThG (no license).

Recall our naive Bayes document classifier from Section 6.5. Its over-fitting issues
have been discussed in Section 7.1.1, where we noted a surprisingly simple and
widely used fix: Laplace smoothing, add 1 to each count and move on. Compare
this to MAP estimation for the thumbtack example, where we added α and β
to the counts n1 and n − n1 . Maybe Laplace smoothing is MAP estimation with
a prior much like Beta? Abstracting from naive Bayes details, suppose that
 XM 
p ∈ ∆M = q qm ≥ 0, m = 1, . . . , M, qm =1 .

m=1

∆M is the M -dimensional probability simplex (Section 6.5). What the Beta


124 7 Generalization. Regularization

distribution is for ∆2 , the Dirichlet distribution is for general ∆M :


M
Y
Dir(p|α) ∝ (pm )αm −1 I{p∈∆M } , αm > 0, m = 1, . . . , M.
m=1

The normalization constant is not important for our purposes here. Densi-
ties for different Dirichlet distributions over ∆3 are shown in Figure 7.6. Its
properties parallel those of the Beta. The mean is E[p|α] = α/(1T α), and
PM
1T α = m=1 αm determines the concentration of the density. Dir(p|1) is
the uniform distribution over ∆M . If any αm < 1, the density grows un-
boundedly as p → δ m . If all αm > 1, the density has a unique mode at
(α − 1)/(1T α − M ). Most importantly, the Dirichlet family is conjugate for
QM (m)
the likelihood P (D|p) = m=1 (pm )N of i.i.d. data sampled from p. If the
prior is p(p) = Dir(p|α), then the posterior
M
Y (m)
+αm −1
p(p|D) ∝ (pm )N I{p∈∆M } ∝ Dir(p|α + [N (m) ])
m=1

is Dirichlet again, with counts [N (m) ] added to α. If N = m N (m) is the sum


P
of counts, the concentration grows by N . The MAP estimator is
1 h i
p̂MAP = T N (m) + αm − 1 .
1 α+N −M
This is the same as the ML estimator if α = 1, the uniform distribution over
∆M . On the other hand, Laplace smoothing is obtained by setting α = 21.
Compared to the MLE, the MAP estimator is shrunk towards 1/M ∈ ∆M , which
alleviates over-fitting issues due to zero counts. Variants of Laplace smoothing
used in practice correspond to MAP estimation with different α, which may
even be learned from data.

The Wishart Distribution (*)

Our second example is concerned with covariance estimation. To keep things


simple, assume that data xi ∈ Rp is drawn i.i.d. from N (0, Σ), and we wish to
estimate the covariance matrix Σ ∈ Rp×p . The ML estimator is
n
1X
Σ̂ = xi xTi .
n i=1

Note that this lacks the −x̄ x̄ T term of the standard sample covariance, since the
mean is fixed to zero here. If p is comparable to the sample size n, then Σ̂ is ill-
conditioned, and the ML plug-in classifier is adversely affected (Section 7.1.1).
One remedy is to use MAP estimation instead, with a Wishart prior distribution.
Let us focus on the precision matrix P = Σ−1 . The likelihood is
Pn
1
xT n
p(D|P ) ∝ |P |n/2 e− 2 i=1 i P xi = |P |n/2 e− 2 tr Σ̂P
.

Here, we used manipulations involving the trace which were introduced in Sec-
tion 6.4.3. The Wishart family contains distributions over symmetric positive
7.3 Maximum A-Posteriori Estimation 125

definite matrices P  0 (Section 6.3). The Wishart distribution with α > p − 1


degrees of freedom and mean V  0 has the density
α
tr V −1 P
W (P |V , α) ∝ |P |(α−p−1)/2 e− 2 I{P 0} .

This is of course conjugate to our Gaussian likelihood for the precision matrix.
For the Wishart prior p(P ) = W (I, α) with mean I, the posterior is
α
tr P − n
p(P |D) ∝ |P |(α+n−p−1)/2 e− 2 2 tr Σ̂P

α+n
tr( α+n
α n
Σ̂ )P
∝ |P |(α+n−p−1)/2 e− 2 I+ α+n
,

Wishart with α + n degrees of freedom and mean (αI + nΣ̂)/(α + n). Its mode
is the inverse of
MAP 1  
Σ̂ = αI + nΣ̂ ,
α+n−p−1
derived as in Section 6.4.3. If λ1 ≤ · · · ≤ λp are the eigenvalues of Σ̂, then
MAP
Σ̂ has the same eigenvectors, but eigenvalues

α + nλj
λMAP
j = .
α+n−p−1

In particular, λMAP
1 ≥ α/(α + n − p − 1). Eigenvalues are bounded away from
zero, alleviating the over-fitting problems of the plug-in classifier. To conclude,
the MAP estimator with a Wishart prior with mean I is obtained from the stan-
dard sample covariance matrix by regularizing the eigenspectrum. Such spectral
regularizers are widely used in machine learning.
126 7 Generalization. Regularization
Chapter 8

Conditional Likelihood.
Logistic Regression

The machine learning methods we encountered so far can be classified into two
different groups. We can pick some error function and function class, then opti-
mize for the error-minimizing hypothesis within the class. In order to alleviate
over-fitting problems, the error function can be augmented by a regularization
term. Examples for this approach include perceptron classification, least squares
estimation or multi-layer perceptron learning. Alternatively, we can capture our
uncertainty in relevant variables by postulating a probabilistic model, then use
probability calculus for prediction or decision making. The most important ideas
in this context are maximum likelihood estimation (Chapter 6) and maximum
a-posteriori estimation (Section 7.3). Both paradigms come with strenghts and
weaknesses. Probabilistic modelling is firmly grounded on decision theory (Sec-
tion 5.2), and it is more “user-friendly”. As noted in Section 5.1, it is natural for
us to formulate our ideas about a setup in terms of probabilities and conditional
independencies among them, while constructing some regularizer or feature map
from scratch is unintuitive business. On the other hand, a probabilistic model
forces us to specify each and every relationship between variables in a sound
way, which can be more laborious than fitting some classifier to the data.

In this chapter, we will work out a model-based perspective on methodology


in the first group, thereby unifying the paradigms under the umbrella of prob-
abilistic modelling. To this end, we will introduce discriminative models and
conditional maximum likelihood estimation, providing a common basis for lin-
ear classification and regression, least squares estimation, and training MLPs.
We will understand what the squared error means in probabilistic terms and de-
vise alternative error functions which can work better in practice. Discriminative
modelling does for error function minimization what generative modelling does
for data analysis. It provides a natural and automatic route from prior knowl-
edge about the structure of a problem and the noise corrupting our observations
to the optimization problems to be solved in practice.

127
128 8 Conditional Likelihood. Logistic Regression

8.1 Conditional Maximum Likelihood


We do like the squared error function
n
1X
Esq (w) = (y(xi ; w) − ti )2 . (8.1)
2 i=1

Its gradient is easily computed, and we can run gradient descent optimization
(Section 2.4.1). If y(xi ; w) is linear in the weights w, minimizing Esq (w) (least
squares estimation) corresponds to orthogonal projection onto the linear mode
space (Section 4.2.1), for which we can use robust and efficient algorithms from
numerical mathematics (Section 4.2.3). For multi-layer perceptrons, the gradient
can be computed by error backpropagation. Why would we ever use anything
else? As we will see in this section, the squared error function is not always
a sensible choice in practice. In many situations, we can do much better by
optimizing other error functions. Switching from squared error to something
else, we should be worried about sacrificing the amazing properties of least
squares estimation. However, as we will see in the second half of this chapter,
such worries are unfounded. The minimization of most frequently used error
functions can be reduced to running least squares estimation a few times with
reweighted data points. MLP training works pretty much the same way with all
these alternatives, in particular error backpropagation remains essentially the
same.
Where do alternative error functions come from? How do we choose a good error
function for a problem out there? We will address this question in much the same
spirit as in the last two chapters: we let ourselves be guided by decision theory
and probabilistic modelling. We will discover a second way of doing maximum
likelihood estimation and learn about the difference between generative and
discriminative (or diagnostic) modelling.

8.1.1 Issues with the Squared Error Function

In this section, we work through some examples, demonstrating why the squared
error (8.1) can be improved upon in certain situations. Recall that we introduced
Esq (w) in Section 2.4 for training a binary classifier, even before discussing the
classical application of least squares linear regression (Section 4.1). Is Esq (w)
a good error function for binary classification? No, it is not! Let us see why.
Consider a binary classification dataset D = {(xi , ti ) | i = 1, . . . , n}, where
ti ∈ {−1, +1}, for which we wish to train a linear classifier f (x) = sgn(y(x)),
y(x) = wT φ(x). The only other classification error function we know up to
now is the perceptron error
n
X
Eperc (w) = g (−ti y(xi )) , g(z) = zI{z≥0} ,
i=1

derived in Section 2.4.2, so let us compare the two. Before we do that, one
comment is necessary in the light of Chapter 2. We do not have to assume in this
comparison that the dataset D is linearly separable. The perceptron algorithm
8.1 Conditional Maximum Likelihood 129

from Section 2.3 will fail for a non-separable set, but other algorithms1 minimize
Eperc (w) properly for any dataset.

2 Perceptron error
Squared error

1.5
Error

0.5

0
−1 −0.5 0 0.5 1 1.5 2 2.5 3
ty
4 4

2 2

0 0

−2 −2

−4 −4

−6 −6

−8 −8

−4 −2 0 2 4 6 8 −4 −2 0 2 4 6 8

Figure 8.1: Top: Perceptron and squared error for binary classification, shown
as function of ty. Notice the rapid increase of the squared error on both sides of
ty = 1. Bottom: Two binary classification datasets. On both sides, we show the
least squares discriminant (squared error; magenta line) and the logistic regres-
sion discriminant (logistic error; green line). The logistic error is introduced in
Section 8.2. For now, think of it as a smooth approximation of the perceptron
error. Bottom right: The least squares fit is adversely affected by few additional
datapoints in the lower right, even though they are classified correctly by a large
margin. Figures on the bottom from [5] (used with permission).

Note that the perceptron error per pattern is a function of ti y(xi ), which is
positive if and only if y(·) classifies (xi , ti ) correctly. We can also get the squared
error into this form (using that t2i = 1):
n n
1X 1X
Esq (w) = (y(xi ) − ti )2 = (ti y(xi ) − 1)2 .
2 i=1 2 i=1
1E
perc (w) is convex and lower bounded. It can be minimized by solving a linear program,
which you may derive for yourself.
130 8 Conditional Likelihood. Logistic Regression

We can compare the error function by plotting their contribution for pattern
(xi , ti ) as function of ti yi , where yi = y(xi ) (Figure 8.1, top). As far as sim-
ilarities go, both error functions are lower bounded by zero and attain zero
only for correctly classified patterns (yi = ti for squared error). But there are
strong differences. The perceptron error does not penalize correctly classified
patterns at all, which is why the perceptron algorithm never updates on such.
Also, Eperc (w) grows gracefully (linearly) with −ti yi for misclassified points. In
contrast, the squared error grows quadratically in ti yi on both sides of 1. Misclas-
sified patterns (ti yi ≤ 0) are penalized much more than for the perceptron error.
The behaviour of the squared error right of 1 is much more bizarre. ti yi > 1
means that y(·) classifies (xi , ti ) correctly with a large margin (Section 2.3.3).
That is a good thing, but the squared error penalizes it! In fact, it penalizes
it the more, the larger the margin is! This leads to absurd situations, as de-
picted in Figure 8.1, bottom right. This irrational2 behaviour of the squared
error, applied to binary classification, can lead to serious problems in practice.
Minimizing the squared error is not a useful procedure to train a classifier. In
this section, we learn about alternatives which are only marginally more costly
to run and require about the same coding efforts.
Why don’t we simply always use the perceptron error? While it is widely
used, it comes with problems of its own. It is not differentiable everywhere,
so most gradient-based optimizers do not work without subtle modifications.
Also, Eperc (w) does not assign any error to correctly classified points (xi , ti ),
which means that all separating hyperplanes are equally optimal under this cri-
terion. A different and more compelling reason for using other error functions
is discussed in Section 8.2.2.

8.1.2 Squared Error and Gaussian Noise


Recall how we motivated curve fitting in Section 4.1. The targets ti ∈ R arise
as sum of a systematic part y(xi ), where y(·) is a smooth curve, plus random
noise: ti = y(xi ) + εi . In the spirit of Chapter 6, let us write down what we
mean by that and formalize our model assumptions. By the i.i.d. assumption, the
variables εi are distributed independently according to a noise distribution with
density pε (ε). Put differently, the conditional distribution of a target ti , given
the underlying clean curve value yi = y(xi ), is p(ti |yi ) = pε (ti − yi ). Suppose
you were given the underlying curve y(·) and access to the noise distribution
pε (·). Then, for any xi , you could generate a corresponding target ti by drawing
εi ∼ pε , then ti = y(xi ) + εi . Generate data, given model and parameters?
Sounds like a likelihood function:
n
Y n
Y
p({ti }|y(·), {xi }) = p(ti |y(xi )) = pε (ti − y(xi )).
i=1 i=1

If y(x) = y(x; w) is parameterized in terms of weights w (for example, a linear


function wT φ(x) or a MLP), then
n
Y
p({ti }|w, {xi }) = p(ti |y(xi ; w)).
i=1
2 In fact, any sensible error function for binary classification should be nonincreasing on

ti yi . Certainly, it should not grow quadratically!


8.1 Conditional Maximum Likelihood 131

This likelihood function differs from what we saw in Chapter 6. We do not


generate all of the data D, but only the targets t = [ti ]. It is a conditional
likelihood: we condition on the input points. This makes sense. After all, our
goal is to predict the function x → t at inputs x∗ not seen in D, and we do
not need the distribution of x for that. In the sequel, we will typically drop the
conditioning on the input points {xi } from the notation:
n
Y
p(t|w) = p(ti |yi ), yi = y(xi ; w).
i=1

As noted in Section 6.3, random errors are often captured well by a Gaussian
distribution. Let us choose p(t|y) = N (y, σ 2 ), meaning that each εi is drawn
from N (0, σ 2 ), where σ 2 is the noise variance. Here is a surprise. The negative
log conditional likelihood for Gaussian noise is the squared error function:
n
X 1 1
− log p(t|w) = 2
(y(xi ; w) − ti )2 + C = 2 Esq (w) + C,
i=1
2σ σ (8.2)
n
C = log(2πσ 2 ).
2
Least squares estimation of w, minimizing Esq (w), is equivalent to maximizing
a Gaussian conditional likelihood.
This observation has several important consequences. First, it furnishes us with
statistical intuition about the squared error. Independent of computational ben-
efits and simplicity, it is the correct error function to minimize if residuals away
from a smooth curve behave more or less like Gaussian noise. On the other
hand, if ti are classification labels and y(x) a discriminant, or if the problem is
curve fitting, but we expect large errors to occur with substantial probability,
then other criteria can lead to far better solutions, and we should make the
additional effort to implement estimators for them. Second, just as squared er-
ror and Gaussian noise are related by the now familiar − log(·) transform, the
same holds for other such pairs. It is fruitful to walk this bridge in both ways.
Given an exotic error function, it may correspond to a negative log conditional
likelihood, which provides a clear idea about the scales of residuals to which the
error is most sensitive. And if a conditional density p(ti |yi ) captures properties
of the random errors we expect for our Pndata, we have to search no further for a
sensible error function to minimize: i=1 − log p(ti |yi ).

8.1.3 Conditional Maximum Likelihood


Probabilistic modelling is all about formulating one’s assumptions. If the task is
to predict a target t from an input pattern x, as in classification or regression, it
may be most convenient to model the relationship x → t directly, not bothering
with the distribution of x itself. A powerful way to do so is to postulate a
systematic function x → y, subsequently obscured by conditionally i.i.d. random
noise (see Figure 8.2). In other words, our conditional model has two parts: a
deterministic function y(x) and a noise distribution (or density) p(t|y). The
underlying function y(x) is parameterized in terms of weights w, and our goal
is to learn w from data. For example, in least squares linear regression, we use
132 8 Conditional Likelihood. Logistic Regression

Figure 8.2: Graphical model for a conditional likelihood. y = y(x) is a linear


combination of a weight vector w and a feature map φ(x). The observed target
t is then sampled from P (t|y). While y is real-valued, t can be discrete (for
example, t ∈ {−1, +1} in binary classification).

linear functions y(x) = wT φ(x) and a Gaussian noise model p(t|y) = N (y, σ 2 ).
In our examples so far, y(x) was a univariate function (a “clean” curve in
regression, or a discriminant function in binary classification), and noise was
additive (ti = yi +εi , where εi ∼ pε (·) i.i.d. and independent of yi ), but these are
special cases. In the sequel of this chapter, we will learn to know a non-additive
noise model for binary classification. Its extension to multi-way classification
will necessitate a multivariate systematic mapping y(x).
Given such a setup, we can fit the conditional model to data D = {(xi , ti ) | i =
1, . . . , n} by maximizing the conditional likelihood
n
Y
p(t|w) = p(ti |yi ), yi = y(xi ; w).
i=1

In practice, we minimize the negative log conditional likelihood − log p(t|w)


instead. Much as in Chapter 6, the main advantage of conditional maximum
likelihood is that we do not have to come up with loss functions, estimators or
algorithms for every new problem we face. Instead, we simply phrase what we
know and observe about the setting in probabilistic terms, from which all these
details follow automatically. However, conditional modelling and estimation is
clearly different from the joint maximum likelihood procedure of Chapter 6. We
will get to the bottom of this distinction in Section 8.2.3, after working through
an illustrative example.
Finally, the attentive reader may ask whether parameters w to be learned are
confined to the systematic part y(x). After all, the noise model p(t|y) may
come with parameter as well, for example the noise variance σ 2 of the Gaussian
N (y, σ 2 ). In general, the weights w may parameterize the noise model as well.
However, parameters such as the noise variance σ 2 are not typically lumped
together with w, but kept separate as so called hyperparameters. The reason
for this is that we cannot estimate them by maximum likelihood without run-
ning into the over-fitting problem (Chapter 7). For example, we can drive (8.2)
to −∞ by choosing w such that yi = ti for all i (interpolant), and σ 2 → 0.
Another example for a hyperparameter is the degree p in polynomial regression
(Section 7.1.1). Choosing hyperparameters belongs to the realm of model se-
lection, which we will learn about in Chapter 10. For now, we consider these
8.2 Logistic Regression 133

parameters fixed and given, and concentrate our learning efforts on the weights
(or primary parameters) w.

8.2 Logistic Regression


The squared error function is not useful for training a binary discriminant func-
tion. Neither are we completely happy with the perceptron error function (Sec-
tion 8.1.1). In this section, we employ the conditional likelihood viewpoint to
derive an error function with improved properties.

0.8

0.6

0.4

0.2

0
−4 −3 −2 −1 0 1 2 3 4

Figure 8.3: Logistic function σ(v) = 1/(1 + e−v ), the transfer function implied
by the logistic regression likelihood.

Consider the binary classification problem with data D = {(xi , ti ) | i =


1, . . . , n}, where ti ∈ {−1, +1}. Suppose we employ linear functions y(x) =
wT φ(x). How to choose a sensible noise model P (t|y), where t ∈ {−1, +1}
and y ∈ R? Given such a model, which aspect of the conditional distribution
P (t|x) does y(x) represent? Once more, decision theory comes to the rescue.
Recall our derivation of the optimal discriminant function y ∗ (x) for spherical
Gaussian class-conditional distributions (Section 6.4.1). We found that y ∗ (x) is
linear, representing the log odds ratio
P (t = 1|x)
y ∗ (x) = log . (8.3)
P (t = −1|x)
What we did not do back then was to work out P (t|x) in terms of y ∗ (x). Let
us do that now. Using that P (t = 1|x) + P (t = −1|x) = 1:

(x)
ey (1 − P (t = 1|x)) = P (t = 1|x)

ey (x) 1
⇔ P (t = 1|x) = = σ(y ∗ (x)), σ(v) = .
1 + ey∗ (x) 1 + e−v
σ(v) is called the logistic function (Figure 8.3). It may remind you of the tanh(a)
transfer function we used with MLPs (Section 3.2). Indeed, you will have no
problem confirming that tanh(a) = 2σ(2a) − 1: while tanh(a) is symmetric
w.r.t. the origin, σ(v) is symmetric w.r.t. (0, 1/2). Since σ(−v) = 1 − σ(v), we
can write compactly:

P (t|x) = P (t|y ∗ (x)) = σ(ty ∗ (x)), t ∈ {−1, +1}.


134 8 Conditional Likelihood. Logistic Regression

P (t|x) depends on x only through y ∗ (x), so this is the noise model we have
been looking for: P (t|y) = σ(ty).

This went a bit fast, so let us repeat what we did here. We used a simple
generative setup of binary classification (two Gaussian class-conditional distri-
butions with unit covariance matrix) in order to specify a noise model P (t|y)
for conditional maximum likelihood. Given the joint setup p(x, t) = p(x|t)P (t),
we worked out the posterior P (t|x) and the corresponding optimal discriminant
function y ∗ (x) in Section 6.4.1. Two things happen. First, y ∗ (x) turns out to be
linear. Second, the posterior distribution P (t|x) can be expressed as P (t|y ∗ (x))
for a conditional distribution P (t|y), y ∈ R. In probabilistic terms, t and x are
conditionally independent given y ∗ (x). The “two spherical Gaussian” generative
setup exhibits the separation into systematic part and random noise required
for conditional modelling (Section 8.1.3). The systematic part y ∗ (x) is linear,
while the noise model turns out to be logistic: P (t|y) = σ(ty).

There is nothing special about the setup with spherical Gaussian class-
conditional distributions p(x|t). The logistic link between y ∗ (x) and P (t|y) is
simply an inversion of the definition of the log odds ratio (8.3) in terms of P (t|y).
Our conditional likelihood requirements are satisfied whenever the log odds ra-
tio is linear in the parameters w to be learned, and this happens for a range
of other generative models as well. An example is the naive Bayes document
classification model for K = 2 classes (Section 6.5).

5
Perceptron error
Squared error
4 Logistic error

3
Error

0
−4 −3 −2 −1 0 1 2 3
ty

Figure 8.4: Binary classification error functions used in perceptron learning,


least squares discrimination and logistic regression.

The corresponding error function per data point is

− log P (ti |yi ) = − log σ(ti yi ) = log 1 + e−ti yi ,



8.2 Logistic Regression 135

giving rise to the logistic error function


n
X
log 1 + e−ti yi ,

Elog (w) = − log P (t|w) = yi = y(xi ; w). (8.4)
i=1

Minimizing Elog (w) is an instance of conditional maximum likelihood, known as


logistic regression (if y(x; w) is linear in w). The nomenclature reinforces a point
we made in Section 4.1. Even though our goal is classification, mapping x to
t ∈ {−1, +1}, the basic relationship is a real-valued curve x → y, which is fitted
to data (“regression”) through the logistic link. As we will see in Section 8.2.2,
this procedure can do more for us than mere discrimination. Moreover, the
embedded real-valued function is the basis for a very efficient training algorithm,
which reduces logistic regression to a few (reweighted) calls of classical linear
regression (Section 8.2.4).
In Figure 8.4, the logistic error is compared to the squared error Esq (w) and
the perceptron error Eperc (w), plotting their contribution per data point (x, t)
as function of ty. Clearly, Elog (w) does not share the erratic behaviour of the
squared error. It decreases monotonically with ty, encouraging classification
with large margins. In fact, the logistic and perceptron error functions behave
the same for large |ty|, positive or negative. The logistic error improves3 upon
the perceptron error by being continuously differentiable everywhere. It also
encourages large margins, as patterns which are correctly classified just so still
imply a substantial penalty. A further more important advantage of Elog over
Eperc is discussed in Section 8.2.2.

8.2.1 Gradient Descent Optimization

In this section, we show how to compute the gradient of the logistic error (8.4)
w.r.t. the weights w, which allows us to minimize Elog (w) by gradient descent
(Section 2.4.1). We begin with the linear case, y(x) = wT φ(x). Recall that
n
X
Ei (w) = − log σ(ti yi ) = log 1 + e−ti yi .

Elog (w) = Ei (w),
i=1

One observation up front. Elog (w) is a convex4 , lower bounded function of w.


Convexity is a key concept in numerical optimization and machine learning,
which we will need in Chapter 9. Some definitions and links are given in Sec-
tion 9.1.1. For our purposes here, convexity implies that each local minimum
is a global minimum, so that gradient descent optimization will detect a global
5
minimum point
Pn upon convergence . The gradient computation decomposes as
∇w Elog = i=1 ∇w Ei . For standard gradient descent, we determine a descent
direction by summing up g i = ∇w Ei , while for stochastic gradient descent we
3 Differentiability can also be a disadvantage, as it does not encourage sparsity of the

predictor (see Chapter 9).


4 Namely, y 7→ log(1 + e−ti yi ) is an instance of the logsumexp function, which is convex
i
[7], and yi is linear in w.
5 There is one catch with the latter statement, which we will clarify in Section 8.3.2. For

a linearly separable training dataset, logistic regression can drive kwk to very large values at
ever decreasing gains in Elog (w).
136 8 Conditional Likelihood. Logistic Regression

use g i for single patterns (xi , ti ) (Section 2.4.2). Before we begin, recall the
simple form of the gradient for the squared error (8.1) from Section 2.4.1:

1
∇w (yi − ti )2 = (yi − ti )∇w yi = (yi − ti )φ(xi ).
2
The gradient is the sum of feature vectors φ(xi ), each weighted by the residual
yi − ti .
Recall that ti ∈ {−1, +1}. Let t̃i = (ti + 1)/2 ∈ {0, 1}, and define πi = P (ti =
1|yi ) = σ(yi ) ∈ (0, 1). In order to work out the gradient, note that

e−v e−v
σ 0 (v) = = σ(v) = σ(v)σ(−v) = σ(v)(1 − σ(v)).
(1 + e−v )2 1 + e−v

Therefore,

∂ −ti σ(ti yi )σ(−ti yi )


− log σ(ti yi ) = = −ti σ(−ti yi ) = πi − t̃i ,
∂yi σ(ti yi )

and
∇w Ei = (πi − t̃i )∇w yi = (πi − t̃i )φ(xi ), πi = σ(yi ).
This has the same form as for the squared error, only that the residuals are
redefined: πi − t̃i instead of yi − ti . For the squared error, a pattern (xi , ti ) does
not contribute much to the gradient if and only if yi − ti ≈ 0, or yi ≈ ti . As
noted in Section 8.1.1, this does not make much sense for binary classification.
For example, ti = +1, yi = 5 is penalized heavily, even though the pattern
is classified correctly with large margin. For the logistic error, the residual is
redefined in a way which makes sense for binary classification. A pattern does
not contribute much if and only if πi = P (ti = 1|yi ) ≈ t̃i , namely if the pattern
is classified correctly with high confidence. If ti = +1, yi = 5, then π1 ≈ 1 and
t̃i = 1, so that (xi , ti ) contributes very little to ∇w Elog . Using the vectorized
notation of Section 2.4.1:
n
X
∇w Elog = (πi − t̃i )φ(xi ) = ΦT (π − t̃), π = [πi ], t̃ = [t̃i ]. (8.5)
i=1

If we limit ourselves to gradient descent optimization, there is no computational


difference between the two error functions, and the logistic error should be the
preferred option.

Logistic Error for MLPs

Recall multi-layer perceptrons (MLPs) from Chapter 3. Back then, we mini-


mized the squared error Esq (w) by gradient descent, where the gradient could
be computed efficiently by error backpropagation (Section 3.3). Suppose we
want to train an MLP for a binary classification task, such as 8s versus 9s on
the MNIST digits. The squared error is bad for classification, whether we use
linear discriminants, MLPs, or anything else, so let us use the logistic error for
our MLP instead. Surprisingly, this switch does not introduce any additional
complications, neither in the implementation nor in computations. Recall our
8.2 Logistic Regression 137

derivation of error backpropagation in Section 3.3, and consider a network with


L layers, whose output activation is a(L) (xi ) for input xi . Let us compare the
i-th pattern’s contribution to the error functions and their respective gradient
computations:

1  (L) 2  
Esq,i (w) = a (xi ) − ti , Elog,i (w) = − log σ ti a(L) (xi ) .
2
The corresponding residuals at the uppermost layer are
 
(L) (L)
rsq,i = a(L) (xi ) − ti , rlog,i = σ a(L) (xi ) − t̃i .

Recall that backpropagation decomposes into forward pass, computation of the


output residual, backward pass and gradient assembly. Suppose you have code
up and running for training MLPs by minimizing the squared error. The only
change you have to make in order to minimize the logistic error instead, is to
replace the output residuals a(L) (xi ) − ti by σ(a(L) (xi )) − t̃i . This insight holds
for other (continuously differentiable) error functions besides Esq and Elog just
as well. You can and should keep your MLP code generic. For each new error
function to be supported, all you need to work out is which output residuals it
gives rise to. All dominating computations, such as forward and backward pass,
are independent of the error function used. In short, error backpropagation stays
exactly the same, the backward pass is simply initialized with different errors
(or residuals).

8.2.2 Estimating Posterior Class Probabilities (*)

Recall our comparison between logistic and perceptron error above. A com-
pelling reason to prefer Elog (w) over Eperc (w) is understood by noting that for
our decision-theoretic setup, the link between the optimal discriminant y ∗ (x)
and P (t|x) is one-to-one. In other words, the posterior distribution P (t|x) is
a fixed and known function of y ∗ (x). If we employ the logistic link for condi-
tional maximum likelihood, we cannot only learn to discriminate well, but also
estimate the posterior probabilities P (t|x). There are numerous reasons why we
would want to do the latter, some of which were summarized in Section 5.2.5.
After all, the Bayes-optimal classifier f ∗ (x) decides f ∗ (x) = 1 in the same
way for P (t = 1|x) = 0.51 and P (t = 1|x) = 0.99, while knowledge of P (t|x)
may lead to different decisions in these cases, and is clearly more informative
in general. For instance, class probability estimates are essential in the cancer
screening example of Section 5.2.4.
The representation of f ∗ (x) in terms of a discriminant function, f ∗ (x) =
sgn(y ∗ (x)), is obviously not unique. When we talked about “the” optimal dis-
criminant y ∗ (x) above, we meant the log odds ratio (8.3), but there are many
others. For some of them, we can reconstruct the posterior probability P (t|x)
from y ∗ (x) for each input point x, for others we cannot. An example for the
first class of discriminant functions is the log odds ratio:

P (t = +1|x)
y ∗ (x) = log ⇒ P (t|x) = σ(ty ∗ (x)).
P (t = −1|x)
138 8 Conditional Likelihood. Logistic Regression

An example for the second class is f ∗ (x) itself. This is an optimal discriminant
function, since f ∗ (x) = sgn(f ∗ (x)). On the other hand, f ∗ (x) contains no
more information about P (t|x) than whether it is larger or smaller than 1/2.
If our goal is to estimate posterior class probabilities, we have to choose an
error function whose minimizer tends to a “probability-revealing” discriminant
function at least in principle, for a growing number of training data points and
discriminant functions to choose from.
Giving an operational meaning to “at least in principle” is the realm of learn-
ing theory, which is not in the scope of this lecture. But a necessary con-
dition for an error function to allow for posterior probability estimation is
easily understood by the concept of population minimizers. Recall the setup
of decision theory
Pn from Section 5.2. Suppose we are given an error function
E(y(·)) = n−1 i=1 g(ti , y(xi )). Then, y ∗ (x) is a population minimizer if

y ∗ (x) ∈ argmin E g(t, y) x


 
y

for almost all x. Intuitively, if we use more and more training data and allow
for any kind of function y(x), we should end up with a population minimizer.
An error function allows for estimating posterior probabilities only if all its
population minimizers y ∗ (x) are probability-revealing, in that P (t = +1|x) is
a fixed function of y ∗ (x).
For the logistic error function, glog (t, y) = log(1 + e−ty ), and the unique popula-
tion minimizer is the log odds ratio (8.3), which is probability-revealing. On the
other hand, the perceptron error function performs poorly w.r.t. this test. One
of its population minimizers is y ∗ (x) ≡ 0. This comment may paint an overly
pessimistic picture of the perceptron criterion, which often works well if applied
with linear functions y(x) = wT φ(x) under constraints such as kwk = 1 (which
rules out the all-zero solution). But if Eperc is used in less restrictive settings, it
often does lead to solutions with a tiny margin (see Section 2.3.3), from which
posterior probabilities cannot be estimated. As we will see in Chapter 9, one way
of addressing these problems of Eperc is to insist on a maximally large margin.
However, as shown in Section 9.4, the resulting maximum margin perceptron (or
support vector machine) does not allow for the consistent estimation of posterior
class probabilities either.
To sum up, the logistic error function, being the negative log conditional like-
lihood under the logistic noise model P (t|y) = σ(ty), supports the consistent
estimation of posterior class probabilities, while the perceptron error does not.
It is interesting to note that the squared error (8.1), for all its shortcomings with
binary classification, shares this beneficial property with Elog . We will see in Sec-
tion 10.1.3 that its population minimizer is the conditional expectation E[t | x],
which is obviously probability-revealing in the case of binary classification.

8.2.3 Generative and Discriminative Models

Armed with the example of logistic regression, we are now ready to highlight
the fundamental difference between joint and conditional likelihood maximiza-
tion, between generative and discriminative modelling. Recall our motivation
8.2 Logistic Regression 139

of logistic regression by way of decision theory for spherical Gaussian class-


conditional distributions. However, there are two different routes now from
decision-theoretic insight to a real-world classifier trained on data D by maxi-
mum likelihood estimation. We can proceed as in Section 6.4.1, estimating the
parameters of class-conditional and class prior distributions by joint maximum
likelihood and plugging them into the log odds ratio discriminant. Or we esti-
mate the discriminant parameters directly by conditional maximum likelihood
(Section 8.1.3).
(1+t)/2
For the first option, p(x|t, θ gen ) = N (x|µt , I) and P (t|θ gen ) = π1 (1 −
π1 )(1−t)/2 , with parameters θ gen = [(θ −1 )T , (θ +1 )T , π1 ]T . We estimate those by
maximizing the joint likelihood
( n n
)
Y Y
max p(xi , ti |θ gen ) = p(xi |ti , θ gen )P (ti |θ gen ) ,
θ gen
i=1 i=1

then plug the solution θ̂ gen into the logs odds ratio
 
T T µ̂+1 − µ̂−1
ŷgen (x; θ̂ gen ) = µ̂+1 − µ̂−1 x + b̂ = (ŵgen ) φ(x), ŵgen = ,

1 π̂1
kµ̂+1 k2 − kµ̂−1 k2 + log

b̂ = − .
2 1 − π̂1

For the second option, we parameterize the log odds ratio directly as
ydsc (x; θ dsc ) = (wdsc )T φ(x), so that θ dsc = wdsc , and fit it to the data by
maximizing the conditional likelihood
( n n
)
Y Y
max P (ti |θ dsc , xi ) = σ (ti ydsc (xi ; θ dsc )) .
θ dsc
i=1 i=1

Even though these two classification methods share the same decision-theoretic
motivation and result in P (t|x) estimates of the same functional form, they can
behave very differently in practice. The weights ŵdsc in conditional maximum
likelihood are estimated directly, while the weights ŵgen are assembled from
θ̂ gen , which are twice as many parameters estimated in a different way. In Fig-
ure 8.5, the generative ML plug-in rule is compared to discriminative logistic
regression. The two methods clearly produce different results. They constitute
examples of different modelling and learning paradigms:

• Generative modelling: Devise a joint model of all variables of the problem


domain, containing in particular the input point x. For classification, the
joint model would be

p(x, t|θ) = p(x|t, θ)P (t|θ).

Learn parameters by joint maximum likelihood,


n
Y
max p(xi , ti |θ),
θ
i=1
140 8 Conditional Likelihood. Logistic Regression

n=8 n=20

n=34 n=58

Figure 8.5: ML plug-in classifier for spherical Gaussian class-conditional dis-


tributions (black) versus logistic regression by conditional maximum likelihood
(red), for a growing number n of data points. The true data distribution con-
sists of two equi-probable Gaussians with spherical covariance (optimal decision
boundary in dashed green), which favours the plug-in rule. Note the different
behaviour of the two methods, in particular for small training set sizes.

or a maximum a-posteriori variant (Section 7.3). Plug the maximizer θ̂


into the log odds ratio discriminant function, or into the derived posterior
P (t|x, θ̂) ∝ p(x|t, θ̂)P (t|θ̂)
in order to predict posterior class probabilities.
We mainly focussed on the generative modelling approach in Chapter 6.
The nomenclature is due to the fact that given a generative model, a
complete dataset could be sampled (or generated) from it. In particular,
the input point x is modelled.
• Discriminative modelling (or diagnostic modelling): Devise a conditional
model of the variable(s) to be predicted, conditional on variables which
will always be given at prediction time. For classification, the target t is
modelled conditioned on the input point x:
P (t|x, θ).
Learn parameters by conditional maximum likelihood,
n
Y
max P (ti |xi , θ),
θ
i=1
8.2 Logistic Regression 141

or a maximum a-posteriori variant (Section 8.3.2). Plug the maximizer θ̂


directly into P (t|x, θ) in order to predict posterior class probabilities.
The discriminative modelling approach and conditional maximum likeli-
hood was introduced in the present chapter. The name “discriminative” is
somewhat misleading (and “diagnostic” is preferred in statistics), since re-
gression and other setups are covered just as well. As seen in Section 8.1.2,
least squares estimation is an example of conditional maximum likelihood
for a discriminative model with Gaussian noise. Note that a discriminative
model does not say anything about the generation of input points x.

The most important point to note about generative and discriminative mod-
elling is that they constitute two different options we have for addressing a
problem such as classification or regression, each coming with strengths and
weaknesses. For a particular application, it is not usually obvious a priori which
of the two will work better. However, some general statements can be made.
Discriminative modelling is more direct and often leads to models with less pa-
rameters to estimate. After all, the distribution p(x) over input points is not
represented at all. Most “black-box” classification or curve fitting methods are
based on discriminative models, even though the feature map φ(x) still has to
be specified. On the other hand, it is often much simpler to encode structural
prior knowledge about a task in a generative “forward” model (as emphasized
in Section 6.1) than to guess a sensible form for P (t|x). Moreover, training a
generative model is often simpler and more modular. A general advantage of
generative over discriminative models is that cases with corrupted or partly
missing input point xi can be used rather easily with the former (using tech-
niques discussed in Chapter 12), but typically have to be discarded with the
latter. It is possible to combine the two paradigms in different ways, a topic
which is not in the scope of this course.

8.2.4 Iteratively Reweighted Least Squares (*)


As noted in Section 8.1.1, least squares estimation is not a useful approach
to binary classification. Nevertheless, it tends to be used rather frequently in
machine learning circles towards this end. As seen in Section 8.2.1, if you limit
yourself to gradient descent, there is really no reason to prefer Esq over Elog .
On the other hand, least squares estimation can in general be solved much more
efficiently by advanced solvers from numerical mathematics, for which code is
publicly available (Section 4.2.3). In this section, we discuss an algorithm which
globally minimizes the logistic error function (8.4) by solving a short sequence
of reweighted least squares problems. In other words, this algorithm reduces
logistic regression training to a few calls of least squares estimation. As we will
see, the implementational effort on top of LSE is rather minor.
The algorithm we will work out is called iteratively reweighted least squares
(IRLS). It is an instance of the Newton-Raphson algorithm, motivated in Sec-
tion 3.4.2. The application to logistic regression is also called Fisher scoring in
the statistics literature. We will derive it as a reduction to least squares esti-
mation. Recall from Chapter 4 that LSE is equivalent to the minimization of
the quadratic function Esq (w) with positive definite system matrix ΦT Φ. We
wish to minimize Elog (w) which is not quadratic, so we cannot hope for a direct
142 8 Conditional Likelihood. Logistic Regression

reduction to LSE. The next best idea is to proceed iteratively. Suppose we are at
the point w. The locally closest quadratic fit to Elog (w0 ) is given by the Taylor
approximation
T
Elog (w0 ) ≈ qw (w0 ) :=Elog (w) + (∇w Elog ) (w0 − w)
1
+ (w0 − w)T (∇∇w Elog ) (w0 − w).
2
Here, ∇∇w Elog is the Hessian (matrix of second derivatives) at w. The idea be-
hind Newton-Raphson is to minimize the surrogate qw (w0 ) instead of Elog (w0 ),
updating w to the quadratic minimizer. We will see that ∇∇w Elog is positive
definite, so the quadratic minimizer w0 is given by

(∇∇w Elog ) (w0 − w) = −∇w Elog .

We already worked out the gradient (8.5) in Section 8.2.1. For the Hessian, we
employ the same strategy. First, it will be the sum of contributions ∇∇w Ei ,
one for each data point. Second, we can use the chain rule once more. If y =
[yi ] = Φw, then
∂Ei ∂Ei
∇w E i = ∇w yi = φ(xi )
∂yi ∂yi
∂E 2 ∂E 2
⇒ ∇∇w Ei = (∇w yi ) 2 i (∇w yi )T = 2 i φ(xi )φ(xi )T .
∂ yi ∂ yi
Therefore, the Hessian is
n
X ∂Ei2
∇∇w Elog = κi φ(xi )φ(xi )T = ΦT (diag κ)Φ, κi = .
i=1
∂ 2 yi

Moreover, since ∂Ei /∂yi = σ(yi ) − t̃i , then


∂ 
κi = σ(yi ) − t̃i = σ(yi )σ(−yi ) = πi (1 − πi ) > 0,
∂yi
since πi = σ(yi ) > 0. In fact, κi ∈ (0, 1/4]. The Hessian is positive definite,
which provides another proof of the convexity of Elog (w) (see Section 9.1.1).
Better still, both Hessian and gradient are simply reweighted versions of the
corresponding entities in standard least squares estimation. If ∇w Elog = ΦT ξ,
ξi = σ(yi ) − t̃i , then
1
qw (w + d) = ξ T Φd + dT ΦT (diag κ)Φd + C1
2
1 T
= (Φd − f ) (diag κ) (Φd − f ) + C2
2
n 2
1X  T
= κi d φ(xi ) − fi + C2 , fi = −ξi /κi ,
2 i=1

where C1 , C2 are constants. mind qw (w + d) is a least squares estimation prob-


lem, where each data point xi is associated with a pseudotarget fi , and the
contribution of (xi , fi ) is weighted by κi ∈ (0, 1/4]. Publicly available LSE codes
8.3 Discriminative Models 143

typically allow for such weighting. The solution d∗ is called Newton direction,
qw (w0 ) is minimized for w0 = w + d∗ . This observation explains the naming of
IRLS, which is solved by iterating over a sequence of reweighted least squares
estimation problems.
If you ever implement IRLS in practice, you should note one more detail. The
derivation above suggests to update w to w + d∗ at the end of an iteration, as
this is the minimizer of the quadratic fit qw (w0 ). In practice, it works better to
employ a line search:

w0 ← w + λ∗ d∗ , λ∗ = argmin Elog (w + λd∗ ).


λ>0

In other words, we search for a minimum point along the line segment {w +
λd∗ | λ > 0} determined by the Newton direction. The minimum does not have
to be found to high accuracy, so that very few evaluations of Elog (and possibly
its derivative) along the line are sufficient. It is common practice to start the
line search with λ = 1 and accept the full Newton step if this leads to sufficient
descent. Details can be found in [2].

8.3 Discriminative Models

In this section, we discuss further examples and extensions of discriminative


modelling. First, we show how to extend logistic regression to more than two
classes. Next, we extend conditional maximum likelihood to conditional maxi-
mum a-posteriori (MAP) estimation and draw a bridge to regularized estima-
tion.

8.3.1 Multi-Way Logistic Regression

We developed logistic regression in Section 8.2 for the case of two classes. How
about the general case of K ≥ 2 classes? We can let ourselves being guided
by decision theory in much the same way. We derived a generative maximum
likelihood approach to multi-way classication in Section 6.4.1, which employed
K discriminant functions

yk∗ (x) = (wk )T φ(x) = log P (t = k|x) + C(x), k = 0, . . . , K − 1.

Here, C(x) does not depend on k. In the case of spherical Gaussian class-
conditional distributions discussed back then, we had φ(x) = [xT , 1]T . What
matters is that the log posterior log P (t = k|x) is a linear function of φ(x) plus
a part which does not depend on the class label k. The implied classification
rule was
f ∗ (x) = argmax yk∗ (x) = argmax P (t = k|x),
k=0,...,K−1 k=0,...,K−1

since neither the addition of C(x) nor the increasing exp(·) transform changes
the maximizing value for k. How do we obtain P (t|x) from the yk∗ (x)? We take
144 8 Conditional Likelihood. Logistic Regression

the exponential, then normalize the result to sum to one:



∗ eyk (x)
eyk (x) = P (t = k|x)eC(x) ⇒ P (t = k|x) = P y∗ (x) = σk (y ∗ (x)) ,
k̃ e

e vk X
σk (v) = P v = evk −lsexp(v ) , lsexp(v) = log evk̃ .
k̃ e
k̃ k̃

σ(v) = [σk (v)] ∈ ∆K is called softmax mapping (or multivariate logit mapping
in statistics). Recall from Section 6.5 that ∆K denotes the simplex of probability
distributions over {0, . . . , K − 1}. The softmax mapping is not one-to-one, since
σ(v + α1) = σ(v) for any α ∈ R, reflecting the fact that we can always add
any fixed C(x) to our discriminant functions yk∗ (x) and still obtain the same
posterior probabilities. The softmax mapping is conveniently defined in terms of
the logsumexp function lsexp(v), a convex function we previously encountered
in Section 8.2.1.
Once more, our motivation of the softmax link for multi-way classification is not
limited to spherical Gaussian generative models, but holds more generally. At
this point, we take the same step as in Section 8.2. We define a discriminative
model for P (t|x) as
 
w0
P (t = k|x, w) = σk [yk̃ (x; wk )] , yk (x; wk ) = (wk )T φ(x), w = 
 ..
.
 
.
wK−1
Suppose we are given some multi-way classification data D = {(xi , ti ) | i =
1, . . . , n}, where ti ∈ {0, . . . , K − 1}. What is the negative log conditional like-
lihood − log P (t|w), the generalization of the logistic error function (8.4) to
K classes? Recall from Section 6.5 that the use of indicators can simplify ex-
pressions substantially. With this in mind, let us define t̃i = [t̃ik ] ∈ {0, 1}K ,
where t̃ik = I{ti =k} , or t̃i = δ ti . t̃i has a one in component ti , zeros elsewhere.
The representation of {ti } in terms of the t̃i is called 1-of-K coding. Using the
techniques from Section 6.5:
K−1 K−1
!
Y X
t̃ik
P (ti |xi , w) = P (ti |y i ) = σk (yik ) = exp t̃ik yik − lsexp(y i ) ,
k=0 k=0
y i = [yik ], yik = yk (xi ; wk ).
P
In this derivation, we used that k t̃ik = 1. The negative log conditional likeli-
hood, or K-way logistic error function, is
Xn
lsexp(y i ) − (t̃i )T y i ,

− log P (t|w) = Elog (w) = y i = [yk (xi ; wk )] .
i=1
| {z }
=:Ei (w)

This is still a sum of independent terms Ei (w), one for each data point (xi , ti ),
but it couples the entries of each y i ∈ RK . It is also a convex function, due to
the convexity of lsexp(·). We can understand the structure of this error function
by noting that lsexp(v) is a “soft” approximation to maxk vk :
X X
log evk = M + log evk −M ∈ (M, M + log K], M = max vk .
k
k k
8.3 Discriminative Models 145

Namely, vk − M ≤ 0, so that evk −M ≤ 1, while k evk −M > 1, since vk = M


P
for at least one k. The approximation is close if maxk vk is larger than the other
entries by some margin, and in this case σ(v) ≈ δ argmaxk vk , which explains the
“softmax” nomenclature. Therefore,

Ei (w) = lsexp(y i ) − (t̃i )T y i ≈ max yik − yi(ti ) .


k

The error is close to zero if yi(ti ) = maxk yik , i.e. if (xi , ti ) is classified correctly,
while an error is penalized linearly by the distance between predicted maxk yik
and desired yi(ti ) .

Gradient of Multi-Way Logistic Error

In order to minimize − log P (t|w), we need to work out its gradient w.r.t. w.
To this end, we use a remarkable property of lsexp(v):

∂ e vk
lsexp(v) = P v = σk (v) ⇒ ∇v lsexp(v) = σ(v).
∂vk k̃ e

Therefore,
∇yi Ei = σ(y i ) − t̃i ,

an expression which is precisely analogous to what we found in the binary clas-


sification case. Using the chain rule,
n
X n
X
∇wk Elog = (σk (y i ) − t̃ik )∇wk yik = (σk (y i ) − t̃ik )φ(xi ).
i=1 i=1

Note that
K−1 X 
X K−1
∇w k E i = (σk (y i ) − t̃ik ) φ(xi ) = 0,
k=0
k=0
P P
since k σk (y i ) = k t̃ik = 1. The contribution φ(xi ) is distributed between
the gradients ∇wk Elog in a zero-sum fashion: a positive example for class ti , a
negative example for all other classes k 6= ti .
Let us apply our new found K-way logistic error function to training a multi-
layer perceptron for K-way classification. We need K output layer activations
(L) (L)
ak (x), one for each class. Also, a(L) (x) = [ak (x)]. Much like in Section 8.2.1,
we only have to modify error backpropagation in a minor way. Fix pattern
(xi , ti ). The output residuals form a vector
 
(L)
ri = σ a(L) (xi ) − t̃i

(L)
now. The backward pass starting at output activation ak (xi ) is seeded with
(L)
the error ri,k .
146 8 Conditional Likelihood. Logistic Regression

8.3.2 Conditional Maximum A-Posteriori Estimation (*)


In this chapter, we found novel interpretations of error function minimization,
such as least squares estimation, in terms of conditional likelihood. Recall from
Chapter 7 that such procedures can run into over-fitting problems, which pre-
vents them from generalizing well to unseen test data. A general remedy, regular-
ization, was found to alleviate over-fitting artefacts substantially. An example is
Tikhonov regularized least squares estimation, where a penalty term (β/2)kwk2
is added to the squared error function. As seen in Section 7.3, regularization has
a clean probabilistic interpretation as maximum a-posteriori (MAP) estimation
in generative models which would normally give rise to joint maximum likelihood
plug-in rules. But what about regularized LSE?
In this section, we close the circle by showing how regularization in conditional
maximum likelihood for discriminative models can be interpreted as condi-
tional maximum a-posteriori estimation. Recall our reformulation of LSE in Sec-
tion 8.1.2 as conditional likelihood maximization, where p(t|w) = N (t|y, σ 2 I),
where yi = wT φ(xi ), or y = Φw. In the spirit of Section 7.3, the Tikhonov reg-
ularizer (β/2)kwk2 corresponds to a Gaussian prior distribution on the weight
vector, p(w) = N (w|0, β −1 I). The posterior distribution is

p(w|t) ∝ p(t|w)p(w),

and the conditional MAP estimator is its mode:

ŵMAP = argmax p(w|t) = argmin {− log p(t|w) − log p(w)} .


w w

You should confirm for yourself that the second expression is σ −2 times the
regularized least squares criterion (7.1) up to a constant, if ν = βσ 2 . The prob-
abilistic conditional MAP interpretation of Tikhonov-regularized least squares
is as follows. Or model assumptions are twofold. First, the noise is additive
Gaussian with variance σ 2 . Second, the clean curve y(x) is a linear function
of weights w with a Gaussian prior N (w|0, β −1 I), which keeps the weights
uniformly small.

MAP for Logistic Regression

MAP estimation is useful for logistic regression just as well. In fact, let us
consider what happens with the logistic error function (8.4) for a linearly sep-
arable dataset D. In this case, there exists some weight vector w1 so that
ti y(xi ; w1 ) = ti (w1 )T φ(xi ) > 0 for all i = 1, . . . , n. But then,

σ (ti y(xi ; αw1 )) = σ αti (w1 )T φ(xi ) → 1 (α → ∞),




and Elog (αw1 ) → 0 as α → ∞. While the logistic error is lower bounded by


zero, the only way to converge against zero is to scale kwk ever larger. Changing
α leaves the separating hyperplane invariant, but P (t|x, w) viewed along the
direction w1 converges to a hard step function as α → ∞, a telltale sign of over-
fitting. A simple remedy is to do MAP estimation instead, using a Gaussian prior
p(w) = N (0, β −1 I). Having to pay for large kwk, the fitting method will re-
frain from over-saturating the posterior probabilities. As noted in Section 7.2.2,
8.3 Discriminative Models 147

a beneficial side effect of employing a regularizer (or prior distribution) is to


improve the conditioning of the underlying optimization problem. This holds
both for regularized LSE, where the quadratic function becomes “more positive
definite”, and for penalized logistic regression, where the regularizer improves
the condition number of the Hessian matrices (Section 8.2.4). For multi-way
logistic regression (Section 8.3.1), it is common practice to use an independent
Gaussian prior
K−1
Y K−1
Y
p(w) = p(wk ) = N (wk |0, βk−1 I).
k=0 k=0

Just like the variance σ 2 of the Gaussian noise model, the prior parameter β is
a hyperparameter (Section 8.1.3). Choosing good hyperparameter values is an
instance of model selection (see Chapter 10).

Techniques: Gaussian Posterior Distribution (*)

By relating conditional MAP with regularized least squares estimation, we con-


vinced ourselves that the posterior distribution p(w|t) ∝ p(t|w)p(w) has the
mode (point of maximum density)
 −1
ŵ = ΦT Φ + νI ΦT y,

the minimizer of the regularized squared error (7.1). But what is the poste-
rior distribution? Let us fill in a hole we left in our survey of the Gaussian in
Section 6.3. First,
 
1 2 β 2
p(w|t) ∝ p(t|w)p(w) ∝ exp − 2 kt − Φwk + kwk .
2σ 2

Recall our timesaving trick from Section 7.3: when working out a distribution
(or density) over w, we ignore all multiplicative terms which do not depend on
w. Note the absence of 2π and determinant terms in our derivation. Writing
1
ν = βσ 2 , we have that p(w|t) ∝ e− 2σ2 q(w) , where
 
q(w) = kt − Φwk2 + νkwk2 = wT ΦT Φ + νI wT − 2tT Φw + C.

C is some constant, we could work it out, but we don’t, as it does not depend
on w. In the sequel, instead of writing C1 , C2 , . . . for all sorts of constants we
don’t care about anyway, we call all of them C (even though the value may
change from line to line). Now, q(w) is a quadratic function, and we begin to
suspect that maybe p(w|t) is Gaussian after all. Indeed, if Σ = (ΦT Φ + νI)−1 ,
then

q(w) = wT Σ−1 w − 2ŵT Σ−1 w + C = (w − ŵ)T Σ−1 (w − ŵ) + C


1
 
⇒ p(w|t) ∝ e− 2σ2 q(w) ∝ N w ŵ, σ 2 Σ .

We first completed the square (Section 6.4.3), then matched the result to the
form of a Gaussian density (6.2). If we find such a match, we can just read off
148 8 Conditional Likelihood. Logistic Regression

mean and covariance matrix. The posterior distribution is a Gaussian, its mode
ŵ is also its mean (we already know that from Section 6.3), and its covariance
matrix is σ 2 Σ. In other words, a Gaussian prior p(w) is conjugate for a Gaussian
likelihood p(t|w) = N (t|w, σ 2 I) (Section 7.3.1). One more closedness property
for the amazing Gaussian family:

• Closed under full-rank affine linear transformations (Section 6.3), in par-


ticular under marginalization.

• Closed under conditioning and Bayes formula (this section).

This list is by no means exhaustive. If you recall the sum and product rule
from Section 5.1, you note that all these operations keep us within the Gaussian
family, which is one reason why it is so frequently used in practice. An example
which throws you out of the family: if x ∼ N (0, 1), then x2 is not a Gaussian
random variable.
Chapter 9

Support Vector Machines

In this chapter, we explore kernel methods for binary classification, first and
foremost the support vector machine. These are nonparametric methods, like
nearest neighbour (Section 2.1), which can be regarded as regularized linear
methods in huge, even infinite-dimensional feature spaces. We first encountered
the margin of a dataset in Chapter 2, in the context of the perceptron algorithm.
In this chapter, we will gain a deeper understanding of this concept.
This chapter is mathematically somewhat more demanding than previous ones,
but we will move slowly and build intuition. The work will be very worthwile.
Support vector machines are not just among the most powerful “black box”
classifiers, with high impact1 on many applications, they also seeded a link be-
tween machine learning and convex optimization which has been and continues
to be enormously fruitful, far beyond SVMs.

9.1 Maximum Margin Perceptron Learning


In this chapter, we will be concerned with binary classification throughout.
Our goal will be to train a classifier on data D = {(xi , ti ) | i = 1, . . . , n},
where ti ∈ {−1, +1}. In this section, we will concentrate on linear classifiers
f (x) = sgn(y(x)), y(x) = wT φ(x) + b. Note that we make the bias parameter
explicit in this chapter, for reasons that will become clear soon. We will also
assume that D is linearly separable: there exists some w, b, so that ti y(xi ) > 0
for all i = 1, . . . , n. We will relax both assumptions in sections to come, so
the end result will be a powerful nonlinear classification method, which can
tolerate training errors in the interest of better generalization: the support vector
machine (SVM).
The story starts with a closer look at the perceptron learning algorithm. This is a
good point to revisit Section 2.3, and in particular Section 2.3.3. We know that
the perceptron algorithm converges finitely whenever D is linearly separable,
1 Theyare easily en par with neural networks in that respect. SVMs are far easier to use and
are based on much more robust training algorithms, properties which are highly appreciated
in most application fields.

149
150 9 Support Vector Machines

which is equivalent to a positive margin γD . Quite literally, the margin quantifies


the room to move between different separating hyperplanes. The perceptron
convergence theorem (Theorem 2.1) bounds the number of updates in terms of
2
1/γD . Intuitively, the larger the fraction of separating hyperplanes among all
hyperplanes, the easier it is to find one of the former. However, the perceptron
algorithm stops once it finds any separating hyperplane. If γD > 0, there are
infinitely many such solutions. Which of them is the best? Since Chapter 7 we
know that the answer to this question is not realizable, requiring knowledge of
the “true” data distribution, while all we have is the data D.

(b)

(a)

(c)

Figure 9.1: Different separating hyperplanes for binary classification data. (a) is
the largest margin hyperplane, while (b) and (c) are potential solutions for the
perceptron learning algorithm.

Fine, but ask yourself which of the solutions in Figure 9.1 you would pick if
you had to. Would you go for (b) or (c), sporting tiny distances to some of
the datapoints? Or would you choose (a), which allows for maximum room
to move? The principle behind our intuitive preference for (a) is as follows.
Our basic assumption about D is that it is an i.i.d. random sample. If we could
repeat the sampling process, we would get another set D0 whose overall structure
resembled that of D, but details would be different. With this in mind, it makes
sense to search for a discriminant function exhibiting stability to small changes in
the individual data points. For example, suppose that each pattern xi is slightly
displaced, leading to equally slight displacements of the φ(xi ). A stable solution
would in general remain unchanged. Given this particular notion of stability,
the best hyperplane is the one whose distance to the nearest training pattern is
maximum. We should look for the discriminant function with maximum margin.

Problem: Among many separating hyperplanes, which one shall I choose?


Approach: The hyperplane with largest margin γD (w, b) exhibits maximum
stability against small xi displacements.

For much of this chapter, we will use a slightly redefined version of the margin
for unnormalized patterns (Section 2.3.3):
ti (wT φ(xi ) + b)
γD (w, b) = min . (9.1)
i=1,...,n kwk
9.1 Maximum Margin Perceptron Learning 151

Figure 9.2: The signed distance δi of a pattern φ(xi ) to a hyperplane with unit
normal vector w0 is defined by φ(xi ) = âi + δi ti w0 , where âi is the orthogonal
projection of φ(xi ) onto the hyperplane. The margin of the hyperplane w.r.t.
a dataset is the smallest signed distance over all datapoints. The hyperplane
is separating the data iff its margin is positive. The maximum margin over all
hyperplanes is the margin γD of the dataset.

Recall the geometrical picture of the margin from Section 2.3.3 and Figure 9.2.
For any i, ti (wT φ(xi ) + b)/kwk is the signed distance δi of φ(xi ) from the hy-
perplane (negative if the point is misclassified). Namely, let âi be the orthogonal
projection of φ(xi ) onto the hyperplane. Then, wT âi + b = 0, since âi is on the
plane. Moreover, to get to φ(xi ), we march to âi , then move δi along the unit
normal vector w0 = w/kwk: φ(xi ) = âi + δi ti w0 . Multiplying with wT :
ti (wT φ(xi ) + b)
wT φ(xi ) = wT âi + δi ti kwk = −b + δi ti kwk ⇔ δi = .
kwk
Here, we used wT w0 = kwk and 1/ti = ti . The margin is the smallest signed
distance to any of the training cases. If (w, b) describes a separating hyperplane,
you can remove “signed”. A difference2 to the margin concept in Section 2.3.3
is that here, we normalize w only, but leave b unregularized. Moreover, we do
not insist on normalizing the feature vectors φ(xi ) here, even though this is
typically done in SVM practice (see end of Section 9.2.3).
The maximum margin perceptron learning problem (also known as optimal per-
ceptron learning problem) is given by
ti (wT φ(xi ) + b)
 
max γD (w, b) = min . (9.2)
w,b i=1,...,n kwk
We maximize the minimum margin per pattern, where the minimum is over the
data points, the maximum over the hyperplane. The solution to this problem
is the margin γD for the dataset D. As in Section 2.3.3, we can interpret 2γD
as the maximum width of a slab of parallel separating hyperplanes in between
datapoints of the two classes (the light-red region in Figure 9.2). Note that at
least one degree of freedom is left unspecified in this problem. If (w∗ , b∗ ) is a
solution, then so is (βw∗ , βb∗ ) for any β > 0.
2 For the margin concept used in this chapter, we can translate all datapoints by a common
152 9 Support Vector Machines

Figure 9.3: Examples of convex and non-convex sets (top panel), as well as
convex and non-convex functions (bottom panel).

9.1.1 A Convex Optimization Problem


A basic understanding of convex sets, functions, and convex optimization is
essential for working in machine learning today, whether in research or appli-
cations. The book of Boyd and Vandenberghe [7] is highly recommended, it
will help you develop geometrical intuition, and its many examples convey the
breadth of convex optimization in applications today. You should certainly read3
Sections 2.1 and 3.1. A set S ⊂ Rp is convex if all points on the line segment be-
tween any two a, b ∈ S are contained in S: λa + (1 − λ)b ∈ S for any λ ∈ [0, 1].
A real-valued function f : S → R is convex if its domain S is convex, and if
f (λa + (1 − λ)b) ≤ λf (a) + (1 − λ)f (b), a, b ∈ S, λ ∈ [0, 1].
If you plot the function, then the line segment between f (a) and f (a) lies
permanently above the curve f (λa + (1 − λ)b), λ ∈ [0, 1]. Examples of convex
(non-convex) sets and functions are given in Figure 9.3. Convex functions can
be seen as generalizations of linear functions (which are convex), where “≤”
would be “=”. We already met convex functions several times in this course.
The most elementary class of convex functions beyond linear ones are quadratics
with positive semidefinite covariance matrix (Section 6.3, Section 4.2.2). More
general, if a function f (x) is twice differentiable everywhere, then it is convex
if and only if its Hessian ∇∇x f is positive semidefinite everywhere. Both the
logistic and the perceptron error function are convex (Chapter 8), the latter is an
example for a convex function which is not differentiable everywhere. Maybe the
most important practical consequence of convexity is that every local minimum
point of f (x) must be a global minimum point as well (you should prove this
for yourself). Finally, a convex optimization problem has the form
min f (x),
x∈S

offset vector without changing the value of γD .


3 The book is available online at www.stanford.edu/∼boyd/cvxbook.html.
9.2 Support Vector Machines 153

where S ⊂ Rp is convex and f (x) is a convex function on S.


The maximum margin perceptron learning problem (9.2) is not convex as it
stands, since the criterion γD (w, b) is not convex. Our goal in this section is
to show that a version of this problem corresponds to a convex optimization
problem with a unique solution.
The first step is to eliminate the minimum over i = 1, . . . , n by introducing linear
constraints. To this end, we introduce another variable γ̃ and rewrite (9.2) as
γ̃
max , subj. to ti (wT φ(xi ) + b) ≥ γ̃, i = 1, . . . , n.
w,b,γ̃ kwk
Make sure to understand why the two problems are equivalent before moving
on. In fact, for fixed (w, b), the optimum choice is γ̃ = mini ti (wT φ(xi ) + b).
At this point, we play out the card of being able to rescale [wT , b]T without
changing anything. This means that there is one parameter too much, which
can be fixed to an arbitrary positive value. Let us simply choose γ̃ to be that
parameter, and fix γ̃ = 1.
1
max , subj. to ti (wT φ(xi ) + b) ≥ 1, i = 1, . . . , n.
w,b kwk
Finally, instead of maximizing 1/kwk, we can just as well minimize the quadratic
(1/2)kwk2 . We end up with
1
min kwk2 , subj. to ti (wT φ(xi ) + b) ≥ 1, i = 1, . . . , n. (9.3)
w,b 2

This is a convex optimization problem. For one, kwk2 is a convex function. More-
over, the constraints determine the set to be optimized over, called the feasible
set. Each such affine constraint defines a halfspace in Rp+1 where [wT , b]T lives
(Section 2.2), and the intersection of halfspaces, if not empty, defines a convex
set (please prove this for yourself). Why is the feasible set not empty? Because
we assume that the dataset D is linearly separable.

Problem: Finding the largest margin hyperplane is not a convex problem.


Approach: Convert problem into an equivalent convex quadratic program.
Key step: Fix scale of (w∗ , b∗ ), whose size does not matter.

Our convex problem (9.3) is an example of a quadratic program. You may be


familiar with linear programs, which sport linear criteria to be minimized w.r.t.
linear constraints (the feasible set being a convex polytope, a bounded intersec-
tion of linear halfspaces). Quadratic programs are defined by a positive semidef-
inite quadratic criterion subject to linear constraints. We will defer the question
of how to solve (9.3) to Section 9.3. First, we will be concerned with lifting the
simplifying assumptions made at the beginning of this section.

9.2 Support Vector Machines


In the previous section, we turned the idea of stability and maximum margin
perceptron learning into a convex optimization problem (9.3). However, this
154 9 Support Vector Machines

formulation has two shortcomings shared with the perceptron algorithm. First,
it works only for linearly separable datasets D. Second, we are limited to linear
discriminants y(x) = wT φ(x) + b in some finite-dimensional feature space,
w ∈ Rp . In this section, we will learn to know remedies for both shortcomings. At
the end, we will not only have derived support vector machines, but also gained
a much broader perspective on “kernel” variants of regularized estimation.

9.2.1 Soft Margins


How can we modify (9.3) in order to tolerate classification errors, while penal-
izing them? It is helpful to look at methods which already provide this feature,
such as for example logistic regression (Section 8.2). Writing yi = y(xi ) =
wT φ(xi ) + b, the constraints in (9.3) are ti yi ≥ 1. Compare this to the logistic
error per case in terms of ti yi (Figure 9.5). For ti yi ≥ 1, this is fairly close to
zero, while it grows linearly with −ti yi . Maybe we should relax the hard con-
straint ti yi ≥ 1 into something which implies a penalty linear in (1 − ti yi ). To
do so, we introduce nonnegative slack variables ξi ≥ 0, one for each pattern,
giving rise to the maximum soft margin perceptron learning problem
n
1 X
min kwk2 + C ξi , ξ = [ξi ],
w,b,ξ 2 i=1 (9.4)
T
subj. to ti (w φ(xi ) + b) ≥ 1 − ξi , ξi ≥ 0, i = 1, . . . , n.
This is the quadratic program underlying the support vector machine (SVM),
as devised by Cortes and Vapnik [9]. The parameter C > 0 is a hyperparameter,
whose choice is an instance of model selection (see Chapter 10). Consider what
happens at an optimal solution (w∗ , b∗ , ξ ∗ ), where we must have ξ∗,i = max{1 −
ti y∗,i , 0} (why?). If ξ∗,i = 0, then ti y∗,i ≥ 1 and the case is classified correctly
with a margin, just as before. If ξ∗,i > 0, the pattern lies within the margin area
(ty(x) < 1), for ξ∗,i > 1 it is even misclassified. We pay for this flexibility by

P ∗,i in the criterion value. Note that due to this argument, the added penalty
i ξ∗,i can be seen as an upper bound on the number of training errors.

Problem: Margin constraints on all patterns can be too stringent.


No solution for linearly non-separable data (outliers).
Approach: Maximize a soft margin instead. Allow margin constraints to be
violated, but impose (linear) costs for each violation.

The problem (9.4) is often called soft margin SVM problem, while our previous
variant (9.3) becomes the hard margin SVM problem. Note that the hard margin
problem is a special4 case of the soft margin problem, where C = ∞. The soft
margin extension can also lead to an improved solution for a linearly separable
dataset D. To understand this point, let us define the soft margin as 1/kw∗ k
for an optimal solution (w∗ , b∗ , ξ ∗ ) of (9.4), in analogy with Section 9.1.1. Note
that this quantity depends on C. If D is separable, the soft margin for C = ∞
coincides with the normal (hard) margin. In geometrical terms, the soft mar-
gin is the closest distance to the hyperplane of any pattern i with ξ∗,i = 0. It
4 In general, the soft margin solution will coincide with the hard margin solution from some

finite C < ∞ onwards.


9.2 Support Vector Machines 155

Figure 9.4: The soft margin support vector machine allows for a larger soft
margin γsoft at the expense of violating the large margin constraints ti yi ≥ 1
for certain patterns. Different from the hard margin SVM, it can be applied to
datasets which are not linearly separable.

is therefore no smaller, and typically larger, than the hard margin. Figure 9.4
illustrates the consequences of optimizing the more general soft margin. Intu-
itively, we extend the width of our slab (light-red region), thus the stability of
the solution, at the expense of engulfing a few patterns. To conclude, the soft
margin SVM is more generally useful than its hard margin special case, and we
will mainly be concerned with the former in the remainder of this chapter.
The soft margin SVM can be understood as a Tikhonov-regularized estimator,
directly related to conditional maximum a-posteriori estimation (Section 8.3.2).
To this end, we eliminate the slack variables in (9.4) by using the hinge function
[x]+ := max{x, 0} and divide the criterion by C, arriving at the equivalent
problem
n
1 X
min kwk2 + [1 − ti yi ]+ , yi = wT φ(xi ) + b.
w,b 2C
i=1

This problem combines the hinge error function


n
X
Esvm (w, b) = [1 − ti yi ]+ , yi = wT φ(xi ) + b,
i=1

with the Tikhonov regularizer (2C)−1 kwk2 . We can compare Esvm with the
perceptron and logistic error (Figure 9.5). It corresponds to the perceptron error
[−ti yi ]+ shifted to the right in order to enforce a discrimination with margin.
Unlike the logistic error, it is exactly zero for ti yi ≥ 1. The consequences of this
fact will be highlighted in Section 9.4.
156 9 Support Vector Machines

5
Perceptron
4.5 SVM
Logistic
4

3.5

3
L(y,t)

2.5

1.5

0.5

0
−4 −3 −2 −1 0 1 2 3
ty

Figure 9.5: Error per pattern i as function of ti yi , for perceptron, logistic and
SVM hinge error.

As a final comment, some readers may be puzzled about the choice of 1 in


[1 − ti yi ]+ , why not 2 or 10−10 ? This choice is indeed arbitrary in the following
sense. Suppose we had chosen ε > 0 instead of 1. Then, a solution (w∗ , b∗ )
to the hard margin SVM problem corresponds to the solution (w∗ /ε, b∗ /ε) of
(9.3) in the original form. For the soft margin SVM, we also have to scale C. A
solution (w∗ , b∗ , ξ ∗ ) of the ε-modified problem with C is equivalent to a solution
(w∗ /ε, b∗ /ε, ξ ∗ /ε) of the original problem with C/ε (please confirm these points
for yourself). This makes sense, since C has to be measured in the units of y(x).

9.2.2 Feature Expansions. Representer Theorem


In this section, we prepare the ground for a powerful nonlinear extension not
only of SVMs, but of all other regularized estimation techniques we encoun-
tered so far (logistic regression, perceptron), by way of “kernelization”. This
property is often presented as “magical” or a “trick”, while it is a natural prop-
erty of estimation methods based on linear functions. Consider some regularized
estimation problem with ν > 0:
( n
)
X ν 2
min E(w, b) = Ei (ti , yi ) + kwk , yi = y(xi ) = wT φ(xi ) + b. (9.5)
w,b
i=1
2

We assume that the criterion is lower bounded and has globally optimal solu-
tions. What we will show is that there is always an optimal solution (w∗ , b∗ ) for
which we can write
Xn
w∗ = α∗,i φ(xi ).
i=1
9.2 Support Vector Machines 157

The weight vector w∗ can be written as linear combination of the feature vectors
φ(xi ). In other words, w∗ = ΦT α∗ . Writing w∗ in terms of α∗ is known as dual
representation. The consequence of this so called representer theorem is that we
may as well optimize over α, with w = ΦT α, without loss in optimality. This
statement is trivial for p ≤ n, so for the rest of this section we will assume that
p > n (more features than datapoints), and that ν > 0 in (9.5).
We begin with a method which does not use Tikhonov regularization, the per-
ceptron algorithm of Section 2.3. Recall from Algorithm 1 that we start with
w ← 0, and each update has the form w ← w + ti φ(xi ). This means that we
can just as well maintain a dual representation α ∈ Rn , start with α ← 0, and
update α ← α + ti δ i . The perceptron algorithm is ready for “kernelization”.
Note that the dual representation can even be a good idea in this case for p < n.
If D has a sizeable margin, the perceptron algorithm will only ever update on
a few patterns, and the dual vector α remains sparse, thus can be stored and
handled efficiently.
How about linear classification by minimizing the squared error Esq ? Consider
the stochastic gradient variant from Section 2.4.2. Upon visiting pattern i, we
update
w ← w − η∇w Ei = w − η(yi − ti )φ(xi ),
which again gives rise to a dual representation, in which we update α ← α −
η(yi − ti )δ i . We derive the underlying dual optimization problem at the end of
this section.
How about any problem of the form (9.5)? If you read Section 4.2.2 on orthog-
onal projection, you have all the tools together. The criterion (9.5) consists of
two parts. The error function depends on (w, b) through y = Φw + b1 only,
while the regularizer is (ν/2)kwk2 . We will show the following. For any (w, b)
giving rise to y, we can find some w̃ = ΦT α so that Φ w̃ + b1 = y = Φw + b1
and kw̃k2 ≤ kwk2 . We just need to apply this to an optimal solution in order to
establish the representer theorem. Recall from Section 4.2.2 that w = wk + w⊥
uniquely, where wk ∈ ΦT Rn (therefore wk = ΦT α for some α ∈ Rn ) and w⊥
is orthogonal to ΦT Rn , meaning that Φw⊥ = 0. Therefore,

y = Φ wk + w⊥ + b1 = Φwk + b1.

Moreover, by the Pythagorean theorem (Section 2.1.1):

kwk2 = kwk k2 + kw⊥ k2 ≥ kwk k2 .

The claim follows with w̃ = wk . The intuition behind the representer theorem
is that the contribution w⊥ orthogonal to wk cannot help in decreasing the
error (it does not affect y), but it adds to the regularization costs.
To conclude, for a wide range of Tikhonov5 regularized estimation problems of
the form (9.5), including penalized least squares, MAP estimation for logistic
regression (Section 8.3.2) and soft margin support vector machines,
Pn the optimal
solution can be represented in terms of w∗ = ΦT α∗ = i=1 α∗,i φ(xi ). This
fact is valuable if p > n, as it allows us to optimize over (α, b) instead of (w, b).
5 You will have no problems to confirm that the representer theorem holds more generally

for regularizers of the form R(kwk), where R(v) is nondecreasing for v ≥ 0.


158 9 Support Vector Machines

Problem: I want to use more features p than datapoints n.


Can I save time and storage?
Approach: You can! The representer theorem allows to optimize over
α ∈ Rn instead of w ∈ Rp .

Dual Representation of Linear Least Squares (*)

Let us work out the dual representation for linar least squares regression, namely
(9.5) with Ei = (yi − ti )2 /2. Note that for this example, we deviate from the
policy here and include b into w, appending 1 to φ(x). The other case leads to
the same conclusion, but is more messy to derive. Starting from Section 7.2, we
have that
  1
νI + ΦT Φ w∗ = ΦT y ⇒ w∗ = ΦT α∗ , α∗ = (y − Φw∗ ) .
ν

Plugging in Φw∗ = ΦΦT α∗ , then


 
να∗ = y − ΦΦT α∗ ⇒ νI + ΦΦT α∗ = y.

Note the remarkable duality in this formulation. The system for the dual rep-
resentation α∗ is obtained by replacing ΦT Φ ∈ Rp×p with ΦΦT ∈ Rn×n , and
the right hand side ΦT y with y. Ultimately, this simple classical relationship is
the basis for the “kernel trick”. We will work out details in Section 9.2.3. This
example provides an important lesson about penalized linear regression. If we
encounter p > n, we solve for α∗ (a n × n linear system) rather than for w∗
directly (a p × p linear system). Computations scale superlinearly only in the
smaller of p and n.
Finally, what happens to the representer theorem for unregularized criteria, (9.5)
without the Tikhonov regularizer? Should we still use a dual representation for
w? The answer is yes, although some clarification is needed. If p > n and
(9.5) comes without regularizer, then the problem has infinitely many optimal
solutions. In particular, we can always add any multiple of w⊥ to w without
changing the criterion at all. To see why dual representations are still possible,
in fact constitute a very sensible choice, note that (9.5) without regularizer is
obtained by setting ν = 0. For any ν > 0, the representer theorem provides
(ν) (ν) (ν) (0)
an optimal solution w∗ = ΦT α∗ . By continuity, α∗ converges to α∗ , and
(0) (0)
w∗ = ΦT α∗ is a solution of the unregularized problem. Moreover, if rk Φ = n,
(0)
then among the infinitely many solutions, w∗ is the one with smallest norm
kw∗ k, simply because this holds true for any ν > 0. For the regularized LSE
example above, we have that
 −1
(0) (0)
w∗ = ΦT α∗ = Φ ΦΦT y,

the so called Moore-Penrose pseudoinverse solution. Beware that computing


(0)
w∗ in practice can be challenging. If you really do not want to regularize your
problem at all, you should be familiar with best practice methods discussed in
Section 4.2.3.
9.2 Support Vector Machines 159

9.2.3 Kernel Methods


We have all the tools together now to make an exciting step. Let us summarize
our findings. We are interested in regularized estimation problems of the form
(9.5), where y(x) = wT φ(x) + b is linear, examples include the soft margin
SVM and MAP for logistic regression. Here is a mad idea. Suppose we use a
huge number of features p, maybe even infinitely many. Before figuring out how
this could be done, let us first see whether this makes any sense in principle.
After all, we have to store w ∈ Rp and evaluate φ(x) ∈ Rp . Do we? In the
previous section, we learned that we can always represent w = ΦT α, where
α ∈ Rn , and our dataset is finite. Moreover, the error function in (9.5) depends
on
y = Φw + b = ΦΦT α + b
only, and ΦΦT is just an Rn×n matrix. Finally, the Tikhonov regularizer is
given by 2
ν ν
kwk2 = ΦT α = αT ΦΦT α,

2 2
it also only depends on ΦΦT . Finally, once we are done and found (α∗ , b∗ ),
where w∗ = ΦT α∗ , we can predict on new inputs x with

y ∗ (x) = wT∗ φ(x) + b∗ = αT∗ Φφ(x) + b∗ .

We need finite quantities only in order to make our idea work, namely the
matrix ΦΦT during training, and the mapping Φφ(x) for predictions later on,
to be evaluated at finitely many x. This is the basic observation which makes
kernelization work.
The entries of ΦΦT are φ(xi )T φ(xj ), while [Φφ(x)] = [φ(xi )T φ(x)]. We can
write
K(x, x0 ) = φ(x)T φ(x0 ),
a kernel function. It is now clear that given the kernel function K(x, x0 ), we
never need to access the underlying φ(x). In fact, we can forget about the
dimensionality p and vectors of this size altogether. What makes K(x, x0 ) a
kernel function? It must be the inner product in some feature space, but what
does that imply? Let us work out some properties. First, a kernel function is
obviously symmetric: K(x0 , x) = K(x, x0 ). Second, consider some arbitrary set
{xi } of n input points and construct the kernel matrix K = [K(xi , xj )] ∈ Rn×n .
Also, denote Φ = [φ(x1 ), . . . , φ(xn )]T ∈ Rn×p . Then,
2
αT K α = αT ΦΦT α = ΦT α ≥ 0.

In other words, the kernel matrix K is positive semidefinite (see Section 6.3).
This property defines kernel functions. K(x, x0 ) is a kernel function if the kernel
matrix K = [K(xi , xj )] for any finite set of points {xi } is symmetric positive
semidefinite. An important subfamily are the infinite-dimensional or positive
definite kernel functions. A member K(x, x0 ) of this subfamily is defined by all
its kernel matrices K = [K(xi , xj )] being positive definite for any set {xi } of
any size. In particular, all kernel matrices are invertible. As we will see shortly,
it is positive definite kernel functions which give rise to infinite-dimensional
feature spaces, therefore to nonlinear kernel methods.
160 9 Support Vector Machines

Problem: Can I use astronomically many features p? How about p = ∞?


Approach: No problem! As long as you can efficiently compute the kernel
function K(x, x0 ) = φ(x)T φ(x0 ), the representer theorem
saves the day.

Hilbert Spaces and All That (*)

Before we give examples of kernel functions, a comment for meticulous read-


ers (all others can safely skip this paragraph and move to the examples). How
can we even talk about φ(x)T φ(x) if p = ∞? Even worse, what is Φ ∈ Rn×p
in this case? In the best case, all this involves infinite sums, which may not
converge. Rest assured that all this can be made rigorous within the frame-
work of Hilbert function and functional spaces. In short, infinite dimensional
vectors become functions, their transposes become functionals, and matrices
become linear operators. A key result is Mercer’s theorem for positive semidefi-
nite kernel functions, which provides a construction for a feature map. However,
with the exception of certain learning-theoretical questions, the importance of
all this function space mathematics for down-to-earth machine learning is very
limited. Historically, the point about the efforts of mathematicians like Hilbert,
Schmidt and Riesz was to find conditions under which function spaces could
be treated in the same simple way as finite-dimensional vector spaces, working
out analogies for positive definite matrices, quadratic functions, eigendecom-
position, and so on. Moreover, function spaces governing kernel methods are of
the particularly simple reproducing kernel Hilbert type, where common patholo-
gies like “delta functions” do not even arise. You may read about all that in
[36] or other kernel literature, it will not play a role in this course. Just one
warning which you will not find spelled out much in the SVM literature. The
“geometry” in huge or infinite-dimensional spaces is dramatically different from
anything wePcan draw or imagine. For example, in Mercer’s construction of
K(x, x0 ) = j≥1 φj (x)φj (x0 ), the different feature dimensions j = 1, 2, . . . are
by no means on equal terms, as far as concepts like distance or volume are
concerned. For most commonly used infinite-dimensional kernel functions, the
contributions φj (x)φj (x0 ) rapidly become extremely small, and only a small
number of initial features determine most of the predictions. A good intuition
about kernel methods is that they behave like (easy to use) linear methods of
flexible dimensionality. As the number of data points n grows, a larger (but
finite) number of the feature space dimensions will effectively be used.

Examples of Kernel Functions

Let us look at some examples. Maybe the simplest kernel function is K(x, x0 ) =
xT x0 , the standard inner product. Moreover, for any finite-dimensional feature
map φ(x) ∈ Rp (p < ∞), K(x, x0 ) = φ(x)T φ(x0 ) is a kernel function. Since
any kernel matrix of this type can at most have rank p, such kernel functions
are positive semidefinite, but not positive definite. However, even for finite-
dimensional kernels, it can be much simpler to work with K(x, x0 ) directly
than to evaluate φ(x). For example, recall polynomial regression estimation
from Section 4.1, giving rise to a polynomial feature map φ(x) = [1, x, . . . , xr ]T
for x ∈ R. Now, if x ∈ Rd is multivariate, a corresponding polynomial feature
9.2 Support Vector Machines 161

map would consists of very many features. Is there a way around their explicit
representation? Consider the polynomial kernel
r X
K(x, x0 ) = xT x0 = (xj1 . . . xjr ) x0j1 . . . x0jr .

j1 ,...,jr

For example, if d = 3 and r = 2, then


2
K(x, x0 ) = (x1 x01 + x2 x02 + x3 x03 ) = x21 (x01 )2 + x22 (x02 )2 + x23 (x03 )2
+ 2(x1 x2 )(x01 x02 ) + 2(x1 x3 )(x01 x03 ) + 2(x2 x3 )(x02 x03 ),

a feature map of which is


x21
 
 x22 
2
 
√ x3
 
φ(x) =  .
√2x1 x2
 
 
√2x1 x3
 
2x2 x2

If x ∈ Rd , K(x, x0 ) is evaluated in O(d), independent of r. Yet it is based on


r
a feature map φ(x) ∈ Rd , whose dimensionality6 scales exponentially in r. A
variant is given by
r
K(x, x0 ) = xT x0 + ε , ε > 0,

which can be obtained by replacing x by [xT , ε]T above. The feature map now
runs over all subranges 1 ≤ j1 ≤ · · · ≤ jk ≤ d, 0 ≤ k ≤ r.
A frequently used infinite-dimensional (positive definite) kernel is the Gaussian
(or radial basis function, or squared exponential) kernel:
τ 0 2
K(x, x0 ) = e− 2 kx−x k , τ > 0. (9.6)

We establish it as a kernel function in Section 9.2.4. The Gaussian is an example


of a stationary kernel, these depend on x−x0 only. We can weight each dimension
differently:
1
Pd 0 2
K(x, x0 ) = e− 2 j=1 τj (xj −xj ) , τ1 , . . . τd > 0.
Free parameters in kernels are hyperparameters, much like C in the soft margin
SVM or the noise variance σ 2 in Gaussian linear regression, choosing them is a
model selection problem (Chapter 10).
Choosing the right kernel is much like choosing the right model. In order to do
it well, you need to know your options. Kernels can be combined from others in
many ways, [5, ch. 6.2] gives a good overview. It is also important to understand
statistical properties implied by a kernel. For example, the Gaussian kernel pro-
duces extremely smooth solutions, while other kernels from the Matérn family
are more flexible. Most books on kernel methods will provide some overview,
see also [38].
One highly successful application domain for kernel methods concerns problems
where input points x have combinatorial structure, such as chains, trees, or
6 More economically, we can run over all 1 ≤ j1 ≤ · · · ≤ jr ≤ d.
162 9 Support Vector Machines

graphs. Applications range from bioinformatics over computational chemistry


to structured objects in computer vision. The rationale is that it is often simpler
and much more computationally efficient to devise a kernel function K(x, x0 )
than a feature map φ(x). This field was seeded by independent work of David
Haussler [21] and Chris Watkins.
A final remark concerns normalization. As noted in Chapter 2 and above in
Section 9.1, it is often advantageous to use normalized feature maps kφ(x)k = 1.
What does this mean for a kernel?

K(x, x) = φ(x)T φ(x) = 1.

Therefore, a kernel function gives rise to a normalized feature map if its diagonal
entries K(x, x) are all 1. For example, the Gaussian kernel (9.6) is normalized.
Moreover, if K(x, x0 ) is a kernel, then so is

K(x, x0 )
p
K(x, x)K(x0 , x0 )

(see Section 9.2.4), and the latter is normalized. It is a good idea to use nor-
malized kernels in practice.

9.2.4 Techniques: Properties of Kernels (*)


In this section, we review a few properties of kernel functions and look at some
more examples. The class of kernel functions has formidable closedness prop-
erties. If K1 (x, x0 ) and K2 (x, x0 ) are kernels, so are cK1 for c > 0, K1 + K2
and K1 K2 . You will have no problem confirming the the first two. The third
is shown at the end of this section. Moreover, f (x)K1 (x, x0 )f (x0 ) is a kernel
function as well, for any f (x). This justifies kernel normalization, as discussed
at the end of Section 9.2.3. If Kr (x, x0 ) is a sequence of kernel functions con-
verging pointwise to K(x, x0 ) = limr→∞ Kr (x, x0 ), then K(x, x0 ) is a kernel
function as well. Finally, if K(x, x0 ) is a kernel and ψ(y) is some mapping into
Rd , then (y, y 0 ) 7→ K(ψ(y), ψ(y 0 )) is a kernel as well.
Let us show that the Gaussian kernel (9.6) is a valid kernel function. First,
(xT x0 )r is a kernel for every r = 0, 1, 2, . . . , namely the polynomial kernel from
Section 9.2.3. By the way, K(x, x0 ) = 1 is a kernel function, since its kernel
matrices 11T are positive semidefinite. Therefore,
r
X 1 T 0 j
Kr (x, x0 ) = (x x )
j=0
j!

T 0
are all kernels, and so is the limit ex x = limr→∞ Kr (x, x0 ). More general, if
0
K(x, x0 ) is a kernel, so is eK(x,x ) . Now,
0 2
τ τ 2 T
x0 − τ2 kx0 k2
e− 2 kx−x k = e− 2 kxk eτ x e .

The middle is a kernel, and we apply our normalization rule with f (x) =
τ 2
e− 2 kxk . The Gaussian kernel is infinite-dimensional (positive definite), al-
though we will not show this here.
9.2 Support Vector Machines 163

Another way to think about kernels is in terms of covariance functions. A ran-


dom process is a set of random variables a(x), one for each x ∈ Rd . Its covariance
function is

K(x, x0 ) = Cov[a(x), a(x0 )] = E [(a(x) − E[a(x)])(a(x0 ) − E[a(x0 )])] .

Covariance functions are kernel functions. For some set {xi }, let a = [a(xi ) −
E[a(xi )]] ∈ Rn be a random vector. Then, for any v ∈ Rn :

v T K v = v T E aaT v = E (v T a)2 ≥ 0.
   

Finally, there are some symmetric functions which are not kernels. One example
is
K(x, x0 ) = tanh αxT x0 + β .


In an attempt to make SVMs look like multi-layer perceptrons, this non-kernel


was suggested and is shipped to this day in many SVM toolboxes7 . Running
the SVM with “kernels” like this spells trouble. Soft margin SVM is a convex
optimization problem only if kernel matrices are positive semidefinite, codes will
typically crash if that is not the case. A valid “neural networks” kernel is found
in [44], derived from the covariance function perspective.
Finally, why is K1 (x, x0 )K2 (x, x0 ) a kernel? This argument is for interested
readers only, it can be skipped at no loss. We have to show that for two positive
semidefinite kernel matrices K 1 , K 2 ∈ Rn×n the Schur (or Hadamard) product
K 1 ◦K 2 = [K1 (xi , xj )K2 (xi , xj )] (Section 2.4.3) is positive semidefinite as well.
To this end, we consider the Kronecker product K 1 ⊗ K 2 = [K1 (xi , xj )K 2 ] ∈
2 2
Rn ×n . This is positive semidefinite as well. Namely, we can write K 1 = V 1 V T1 ,
K 2 = V 2 V T2 , then

K 1 ⊗ K 2 = (V 1 ⊗ V 2 )(V 1 ⊗ V 2 )T .

But the Schur product is a square submatrix of K 1 ⊗ K 2 , a so called minor.


In other words, for some index set J ⊂ {1, . . . , n2 } of size |J| = n: K 1 ◦ K 2 =
(K 1 ⊗ K 2 )J , so that for any v ∈ Rn :
2
v T (K 1 ◦ K 2 )v = z T (K 1 ⊗ K 2 )z ≥ 0, z = I ·,J v ∈ Rn .

The same proof works to show that the positive definiteness of K 1 , K 2 implies
the positive definiteness of K 1 ◦ K 2 , a result due to Schur.

9.2.5 Summary
Let us summarize the salient points leading up to support vector machine bi-
nary classification. We started with the observation that for a linearly separable
dataset, many different separating hyperplanes result in zero training error.
Among all those potential solutions, which could arise as outcome of the per-
ceptron algorithm, there is one which exhibits maximum stability against small
displacements of patterns xi , by attaining the maximum margin γD (w, b). Pur-
suing this lead, we addressed a number of problems:
7 It is even on Wikipedia (en.wikipedia.org/wiki/Support vector machine).
164 9 Support Vector Machines

• Maximizing the margin does not look like a convex optimization prob-
lem. However, exploiting the fact that the size of (w, b) does not matter
(only the direction does), we can find an equivalent convex program for-
mulation: minimize a positive definite quadratic function, subject to linear
constraints, a quadratic program. This is the hard margin support vector
machine. Each margin constraint is enforced in a hard, non-negotiable
manner.

• Real data often contains outliers, arising through measurement noise or


labeling errors. They can result in overly small margins or even render a
training dataset linearly non-separable. In practice, we should enforce sta-
bility, yet at the same time tolerate a small number of margin violations
to not get sidetracked by outliers. Inspired by logistic regression, we in-
troduced the concept of a “soft margin”, allowing each margin constraint
to be violated, but requesting a linear cost for such slack. The result-
ing soft margin support vector machine comes in two equivalent forms.
First, we can represent costs to be paid as additional slack variables ξi ,
extending the quadratic program in a natural way. Second, the soft margin
SVM problem corresponds to minimizing the piecewise linear hinge error
function plus a Tikhonov regularizer, which allows for comparisons with
logistic regression.

• Once Tikhonov regularization is used, we can in principle use linear dis-


criminants with many more features p than training datapoints n. Maybe
the most surprising twist in this chapter is that we can do so and do
not even have to pay for it. The representer theorem and the kernel
trick allows us to reformulate Tikhonov-regularized estimation problems
in only n dual variables, no matter what p is. We never have to evalu-
ate the feature map φ(x) ∈ Rp , but can get by with kernel evaluations
K(x, x0 ) = φ(x)T φ(x0 ). In other words, we can formulate our estima-
tion problem in terms of a kernel function up front, ignoring the feature
map behind. Resulting discriminant functions come in the form of kernel
expansions:
Xn
y(x) = αi K(x, xi ) + b.
i=1

This is a weighted sum of kernel functions K(·, xi ) placed on each data-


point.

• We have motivated kernel methods as arising from linear functions in a


feature space given by φ(x), so that K(x, x0 ) = φ(x)T φ(x0 ). However,
in practice we choose a kernel function K(x, x0 ) and work with it without
ever having to know the underlying feature map φ(x). In fact, for any
given kernel K(x, x0 ), there are many feature maps giving rise to it, so it
is not even possible to uniquely go from K(x, x0 ) to φ(x).

At this point, we could plug the kernel expansion in terms of α and b into the
soft margin SVM problem (9.4), then solve the resulting quadratic program in
(α, b, ξ). Indeed, this is how we would proceed in order to “kernelize” logistic
regression. However, in case of the SVM, some more work leads to a simpler
9.3 Solving the Support Vector Machine Problem 165

optimization problem, whose properties provide important insights into the in-
fluence of single patterns on the final solution. We will derive this dual problem
in the following section.

9.3 Solving the Support Vector Machine Prob-


lem
In this section, we will learn how to solve the soft margin SVM learning prob-
lem (9.4), using the concept of Lagrange duality. We will keep our exposition
simple and use graphical intuition rather than proofs, but there is no shortage
of literature to fill in these gaps [7, 36].
The soft margin SVM is the solution of a convex problem (9.4) with a convex
criterion and linear inequality constraints. It is posed in terms of p + n + 1
parameters and n constraints. Maybe the most powerful tool to address convex
programs of this kind is Lagrange duality, a technique to derive a second convex
optimization problem of the same kind, called the dual problem, which (a) has
the same optimal solution than the primal problem we start from, and (b)
operates roughly on as many parameters as the number of primal constraints. As
we will see, the dual problem for the soft margin SVM is a quadratic program in
n parameters, independent of p. In fact, if we did not know about the representer
theorem and kernel expansions, we would reinvent it for the SVM in this way.
In order to keep the exposition as simple as possible, we will derive the soft
margin SVM dual problem in an intuitive and somewhat informal way. While
this is sufficient for our purposes here, Lagrange duality for general convex
problems is a core concept in machine learning today, with applications far
beyond support vector machines. For this reason, we provide a more complete
account of Lagrange multipliers and Lagrange duality in Appendix A.
Our starting point is the formulation of the soft margin SVM problem in terms
of the hinge error function:
( n
)
1 X
min ΦP = kwk2 + C[1 − ti yi ]+ , yi = wT φ(xi ) + b.
w,b 2 i=1

This is an unconstrained optimization problem, so why don’t we proceed as


so many times before, solving ∇w,b ΦP = 0 for w and b? This strategy fails
badly in this case, since ΦP is not differentiable. Our goal must be to replace
ΦP by continuously differentiable functions, for which we can make use of the
first-order stationary condition. The key idea is sketched in Figure 9.6. The
non-differentiable term C[1 − ti yi ]+ is lower bounded by the linear αi (1 − ti yi ),
as long as αi ∈ [0, C]. This family of linear lower bounds is tight, in that
C[1 − ti yi ]+ = max αi (1 − ti yi ).
αi ∈[0,C]

The new variables α = [αi ] represent the slopes of the linear bounds, they come
with two-sided linear inequality constraints. Given these variables, our problem
becomes
( n
)
1 2
X
p̃∗ = min max L(w, b, α) = kwk + αi (1 − ti yi ) .
w,b {0≤αi ≤C} 2 i=1
166 9 Support Vector Machines

−1

−2
−3 −2 −1 0 1 2 3

Figure 9.6: Hinge function v 7→ C[v]+ , C = 2, where v = 1 − ty (black), along


with some linear lower bounds v 7→ αv (α = 0.1, red; α = 1, blue; α = 1.9,
green). These are global lower bounds of the hinge function for any α ∈ [0, C].

At the expense of introducing new variables αi , one for each soft margin con-
straint, the inner criterion L is continuously differentiable. This is called the
primal problem, and p̃∗ is called the primal value. Notice the min-max structure
of this problem: what we are after is a saddlepoint, minimizing L w.r.t. the pri-
mal variables w, b, while at the same time maximizing L w.r.t. the new dual
variables α.
These are not the only saddlepoints, we might just as well look at
d˜∗ = max min L(w, b, α).
{0≤αi ≤C} w,b

Here, the unconstrained minimization over w, b is inside, the maximization


over dual variables is outside. This is called the dual problem. How do these two
problems relate to each other? First of all, we always have
d˜∗ ≤ p̃∗ .
After all, this is how we constructed things. For each fixed α with αi ∈ [0, C]:
min L(w, b, α) ≤ min max L(w, b, α0 ).
w,b w,b {0≤α0i ≤C}

The dual value d˜∗ bounds the primal value p̃∗ from below. This fact is called
weak duality. While weak duality holds for any primal problem, convex or not,
more is true for our soft margin SVM problem. There, we have strong duality,
in that the two saddlepoint values are identical:
!
max min L(w, b, α) = d˜∗ = p̃∗ = min max L(w, b, α).
{0≤αi ≤C} w,b w,b {0≤αi ≤C}

This means that we can solve the original primal problem by solving the dual
problem instead, which will turn out to be simpler in the case of soft margin
SVM.
9.3 Solving the Support Vector Machine Problem 167

Let us solve the dual problem. Denote φi := φ(xi ). The inner criterion is
n
1 X
L= kwk2 + αi (1 − ti yi ), yi = wT φi + b, i = 1, . . . , n.
2 i=1

The criterion for the dual problem is obtained by minimizing over the primal
variables (w, b):
ΦD (α) = min L(w, b, α).
w,b

Finally an unconstrained and differentiable problem: we can apply our “set the
gradient to zero” procedure! First,
n
X n
X
∇w L = w − αi ti φi = 0 ⇒ w= αi ti φi .
i=1 i=1

Here is the feature expansion we were waiting for! Also,


n
X n
X
∇b L = − αi ti = 0 ⇒ αi ti = 0.
i=1 i=1

This means that the dual problem has an equality constraint. Plugging these
into the criterion, we obtain the dual optimization problem:
 
 1 X X 1 X 
max ΦD (α) = kwk2 + αi (1 − ti yi ) = αi − αi αj ti tj φTi φj ,
α  2 i i
2 i,j 
X
subj. to αi ∈ [0, C], i = 1, . . . , n, αi ti = 0.
i
(9.7)
We used that
X X   (1) X (2)
αi ti yi = αi ti φTi w + b = αi ti φTi w = kwk2 ,
i i i
P P
since (1) i αi ti = 0 and (2) w = i αi ti φj . The dual problem is indeed
simpler than the primal. First, it depends on n variables only, not on p+n+1. It is
convex, since −ΦD (α) is a positive semidefinite quadratic. Finally, its feasible set
is simply the intersection of the hypercube [0, C]n with the hyperplane αT t = 0.
It is naturally kernelized (Section 9.2.3). Picking a kernel function K(x, x0 ), we
have that
Kij = K(xi , xj ) = φTi φj , K = [Kij ] ∈ Rn×n .
In terms of this kernel matrix, the dual criterion becomes
X 1X 1
ΦD (α) = αi − αi ti Kij tj αj = 1T α − αT (diag t)K (diag t)α.
i
2 i,j
2
Pn
Moreover, if α∗ is the solution of the dual problem, then w∗ = i=1 α∗i ti φi
solves the primal, and we obtain the discriminant
n
X
y ∗ (x) = (w∗ )T φ(x) = α∗,i ti K(xi , x) + b∗ . (9.8)
i=1
168 9 Support Vector Machines

Support Vectors

There is a property we have not yet used, which links primal and dual variables
at the saddlepoint. By working it out, we will complete our derivation of the
soft margin SVM dual problem. First, we will find how to solve for b∗ . Second,
we will obtain an important classification of patterns (xi , ti ) in terms of values
of αi , which will finally clarify the catchy name “support vectors”. To gain some
intuition, consider the soft margin SVM solution in Figure 9.7. Three different
things can happen for a pattern (xi , ti ). First, it may be classified correctly
with at least the margin, in that ti yi ≥ 1. In this case, the optimal solution does
not really depend on it. We could remove the pattern from D and still get the
same solution. The solution is not supported by these vectors. Second, it may
lie precisely on the margin, ti yi = 1. These ones are important, they provide
the essential support. Finally, for the soft margin SVM, there may be patterns
in the margin area or even misclassified, ti yi < 1. They support the solution as
well, because we pay for them, and removing them may allow us to increase the
soft margin.

Figure 9.7: Different types of patterns for soft margin SVM solution. (x1 , t1 )
is classified correctly with margin, t1 y1 ≥ 1. The SV solution does not depend
on it. (x2 , t2 ) is an essential support vector and lies directly on the margin,
t2 y2 = 1. Both (x3 , t3 ) and (x4 , t4 ) are bound support vectors, α3 = α4 = C.
(x3 , t3 ) lies in the margin area, while (x4 , t4 ) is misclassified (t4 y4 < 0).

Suppose now that we are at a saddlepoint of the dual problem, dropping the
“*” subscript for easier notation. Recall how we got the αi into the game above:
C[1 − ti yi ]+ = max αi (1 − ti yi ).
αi ∈[0,C]

A glance at Figure 9.6 reveals that αi is linked with yi (and therefore with w, b)
9.3 Solving the Support Vector Machine Problem 169

through the requirement that the lower bound has to touch the hinge function
at the argument 1 − ti yi . As illustrated in Figure 9.8, there are three different
cases:

• αi = 0. In this case, 1 − ti yi ≤ 0 (Figure 9.8, left). These are points which


are classified correctly with at least the margin. Since αi = 0, the solution
indeed does not depend on them. They are not support vectors.
• αi ∈ (0, C). In this case, 1 − ti yi = 0 (Figure 9.8, middle). These points
lie directly on the margin. They are essential support vectors.
• αi = C. Then, 1 − ti yi ≥ 0 (Figure 9.8, right). These points lie in the
margin area or may even be misclassified. They are called bound support
vectors, since αi is bound to the maximum value C.

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0

−1 −1 −1

−2 −2 −2
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
1−t y 1−t y 1−t y

Figure 9.8: Left: No support vector (1 − ti yi ≤ 0, αi = 0). Middle: Essential


support vector (1 − ti yi = 0, 0 < αi < C). Right: Bound support vector (1 −
ti yi ≥ 0, αi = C).

If (xi , ti ) is not a support vector, its αi = 0 and it does not appear in the
kernel expansion (9.8) of y ∗ (x). If this happens for many patterns, the kernel
expansion (9.8) is sparse, and a lot of time can be saved both when predicting
on future test data and during training. Sparsity is a major attractive point
about support vector machines in practice (see Section 9.4).
Finally, the essential support vectors allow us toPdetermine the solution for b.
Denote S = {i | αi ∈ (0, C)}. For these, let ỹi = j αj tj K(xj , xi ) = yi − b. We
know that 1 = ti yi = ti (ỹi + b), so that
1 X
b= (ti − ỹi ) .
|S|
i∈S

Note we could also just compute b from any single i ∈ S, but the above formula
averages out numerical errors.
Most SVM codes in use today solve this dual problem, making use of the so
called Karush-Kuhn-Tucker (KKT) conditions we just derived. A particularly
simple algorithm is sequential minimal optimization (SMO) [32], where pairs
(αi , αj ) are updated in each iteration. There are some good and some not
so good SVM packages out there. An example for a good code is LIBSVM at
www.csie.ntu.edu.tw/∼cjlin/libsvm/.
This concludes our derivation of the soft margin SVM optimization problem in
both primal and dual form. Our derivation is a special case of the much more
general and powerful framework of Lagrange multipliers and Lagrange duality,
which we discuss in Appendix A.
170 9 Support Vector Machines

9.4 Support Vector Machines and Kernel Logis-


tic Regression (*)
In this chapter, we worked out support vector machines and learned how to
kernelize estimation techniques based on linear functions. For the SVM, kernel-
ization is a consequence of working with the dual problem (Section 9.3), while
for general regularized estimators, it is justified by the representer theorem (Sec-
tion 9.2.2). So what’s the deal about SVMs? In this final section, we will address
this question by relating the soft margin SVM to kernel logistic regression for
binary classification.
Briefly, the main attraction of SVMs over regularized conditional likelihood
estimators is sparsity, while a major drawback of SVMs is their inability to
consistently estimate posterior class probabilities. These two aspects are linked
in a fascinating way, unfortunately out of scope here. The latter may seem like
a small problem, but it is the ultimate reason behind the appaling difficulties of
generalizing SVMs to multi-way classification in a sensible way, while this can
be done easily for logistic regression (Section 8.3.1).

Sparsity

The solution (α, b) to a soft margin SVM problem is sparse if most patterns
are not support vectors and most of the coefficients of α are zero. This does
not always happen, but it happens frequently in practice. A beneficial aspect
of sparsity is that the final predictor y ∗ (x) (9.8) can be evaluated efficiently
on test data. In fact, we only have to store the support vectors for prediction
purposes. In many real-world applications, it is very important to be able to
predict quickly, and sparsity is a substantial asset then. Even better, sparsity
tends to help with the training as well. Most SVM solvers exploit sparsity in one
way or another, which often lets them cope with dataset sizes n for which the
storage of the full kernel matrix K is out of the question. An extreme example is
SMO (Section 9.3), which is very slow without, yet highly effective with sparsity.
One problem with SVM sparsity is that it cannot be controlled in a sensible way,
nor can we easily say up front how sparse our solution will be. To some extent,
sparsity can be linked to the Bayes error of a problem (details in [36]), but the
link is not perfect. It is sometimes advocated to regulate sparsity by the choice of
C in (9.4). It is true that cranking up C can lead to sparser solutions, but that is
not what this parameter is for. C should be chosen solely with one goal in mind,
namely to attain the best possible generalization error. There are alternative
methods which allow the degree of sparsity to be regulated, for example the
relevance vector machine [5, ch. 7.2] or the informative vector machine [26].
However, when it comes to combining sparsity with computational efficiency
and robustness, there is still no match for the SVM.

Estimating Posterior Class Probabilities

Recall from Section 8.2.2 that a conditional maximum likelihood method such
as logistic regression can estimate class posterior probabilities P (t = +1|x),
9.4 Support Vector Machines and Kernel Logistic Regression (*) 171

since its population minimizer, the log odds ratio, is probability-revealing. In


this section, we consider regularized and kernelized logistic regression, which
corresponds to the problem
n
X   ν n
X
min log 1 + e−ti y(xi ) + αT K α, y(x) = αj K(xj , x) + b. (9.9)
α,b
i=1
2 j=1

Here, K is the kernel matrix. Other than in the SVM dual problem, the αi can
be negative (in fact, αi plays the role of ti αi in the SVM problem). For kernels
such as the Gaussian
P (9.6) and a certain decay of ν = νn with training set size
n, the minimizers j α∗,j K(xj , ·) + b∗ of (9.9) converge to the log odds ratio
log{P (t = +1|x)/P (t = −1|x)} for the true data distribution almost surely8 .
In other words, posterior class probabilities P (t = +1|x) can be estimated
consistently with regularized kernel logistic regression.
How about the soft margin SVM? First, it is not a conditional maximum a-
posteriori technique, mainly since the hinge loss [1 − ti yi ]+ does not correspond
to a negative log likelihood [37]. A way of understanding the SVM in terms of
probabilities is shown in [24], but this is markedly different from MAP estima-
tion. It should therefore not come as a surprise that posterior class probabilities
cannot be estimated consistently with the SVM. The population minimizer for
the hinge loss is

y ∗ (x) = argmin E [1 − ty]+ x


 
y
 
 +1 | P (t = +1|x) > P (t = −1|x) 
= −1 | P (t = +1|x) < P (t = −1|x) .
∈ [−1, +1] | P (t = 1|x) = P (t = −1|x)
 

The proof is left to the reader. Note that y ∗ (x) can take any value in [−1, +1]
if P (t = 1|x) = P (t = −1|x). Now, y ∗ (x) is certainly not probability-revealing.
For each x, it merely contains the information whether P (t = +1|x) > P (t =
−1|x) or P (t = +1|x) < P (t = −1|x), but nothing else. Indeed, Bartlett
and Tewari [1] demonstrated rigorously that P (t = +1|x) cannot be estimated
consistently by SVMs. The reason for this is precisely the sparsity of SVMs.
The failure of SVMs does not just happen on some unusual data distributions,
but is a direct implication of margin maximization. Ironically, despite this clear
result, SVMs are used to “estimate class probabilities” to this day, using [33]
or more elaborate variants. This cannot be justified as an approximation, it is
just plain wrong. If we want a sparse classifier trained by convex optimization,
we can use the SVM. If our goal is to estimate class probabilities for automatic
decision making (Section 5.2), the SVM is not a trustworthy option, and we
should use other methods such as kernelized logistic regression, trained by the
IRLS algorithm (Section 8.2.4).
We close this section with noting that there is a probabilistic counterpart to
support vector machines, going far beyond MAP estimation for logistic regres-
sion: Gaussian process methods. Kernels are covariance functions there (Sec-
tion 9.2.4), which give rise to Gaussian priors over functions y(·). GP methods
8 This is a classical result, which follows for example from Corollary 3.62 and example 3.66

in [41].
172 9 Support Vector Machines

are out of scope of this basic course, but do constitute an important part of
modern machine learning, with applications ranging from computer vision over
robotics, spatial statistics to bioinformatics. If you are interested, start with [45]
or the textbook [34]. A useful website is www.gaussianprocess.org.
Chapter 10

Model Selection and


Evaluation

The central problem of machine learning is generalization. We built a model


for our problem at hand and trained it on some data. How well will it work
on future test data? How can we estimate its generalization performance? How
complex should our model be, how much should we regularize? In this chapter,
we develop an understanding of such questions. We will learn about the training-
validation-test set paradigm of machine learning and about cross-validation as
a general technique for selecting complexity parameters of a model.

10.1 Bias, Variance and Model Complexity


Recall the fundamental concept of generalization from Chapter 7. Decision the-
ory (Section 5.2) tells us how to act if the “true” distribution underlying our
problem is known: we should minimize risk, or expected loss. Normally, we don’t
know the truth, but have access to a training dataset D = {(xi , ti ) | i = 1, . . . , n}
sampled from it. Our goal must now be to make use of D in order to approx-
imate the non-realizable decision-theoretic procedure as closely as possible. In
particular, we would want to know when things go wrong. We already identified
a major problem, over-fitting, and studied regularization techniques in order
to alleviate it (Chapter 7). But that is not the whole story. How do we decide
between different models, given D? For example, what is the best total degree
in polynomial regression (Section 4.1)? The best noise variance in linear regres-
sion with Gaussian noise (Section 8.1.3)? How do we choose the parameter ν
of a Tikhonov regularizer (Section 7.2), or prior parameters in MAP estimation
(Section 8.3.2)? All of these parameters have in common that they determine
the model’s expressiveness or complexity. They cannot be chosen by minimiz-
ing an error function on D, since more complex models always result in better
training data fits. In this chapter, we will be concerned with how to estimate
the (test) risk from data samples. Two major reasons for doing so are:

• Model selection: How do we choose between models or learning methods

173
174 10 Model Selection and Evaluation

of different complexity (for example, hyperparameters or regularization


constants)?

• Model assessment: How will the finally chosen method behave on future
test data?

10.1.1 Validation and Test Data


The simplest, and ultimately the only 100% safe way of estimating the test error
is to use independent data which is never used for training, and to use it only
once. We distinguish between training data, validation data and test data. We
already know what the training data is for. The test data is kept in a vault and
only used at the very end, for model assessment or to compare final predictors
against each other. In a serious machine learning study, we are not allowed to go
back and change things after the test data has been used in any way. Otherwise,
the test set takes the role of a training set and test errors will be underestimated:
our method will look better than it really is.
The validation set is used for model selection, as explored in detail in Sec-
tion 10.2. Briefly, we compare trained predictors ŷν (·) for different values of
a complexity parameter ν by estimating their respective test errors using the
validation dataset, then choosing the value ν for which the test risk estimate
is lowest. This works because training and validation set are independent, so
that training and test risk estimation cannot influence each other. It is difficult
to give a general rule on how to choose respective sizes of training, test and
validation set. A typical split may be 50% for training, 25% for testing and
25% for validation. It is good practice to split a given database at random, or
permute cases up front, as there is often some particular ordering imposed on
the original set.
An obvious problem with this procedure is waste of data. We need to hold
out a substantial fraction of available data, which may just as well be used
in order to train a better model. In many applications, high quality labeled
data is difficult and expensive to come by. In the remainder of this chapter,
we will mainly be interested in ways to understand and estimate generalization
performance by using the training dataset only. It is important to note that these
complement, but do not replace hold out dataset procedures. Which of them
we use in practice, must depend on reliability standards for model selection
and assessment as well as the size of available data in comparison to model
complexity.

10.1.2 A Simple Example


Let us start with an artificial polynomial curve fitting example, reminiscent of
Section 4.1 and Section 7.1.1. In Figure 10.1, top, we show two datasets of 10
points each. The data DL in the left panel is closely fitted by a line, while the
data DR on the right is equally closely fitted by a polynomial of degree 7. In
terms of goodness of the fit, these two examples are equivalent. However, most
of us would agree that on the left, we discovered underlying linear structure,
while the result on the right is probably arbitrary. Intuitively, there are so few
10.1 Bias, Variance and Model Complexity 175

1 7

6
0.5
5

4
0
3

2
−0.5
1

0
−1
−1

−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3

1 7

6
0.5
5

4
0
3

2
−0.5
1

0
−1
−1

−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3

Figure 10.1: Two datasets of size 10, closely fitted by polynomials of degree 1
(left) and degree 7 (right). Top row: Least squares fits exhibit similar squared
error in either case. Bottom left: Degree 1 polynomials fitted to subsets of size 5
are close to the fit for the whole set. Bottom right: Degree 7 polynomials fitted
to subsets of size 5 are very different from each other and from the fit for the
whole set.

10 point datasets well described by lines, compared to what could happen by


chance, that the left fit is highly likely to witness underlying structure. On the
other hand, polynomials of degree 6 can closely fit a substantial fraction of all
10 point datasets, and the fit on the right is likely to arise just by chance. This
intuition is still too vague to be useful in practice, so let us try to be more clear.
0
Suppose we obtained another independent sample of size 10 on either side, DL
0
and DR , from the same distributions respectively. On the left, we would expect
0
the previous line to be a good fit for the new sample DL as well. Put differently,
0
the best line fits of DL and DL should agree closely. On the right, it seems
0
much more likely that the best degree 6 fits of DR and DR are very different
from each other. In short, if we could resample the data, we would expect much
more variability on the right than on the left side. Why do we expect that?
Since data is sampled independently from the same distribution, we can split
DL randomly into two parts of 5 points each (Figure 10.1, bottom). The best
line fits to each subsample are very close to each other and to the best fit of
the whole DL . On the right side, a similar split of DR leads to entirely different
degree 6 polynomial fits. This simple example conveys two ideas which we will
176 10 Model Selection and Evaluation

develop in this chapter.

• Generalization behaviour can be understood and analyzed by considering


the training sample D as a random variable which can be resampled in
principle. The analysis focusses on both average value (mean) and vari-
ability (variance) of prediction errors w.r.t. D. In particular, high variance
is a telltale sign of over-fitting.

• In general, estimating the test error of a predictor requires independent


data not used for training. However, it may be possible to estimate average
and variability of prediction errors by splitting the single training sample
D into several parts, using some for training and others for evaluation.

To be more specific, we introduce the (test) risk


h i
R(ŷν |D) = E L(ŷν (x), t) D

and the empirical (or training) risk


n
1X
R̂n (ŷν |D) = L(ŷν (xi ), ti ),
n i=1

where D = {(xi , ti ) | i = 1, . . . , n} is a training dataset drawn i.i.d. from the


(unknown) true law, ŷν (·) is the predictor trained on D for a complexity param-
eter ν, and (x, t) in the definition of R(ŷν |D) is another (independent) test case
from the same distribution. Both test and training risk depend on the sample
D, because ŷν (·) depends on D. However, the training risk R̂n (ŷν |D) depends
on D twice: it is used to select the predictor ŷν (·) and to quantify its risk. We
already know that minimizing the empirical risk R̂n (ŷν |D) is not necessarily a
good way of choosing ν, since we may run into the overfitting problem. On the
other hand, we cannot minimize the true risk R(ŷν |D), because we do not know
the true underlying distribution. In this chapter, we analyze the relationship
between risk R(ŷν |D) and empirical risk R̂n (ŷν |D) more closely, with the aim
of finding more reliable empirical estimators of R(ŷν |D).

10.1.3 Bias-Variance Decomposition

Consider a curve estimation setup. ŷ(·) is fit to data D = {(xi , ti ) | i = 1, . . . , n},


ti ∈ R, with the aim of minimizing the population squared error

1 h h ii
E = E [L(ŷ(x), t)] = E E (ŷ(x) − t)2 x .

2
If you are unsure about the notation, you might want to revisit Chapter 5, in
particular Section 5.2. Our convention is that a conditional expectation E[A|B]
is over all variables in the expression A which do not appear in B. For example,
the expectation above is over (x, t) from the true law, which in the second
expression is decomposed into conditional expectation over t given x inside and
marginal expectation over x outside. We will worry about the dependence of
10.1 Bias, Variance and Model Complexity 177

ŷ(·) on the training sample D in a moment. For now, let us find out what the
optimal solution for ŷ(·) would be. Namely,
h i h i
2 2
E (ŷ(x) − t) = E (ŷ(x) − yopt (x) + yopt (x) − t)
h i h i
2 2
= E (ŷ(x) − yopt (x)) + 2E [(ŷ(x) − yopt (x)) (yopt (x) − t)] + E (yopt (x) − t) ,

where yopt (x) does not depend on t. The middle term is “cross-talk”, it vanishes
if we choose yopt (x) = E[t | x]:
h i h i h i
2 2 2
E (ŷ(x) − t) = E (ŷ(x) − E[t | x]) + E (E[t | x] − t)
h i (10.1)
2
= E (ŷ(x) − E[t | x]) + E [Var[t | x]] .

You should test your understanding on this derivation. Why does the cross-talk
term vanish under this particular choice of yopt (x)? Why can the final term be
written as expected conditional variance? Is this the same as Var[t]? If not1 ,
why not, what is missing?
The expression (10.1) is uniquely minimized by ŷ(x) = E[t | x], the conditional
expectation of t given x. In other words, E[t | x] is the population minimizer
under the squared loss function (see Section 8.2.2), the optimal solution to the
curve fitting problem, which requires knowledge of the true underlying law.
The term E[Var[t | x]] is a lower bound on the population squared error, we
cannot perform better than this. For example, suppose the data is generated
as ti = y∗ (xi ) + εi , where y∗ (·) is an underlying curve, and the εi are i.i.d.
noise variables with zero mean and variance σ 2 (Section 4.1). Then, E[t | x] =
E[y∗ (x) + ε | x] = y∗ (x) and E[Var[t | x]] = σ 2 . For this additive noise setup,
the conditional expectation recovers the underlying curve perfectly, while the
minimum unresolvable error is equal to the noise variance.
Our argument so far is decision-theoretic much like in Section 5.2, but deal-
ing with curve fitting instead of classification. In practice, ŷ(·) is learned from
data D, without knowledge of the true law. Let us fix x and use the quadratic
decomposition once more, this time taking expectations over the sample D:
h i h i
2 2
E (ŷ(x|D) − E[t | x]) x = E (ŷ(x|D) − yavg (x) + yavg (x) − E[t | x]) x ,

where yavg (x) does not depend on D. Using the same argument as above, the
cross-talk vanishes for yavg (x) = E[ŷ(x|D) | x], the expected prediction curve,
and
h i
2 2  
E (ŷ(x|D) − E[t | x]) x = (E[ŷ(x|D) | x] − E[t | x]) + Var ŷ(x|D) x .
| {z } | {z }
Bias2 Variance

This is the bias-variance decomposition for regression with squared loss. Note
that it holds pointwise at each x. A further expectation over x provides a bias-
variance decomposition of the expected test risk:
h i
2   
E[R(ŷν |D)] = E (E[ŷ(x|D) | x] − E[t | x]) +E Var ŷ(x|D) x +E [Var[t | x]]
1 Answer: Var[t] = E[Var[t | x]] + Var[E[t | x]]. Why?
178 10 Model Selection and Evaluation

into average squared bias, average variance and intrinsic noise variance. While
the third term is constant (and typically unknown), there is a tradeoff between
the first two terms which can be explored by varying the model complexity
(regularization parameter ν in the example above).
2 2

1.5 1.5

1 1

0.5 0.5

0 0

−0.5 −0.5

−1 −1

−1.5 −1.5

−2 −2
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

3 3

Bias2 2
Bias
2.5 Variance 2.5 Variance

2 2

1.5 1.5

1 1

0.5 0.5

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

Figure 10.2: Bias and variance for least squares polynomial curve fitting. 1000
samples of size 12 each were drawn from y∗ (x) = sin(2πx) plus N (0, 1/16) noise,
where xi are drawn uniformly from [0, 1]. Each sample includes 0 and 1 among
the xi , to avoid high variance due to extrapolation. Note that this favours the
higher-order polynomials.
Left column: Each sample is fitted by a line y(x) = w1 x + w0 . The expected fit
E[ŷ(x|D) | x] (upper left; solid black) is rather poor, giving rise to substantial
squared bias (lower left; blue). On the other hand, the variance across samples
(lower left; red) is small, in that most fitted lines are close to the average. Right
column: Each sample is fitted by a polynomial y(x) = w6 x6 + · · · + w1 x + w0 of
degree 6. The expected fit is very good (small squared bias), but the variance
across samples (lower right; red) is large. Notice that these curves represent
sample averages of the population quantities E[ŷ(x|D) | x] and Var[ŷ(x|D) | x]
(averaged over 1000 samples).

10.1.4 Examples of Bias-Variance Decomposition


In this section, we work through a number of examples in order to develop
an intuition about bias-variance decompositions. We will use the squared error
throughout. The basic message is this. A simple model tends to give rise to large
bias, but small variance, while a complex model often exhibits small bias and
large variance. In Figure 10.2, we revisit the curve fitting problem of Section 4.1.
10.1 Bias, Variance and Model Complexity 179

Data is sampled as ti = y∗ (xi ) + εi , xi ∈ [0, 1], εi ∼ N (0, σ 2 ), where y∗ (·) is a


smooth curve. We draw 1000 samples D of size 12 each and plot both average
curve E[ŷ(x|D) | x] and standard deviation Var[ŷ(x|D) | x]1/2 in the upper row,
squared bias and variance in the lower row. In each column, we use different
function classes for regression: lines y(x) = w1 x + w0 on the left, polynomials of
degree six on the right. The model on the left is less complex than the one on the
right. The underlying y∗ (·) is not well represented by any line, explaining the
large bias on the left. On the other hand, the two parameters w1 , w0 are robustly
estimated even from little data, so the variance is small. In contrast, the average
over six-degree polynomials represents y∗ (x) = E[t | x] closely, so the bias on the
right is small. However, as higher-order polynomials fit the erratic noise on top
of the signal (over-fitting), the variance over samples D is large. Neither of these
choices realize a good tradeoff between squared bias and variance.

1.8 Exp. Bias2


1.6 Exp. Variance
Expected Risk
1.4

1.2

0.8

0.6

0.4

0.2

0
0 1 2 3 4 5 6 7

Figure 10.3: Bias-variance decomposition for least squares polynomial curve


fitting. Shown are average squared bias E[(E[ŷ(x|D) | x] − E[t | x])2 ], average
variance E[Var[ŷ(x|D) | x]] and expected test risk (sum of former two plus noise
variance) as function of polynomial degree, estimated from the same data as in
Figure 10.2. The best tradeoff between squared bias and variance (on average
over x) is obtained for degree 3 polynomials.

In Figure 10.3, we repeat the same polynomial curve fitting setup (1000 training
samples D of size 12 each), showing average squared bias, average variance and
expected test risk as function of the polynomial degree (p − 1, where p is the
number of weights). Here is a bias-variance tradeoff in action. As p (and therefore
the model complexity) increases, the bias decreases, while the variance increases.
The expected test risk is smallest for degree 3 (p = 4).
180 10 Model Selection and Evaluation

Regularized Least Squares Estimation (*)

One way to scale effective model complexity is regularization. Consider Tikhonov


regularized linear least squares estimation (Section 7.2), based on functions
y(x) = wT φ(x). Suppose that the true law is given by t = y∗ (x) + ε, where ε is
independent zero-mean noise of variance σ 2 . Moreover, let ŷν (x) = (ŵν )T φ(x)
be the regularized least squares fit to the training sample D, i.e. ŵν is the min-
imizer of (7.1). Note that E[ŷν (x) | x] = E[ŵν ]T φ(x). How does the choice of
the regularization constant ν scale the bias-variance tradeoff? The simplest case
is that the true curve is linear itself, y∗ (x) = wT∗ φ(x). It is well known that the
standard least squares estimator is unbiased:
 −1   −1 
E [ŵ0 ] = E ΦT Φ ΦT (Φw∗ + ε) = E ΦT Φ ΦT Φw∗ = w∗ .

In this case, the bias vanishes for ν = 0, while being positive in general for
ν > 0. For general curves y∗ (x), we can show that the average bias on the
training sample is nondecreasing in ν:
E kt − Φ ŵν1 k2 ≤ E kt − Φ ŵν2 k2 , ν1 < ν2 .
   

This is because for any fixed sample D, the training error is nonincreasing in ν
(Section 7.2). Regularization is a means to decrease variance at the expense of
increasing the bias.

Ensemble Methods (*)

A final example concerns ensembles (or committees) of predictors, a common


technique for variance reduction. Suppose we use a number L of different learn-
ing methods, giving rise to predictors ŷl (x) on a training sample D. For example,
the ŷl (x) could be multi-layer perceptrons of different architecture, initialized
in different ways. Or all ŷl (x) could use the same method, but work on differ-
ent2 subsets of D. If applied to data based on the underlying curve y∗ (x), we
can write ŷl (x) = y∗ (x) + ε̂l (x) for l = 1, . . . , L, where ε̂l (x) denotes the error
committed by the l-th predictor. In the sequel, we concentrate on errors at a
fixed x, but average quantities can always be obtained by a further expectation.
Let
El (x) = E (ŷl (x) − y∗ (x))2 x = E ε̂l (x)2 x .
   

Given these L methods, we can pick one of them for prediction. Doing so, the
average error is
L L
1X 1X 
E ε̂l (x)2 x .

Eavg (x) = El (x) =
L L
l=1 l=1

Instead, suppose we use an ensemble of all of them,


L
1X
ŷens (x) = ŷl (x),
L
l=1
2A popular technique in statistics, not discussed in this course, is to use bootstrap resam-
pling: each ŷl (x) is trained on a sample Dl of the same size as D, obtained from D by sampling
with replacement. The rationale behind bootstrap resampling is discussed in [20, ch. 7].
10.1 Bias, Variance and Model Complexity 181

whose error is
" 2 #
XL
−1
Eens (x) = E L (y∗ (x) + ε̂l (x)) − y∗ (x) x

l=1
" 2 #
XL
−1
=E L ε̂l (x) x .

l=1

How do these errors compare? Using the Cauchy-Schwarz inequality (Sec-


tion 2.3):
X 2
L XL
1 · ε̂(x) ≤ L ε̂(x)2 ,
l=1 l=1

so that Eens (x) ≤ Eavg (x). The ensemble error is never worse than the average
error picking a single predictor. In the best case, it can be much better. Denote
the expected error of each method by Bl (x) = E[ε̂l (x) | x]. Let us assume that
the errors of different methods are uncorrelated: Cov[ε̂l1 (x), ε̂l2 (x) | x] = 0,
l1 6= l2 . Then,
X
Eens (x) = L−2
 
E ε̂l1 (x)ε̂l2 (x) x
l1 ,l2
X
= L−2
  
Cov ε̂l1 (x), ε̂l2 (x) x + Bl1 (x)Bl2 (x)
l1 ,l2
X   X 2
−2
Var ε̂l (x) x + L−1

=L Bl (x) ,
l l
| {z } | {z }
Variance Bias2

while X X
Eavg (x) = L−1 Var ε̂l (x) x + L−1 Bl (x)2 .
 
l l
| {z } | {z }
Avg. Variance Avg. (Bias)2

The ensemble method reduces the variance by a factor of L in this case! More-
over, the bias is not increased:
 X 2 X 2 X
L−1 Bl (x) = L−2 1 · Bl (x) ≤ L−1 Bl (x)2 ,
l l l

using the Cauchy-Schwarz inequality. In most realistic situations, errors are


correlated and gains are somewhat less dramatic. Ensemble methods are widely
used in practice in order to reduce variance. Our analysis provides insight into
how to choose component methods ŷl (·). They should be as diverse as possible,
so to give rise to weakly correlated errors. Moreover, they should have small bias,
since variance is mainly taken care of by the ensemble formation. An obvious
drawback of ensemble methods is the increased computational complexity of the
final predictor.
Finally, how about bias-variance decompositions for loss functions other than
the squared one? There has been quite some work in this direction, but the
exact additive decomposition of expected risk into squared bias and variance
holds for the squared loss only. Having said that, an increase in bias or variance
does in general imply an increase in expected risk.
182 10 Model Selection and Evaluation

10.2 Model Selection


Now that we understand what (expected) test risk is composed of, how can we
estimate it? The main rationale for doing so in machine learning is model selec-
tion. Examples for model selection have been noted in Section 10.1. For simplic-
ity, we will concentrate on selecting the regularization constant ν in Tikhonov-
regularized least squares in this section, but the results are transferable. Recall
that ŷν (x) = (ŵν )T φ(x), where
n
1X
ŵν = argmin L(wT φ(xi ), ti ) + νkwk2 , L(y, t) = (y − t)2 ,
w n i=1
moreover
n
1X h i
R̂n (ŷν |D) = L(ŷν (xi ), ti ), R(ŷν |D) = E L(ŷν (x), t) D .

n i=1

Our goal is to select a value of ν giving rise to small test risk R(ŷν |D) or small
expected test risk E[R(ŷν |D)].
As noted in Section 10.1.1, we can estimate R(ŷν |D) using a validation dataset
Dvalid of size nvalid , independent of the training dataset D. Since ŷν (·) is inde-
pendent of Dvalid , we have that
h i 1 X
R(ŷν |D) = E L(ŷν (x), t) D ≈ L(ŷν (x̃j ), t̃j )

nvalid
(x̃j ,t̃j )∈Dvalid

by the law of large numbers. We can now minimize the estimator on the right
hand side w.r.t. ν. This is typically done by evaluating the estimator over a
candidate set of values for ν, then picking the minimizer. The problem with this
approach is that Dvalid cannot be used for training. In the remainder of this
section, we discuss a technique for model selection on the training dataset only.

10.2.1 Cross-Validation
Given some dataset D of size n, we would like to split it into a training and a
validation part of 4/5 and 1/5 the total size respectively. Let us partition the
complete set randomly into 5 equisized parts, D1 , . . . , D5 , then use parts 2–5,
[
D−1 = D \ D1 = Dk ,
k=2,...,5

as training dataset and part 1, D1 , as validation dataset. More specifically, we


train ŷν (·) on D−1 , then estimate the test risk using D1 . Denote this test risk
estimator by R̂(−1) (D) (test your understanding by working out its expression).
But what is special about D1 for validation, why not D2 ? That results in an-
other estimator R̂(−2) (D). Given how we got here, it is clear that the different
estimators R̂(−m) (D), m = 1, . . . , 5, have the same distribution. Their expected
value is equal to the expected test risk E[R(ŷν |D0 )] for training samples D0 of
size 4n/5. We might as well use all of them in a round-robin fashion:
5
1 X (−m)
R̂ (D) ≈ E[R(ŷν |D0 )], |D0 | = 4n/5.
5 m=1
10.2 Model Selection 183

This is the 5-fold cross-validation (CV) estimator. Note how each case (xi , ti )
in D is used once for validation and four times for training. Defining the CV
estimator requires some notation, which can be confusing. You should be guided
by the idea, which is simple indeed:

• Split the available training dataset D into M equisized3 parts (or folds)
Dm , which do not overlap. In the previous example, M = 5.
• For m = 1, . . . , M : Train on D−m = D \ Dm , then evaluate the validation
risk on Dm . Importantly, Dm was not used for training.
• Average the M validation risk values obtained in this round-robin way.

The M -fold cross-validation (CV) estimator is formally defined as follows. For


simplicity, assume that n = |D| is a multiple of M ≥ 2. Partition the dataset D
into M parts of size n/M each. Define m(i) ∈ {1, . . . , M } so that (xi , ti ) ∈ Dm(i) .
We need to train M predictors, one on each D−m :

ŷν(−m) (·) = (ŵ(−m)


ν )T φ(·),
M X
ŵ(−m)
ν = argmin L(wT φ(xi ), ti ) + νkwk2 .
w (M − 1)n
i:m(i)6=m

Then, the M -fold cross-validation (CV) estimator is


n M
(M ) 1 X  (−m(i))  1 X MX  
R̂CV (D) = L ŷν , ti = L ŷν(−m) , ti .
n i=1 M m=1 n i:m(i)=m
| {z }
=R̂(−m) (D)

We can choose M between 2 and n. If M = n, then Dm = {(xm , tm )}, so that


we leave out single cases for the n validations. This particular case is called
leave-one-out cross-validation. We have that
h i
(M )
E R̂CV (D) = E [R(ŷν |D0 )] , |D0 | = (M − 1)n/M,

where D0 is an i.i.d. sample of size (M − 1)n/M from the true underlying dis-
(M )
tribution. For not too small M , R̂CV (D) can be used as approximate estimator
of the expected test risk E[R(ŷν |D)] for the full sample. In Figure 10.4, we com-
pare the expected tesk risk with both 5-fold and leave-one-out CV estimators
on the polynomial curve fitting task used already above (we use a sample size
of 50 here, since cross-validation is unrealiable for small datasets). While CV
somewhat overestimates the test risk for too small and too large degrees, its
minimum points coincide with those of the expected test risk in this example.
Notice how it errs on the conservative side, and how its variance (over samples)
grows sharply with polynomial degree.
The curious reader may wonder how we computed the leave-one-out cross-
validation estimator in Figure 10.4. Do we have to run least squares regression
50 times on samples of size 49? And what if n = 1000 or a million? In this light,
3 In practice, the folds can be of slightly different size, which does not change the general

procedure.
184 10 Model Selection and Evaluation

0.4 0.4
Exp. Test Risk Exp. Test Risk
0.35 0.35
5−fold CV 50−fold CV (LOO)
0.3 0.3

0.25 0.25

0.2 0.2

0.15 0.15

0.1 0.1

0.05 0.05

0 0
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8

Figure 10.4: Cross-validation estimators versus expected test risk for polynomial
curve fitting task. 100 samples of size 50 each were drawn from y∗ (x) = sin(2πx)
plus N (0, 1/16) noise, where xi are drawn uniformly from [0, 1]. Each sample
includes 0 and 1 among the xi , to avoid high variance due to extrapolation.
Shown are expected test risk (green) and its cross-validation estimate (mean
(solid) and standard deviation (dashed) over 100 samples), as function of poly-
nomial degree. Left: 5-fold cross-validation (test batches of size 10). Right: 50-
fold leave-one-out cross-validation.

leave-one-out CV seems more of an academic exercise. Fortunately, it is possible


to evaluate this estimator much more efficiently, at least for (kernelized) linear
methods, as is shown in Section 10.2.2.
How shall we choose M ? Again, there is no general best recipe. Let us gain in-
sight by applying the bias-variance concept to cross-validation estimators them-
selves. For the largest choice M = n (leave-one-out), the expected value of
(n)
R̂CV (D) is closest to the expected test risk for our sample size: leave-one-out
(M )
exhibits the smallest bias. The bias of R̂CV (D) for small M can be problematic
with kernelized estimators, where the optimal value of ν is a decreasing function
of the training set size n: cross-validation may regularize more than needed in
such cases. On the other hand, the variance of leave-one-out CV can be large,
since the individual “training subsets” D−m are strongly overlapping, so that
(−m)
ŷν are highly dependent. Finally, leave-one-out cross-validation can be prob-
lematic for computational4 reasons, since we have to train n different predictors
on samples D−m as large as D itself. For kernelized estimators, efficient approx-
imations to leave-one-out have been developed in statistics [20, sect. 7.10.1]. In
practice, the choices M = 5 and M = 10 are frequently used.
We stress again that cross-validation, whatever the choice of M , is not a fool-
proof method for model selection, but should rather be viewed as a convenient
heuristic which often works well and is very widely used in machine learning.
The theoretical analysis of cross-validation is somewhat intricate and out of
scope of this course. In the remainder of this section, we give some hints about
(M )
using CV in practice, mainly based on [20, ch. 7]. First, the variance of R̂CV (D)
can be large, in particular for large M , and if the expected curve (as function
4 It is sometimes implied that leave-one-out CV is superior to other choices of M (you
pay more, you get more). This is not true in general. Independent of computational issues, a
bias-variance tradeoff has to be faced.
10.2 Model Selection 185

of ν) is rather flat, the CV minimizer for ν may be determined mainly by the


variance. We can get an idea about the variance by computing the standard
deviation between the different parts R̂(−m) (D), m = 1, . . . , M , and it is rec-
ommended to always determine this standard error along with CV scores. A
rule of thumb for choosing ν is to look at both the CV score curve and stan-
dard errors. Suppose the CV score is minimized at ν∗ , attaining score CV(ν∗ )
and standard error std(ν∗ ). We pick the largest ν (simplest model) such that
CV(ν) ≤ CV(ν∗ ) + std(ν∗ ).
It is important to use cross-validation correctly in practice. In real-world ap-
plications, it is customary (and probably unavoidable) to mix many different
machine learning and preprocessing techniques, often in an ad-hoc fashion. A
common but erroneous way to use CV goes as follows (a worked-out example
is given in [20, sect. 7.10.2]). Suppose our problem is noisy and comes with
high-dimensional input points x. First, as part of the preprocessing, we screen
a large number of potential predictors on the training sample D, retaining only
those in the feature map φ(·) which exhibit a substantial correlation with the
target t. Second, we use CV in order to select the regularization parameter ν
and to estimate the expected test risk. Used in this way, CV tends to underesti-
mate the test risk dramatically, and we run into over-fitting without a warning.
What is wrong? Cross-validation has to applied in the most outer loop of the
whole procedure, including the preprocessing if this makes use of the training
set labels5 {ti }. The first thing we do is splitting the data into Dm , then we run
preprocessing and training on the different D−m separately.

Other Model Selection Techniques (*)

Cross-validation is maybe the most widely used model selection technique in


machine learning, but it is not the only one. First, there is a range of simpler
alternatives from statistics. The general idea is to realize that the expected
risk E[R(ŷν |D)] is typically underestimated by the training risk R̂n (ŷν ) by an
amount which decreases with the sample size n, but increases with the model
complexity (governed by ν in our running example). Under some more or less
plausible arguments, this “complexity term” can be represented by a tractable
expression. Implementations of this idea are AIC, BIC or minimum description
length (MDL). Compared to CV, these are typically more efficient to use, but
can be less reliable, in particular for complex nonlinear models. In learning
theory, concentration inequalities and Vapnik-Chervonenkis arguments are used
to bound E[R(ŷν |D)] in terms of R̂n (ŷν ) and complexity terms, but these bounds
are typically too lose to be useful in practice.
A general problem with hold out and cross-validation estimation is that only
a very small number of parameters can be selected. For models with many
hyperparameters, Bayesian model selection techniques can be far more useful in
practice, and they have found widespread use in machine learning. For example,
they led to many advances for multi-layer perceptrons [28]. The interested reader
is encouraged to study [5, sect. 3.3, sect. 3.4]. Recall that most of the trouble with
5 Preprocessinginvolving the input points {xi } only can be done up front, before CV, since
this does not reveal information about x 7→ t which would give preselected predictors an
opportunity for over-fitting.
186 10 Model Selection and Evaluation

over-fitting comes from the fact that weights w are fitted to data D in the first
place. In maximum likelihood and MAP estimation, we do treat w as a random
variable. The sum rule of probability (Chapter 5) tells us to marginalize over w
in order to robustly estimate hyperparameters such as ν from data. Following
this argument, we should maximize the log marginal likelihood
Z
log p(D|ν) = log p(D|w, ν)p(w|ν) dw

in order to choose a hyperparameter ν. Marginal likelihood functions play a


pivotal role for unsupervised learning (Chapter 12). While their computation
is often intractable, it can be approximated in a number of ways [5, ch. 10].
The particularly simple Laplace approximation of the log marginal likelihood
gives rise to the evidence framework [5, sect. 3.4], [28], which can be seen as
generalization of BIC and MDL to general nonlinear models such as MLPs.

10.2.2 Leave-One-Out Cross-Validation (*)


For a dataset of size n, a naive way to compute the leave-one-out cross-validation
(LOO CV) estimator is to recompute ŷ (−m) (·) from scratch for each subset, at
about n times the cost of a single linear regression. Under such circumstances,
LOO CV would not be a tractable option in practice. In this section, we show
how the LOO CV can be computed much more efficiently for a linear regression
model, effectively at the cost of one linear regression.
Denote the design matrix by Φ, the target vector by t. The weights w for the
full dataset are given by the normal equations:
X X
Aw = ΦT t = tj φj , A = ΦT Φ = φj φTj .
j j

Suppose we leave out pattern (φi , ti ). The normal equations for ŷ (−i) =
(w(−i) )T φ(·) are
 
A − φi φTi w(−i) = ΦT t − ti φi = Aw − ti φi .

Rearranging this equation, we see that A(w(−i) − w) = −αi φi for some αi ∈ R,


or w(−i) = w − αi A−1 φi . Plugging this ansatz into the LOO normal equations:
 
A − φi φTi w − αi A−1 φi = ΦT t − αi φi − (φTi w)φi + αi (φTi A−1 φi )φi


!
= ΦT t − ti φi .

Rearraging this equation gives


 
αi 1 − φTi A−1 φi = ti − φTi w = ri , r = t − Φw.

The LOO CV estimator is the sum of squares of the LOO residuals


 
ti − φTi w(−i) = ri + αi φTi A−1 φi = ri + αi φTi A−1 φi − 1 + 1 = αi ,
10.2 Model Selection 187

so that
n
(n) 1X 2
R̂CV (D) = α .
n i=1 i
How do we compute the αi efficiently? We need the residuals r = t −Φw, more-
over the vector [φTi A−1 φi ]. The latter is the diagonal of the matrix ΦA−1 ΦT .
Recall how the least squares problem can be solved by the QR decomposition
(Section 4.2.3):
!
Φ = QR ⇒ w = A−1 ΦT t = R−1 QT t.

Since t is arbitrary in this equation, we have that A−1 ΦT = R−1 QT , therefore


 
Φ A−1 ΦT = QRR−1 QT = QQT ,

a result we derived previously in Section 4.2.3. Therefore, if Q = [q k ], then

  d
X
[φTi A−1 φi ] = diag ΦA −1
Φ T
= (q k )2 ,
k=1

where (q k )2 denotes the pointwise multiplication of q k with itself. Given that


w is determined by a QR decomposition, the additional cost of LOO cross-
validation is negligible.
188 10 Model Selection and Evaluation
Chapter 11

Dimensionality Reduction

In many, if not most real-world machine learning problems, data cases are most
naturally represented in terms of very high-dimensional vectors. Our canon-
ical handwritten digits recognition problem is rather on the low end in this
respect. High-resolution images have millions of pixels. In speech recognition,
audio waveforms are represented by a large number of windowed Fourier trans-
form features. Text documents are commonly represented as count vectors w.r.t.
some dictionary, which for realistic corpora contains hundreds of thousands of
words. If learning theory tells us one thing, it is that learning in such huge-
dimensional spaces is impossible in general. A practical way out of this dilemma
is to reduce the dimensionality of attributes by some form of feature mapping.
In this chapter, we will learn to know some of the most widely used linear
dimensionality reduction techniques. Principal components analysis is among
the most fundamental of all techniques used in machine learning, and we will
cover it in some detail. Linear dimensionality reduction techniques are typically
based on the eigendecomposition of certain sample covariance matrices, and
this foundation will be studied in detail. Other more advanced machine learn-
ing techniques share the same mathematical foundations: spectral clustering,
manifold learning, multi-dimensional scaling and metric embedding techniques
for visualization.

11.1 Principal Components Analysis

In this section, we discuss the most important dimensionality reduction tech-


nique in machine learning and statistics: principal components analysis (PCA).
In Section 11.1.1, we will arrive at PCA along three seemingly different routes.
We gain intuition about PCA by applying it to handwritten digits data. The
technique behind PCA, eigendecomposition, is reviewed in Section 11.1.2. In
Section 11.1.3, we show how to compute PCA in practice for datasets of mod-
erate size. Companies like your favourite web search engine routinely compute
PCA directions for datasets of astronomical size. We give some hints into how
this is done in Section 11.1.4.

189
190 11 Dimensionality Reduction

Many machine learning problems are characterized by input points living in very
high-dimensional spaces. For example, the MNIST handwritten digit bitmaps
are of size 28 × 28, so can be seen as vectors in R784 . In bioinformatics, gene
microarray measurements can easily give rise to input space dimensionalities of
many thousands. It is important to understand that the underlying “effective”
dimensionality of what is predictively relevant about a signal is typically much
smaller. After all, it is impossible to do meaningful statistics on general distri-
butions in hundreds or thousands of dimensions, as there is simply never enough
training data to explore the vast majority of dependencies between attributes.
This observation is sometimes termed “curse of dimensionality”, but let us be
precise. The general impossibility of statistics in R1000 is a fact, there is noth-
ing to lament about it. If a problem cannot be represented using much fewer
unknowns, it is out of scope and does not deserve1 our attention. Fortunately,
many real-world problems are amenable to useful low-dimensional modelling.
The curse is that we have to find these adequate low-dimensional representa-
tions, and we have to do so in an automated data-driven way. Given we suc-
ceed, we will have found a probabilistic low-dimensional representation of the
high-dimensional input variable, which conveys enough of the latter’s predictive
information: this is dimensionality reduction. In general, relevant parameters
may be nonlinearly related to input points, and this process of “finding needles
in haystacks” can be very challenging. In this chapter, we will concentrate on
linear dimensionality reduction.
Consider our running MNIST 8s versus 9s classification problem from Chapter 2.
Suppose we want to apply a maximum likelihood plug-in rule with Gaussian
class-conditional distributions p(x|t) = N (µt , Σt ), t = −1, +1, and x ∈ Rd ,
where d = 784. How should we parameterize the covariances Σt ? A simple
choice would be Σt = I, leaving us only with the means µt to be estimated.
However, the variances of pixels are clearly different across the bitmap. Pixels
at the boundary almost always have values close to zero, the variance there is
much smaller than in the center. Diagonal covariances Σt would capture differ-
ent variances at a modest increase in the number of parameters to be estimated.
Still, these imply pairwise conditional independence of individual pixels, an as-
sumption which is clearly violated in our data. Digits are composed of smooth
pen strokes, which induce substantial correlations between neighbouring pixel
values. Full matrices Σt then? Unfortunately, this needs d(d + 1)/2 for Σt , so
d(d + 3)/2 = 308504 parameters for each class. We cannot reliably estimate this
many parameters from our limited training data.
Covariances between pixels matter, but maybe not all of them do. Here is
the idea behind linear dimensionality reduction. Instead of modelling the high-
dimensional input variable x ∈ Rd , we map it to z = U T x and model z ∈ RM
instead. Since M is chosen much smaller than d, we can afford full covariance
matrices for the class-conditionals on z. This is of course a special case of a
linear feature map, z = φ(x) = U T x. The matrix U ∈ Rd×M determines the
dimensionality reduction. It has to be learned from training data for the whole
idea to work well. In this section, we concentrate on learning U in order to
represent an input point distribution per se, ignoring class label information
even if such is available (this is a simple example of unsupervised learning, see
1 Unless a financial analysis firm pays you lots of money for it.
11.1 Principal Components Analysis 191

Chapter 12). We will take several different routes and end up at the same idea:
principal components analysis.
We will adopt the following conventions. First, we will consider zero-mean dis-
tributions on xP only. Given we work with data, we first compute the sample
mean µ̂ = n−1 i xi and subtract it off. Without this preprocessing, the fea-
ture map would be U T (x − µ̂). Second, we will restrict U to have orthonormal
columns: U T U = I ∈ RM ×M . In other words, z = U T x is the “encoding” part
of an orthogonal projection (Section 4.2.2). To justify this restriction, notice
that we can decompose any full-rank matrix in Rd×M as product of U with
orthonormal columns and an invertible RM ×M matrix (QR decomposition, Sec-
tion 4.2.3), and we can absorb the latter into the defition of z at no loss.
What is a good projection U for dimensionality reduction? Let us collect some
general criteria of merit.

• Retaining covariance: A distinct property of the distribution of x is its


covariance Cov[x]. As data is typically noisy, directions of small covariance
are most likely due to random errors, while the signal is often shaped by
directions of large covariance. The covariance of z = U T x is Cov[z] =
U T Cov[x]U (Section 5.1.3), and it seems sensible to choose U so as to
maximize this covariance.
• Minimizing reconstruction error of orthogonal projection: We can under-
stand dimensionality reduction as an autoencoding process. First, we en-
code x by the coefficients z = U T x. Second, we reconstruct it as

x̂ = U z = U U T x.

Recall from Section 4.2.2 that U U T defines an orthogonal projection x 7→


x̂ onto the M -dimensional linear subspace U RM of Rd . In order to retain
as much information in x̂ about x as possible, we should choose U so to
minimize the squared reconstruction error E[kx̂ − xk2 ].
• Removing linear redundancies: It is often the case with high-dimensional
variables x that several ofpthe components are highly correlated or anticor-
related: Cov[xj , xk ] ≈ ± Var[xj ] Var[xk ] (Section 6.3). This means that
xj and xk are approximately linear functions of each other. For example,
in natural images the intensities of neighbouring pixels are often highly
correlated. Modelling such components is wasteful. Given one of them,
the other does not convey much additional information. We should aim
for a decorrelating transformation U , in that the components of z = U T x
should be uncorrelated.

At first sight, these requirements seem to have little to do with each other.
However, we will see that they are closely related. We can optimally satisfy all
three of them by principal components analysis.

11.1.1 Three Ways to Principal Components Analysis


Let us begin by searching for a single direction u ∈ Rd (case M = 1, so that
z = uT x ∈ R), where kuk = 1. Consider the data in Figure 11.1, and compare
192 11 Dimensionality Reduction

Figure 11.1: Illustration of principal components direction for data (blue dots) in
R2 . The first PCA direction minimizes the squared reconstruction error between
points and their projections (green dots) onto the PCA subspace. Squared error
terms are visualized as area of the red squares.

the two different directions u. Since kuk = 1, zi = uT xi is the signed distance


from zero of the projection onto uR. The reconstruction errors are the distances
kxi −zi uk. For the direction on the right, values of zi are larger than on the left,
giving rise to a larger sample variance of z. At the same time, the reconstruction
errors are smaller on average. The total variance of the data is decomposed into
variance along uR (therefore, variance of z) and variance orthogonal to the
direction (squared reconstruction error).
To fully understand this, let us abstract from datasets and assume that we know
the distribution of x (recall that we have E[x] = 0). The problem of minimizing
squared reconstruction error translates into
h i
min E min kx − zuk2 .
u: kuk=1 z

For each x, inside the expectation, we can choose the encoding coefficient z so
to minimize kx − zuk2 . The direction u should work well on average over x,
so the minimization is outside. We already know that z is chosen by way of
orthogonal projection (Section 4.2.2), but let us derive it again:
∂ uT u=1
kx − zuk2 = 2(zu − x)T u = 2(z − xT u) = 0 ⇔ z = uT x.
∂z
Plugging this minimizer in:
E kx − (uT x)uk2 = E kxk2 − 2(uT x)2 + (uT x)2 uT u
   

= E kxk2 − (uT x)2 = E kxk2 − uT Cov[x]u.


   

We used that uT u = 1 and


E (uT x)2 = uT E[xxT ]u = uT Cov[x]u.
 

As the first term does not depend on u, our optimal direction is given by
u∗ = argmax uT Cov[x]u.
u: kuk=1
11.1 Principal Components Analysis 193

This is also the direction for which z = uT x has maximum variance, since

Var[z] = E (uT x)2 = uT Cov[x]u


 

(recall that E[x] = 0, so that E[z] = 0). The solution u∗ is called first principal
components direction, and the corresponding z∗ = uT∗ x is called first principal
component2 of the random variable x.
Importantly, this definition depends on the covariance of x only. In practice,
this means that all we need to estimate of x are first and second order moments
(mean and covariance). In fact, our argument amounts to a decomposition of
the total amount of covariance:
E[kxk2 ] = tr E[xT x] = tr E[xxT ] = tr Cov[x]
= Var[uT x] + E kx − (uT x)uk2 ,
 
| {z } | {z }
variance explained squared error

where the “amount of covariance” is measured as tr Cov[x].


The first principal components direction u∗ is a leading eigendirection of the
symmetric positive definite matrix Cov[x], an eigenvector corresponding to the
maximum eigenvalue of Cov[x]. Refresh your memory on all things eigen in
Section 11.1.2 (no harm to jump there now). Namely, there is some λ∗ > 0 such
that
Cov[x]u∗ = λ∗ u∗ .
Moreover, λ∗ is the largest scalar3 with this property. We establish this fact at
the end of Section 11.1.2.

Multiple Directions

What about more than one direction? Denote the first principal components
direction by u1 , how shall we choose a second one, u2 ? This choice has to be
constrained, otherwise we would simply choose u1 again. By our convention,
u2 is orthogonal to u1 . We will see in a moment that this is particularly well
suited to principal components. We can now define u2 in the same way as above,
subject to uT2 u1 = 0:

u2 = argmax uT2 Cov[x]u2 . (11.1)


ku2 k=1,uT
2 u1 =0

This is the second principal components direction for x, and it once more cor-
responds to an eigendirection of Cov[x], corresponding4 to the second-largest
eigenvalue (see Section 11.1.2). Note that eigendirections of a symmetric ma-
trix corresponding to different eigenvalues are necessarily orthogonal (Sec-
tion 11.1.2), which in hindsight motivates our convention on the columns of
U.
2 To be precise, u and z are population principal components, while in practice we esti-
∗ ∗
mate them from data through the sample covariance matrix, as is detailed below.
3 In general, eigenvalues of real-valued matrices can be complex-valued, but this does not

happen for symmetric matrices (Section 11.1.2).


4 An exception is if Cov[x] has a multiple largest eigenvalue λ . In this case, both u and u
∗ 1 2
are eigenvectors corresponding to λ∗ . For real data, multiple eigenvalues are rarely observed.
194 11 Dimensionality Reduction

More generally, the k-th principal components direction uk can be defined re-
cursively via a variant of (11.1), where the constraints read uTk uj = 0 for all
j < k. The main point is this. All principal components (PC) directions are
eigendirections of the covariance matrix Cov[x]. Their ordering follows the or-
dering of the eigenvalues: first PC for largest eigenvalue, and so on. Selecting the
leading M eigendirections, corresponding to the M largest eigenvalues (counting
multiplicities), to form a dimensionality reduction matrix U ∈ Rd×M is called
principal components analysis (PCA).

Examples of Principal Components Analysis

PCA is ubiquitous. It is hard to find any work in applied machine learning,


statistics or data analysis without stumbling across the technique. Given any
sort of multivariate data, the first thing to do is to plot the first few principal
components. PCA is the most basic visualization technique for high-dimensional
data. In applications, whenever people are interested in X, an eigen-X method
has usually been proposed long ago. Examples include eigenfaces, eigendigits
(below), or latent semantic analysis for text documents. You will have guessed
it by now: if we do not know Cov[x] exactly, but have access to some data
D = {x1 , . . . , xn }, we estimate PCA by the eigendecomposition of the sample
covariance matrix
n n
1X X
Σ̂ = xi xTi , xi = 0. (11.2)
n i=1 i=1

Mean λ1 = 5.69 λ2 = 3.64 λ3 = 2.94 λ4 = 2.58

Figure 11.2: Mean and first four principal components directions for MNIST 8s
(subset of size 200, drawn at random from the training database). For the latter,
positive values are shown in red, negative values in blue. The first direction
corresponds to rotations, the second to changes in scale.

In Figure 11.2, we show means and first few principal components directions
for a set of 8s from the MNIST database. Using this example, we can test
the autoencoder property of PCA. In Figure 11.3, we show reconstructions x̂
for digits x, using a growing number of PC directions. Notice how scale and
orientation are shaped before finer details.
We can arrive at a non-recursive definition of PCA for general M < d and
gain additional intuition by developing the minimum reconstruction error and
variance decomposition perspective in the general case. The optimal choice for
minr ∈RM kx − U rk2 is r = U T x (why?), and
h i h i
E kx − U U T xk2 = E kxk2 − E kU T xk2 ,
 
11.1 Principal Components Analysis 195

Original M =1 M =3

M =5 M = 10 M = 30

Figure 11.3: PCA reconstructions of MNIST digit (upper left) from the mean
and a growing number M of PCA directions.

where we used U T U = I. Therefore,


h i
E kxk2 = tr Cov[x] = tr U T Cov[x]U + E kx − U U T xk2 ,
 
| {z } | {z }
variance explained
squared error

T T
where tr U Cov[x]U = tr Cov[U x]. The complete PCA transformation can
therefore by described as solution of
U ∗ = argmax tr U T Cov[x]U . (11.3)
U T U =I

It is noted in Section 11.1.2 that the columns of U ∗ are given by the M leading
eigendirections5 , and the maximal value of the trace is the sum of the M largest
eigenvalues (counting multiplicities).

Whitening and Estimation

What about the third requirement of removing linear redundancies in x by


producing decorrelated encoding coefficients z? It turns out that using PCA,
we get this property for free. Namely, suppose that
U = [u1 , . . . , uM ], Cov[x]uj = λj uj , λ1 ≥ λ2 ≥ · · · ≥ λM .
A compact way of writing this is Cov[x]U = U Λ, where Λ = diag[λj ] ∈ RM ×M
is a diagonal matrix of eigenvalues. The covariance of z = U T x is known from
Section 5.1.3 to be
Cov[U T x] = U T Cov[x]U = U T U Λ = Λ.
5 More specifically, U ∗ is defined only up to a right-multiplication with an orthonormal
RM ×M matrix. It is however customary to order the columns as if they came out of the
recursive procedure.
196 11 Dimensionality Reduction

This means that the components zj of z are uncorrelated: Cov[zj , zk ] =


λj I{k=j} . This provides us with a second perspective on what PCA is doing.
In fact, PCA preprocessing is often taken a step further in what is known as
whitening. For moderate d, this may be done with M = d, exploiting the decor-
relation property of PCA only. Namely, wep choose the feature transformation
matrix V = U Λ−1/2 , where Λ−1/2 = diag[ λj ]. Then, the features V T x are
not only uncorrelated, but have unit variances as well:
Cov[V T x] = Λ−1/2 U T Cov[x]U Λ−1/2 = Λ−1/2 ΛΛ−1/2 = I.
Why would we whiten our data? It is often the case with multivariate vari-
ables that components are scaled differently. For example, they might be mea-
sured using different units. Also, as noted already, there might be substantial
(anti)correlations between some of them. On the other hand, simple models of-
ten assume that components are (conditionally) independent, in order to save
parameters and gain robustness. For example, linear classification can be seen
as generative model with spherical Gaussian class-conditionals (Section 6.4.1).
Whitening ensures that such simplifying assumptions are met at least up to
second order (mean, covariance). Moreover, simple optimization methods like
gradient descent (Chapter 2) can be slow to converge in the presence of cor-
related and scaled variables. Whitening is easy to do and tends to pay off in
reduced nonlinear optimization time. It is particular useful in the context of
multi-layer perceptron training (Section 3.4.1). Some care has to be taken with
whitening in practice. Even if our goal is M = d (no dimensionality reduction),
we should discard directions corresponding to eigenvalues very close to zero.
A final comment relates to estimating PCA from data, replacing the population
covariance Cov[x] by the sample covariance matrix Σ̂ ((11.2)). For large d, it
might well happen that d2 > n, where n is the dataset size, and we should
expect to run into over-fitting problems (Section 7.1). Will we have to regu-
larize our estimator of Cov[x]? The answer is no, at least as long as M  n.
Dimensionality reduction by way of PCA has a special form of regularization
built in. We have already seen in Section 7.2 that the smallest eigenvalues of
a covariance matrix are most troublesome. First, an eigenvalue close to zero
implies a hard constraint on the variable. Second, the smallest eigenvalues are
often strongly underestimated in Σ̂, up to being numerically equal to zero. In
the context of least squares estimation, Tikhonov regularization has the effect
of lifting small eigenvalues to get them out of the danger zone (Section 7.2.2). In
contrast, PCA simply discards the smallest eigenvalue directions: regularization
by clipping (or thresholding) rather than by lifting. PCA concentrates on the
dominating eigendirections, which are also most reliably6 estimated from noisy
data.

11.1.2 Techniques: Eigendecomposition. Rayleigh-Ritz


Characterization
A large number of multivariate statistics and machine learning techniques are
based on eigendecompositions of symmetric matrices, or more general singular
6 One can show that the largest eigenvalues of Cov[x] are overestimated by PCA on Σ, so

the dominating directions are somewhat strengthened by PCA.


11.1 Principal Components Analysis 197

value decompositions of arbitrary matrices. Beyond the brief reminder here,


make sure to read [42, ch. 6], where you will find examples, applications and
geometrical intuition. How does all this relate to PCA? Here is a roadmap
through this section:

• Every squared matrix A has a decomposition (11.4) in terms of eigenvec-


tors and eigenvalues.

• For symmetric A, eigenvalues are real numbers, and eigenvectors can be


chosen orthonormal (11.5). For positive definite A, all eigenvalues are
positive.

• We can apply the eigendecomposition to the (positive definite) covariance


matrix Cov[x] in order to establish that the PCA directions correspond
to orthonormal eigenvectors for the largest eigenvalues of Cov[x].

Recall that a square matrix A ∈ Rd×d is a linear transform of Rd into Rd . Its


eigendecomposition characterizes all properties of this transform in a most useful
manner. It turns out that there are certain characteristic directions v ∈ Cd ,
v 6= 0, which are mapped back by A onto itself, except for scaling:

Av = λv, λ ∈ C.

Both v and λ can be complex-valued in general (we will show below that these
quantities are real-valued for symmetric matrices). v is called an eigenvector, λ
an eigenvalue of A, (v, λ) is an eigenvector-value pair. If (v, λ) is such a pair,
so is (αv, λ) for any α ∈ C \ {0}, which is why we refer to v as eigendirection
as well. Importantly, we can find a basis of Cd , u1 , . . . , ud linear independent,
so that each uj is an eigenvector:

Auj = λj uj , j = 1, . . . , d.

Since we know all about A if we know what is does to a complete basis (Auj for
all j), this information determines A entirely. If U = [u1 , . . . , ud ], Λ = diag[λj ],
then
AU = U Λ ⇒ A = U ΛU −1 . (11.4)
This representation of A is called eigendecomposition. Why is this useful? For
one, we know immediately everything about Ak for any k ∈ N:
k
Ak = U ΛU −1 = U Λk U −1 ,

where we used U −1 U = I. Applying A k times does not change its eigenvectors,


but raises its eigenvalues to the k-th power. The set of all eigenvalues is called
eigenspectrum.
Some simple examples. The identity I has only a single eigenvalue λ = 1 which
is counted d times. In general, an eigenvalue λ has multiplicity k ≥ 1 if there
are at most k associated linear independent eigenvectors. These span the k-
dimensional eigenspace corresponding to λ. Note that within an eigenspace, all
vectors are alike, any can serve as representative (except for 0). What about
λ = 0? The corresponding equation is Av = 0, which means that v ∈ ker A
198 11 Dimensionality Reduction

is in the kernel of A. The kernel is the eigenspace corresponding to λ = 0. A


matrix is nonsingular if and only if it does not have a zero eigenvalue. What is
the eigendecomposition of A = vv T , v 6= 0. First,
Av = vv T v = kvk2 v,
so λ = kvk2 with multiplicity 1. Second, for any u orthogonal to v: Au = 0,
so λ = 0 with multiplicity d − 1. Finally, suppose that A is the orthonormal
projection onto a subspace V of dimension k < d (Section 4.2.2). Then, Av = v
for any v ∈ V and Av = 0 for any v ∈ V ⊥ , so A has eigenspace V for λ = 1
and kernel V ⊥ (λ = 0). Picture it geometrically: A leaves v ∈ V invariant, while
removing all contributions orthogonal to V.
Trace and determinant can be expressed in terms of the eigenspectrum. First,
d
X
−1 −1
tr A = tr U ΛU = tr ΛU U = tr Λ = λj .
j=1

The trace is the sum of eigenvalues. For any matrix, the sum of eigenvalues is
the same as the sum of diagonal values. For the determinant:
d
Y
|A| = |U ||Λ| U −1 = |U ||Λ||U |−1 = |Λ| =

λj ,
j=1

the product of the eigenvalues (but not the product of the diagonal entries).
The last mysteries about determinants should disappear now. For example, the
volume interpretation of the determinant given in Section 6.3.1 becomes natural.
The parallelepiped spanned by {uj } is mapped to one spanned by {Auj } =
{λj uj }, therefore the volume changes by |λ1 · · · λd | = ||A||.
Eigenvalues are the roots of the characteristic polynomial. Namely,
Av = λv ⇔ (A − λI) v = 0,
so v ∈ ker(A − λI), v 6= 0. This happens only if q(λ) = |A − λI| = 0. Now,
a closer look at determinants (Section 6.3.2) reveals that q(λ) is nothing but
a degree d polynomial in λ with coefficients in R, determined by A: q(λ) =
α0 + α1 λ + · · · + (−1)d λd . The spectrum of A is the set of roots of q(λ). If
you need to determine the eigenvalues of a 2 × 2 matrix, find the roots of the
characteristic polynomial:
 
a11 − λ a12
= (a11 − λ)(a22 − λ) − a12 a21 .
a21 a22 − λ

Symmetric Matrices. Positive Semidefinite Matrices

Things become simpler if we restrict ourselves to symmetric matrices, AT = A.


First, eigenvalues are always real-valued then, and eigenvectors can be chosen in
Rd . Recall that (·)H denotes Hermitian transposition, where after transposition
each entry is replaced by its complex conjugate. Suppose that v ∈ Cd , λ ∈ C is
an eigen-pair of A. Then,
H
v H Av = λkvk2 , v H AT v = v H Av = λ̄kvk2 .
11.1 Principal Components Analysis 199

Note that AH = AT , as A is real-valued. This means that λ = λ̄ (since kvk2 >


0), so that λ ∈ R. Moreover, the corresponding eigenspace is ker(A − λI), which
has a real-valued basis. Moreover, eigenvectors can be chosen to be orthogonal.
For a symmetric matrix A, suppose that u1 , u2 are eigenvectors corresponding
to different eigenvalues λ1 6= λ2 . Then, uT1 u2 = 0. Namely,
 T
uT1 Au2 = λ2 uT1 u2 , uT1 AT u2 = uT1 AT u2 = uT2 Au1 = λ1 uT2 u1 ,

so that (λ1 − λ2 )uT1 u2 = 0. This means that the eigendecomposition (11.4) is


simpler in the symmetric case. U can be chosen orthonormal, U T U = I. Then,
U −1 = U T , so that

A = U ΛU T , Λ = diag[λj ], λj ∈ R, U T U = I. (11.5)

Finally, what about positive (semi)definite matrices? They are symmetric, so


all eigenvalues are real. Since uT Au ≥ 0 for a positive semidefinite A, all its
eigenvalues are nonnegative. For a positive definite A, we cannot have λ = 0, so
that all eigenvalues are strictly positive. The eigendecomposition of a covariance
matrix is given in (11.5), where all λj > 0.

Variational Characterization of Eigenvalues

We are one step away now from understanding PCA in terms of the eigen-
decomposition of the (sample) covariance matrix. More generally, let A be a
symmetric matrix. Then, A has a real-valued spectrum and we can choose an
orthonormal eigenbasis u1 , . . . , ud with uTi uj = I{i=j} . Suppose that the cor-
responding eigenvalues are ordered: λ1 ≥ λ2 ≥ . . . λd . A result due to Rayleigh
and Ritz states that
u1 ∈ argmax uT Au.
kuk=1

If λ1 > λ2 , this maximizer is unique up to sign flip. To show this, we use the
eigendecomposition A = U ΛU T , where U = [u1 , . . . , ud ] and Λ = diag[λj ].
Let u be any unit vector, and let z = U T u. Then,
d
X
T T T
u Au = u U ΛU u = z Λz = T
λj zj2 .
j=1

The numbers zj2 are nonnegative and sum to one, they form a distribution
over j ∈ {1, . . . , d}. Clearly, the expression is maximized by concentrating all
probability mass on j = 1, since λ1 = max{λj }. This means that u = U z =
u1 is a maximizer of the Rayleigh-Ritz ratio (uT Au)/(uT u). The recursive
definition of the k-th PC direction is obtained in the same way. Suppose that
the unit vector u is orthogonal to uj , j < k. Then, z = U T u must have zj = 0
for j < k, and z T Λz is maximized by placing all probability mass on k, which
implies the maximizer u = uk .
Finally, a non-recursive min-max characterization of the eigenvalues is due to
Courant and Fisher [23, ch. 4.2]. It directly implies the result (11.3) we used in
Section 11.1.1.
200 11 Dimensionality Reduction

11.1.3 Principal Components Analysis in Practice


How to compute PCA directions in practice? Just as with solving linear systems
(Section 4.2.3), there are two regimes. If min{n, d} is of moderate size, say
up to a few thousand, direct methods should be used, as their state-of-the-art
numerical implementations are essentially foolproof to run. We will concentrate
on this regime here. For much larger problems, iterative approximate methods
have to be used. While these are out of scope of this course, some intuition is
provided in Section 11.1.4.
Suppose for now that n ≥ d (more datapoints than dimensions). If
T
X = n−1/2 [x1 , . . . , xn ] ∈ Rn×d ,

then (recalling that the data is centered) the data covariance matrix is

Σ̂ = X T X .

We could compute Σ̂, then its eigendecomposition. However, for reasons of


numerical stability, it is preferable to compute the singular value decomposition
(SVD) of X instead [42, ch. 6.7]:

X = V ΨU T , V T V = U T U = I, V ∈ Rn×k , U ∈ Rd×k .

Here, k = rk(X ) ≤ min{d, n}, and Ψ ∈ Rk×k is diagonal with positive entries.
The SVD is a generalization of the eigendecomposition to arbitrary matrices.
Now,
Σ̂ = X T X = U Ψ V T V ΨU T = U Ψ2 U T .
| {z }
=I
2
This means that Λ = Ψ are the eigenvalues, U the eigenvectors of Σ̂.
What if n < d (less datapoints than dimensions)? For example, gene microarray
data can come with thousands of dimensions, yet less than hundred datapoints.
In this case, we compute the SVD of X T instead:

X T = U ΨV T , U ∈ Rd×k , V ∈ Rn×k .

Once more, Σ̂ = U Ψ2 U T , so that Λ = Ψ2 are the eigenvalues.


Representing PCA in terms of the SVD of the data matrix reveals a curious
fact which is important with iterative large scale PCA algorithms as well (Sec-
tion 11.1.4). Since

X T X = U Ψ2 U T , X X T = V Ψ2 V T ,

the matrices Σ̂ = X T X and X X T have the same set of non-zero eigenvalues.


Moreover, the corresponding eigenvectors, U and V , are closely related. For
example, suppose we have determined the eigenvectors V of X X T . Then,

X T V = U ΨV T V = U Ψ ⇒ U = X T V Ψ−1 .

This means that leading eigenvectors of Σ̂ are obtained from corresponding


eigenvectors of X X T by multiplication with X T .
11.1 Principal Components Analysis 201

11.1.4 Large Scale Principal Components Analysis (*)


If both d and n are very large, iterative approximate methods have to be used.
Two facts allow us to operate in the “astronomical” regime. First, only a small
number M of PC directions have to be extracted. Second, the data matrix X
is very sparse or otherwise structured, so that matrix-vector multiplications
(MVMs) X v, X T w can be computed much faster than O(dn). The method of
choice for computing leading eigenvectors of X T X or X X T is the Lanczos al-
gorithm [17], which gives rise to Matlab eigs. This algorithm requires one MVM
with X and X T per iteration. Typically, the first few dominating eigendirec-
tions are obtained rapidly (the convergence speed depends on the decay rate of
the spectrum). In practice, we apply Lanczos to the smaller of X X T or X T X ,
then use the equivalence noted in Section 11.1.3. Typically, our data will not
be centered, and centering it up front may turn a sparse into a dense matrix.
Fortunately, there is a simple remedy, detailed at the end of this section.
First, let us gain some understanding why methods such as Lanczos work. We
focus on the power method, which is simpler than Lanczos7 . Given a positive
semidefinite A (X T X or X X T in the case of PCA), we would like to ap-
proximate the leading eigendirection u1 . Suppose that λ1 > λ2 , the largest
eigenvalue has multiplicity one. Pick a unit vector v 0 uniformly8 at random.
Iterate ṽ k = Av k−1 , v k = ṽ k /kṽ k k. Then, v k converges to one of ±u1 . The
intuition behind this procedure is as follows. If we multiply v 0 with A repeat-
edly, its contribution along u1 will grow more rapidly than the others, so it will
eventually dominate the renormalized vectors. For the proof, we expand ṽ k into
the orthonormal basis given by the eigenvectors uj of A. In fact, we will use
a slight modification, defining ṽ k = Ak v 0 , which gives rise to the same v k se-
quence. The rationale is that we know what Ak is doing on the eigendirections:
k
A uj = λkj uj . Therefore, if v 0 = j αj uj , then
P

n
X n
X  X 
ṽ k = αj λkj uj , kṽ k k2 = αj2 λ2k
j = λ 2k
1 α12 + αj2 (λj /λ1 )2k .
j>1
j=1 j=1

We may assume that α1 6= 0. Since λj /λ1 ∈ [0, 1) for j > 1, we know that

kṽ k k
r X
= α12 + α2 (λj /λ1 )2k → |α1 | (k → ∞).
λ1 k j>1 j

Plugging this in,


n
ṽ k X
vk = = αj (λj /λ1 )k (λk1 /kṽ k k)uj → sgn(α1 )u1 (k → ∞).
kṽ k k j=1

The Lanczos algorithm is slightly more advanced than the power method, but
both require one MVM with A per iteration. The Lanczos algorithm is far supe-
rior to the power method if it comes to approximating M > 1 PCA directions.
Details are found in [17].
7 The power method is widely used in machine learning. Unfortunately, it typically needs

many more MVMs than Lanczos, in particular if M is moderately large. We should use Lanczos
whenever possible.
8 This can be done by sampling i.i.d. Gaussians, then normalizing the resulting vector.
202 11 Dimensionality Reduction

Centering

Recall that we assumed so far thatPour data is centered up front by subtracting


off the empirical mean µ̂ = n−1 i xi . However, as noted above, doing so is
not always advisable. It turns out that centering can be folded into an iterative
method such as Lanczos at no extra cost. Suppose that X = n−1/2 [x1 , . . . , xn ]T
is our original data matrix. We can write the empirical mean as
n
X
µ̂ = n−1 xi = n−1/2 X T 1n , 1n = [1] ∈ Rn .
i=1

We would like to replace X by the corresponding centered data matrix, which


is
T
n−1/2 [x1 − µ̂, . . . , xn − µ̂] = X − n−1/2 1n µ̂T .
Note that 1n µ̂T = [µ̂, . . . , µ̂]T (n rows). Plugging in the expression for µ̂:

X − n−1 1n 1Tn X = H X , H = I − n−1 1n 1Tn .

The centered data matrix is the product of the uncentered X with the rank
one centering matrix H . Note that H is the orthogonal projection onto the
hyperplane with normal vector 1n (Section 4.2.2), which is the subspace of
vectors whose components sum to zero. Implicit centering works by using H X
in place of X and X T H in place of X T . Here, MVMs with H carry essentially
no additional cost.

11.2 Linear Discriminant Analysis


Recall our motivation for principal components analysis (PCA) from the begin-
ning of this chapter. For our binary classification problem MNIST 8s versus 9s,
we would like to learn a projection U so that z = U T x, a much lower dimen-
sional vector than x itself, represents the most salient information about x. This
is a special case of a linear feature map, z = φ(x) = U T x: we extract M linear
features zj = uTj x from x. In other words, the columns of U are directions
onto which we project x so to obtain a feature zj . As we saw above, PCA di-
rections represent x well in terms of squared reconstruction error. Equivalently,
they maximize the amount of covariance of x which is retained in z. However,
our problem is binary classification. Why should the directions of maximum
covariance of x also be helpful in classifying (x, t) well?
In general, they will not always be helpful, as the simple example in Figure 11.4
demonstrates. This should not come as a surprise. After all, given some training
data {(xi , ti )}, PCA does not depend on the labels {ti }. For many real-world
problems, the directions of maximum variance in the data do not carry much
discriminative information. If x is drawn from a distribution over image bitmaps,
the largest variance direction often represents differences in global illumination
(brightness) across images. If x represents an audio waveform, the first PCA
direction may tell us mainly about volume. For natural image patches x, the
maximum covariance directions are typically low order sinusoids. None of these
directions help much with classification.
11.2 Linear Discriminant Analysis 203

after Bishop (1995), Figure 3.15

Figure 11.4: PCA directions of largest variance need not be directions which are
most useful for classification. In this example, the maximum variance direction
is closely aligned with the horizontal axis, yet projecting onto it results in sub-
stantial overlap between the classes. We would be better off to project onto the
vertical axis.

For binary classification, a single discriminative direction u can be found as


Fisher’s linear discriminant (FLD). To maximize class separation, we need to
maximize the variance between classes, while at the same time minimizing the
variance within classes. The example in Figure 11.4 shows that these require-
ments can be in conflict with each other. If we project onto the horizontal axis,
we attain a larger distance between the class means than if we choose the verti-
cal axis: the between-class variance is larger for the former choice. On the other
hand, both classes spread much more along the horizontal axis. In order to mini-
mize the within-class variance, we are better off projecting onto the vertical axis.
Fisher’s linear discriminant realizes a tradeoff between these two requirements.
Define sample means and covariance matrices:
n
X
µ̂k = n−1
k I{ti =k} xi , k = 0, 1,
i=1
Xn n
X
Σ̂k = n−1
k I{ti =k} (xi − µ̂k )(xi − µ̂k )T = n−1
k I{ti =k} xi xTi − µ̂k µ̂k T .
i=1 i=1

Here, the label space is T = {0, 1} (for our MNIST example, assign 8 → 0,
9 → 1), and nk is the number of patterns xi with ti = k. The total number
of patterns is n = n0 + n1 . We can quantify the between-class variance as
squared distance between the class means projected onto u: (m0 − m1 )2 , where
mk = uT µ̂k . On the other hand, the within-class scatter for class k can be
measured, in terms of the same units, by summing up the squared distance
between zi = uT xi and mk over all patterns xi belonging to class k:

n n
X X 2
s2k = n−1 I{ti =k} (zi − mk )2 = n−1 I{ti =k} uT (xi − µ̂ti ) .
i=1 i=1

The total amount of within-class variance is s20 + s21 . One way to maximize
between-class scatter while minimizing between-class scatter, the one chosen by
204 11 Dimensionality Reduction

Fisher, is to maximize the ratio


2
(m1 − m0 )2 uT (µ̂1 − µ̂0 )
J(u) = = 2 .
s20 + s21 Pn
n−1 i=1 uT (xi − µ̂ti )

over unit norm vectors u. Stop for a second to note the subtle form of the
denominator s20 + s21 . It looks like the usual covariance, but note that in xi − µ̂ti
the mean we subtract depends on the label ti of xi . How do we solve this
problem? First, both numerator and denominator are quadratic functions in u:

uT S B u
J(u) = ,
uT S W u

where
S B = (µ̂1 − µ̂0 )(µ̂1 − µ̂0 )T
is the between-class scatter matrix,
n
X
S W = n−1 (xi − µ̂ti )(xi − µ̂ti )T = (1 − α)Σ̂0 + αΣ̂1
i=1

is the within-class scatter matrix, and α = n1 /n. We will assume here and
below that the within-class scatter matrix S W is invertible9 . Define d := µ̂1 −
µ̂0 , so that S B = ddT . The form of J(u) looks like a Rayleigh-Ritz ratio we
encountered in Section 11.1.2, only that the denominator squared norm uT S W u
is weighted by S W . We will develop this generalized eigenproblem notion in full
generality below, when we generalize FLD to multiple classes. For now, let us
just set the gradient to zero. Working out ∇u J(u) is a bit simpler if we apply
differentiation to the identity

J(u)uT S W u = uT S B u,

resulting in
T
(dJ(u))uT S W u = 2 (S B u − J(u)S W u) (du).

Setting the gradient equal to zero gives

J(u)S W u = S B u. (11.6)

Here, S B u = (dT u)d, so the right hand side of (11.6) is a multiple of d.


Multiplying both sides with S −1 −1
W , we obtain u ∝ S W d. The Fisher’s linear
discriminant direction is given by

S −1
W d
ûFLD = −1 , d = µ̂1 − µ̂0 .
kS W dk

In order to compute ûFLD , we determine the within-class scatter matrix S W


and the class means, then solve the linear system

S W u = µ̂1 − µ̂0 ,
11.2 Linear Discriminant Analysis 205

0.14 0.14
Class 8 Class 8
Class 9 Class 9
0.12 0.12

0.1 0.1

0.08 0.08

0.06 0.06

0.04 0.04

0.02 0.02

0 0
−0.06 −0.04 −0.02 0 0.02 0.04 −6 −4 −2 0 2 4

6
Class 8
4 Class 9

2
First PCA

−2

−4

−6

−8
−0.08 −0.06 −0.04 −0.02 0 0.02 0.04 0.06
Fisher LD

Figure 11.5: Comparison of single linear features on MNIST 8 vs. 9 binary


classification problem. The dataset are 500 patterns from each of the two classes,
randomly drawn from the MNIST training database. Top left: Fisher’s linear
discriminant direction. Top right: First principal component direction. Shown
are histrograms of feature values over data from each class. Bottom left: Features
values FLD vs. PCA. While the FLD histograms are nicely separated, they
overlap substantially for the PCA feature (little discriminative power). Bottom
left: Data perfectly separable when projected on FLD direction (horizontal axis).

a procedure which is simpler even than PCA.


In Figure 11.5, we compare the FLD direction ûFLD against the first principal
components direction ûPCA (being the leading eigenvector of the total sample
covariance matrix S , ignoring the class labels). The FLD direction provides
an excellent separation of the data, while the PCA direction is not useful in
that respect. A few comments are in order. First, does this mean that PCA is
not useful as a preprocessing technique for classification? By no means. PCA
is widely (and rightly) used in this respect. We do not extract a single PCA
direction, but a number of them, feeding these features into a classifier further
down the line which makes use of the labels ti . The weakness of PCA in the
context of Figure 11.5, its independence from the labels {ti }, can be a strength
when it comes to preprocessing. There may be much more unlabeled than labeled
data, the former cannot be used by discriminative techniques like FLD. As noted
towards the end of Section 10.2.1, feature induction by PCA does not carry the
risk of overfitting, while care has to be taken with FLD.
9 This means that S T
W is positive definite, so that the denominator u S W u in J(u) is
always positive.
206 11 Dimensionality Reduction

If FLD extracts a single discriminative direction, what is the difference to linear


classification in general, say by the perceptron algorithm, linear or logistic re-
gression? Strictly speaking, FLD is not a classification technique. Given ûFLD ,
we still need to construct a discriminant function. A natural choice is the linear
function
ŷFLD (x) = ûFLD T x + bFLD ,
where the bias parameter bFLD is chosen by minimizing the training error.
Viewed in this way, FLD is simply an alternative to perceptron learning or
logistic regression, which may be simpler to implement. On the other hand, we
will see below how to generalize FLD to linear discriminant analysis over C > 2
classes, which allows us to extract a discriminative subspace of dimension C − 1.
These features can be used with any nonlinear classifier.

11.2.1 Decomposition of Total Covariance


We derived FLD based on the intuition of maximizing the between-class scatter
matrix, while minimizing the within-class scatter matrix. How do these notions
relate to the total sample covariance matrix of the data,
n
X n
X
−1 T −1
S =n (xi − µ̂)(xi − µ̂) , µ̂ = n xi ,
i=1 i=1

which PCA is based on? In this section, we show that S can be decomposed
as the sum of S W and a multiple of S B . Recall the definitions of S W and S B
from above, and set d = µ̂1 − µ̂0 . We plug
 
xi − µ̂ = xi − µ̂ti + µ̂ti − µ̂

into the equation for S . Expanding the squares, we see that the “cross-talk”
vanishes:
n
X n
X X
(xi − µ̂ti )(µ̂ti − µ̂)T = I{ti =k} (xi − µ̂k )(µ̂k − µ̂)T = 0,
i=1 k=0,1 i=1

since
n
X
I{ti =k} (xi − µ̂k ) = nk (µ̂k − µ̂k ) = 0.
i=1

Therefore,
n
X X
S = n−1 (xi − µ̂ti )(xi − µ̂ti )T + n−1 nk (µ̂k − µ̂)(µ̂k − µ̂)T .
i=1 k=0,1

The first part of this equation is just S W . Now, nµ̂ = n0 µ̂0 + n1 µ̂1 , so that

n (µ̂k − µ̂) = (n0 + n1 )µ̂k − n0 µ̂0 − n1 µ̂1 = (2k − 1)n1−k d,

therefore
X X nk (n1−k )2
n−1 nk (µ̂k − µ̂)(µ̂k − µ̂)T = n−1 ddT = α(1 − α)ddT ,
n2
k=0,1 k=0,1
11.2 Linear Discriminant Analysis 207

where α = n1 /n. All in all,

S = S W + α(1 − α)ddT , d = µ̂1 − µ̂0 .

This simple decomposition provides insight into the relationship between PCA
and FLD. Maximizing total variance is a good idea for data without class labels,
but once we know the assignment of training pattern to classes, we recognize
that a part of this variance helps with classification (variance between patterns
of different classes), while the remaining part hurts (variance of patterns within
each class) by creating overlap. When maximizing the former part, we have
to control the size of the latter in order to obtain a discriminative direction.
Note that the two parts of S are different in nature. The within-class scatter
matrix S W is a positive definite matrix just like S , while the between-class
scatter matrix S B is of rank one only. We will understand the significance of
this observation in Section 11.2.3.

11.2.2 Relationship to Optimal Classification (*)


The attentive reader may spot a similarity between the FLD direction ûFLD ∝
S −1
W (µ̂1 − µ̂0 ) and the optimal discriminant for two Gaussian classes with equal
covariance matrices, which we derived in Section 6.4.1. In order to understand
this relationship, we will analyze the following question in this section. If data
comes from two Gaussian classes P (x|t) = N (µt , Σ), t = 0, 1, moreover P (t =
1) = α ∈ (0, 1), and we determine the population FLD direction, how does it
relate to the Bayes optimal weight vector w∗ = Σ−1 (µ1 −µ0 )? Do they give rise
to the same discriminant functions, or is FLD suboptimal even if we compute
it based on true means and covariances?
The population FLD direction is proportional to wFLD = Σ−1
W d, where d =
µ1 − µ0 (distance between true means) and

ΣW = E (x − µt )(x − µt )T
 

is the true within-class covariance matrix. Note the subtlety in this definition:
this is not the covariance of x, which would be

Σ = E (x − µ)(x − µ)T ,
 

since the expectation is over (x, t), not just x. The Bayes optimal rule was
determined in Section 6.4.1, it comes with the weight vector w∗ = Σ−1 d. Since
Σ 6= ΣW , we have that w∗ 6= wFLD . However, we will see that w∗ and wFLD
are in fact proportional to each other. Since the length of the weight vector does
not matter, we end up with the same optimal discriminant.
From Section 11.2.1, we know that Σ = ΣW +α(1−α)ddT , where α(1−α) > 0.
Therefore,  
Σw∗ = d = Σ − α(1 − α)ddT wFLD ,
so that
 
ΣwFLD = d + α(1 − α)(wTFLD d)d = 1 + α(1 − α)dT Σ−1
W d d
 
= 1 + α(1 − α)dT Σ−1
W d Σw ∗ .
208 11 Dimensionality Reduction

Here, we used that wTFLD d = dT Σ−1W d > 0. This means that w FLD is propor-
tional to w∗ , and FLD recovers the Bayes optimal discriminant in this case.

11.2.3 Multiple Classes


In this section, we generalize Fisher’s linear discriminant to multi-way classifica-
tion with C ≥ 2 classes. This general procedure is known as linear discriminant
analysis (LDA). On the way, we will learn about generalized eigenproblems and
simultaneous diagonalization, techniques which many other machine learning
methods are based on. We will also understand why the basic idea between
LDA for C classes is limited to extracting no more than C − 1 independent
linear features.
A natural way to derive LDA is to generalize the decomposition of the total data
covariance matrix S from Section 11.2.1. Assume
P that T P= {0, . . . , C − 1}.
In the remainder of this section, we write k short for k∈T . We assume
that C < min{d, n}. Our aim is to extract M linearly independent directions:
U ∈ Rd×M , where rk(U ) = M . We will specify M ≥ 1 below. The within-class
scatter matrix is generalized easily:
X n
X
S W = n−1 nk Σ̂k = n−1 (xi − µ̂ti )(xi − µ̂ti )T .
k i=1

At this point, you should go through the derivation in Section 11.2.1 and confirm
for yourself that
X
S = S W + S B , S B = n−1 nk dk dTk , dk = µ̂k − µ̂.
k

Here, µ̂k is the sample mean over the patterns from class k, and µ̂ is the overall
sample mean. What is the rank of S B ? Obviously, rk(S B ) ≤ C. We even have
rk(S B ) ≤ C − 1. Define
hp i
M = nk /n dk ∈ Rd×C .

Then, S B = M M T . Now, rk(M ) ≤ C − 1, since


hp i X
M nk /n = n−1 nk (µ̂k − µ̂) = µ̂ − µ̂ = 0,
k

so the C columns of M are not linearly independent. We may assume that the
class means µ̂k are linearly independent, so that rk(M ) = C − 1 and rk(S B ) =
C − 1. Note that for the binary case C = 2, we recover the FLD situation of a
rank one between-class scatter matrix.
We need to choose U so that the covariance U T S B U is maximum, while the
covariance U T S W U is minimum. There are several ways to measure the size
of covariance, for example its trace (sum of eigenvalues) or its determinant
(product of eigenvalues), see Section 11.1.2. For the derivation of LDA, both
give the same result, so let us go with the trace. The situation for LDA is a bit
more complicated than for PCA, as we need to simultaneously deal with two
11.2 Linear Discriminant Analysis 209

covariance matrices S B and S W here, not just with one. At this point, it is
important to understand that what we are after is not the matrix U , but rather
the M -dimensional subspace spanned by its columns. We can replace any U by
U R for an invertible R ∈ RM ×M . Also, the scale of U is arbitrary, so we can
fix it without loss of generality. In order to maximize the between-class scatter
tr U T S B U while controlling the within-class scatter, we can just fix the size of
the latter by imposing the constraint U T S W U = I. The linear discriminant
analysis (LDA) problem is:

max tr U T S B U s.t. U T S W U = I. (11.7)


U ∈Rd×M

This is a generalized eigenproblem. If we replaced the invertible matrix S W by


I, we would obtain the standard Rayleigh-Ritz characterization of the eigende-
composition of S B . We will see how to solve the LDA problem in Section 11.2.4.
Let us look into properties of the LDA problem and its solution. First, what
do we mean by talking about “the solution”? For some solution U and some
orthonormal Q ∈ RM ×M , the matrix U Q is a solution as well. Namely,

(U Q)T S W U Q = QT Q = I,
tr(U Q)T S B U Q = tr U T S B U QQT = tr U T S B U .

The same indeterminacy holds for PCA as well: U and U Q span the same
subspace. Second, note that a solution of (11.7) is not in general orthonormal.
In fact, U whitens the within-class scatter S W (Section 11.1.1). As we will see
in Section 11.2.4, more is true. For a solution U :

U T S W U = I, U T S B U diagonal.

The transformation U diagonalizes both S W and S B at the same time: it per-


forms a simultaneous diagonalization. The geometric picture provides insight
into the relationship between PCA and LDA. The former focusses on a single
matrix S , which we can diagonalize by an orthonormal transform U , consisting
of its eigenvectors. Alternatively, we can choose a non-orthogonal transform to
whiten S , say by rescaling the eigenvectors by the square root of eigenvalues.
In LDA, we have to deal with two matrices S W and S B , so are more restricted
in what we can do. Obviously, we cannot whiten both of them with one trans-
form in general. Also, it is not in general possible to diagonalize both of them
with a single orthonormal transform U . What we can do is to whiten S W and
diagonalize S B with a single general transform, and solutions to LDA are found
among such simultaneously diagonalizing transforms.
Finally, how to choose M , the number of independent LDA features? Ultimately
this is a model selection problem, but what is the largest M we can possibly
choose? Recall that S B = M M T , where M ∈ Rd×C , therefore

tr U T S B U = tr(U T M )T U T M .

Since rk(M ) = C − 1, the matrix U T M has at most rank C − 1. This means


that maximizing the criterion tr U T S B U cannot determine more than C − 1
independent directions (columns of U ). If M > C −1, there are always solutions
210 11 Dimensionality Reduction

U to (11.7) with M − (C − 1) zero columns (a proof of this fact is given in


Section 11.2.4). This limitation of LDA is most clearly seen for C = 2. There,
S B ∝ ddT is a rank one matrix, and the criterion tr U T S B U clearly only
determines a single direction.

11.2.4 Techniques: Generalized Eigenproblems. Simulta-


neous Diagonalization (*)
The LDA problem (11.7) is an example of a generalized eigenproblem, defined by
two symmetric matrices S B and S W , where S W is positive definite. Such prob-
lems are ubiquitous in machine learning and applied statistics. In this section,
we discuss generalized eigenproblems in the context of simultaneous diagonal-
ization, before showing how they can be solved efficiently.
The equality-constrained form of the LDA problem (11.7) suggests the following
procedure:

• Whitening of S W : Determine some V ∈ Rd×d such that V T S W V = I.


Note that all whitening transforms are given by V Q, where Q ∈ Rd×d is
orthonormal.

• Diagonalization of S B : Find orthonormal Q ∈ Rd×d such that


(V Q)T S B V Q is diagonal. Output M columns of V Q corresponding to
the largest diagonal entries.

For the whitening step, we use the eigendecomposition S W = RΛRT , where


R are the eigenvectors. Then, V = RΛ1/2 is a whitening transform. Now,
 
(V Q)T S B V Q = QT V T S B V Q,

so Q can be determined by the eigendecomposition V T S B V = Q Λ̃QT . We


manage to simultaneously diagonalize S W and S B by way of two standard
eigendecompositions. While we will shortly explore a more efficient method for
simultaneous diagonalization, the current approach has a simple geometrical in-
terpretation. First, we rotate S W into its eigenbasis, where it becomes diagonal.
Second, we scale it to become white. This step is crucial, since a further orthog-
onal transformation of V T S B V into its eigenbasis (diagonalization) leaves S W
whitened.
We noted in Section 11.2.3 that we are restricted to M ≤ C − 1 in the LDA
problem (11.7). We will see in a moment that a more efficient procedure can
take this reduced size into account. But first, we prove this fact. Suppose that
M ≥ C − 1. We show that we can always find a solution U of Section 11.2.3
whose trailing M − (C − 1) columns are zero. This means that only C − 1
columns can sensibly be determined by the data. Suppose that U is any solution
of (11.7). Since rk(U T M ) = r ≤ C − 1, its eigendecomposition is U T M =
E ΛF T , where Λ ∈ Rr×r is diagonal and E ∈ RM ×r has orthonormal columns.
Extend Q = [V , ∗] ∈ RM ×M to be orthonormal. Then, the last M − r rows
of (U Q)T M = QT U T M are zero, so we might as well blank out the trailing
M − r columns of U Q to attain a solution of (11.7) with trailing zeros.
11.2 Linear Discriminant Analysis 211

Solving LDA in Practice (*)

In the remainder of this section, we show how to solve LDA efficiently in practice.
We will see that by combining the lessons learned in this chapter in a clever way,
we can get away with a small eigenproblem of size C × C, along with having
to solve C linear systems with S W , problems which can be solved much more
efficiently than a d × d eigendecomposition. First, since S W is positive definite,
it has a Cholesky decomposition S W = V V T (Section 4.2.2). Substituting
W = V T U , the LDA problem becomes

max tr W T V −1 M M T V −T W s.t. W T W = I.

This is the usual Rayleigh-Ritz characterization (Section 11.1.2), so the so-


lution is the (C − 1)-dimensional leading eigenspace of V −1 M M T V −T =
V −1 M (V −1 M )T .
At this point, we employ the trick discussed in Section 11.1.4. If X = V −1 M ,
then X X T has essentially the same eigendecomposition as

X T X = (V −1 M )T (V −1 M ) = M T (S W )−1 M ∈ RC×C .

Since rk(M ) = C − 1, this matrix has C − 1 positive eigenvalues. We have


derived the following procedure for solving LDA.

• Compute M ∈ Rd×C and solve for (S W )−1 M (C linear systems).


• Compute the eigendecomposition

M T (S W )−1 M = QΛQT , Q ∈ RC×(C−1) , QT Q = I.

According to Section 11.1.4, Ŵ = (V −1 M )QΛ−1/2 are the eigenvectors


for X X T , and

Û = V −T Ŵ = (S W )−1 M QΛ−1/2


solves the LDA problem. Here, we can reuse (S W )−1 M , which was com-
puted above.

With a bit more work, we can even get by with solving C − 1 linear systems
and a (C − 1) × (C − 1) eigendecomposition. For C = 2, this modified procedure
exactly recovers the FLD method derived at the beginning of this section.
212 11 Dimensionality Reduction
Chapter 12

Unsupervised Learning

Machine learning is about inducing robust statistical relationships between vari-


ables from data in an automatic fashion. Variables can be of different kind and
structure and be dependent in many different ways. Many prominent machine
learning problems are of the supervised learning type, where a function from
some input point x to some target variable t is to be learned. The latter lives in
R (regression estimation), in {−1, +1} (binary classification), or in some finite
set (multi-way classification, ordinal regression, ranking). Datasets are labeled,
meaning that pairs (xi , ti ) are observed. In supervised learning, the machine
learning method plays the role of a student in a rote learning session: for x1 you
say t1 , for x2 you say t2 , and so on.
There is more to machine learning, and in this chapter we will begin to get an
idea about some fundamental concepts we have not touched so far, and which
in machine learning lingo are called unsupervised learning. They apply in situ-
ations where rote learning on carefully preprocessed and hand-labeled data is
not possible or would be too costly, or where fitting a function or classifier is
not the aim in the first place. For example, we may want to discover structure
in data {xi } per se, without obtaining any teaching signals. Or our goal may
be classification, but we have to deal with “raw” data, partly unlabeled, riddled
with missing attribute values, outliers and other distortions. In general, super-
vised learning cannot deal with such data, and it is up to us to clean it up.
Unsupervised learning techniques can help us in that respect.
In this chapter, we will mainly concentrate on clustering and mixture density
estimation, important instances of unsupervised learning, yet no more than
the tip of the iceberg. We will understand why it makes sense to augment
probabilistic models by variables which we can never observe. We will find that
maximum likelihood estimation has more to it than computing sample means,
covariances and counting words, is in fact powerful enough to drive general
unsupervised learning. Dimensionality reduction techniques such as principal
components analysis (Section 11.1) are instances of unsupervised learning as
well, even though their modern interpretation in terms of density estimation [43]
is out of scope of this course. In general, the foundation of statistical machine
learning on probabilistic modelling (Chapter 5,Chapter 6) gathers full stream
only with probabilistic unsupervised learning.

213
214 12 Unsupervised Learning

12.1 Clustering. K-Means Algorithm


Pretty much from the day you were born, you are faced with having to process
a multitude of sensory input data, which you need to make sense off to survive
and to thrive. You do get some teaching signals from parents, friends, and school
teachers later on, but most of the time you do not. How do you do it? While
human learning is imperfectly understood, a key element to it must be concep-
tual grouping. We organize our world view by putting patterns into distinct or
overlapping bins, the personal labels of which we typically invent ourselves. If
some grouping scheme works well, it can be adopted as consensus, but that does
not mean there has to be any physical “reality” to it. Examples include colours,
spoken words, or names of biological species. A first step to understanding and
organizing data is to cluster it.
For simplicity, we restrict ourselves to non-overlapping clustering. Given a set of
points {xi | i = 1, . . . , n}, a clustering is an assignment of datapoints to K > 1
groups, a partition of the dataset into K clusters. As we will see below, it makes
perfect sense to allow the clustering to be probabilistic, in the sense that each
xi belongs to cluster k with a certain probability which can be different from 0
and 1. However, in the current section, we restrict our attention to deterministic
clustering, where each xi belongs to precisely one of the K clusters. We can
formalize the assignment by introducing additional variables ti ∈ 1, . . . , K, one
for each pattern xi . A clustering of {xi } is defined by an instantiation of {ti },
in the sense that pattern xi belongs to cluster ti under the assignment. We have
deliberately used the same notation as for K-way classification: clustering is
classification, with the twist that we never get to see any of the labels. Before
we get into any details, a central point to understand about clustering should be
clear from the analogies above. Unlike classification with label data provided,
clustering is a fundamentally ill-posed problem. There is no “best” solution
independent of any assumptions. In order to comprehend the results you get
with a certain clustering scheme, you need to understand the assumptions it is
based on. It is good practice to run several different clustering schemes on your
data in order to get a balanced picture. Notwithstanding such conceptual issues,
the following questions have to be addressed by a clustering scheme.

• How to score the “quality” of cluster assignment {ti }, given our assump-
tions?
• How to find an assignment of maximum score among the combinatorial
set of all possible clusterings?
• How to choose the number K of clusters?

A large number of answers to these questions have been proposed, all com-
ing with strengths and weaknesses. Among the more basic concepts, two are
most widely used: agglomerative and divisive clustering. Both are hierarchi-
cal clustering principles, in that not only a single K-clustering is produced,
but a tree of nested clusterings at different granularities (number of groups).
They start with a user-supplied distance function d(x, x0 ) and the general as-
sumption that distance values between two patterns in the same cluster should
be smaller than distance values between two patterns in different clusters.
12.1 Clustering. K-Means Algorithm 215

Another degree of freedom is the extension of distance between two points


to distance between two sets of points A and B. Frequently used extensions
include d(A, B) = minx∈A,x0 ∈B d(x, x0 ), d(A, B) = maxx∈A,x0 ∈B d(x, x0 ), or
d(A, B) = d(µA , µB ), where µS is the empirical mean of points in S. Agglom-
erative clustering works bottom up. It starts with K = n and each xi forming
its own group. In successive rounds, the two groups with the smallest distance
are combined to form a new group. In contrast, divisive clustering is top down,
starting with K = 1 and all xi in a single group. In each round, one of the
larger remaining groups is split along a boundary of largest pairwise distances.
It is typically somewhat simpler to implement agglomerative schemes. On the
other hand, divisive clustering can be reduced to combinatorial problems such
as minimum cut, which can be solved efficiently. If only a few large clusters are
sought, it can be more efficient than a bottom up scheme.

K-Means Clustering

In the remainder of this section, we concentrate on a widely used clustering


scheme known as K-means clustering or vector quantization. K-means does not
produce hierarchical groupings, the number of clusters K has to be specified
up front. During the algorithm, we do not only maintain an assignment {ti }
between datapoints xi and clusters, but also a set of prototype vectors µk , one
for each group. Intuitively, µk represents the k-th group as its center of mass.
K-means is typically based on the Euclidean distance between vectors (after
preprocessing), and we will concentrate on this case. The method is driven by
the following two requirements on {ti } and {µk }:

• Each prototype vector µk should be the mean of the datapoints xi assigned


to class k:
Xn Xn
µk = n−1k I{ti =k} xi , nk = I{ti =k} .
i=1 i=1
For this reason, µk is also called cluster center.
• Each datapoint xi should be assigned to the group whose prototype vector
µk is closest to xi in Euclidean distance:

kxi − µti k = min kxi − µk k


k=1,...,K

Notice the “chicken-and-egg” structure of these requirements: what we want


for one group of variables depends on what the other group is doing. If we fix
the assignment {ti }, the cluster centers µk are obtained by maximum likelihood
estimation, independently for each group. On the other hand, if we fix the cluster
centers, the ti are assigned by nearest neighbour classification. As we are not
given any of these variables up front, our first attempt should be an iterative
strategy. We initialize the µk at random, say by placing them on top of K
randomly selected datapoints. Then, we iterate over rounds of updating {ti },
then updating {µk } according to the requirements. We stop until the assignment
does not change anymore. This is the K-means algorithm.
In Figure 12.1, we apply K-means to some data. Empirically, the algorithm
always seems to converge to an assignment which fulfils the nearest neighbour
216 12 Unsupervised Learning

2 (a) 2 (b) 2 (c)

0 0 0

−2 −2 −2

−2 0 2 −2 0 2 −2 0 2

2 (d) 2 (e) 2 (f)

0 0 0

−2 −2 −2

−2 0 2 −2 0 2 −2 0 2

2 (g) 2 (h) 2 (i)

0 0 0

−2 −2 −2

−2 0 2 −2 0 2 −2 0 2

Figure 12.1: Illustration of K-means algorithm on data shown in (a), K = 2. The


top row shows the first iteration, starting from the initial cluster centers in (a).
(b): Each datapoint is assigned to the nearest cluster center. (c): The centers are
re-estimated as empirical means of their data. (d)–(i): Further iterations until
convergence. Figure from [5], used with permission.

criterion. However, when we run it multiple times, we can end up with different
solutions, depending on how the initialization was done. Since each assignment
we find satisfies our assumption, we would have to be more specific about how
to compare two different outcomes of K-means. In Section 12.1.1, we will devise
an energy function which scores assignments {ti }, and which K-means descends
on.

All in all, K-means is a simple and efficient algorithm, which is widely used
for data analysis and within preprocessing pipelines of machine learning appli-
cations. Its main attraction is that it reduces to nearest neighbour search, a
computational primitive which is very well studied, and for which highly effi-
cient algorithms and data structures are known. On the other hand, the non-
uniqueness of solutions can be a significant problem in practice, necessitating
12.1 Clustering. K-Means Algorithm 217

restarts with different initial conditions. Also, K-means is not a robust algo-
rithm in general: small changes in the data {xi } can imply large changes in the
clustering {ti }. Some of these problems are due to the hard cluster assignments
we are forced to make in every single iteration. Finally, a good value for K is
hard to select automatically from data. We will address some of these issues
in the remainder of this chapter, when we devise a probabilistic foundation for
K-means with “soft” assignments.

12.1.1 Analysis of the K-Means Algorithm


In this section, we obtain a more detailed understanding about the K-means
clustering algorithm. We will postulate an energy function over assignments {ti }
and prototype vectors {µk } and prove that each iteration of K-means decreases
this function. In fact, we will show that K-means terminates in finitely many
iterations. Viewed as a function of {µk } only, this energy is not convex, which
provides a rationale for the non-uniqueness of K-means in practice.
For conciseness, denote t = [ti ] and µ = [µk ]. An energy function for K-means
is
Xn XK
φ(t, µ) = I{ti =k} kxi − µk k2 .
i=1 k=1

We will show that φ(t, µ) cannot increase with an iteration of K-means. More-
over, whenever an iteration does not produce a decrease, a target assignment is
reached. This means that the algorithm is finitely convergent. Namely, both t
and µ can only ever attain a finite number of different values. This is clear for
t. Moreover, each µk is the empirical average over a subset of the datapoints
xi , which are fixed up front.
Suppose we are at (t, µ). We first update t → t0 , then µ → µ0 , so that at the
end of an iteration, each µk is the sample average over the patterns assigned to
class k. Denote φ0 = φ(t, µ), φ1 = φ(t0 , µ), φ2 = φ(t0 , µ0 ). We also assume that
if for any i,
kxi − µti k2 = min kxi − µk k2 ,
k

then t0i = ti (no change). First,


n 
X 
φ0 − φ1 = kxi − µti k2 − kxi − µt0i k2
i=1
n 
X 
2 2
= kxi − µti k − min kxi − µk k ≥ 0,
k
i=1

and φ1 = φ0 if and only if t = t0 . Second,


n n
1 X X
µ0k = I{t0 =k} xi , n0k = I{t0i =k} .
n0k i=1 i i=1

As so often before, we expand the quadratic

kxi − µk k2 = k(xi − µ0k ) + (µ0k − µk )k2 ,


218 12 Unsupervised Learning

− µ0k ) = 0, so that
P
making use of vanishing cross-talk i I{ti =k} (xi
0

K X
X n K
 X
φ1 − φ2 = I{t0i =k} kxi − µk k2 − kxi − µ0k k2 = n0k kµ0k − µk k2 ,
k=1 i=1 k=1

and φ2 = φ1 if and only if µ0k = µk for all k. All in all, φ2 ≤ φ0 . Moreover, if


φ2 = φ0 , then t0 = t and K-means stops.
In practice, K-means tends to converge in few rounds. However, it is not neces-
sarily a well-behaved algorithm. Unfortunately, it does not always converge to
a global minimum solution of φ(t, µ). Finding such a global minimum can be a
hard problem.

12.2 Density Estimation. Mixture Models


Clustering methods such as K-means do not have an immediate interpretation
in terms of probabilistic modelling. They are formulated in terms of distances
rather than likelihood functions, and they use hard assignments rather than
posterior probabilities during optimization. However, already our notation in
terms of (xi , ti ) and µk , as well as the use of Euclidean squared distances kxi −
µk k2 suggests some link to generative models with Gaussian class-conditional
densities we learned about in Section 6.4. Consider a generative model defined
in terms of class-conditionals p(x|t) = N (x|µt , I) and P (t = k) = 1/K, where
k ∈ {1, . . . , K}. It encodes the assumption that a datapoint xi is sampled by
first drawing its group label ti uniformly from {1, . . . , K}, then the pattern from
the spherical Gaussian N (xi |µti , I). Given this model, we can understand the
single steps of K-means as operations we already know. First, the update of
µk is the maximum likelihood estimator for a Gaussian mean, restricted to the
points xi assigned to the k-th group. Second, the update of ti is a maximum a
posteriori assignment for fixed model parameters {µk }:

ti = argmin kxi − µk k2 = argmax N (xi |µk , I)P (t = k) = argmax P (ti = k|xi ).


k k k

There is a marked difference between our clustering situation and generative


classification in Section 6.4: we do not know the label values ti here. Maximum
likelihood estimation does not work here, because we are missing half of the
data! Does it? How about the likelihood function for {xi } only?
n n K
!
Y Y X
p(xi ) = p(xi |ti = k)P (ti = k) .
i=1 i=1 k=1

Maybe we should adjust the parameters µk and P (t = k) by maximizing this


likelihood function. Ironically, it is made up of these normalization constants
p(xi ) which we happily dropped in all developments up to now.
In order to analyze and understand data beyond estimating some classification
or regression function, we can try to build a model of the data-generating den-
sity itself. We can then fit model parameters by maximizing the likelihood. This
is called density estimation, the basic concept behind “unsupervised learning”.
12.2 Density Estimation. Mixture Models 219

Did we not do all that before in Chapter 6, under the umbrella of generative
modelling? True, but we will apply this principle to more complex models in
this chapter, thereby unleashing its power. For the models in Chapter 6, vari-
ables were either observed or could be estimated by simple formulae such as
empirical mean, empirical covariance, or count ratios. Here, we augment models
by additional latent variables, such as our group indicators ti , with the aim of
making the model more expressive and realistic.

15 15 15

10 10 10

5 5 5

0 0 0

−5 −5 −5

−10 −10 −10

−15 −15 −15

−10 −5 0 5 10 −10 −5 0 5 10 −10 −5 0 5 10

15 15

10 10

5 5

0 0

−5 −5

−10 −10

−15 −15

−10 −5 0 5 10 −10 −5 0 5 10

Figure 12.2: Illustration of different density estimation techniques, applied to


the dataset shown in the top middle panel. Top left: A single Gaussian does not
provide a good fit of the shape of the data. Top right: Kernel density estimator
with kernel width h = 2 (see text). Good fit which comes at a high cost (one
Gaussian kernel is placed on each of the 100 datapoints). Bottom middle: Gaus-
sian mixture model with 3 components, fitted to the data by the EM algorithm
(result shown bottom left). Excellent fit at moderate cost.

12.2.1 Mixture Models

Given the data in Figure 12.2 (top middle), what type of model should we
use for density estimation? The simplest choice would be a single Gaussian
N (x|µ, Σ). However, the maximum likelihood fit is not good in this case: it
is unlikely that this data comes from a single Gaussian (Figure 12.2, top left).
Another simple idea is known as kernel density estimation. We place a Gaussian
function N (x|xi , h2 I) on each datapoint, then estimate the density as
n
1X
p̂(x|h) = N (x|xi , h2 I).
n i=1

The kernel width h > 0 is the only free parameter, its choice is a model se-
lection problem (Chapter 10). The kernel density estimator is an instance of
220 12 Unsupervised Learning

a nonparametric1 statistical technique. As seen in Figure 12.2 (top right), this


density estimator can provide a good fit, even though adjusting the kernel width
h can be difficult. For example, if h is overly small, p̂(x|h) is a set of peaks at
the datapoints, the equivalent of over-fitting for density estimation. In contrast,
if h is too large, fine but significant details are smoothed away. A more serious
problem is the very high cost of evaluating p̂(x|h) on future data. In fact, we
have to store the whole dataset to represent p̂(x|h), and each evaluation requires
the evaluation of n Euclidean distances.
A compromise between a single Gaussian and one Gaussian on each datapoint
is to use K  n Gaussians, yet allow their parameters to be adjusted at will.
In particular, their means µk do not have to coincide with datapoints, and they
may have different covariance matrices Σk . This is a Gaussian mixture model
(GMM)
K
X K
X
p(x) = p(x|t = k)P (t = k) = N (x|µk , Σk )P (t = k).
k=1 k=1

Here, the p(x|t) are called mixture components and P (t) is called prior distri-
bution. The free parameters of this model are {(µk , Σk )} and {P (t = k)}. If
x is high-dimensional, it is common to use spherical (Σk = σk2 I) or diagonal
covariance matrices. Our probabilistic viewpoint of K-means developed above
corresponds to a GMM with Σk = I. As seen in Figure 12.2 (bottom middle),
a GMM with three components provides an excellent2 fit for our data. On the
other hand, it can be evaluated cheaply on future data.
Mixture models are ubiquitous in machine learning and applied statistics. For
data in Rd , whenever a single Gaussian does not look like an optimal choice,
the next choice must be a GMM with few components. Much of their popular-
ity stems from the simple and intuitive expectation maximization algorithm for
parameter fitting, which we will learn about shortly. For discrete data, mixtures
of multinomials or of independent Bernoulli (binary) distributions are widely
used. There are many variations of the theme, such as hierarchical mixtures,
parameter tying, and many more out of scope of this course. The AutoClass
code3 can be used to setup mixture models over data of many different attribute
types and to run expectation maximization. Mixture models have profound im-
pact on many applications. For example, modern large vocabulary continuous
speech recognition systems are based on Gaussian mixture densities.
Given a mixture model, we can fit its parameters to data {xi } by maximum
likelihood estimation. For the GMM introduced at the beginning of this section,
the log likelihood is
n
X K
X
L(µ, π) = log πk N (xi |µk , I),
i=1 k=1
| {z }
=p(xi )

1 Other examples for nonparametric techniques are nearest neighbour classification (Sec-

tion 2.1) and kernel methods such as the support vector machine (Chapter 9).
2 Not too surprisingly, as the true data generating distribution was a Gaussian mixture

with three components in this case (not shown).


3 ti.arc.nasa.gov/tech/rse/synthesis-projects-applications/autoclass/autoclass-c/
12.2 Density Estimation. Mixture Models 221

where πk = P (t = k) and µk is the mean of the k-th mixture component.


This log likelihood is different from what we encountered so far. The sum over
k is inside the logarithm, and we cannot solve for µ and π directly anymore.
Such functions are called (log) marginal likelihoods. Still, let us try to take the
gradient w.r.t. µk and see how far we get. Recall the definition of the posterior:

p(xi |ti = k)P (ti = k) πk N (xi |µk , I)


P (ti = k|xi ) = = .
p(xi ) p(xi )

Let us concentrate on the i-th term:


∇µk p(xi ) ∇µk p(xi , ti = k) ∇µk elog{πk N (xi |µk ,I)}
∇µk log p(xi ) = = =
p(xi ) p(xi ) p(xi )
πk N (xi |µk , I)∇µk log N (xi |µk , I)
=
p(xi )
= P (ti = k|xi )∇µk log N (xi |µk , I) = P (ti = k|xi )(xi − µk ).

Setting this equal to zero and solving for µk , we obtain


n n
1 X X
µk = P (ti = k|xi )xi , nk = P (ti = k|xi ). (12.1)
nk i=1 i=1

The update equation for the mean µk is an empirical average over the data-
points xi , as usual in ML estimation. However, each datapoint is weighted by
its posterior probability of belonging to the k-th class. If xi lies halfway between
class 1 and 2, then half of it contributes to µ1 and µ2 respectively. This is the
soft group assignment we are after. A similar derivation provides us with the
update for πk . However, we need to take into account that π is a distribution,
therefore make use of the technique detailed in Section 6.5.3. You should confirm
for yourself that the update is
n
nk X
πk = , nk = P (ti = k|xi ). (12.2)
n i=1

We sum up the contributions of each point xi to group k.


There is something not quite right here. When we “solved” the gradient equation
for µk , we ignored the fact that the posterior P (ti = k|xi ) depends on µk as
well. What we have really done is to derive a set of coupled equations, where
parameters we are after appear free on the left side and hidden in the posteriors
on the right side. Nevertheless, it seems sensible to iterate these equations as
follows:

• Compute all posterior distributions [P (ti = k|xi )]k , i = 1, . . . , n.


• Update {µk } and π as posterior-weighted averages, according to (12.1)
and (12.2).

This is an instance of the expectation maximization (EM) algorithm for Gaussian


mixture models. We will establish convergence results for EM in Section 12.3.3.
It provably converges to a local maximum of the log marginal likelihood function
222 12 Unsupervised Learning

True generating components Iteration 1 Iteration 5


12 12 12

10 pi1 = 0.36 10 10
pi2 = 0.36
8 pi3 = 0.27 8 8

6 6 6

4 4 4

2 2 2

0 0 0

−2 −2 −2

−4 −4 −4

−6 −6 −6

−8 −8 −8
−10 −8 −6 −4 −2 0 2 4 6 8 10 12 −10 −8 −6 −4 −2 0 2 4 6 8 10 12 −10 −8 −6 −4 −2 0 2 4 6 8 10 12

Iteration 20 Iteration 15 Iteration 10


12 12 12

10 10 10

8 8 8

6 6 6

4 4 4

2 2 2

0 0 0

−2 −2 −2

−4 −4 −4

−6 −6 −6

−8 −8 −8
−10 −8 −6 −4 −2 0 2 4 6 8 10 12 −10 −8 −6 −4 −2 0 2 4 6 8 10 12 −10 −8 −6 −4 −2 0 2 4 6 8 10 12

Figure 12.3: Expectation maximization algorithm applied to Gaussian mixture


model with three components and general covariance matrices.

L(µ, π). Since L is not in general a convex function, this is all we can hope for.
Iterations of EM applied to a Gaussian mixture model are shown in Figure 12.3.
Properties and generalization of EM are discussed in the next section. Let us
close by making the link between EM and K-means precise. Recall that Σk = I
and πk = 1/K, and the group means µk are the sole parameters to learn.

K-means Clustering Expectation Maximization


µk ← n−1 µk ← n−1
P P
k P i I{ti =k} xi k P i Qi (ti = k)xi
nk ← i I{ti =k} nk ← i Qi (ti = k)
ti ← argmaxk P (ti = k|xi ) Qi (ti = k) ← P (ti = k|xi )

K-means can be seen as a hard version of EM, or EM as a soft version of


K-means. The main difference is that the posterior probabilities in EM are
replaced by the indicator distributions [I{ti =k} ]k , where ti is the maximizer of
the posterior P (ti = k|xi ). K-means follows a “winner takes all” approach, in
that xi is counted fully towards µti instead of being shared between components
according to the posterior probability.

12.3 Latent Variable Models. Expectation Max-


imization
A mixture model is an instance of a latent variable model. For every observed
variable xi , we augment our model by a latent (or hidden) variable ti . Why
do we bother with variables whose value we can never observe? Because they
allow us, in an economic and intuitive way, to construct complex models from
simple well-understood ingredients. Gaussians have many interesting properties
(Section 6.3), but in terms of expressiveness they are no match to Gaussian
mixtures. Mixture models are only the tip of the latent variable model iceberg.
12.3 Latent Variable Models. Expectation Maximization 223

We can create hierarchical mixtures in order to represent sharing of information


at different levels. We can represent rotations, scaling, translations and other
distortions which may affect our data by latent variables, then use the EM algo-
rithm in order to learn classifiers which are invariant to these transformations.
We can learn from partially incomplete data by assigning latent variables to
missing entries. While advanced latent variable models are out of scope of this
course, they are founded on no more than the general principles laid out here.
We begin with deriving the EM algorithm for general latent variables models,
then give an example involving missing data before analysing the convergence
properties of EM.

12.3.1 The Expectation Maximization Algorithm


In this section, we derive the expectation maximization (EM) algorithm in full
generality. Our latent variable model is given by P (x, h|θ). Here, x collects
observed variables, h hidden variables, and θ are the parameters to be learned.
We frequently encounter the particular model structure
P (x, h|θ) = P (x|h, θ)P (h|θ),
but this does not have to be the case. In our GMM example above, x ← x,
h ← t, and θ ← ({µk }, π). Moreover,
P (x|h, θ) = P (x|t, {µk }) = N (x|µt , I), P (h|θ) = P (t|π) = πt .
Given some training data {xi }, we postulate latent variables hi , one paired with
each xi . The log marginal likelihood function is
n X
Y n
X X
L(θ) = log P (xi , hi |θ) = log P (xi , hi |θ).
i=1 hi i=1 hi
P
Here, hi denotes a sum over all possible values of the discrete variable hi . Our
derivation goes
R through just as well for continuous hi , if we replace the sum by
an integral . . . dhi .
We can derive the EM criterion just as in Section 12.2.1, by taking the gradient
of L w.r.t. θ. First, since L(θ) is a sum of terms of equal form,
n
X
L(θ) = log P (xi |θ),
i=1

we can concentrate our derivation on a single pattern xi and sum up results at


the end. For fixed i:
P
∇θ P (xi , hi |θ) ! X
∇θ log P (xi |θ) = hi = P (hi |xi , θ)∇θ log P (xi , hi |θ).
P (xi |θ)
hi

The gradient is the expected value of ∇θ log P (xi , hi |θ), where hi ∼


P (hi |xi , θ). The stationary equations are complicated by the fact that the pos-
terior distribution P (hi |xi , θ) itself depends on θ. We can solve these equations
by iterative decoupling. In each iteration, we compute the posteriors:
Qi (hi ) ← P (hi |xi , θ).
224 12 Unsupervised Learning

This is called the E step (“E” for “expectation”). The Qi (·) are distribution
parameters, en par with θ. In particular, they do not depend on θ during the
second part of the EM iteration, the M step (“M” for “maximization”). The
gradient we consider there is

EQi [∇θ log P (xi , hi |θ)] = ∇θ {Ei (θ; Qi ) := EQi [log P (xi , hi |θ)]} .

Solving the stationary equation for θ amounts to maximizing the surrogate


function
n
X n
X
E(θ; {Qi }) = Ei (θ; Qi ) = EQi [log P (xi , hi |θ)]
i=1 i=1

w.r.t. θ, for the Qi (·) determined in the E step. To sum up, we initialize4 θ to
some sensible values, then iterate over the following two steps:

• E step: Compute posterior distributions

Qi (hi ) ← P (hi |xi , θ), i = 1, . . . , n.

In practice, this amounts to accumulating posterior expectations in terms


of which we can represent the Ei (θ; Qi ) functions.

• M step: Maximize the surrogate criterion


n
X n
X
E(θ; {Qi }) = Ei (θ; Qi ) = EQi [log P (xi , hi |θ)]
i=1 i=1

w.r.t. θ. Given the E step statistics, this step is typically not more difficult
to do than maximizing the complete data log likelihood.

It is a valid question to ask why the M step computations should be simpler


to do than optimizing L(θ) directly. For some latent variable models, the M
step optimization can be complicated, and in such cases using EM is not rec-
ommended (see end of Section 12.3.3). But for many latent variable models,
the M step maximization can be done in closed form. The Gaussian mixture
model case can serve as an example, which is a typical rather than a special
case. Recall the comparison between K-means and EM in Section 12.2.1. Sup-
pose we knew all the group assignments ti . Then, the update for θ consists of
summing up the datapoints xi according to the corresponding indicators. Since
we don’t know the ti , we use EM instead. Importantly, the M step updates look
exactly the same as the updates for known ti , except that we sum over posterior
probabilities instead of indicator values.
Some questions are still open with our derivation of EM. Does it always con-
verge? If so, what does it optimize? We will give answers to these questions in
Section 12.3.3, after looking at an application of EM to missing data.
4 It does matter how you initialize EM, but you need to pick a domain-dependent heuristic.

For mixture models, it is often a good idea to assign component means µk to randomly chosen
datapoints and to set (co)variance parameters to large values (say, to the total (co)variance
for all the dataset).
12.3 Latent Variable Models. Expectation Maximization 225

12.3.2 Missing Data


Real-world datasets are seldomly complete. With high-dimensional input points,
some attribute values may be missing for many, if not most of the cases. Faced
with such data, we can throw out every incomplete case, which leaves us with
far less data. We can use heuristics to fill in the missing values, risking that
this skews learning and predictions. Alternatively, we can use latent variable
models. Before we start, we should note that the latent variable treatment of
missing data we advocate here is based on the assumption that the pattern
of which attribute values are missing does not carry information: values are
missing at random. This assumption is not always justifiable. If your data is
a set of medical records of patients, missing entries may be due to a doctor
deciding against certain tests based on insight about the patient. If so, then
these entries are not missing at random. Or a measurement devise may fail
to output certain attribute values, because they lie out of range or saturate a
sensor. If the pattern of missing values is not random, we may want to take into
account its structure in a more complex model.
Recall the naive Bayes document classifier from Section 6.5. Let us try to im-
prove upon the bag-of-words assumption by using a bigram model for text. Based
on a conditional distribution [pa|b ], a, b ∈ {1, . . . , M } words in the dictionary,
M
X
pa|b ≥ 0, pa|b = 1,
a=1

the likelihood for a document x is given by


N
Y
P (x) = pxj |xj−1 , x = [x1 , . . . , xN ].
j=1

Here, x0 = ∅ and [pa|∅ ] is an additional distribution for the first word, which we
assume to be known. For complete data, we would write
M M
! N
Y Y X
I{x1 =a} φa|b (x)
P (x) = (pa|∅ ) (pa|b ) , φa|b (x) = I{xj =a,xj−1 =b} ,
a=1 b=1 j=2
(12.3)
then estimate pa|b by maximum likelihood, accumulating counts over the docu-
ments assigned to each class.
However, what do we do if some words xj are missing in our training documents?
A simple idea would be to break up documents at missing word locations, to
treat all completely observed parts as independent documents per se. This would
probably be a good working solution in this case, even though it would wreak
havoc with a structured5 document model. But let us work out a principled
solution based on latent variables and EM, which in the case of a substantial
fraction of missing words may well make a difference even in this simple setup.
Our training corpus consists of documents xi = [xij ] ∈ {1, . . . , M }Ni . For no-
tational simplicity, we drop the document index i and concentrate on a single
5 For example, on top of the bigram probabilities, we may impose a higher order structure

(title, abstract, introduction, main, references), or we may want to model document length.
226 12 Unsupervised Learning

document x = [xj ] of length N . The “missing at random” assumption is implicit


in our derivation: the fact that some word is missing in some document does
not depend on the document identity within the corpus, neither on the position
within the document. Purely for simplicity, we will also assume6 that no two
consecutive words can be missing: if xj is missing, then xj−1 and xj+1 are ob-
served. To avoid trivialities, we also assume that the first word x1 is observed.
If the last word xN is missing, we simply strip it off, so it is no restriction to
assume that xN is observed. We declare all existing words in x as observed vari-
ables and all missing words as latent variables. In the notation of Section 12.3.1,
these would be x (observed) and h (latent), but it is easier to stick with x for
the document, identifying missing spots by H ⊂ {1, . . . , N }. Therefore, entries
xj with j ∈ H are missing, while entries xj with j 6∈ H are observed. Note
that H for different documents need not be the same, and we allow for H = ∅.
Denote the observed index by O = {1, . . . , N } \ H, moreover xO = [xj ]j∈O for
the observed, xH = [xj ]j∈H for the missing part of the document. The marginal
likelihood P (xO ) over the observed data is obtained by starting from P the com-
plete likelihood P (x) as in (12.3), then summing over xH : P (xO ) = xH P (x).
Our log marginal likelihood criterion is L(θ) = log P (xO ), where θ = [pa|b ]
consists of M conditional distributions over {1, . . . , M }.

Figure 12.4: Graphical model for four consecutive words x1 , x2 , x3 , x4 . Here,


O = {x1 , x3 , x4 } are observed, while H = {x2 } are latent. We can use the EM
algorithm in order to learn from the observed data only.

How does the EM look like for this missing data example? In the E step, we
need to compute the posterior distributions

Q(xH ) ← P (xH |xO , θ),

one for each document. To fix ideas, consider the four-word example in Fig-
ure 12.4. Here, H = {2}, while O = {1, 3, 4}. The posterior distribution is

P (x1 , x2 , x3 , x4 ) px |x px |x px |x px |∅
P (x2 |x1 , x3 , x4 ) = P 0 =P 4 3 3 2 2 1 1 .
x0 P (x1 , x2 , x3 , x4 ) x0 px4 |x3 px3 |x2 px2 |x1 px1 |∅
0 0
2 2

The first point to note is that all terms in the numerator which do not depend
on x2 , appear in the denominator as well, and they cancel each other out:
px3 |x2 px2 |x1
P (x2 |x1 , x3 , x4 ) = P .
x0 px3 |x2 px2 |x1
0 0
2

6 It is straightforward to remove this assumption, but the derivation becomes a bit more

tedious without revealing new ideas.


12.3 Latent Variable Models. Expectation Maximization 227

What happens in the general case? No matter how many other observed words
occur elsewhere: if only a single word xj is missing, cancellation happens all the
same: px |x px |x
P (xj |xO ) = P j+1 j j j−1 . (12.4)
x0 pxj+1 |xj pxj |xj−1
0 0
j

The same formula holds even if other words are missing as well. To understand
the following general argument, it helps to write down some small examples (do
that!). Namely,
P (xj , xO ) X
p(xj |xO ) = , P (xj , xO ) = P (x).
P (xO ) xH \xj

By our assumptions, if j ∈ H, then xj±1 are observed. If we denote H<j = {k ∈


H | k < j} and H>j = {k ∈ H | k > j}, then

P (xj , xO ) = (C>j )pxj+1 |xj pxj |xj−1 (C<j ),


X X
C<j = P (x1 , . . . , xj−1 ), C>j = P (xj+1 , . . . , xN ).
xH<j xH>j

Since C<j and C>j do not depend on xj , they cancel out in (12.4). We have
shown that Y
Q(xH ) ← P (xH |xO , θ) = P (xj |xO , θ),
j∈H

where P (xj |xO , θ) is computed as (12.4). A naive7 way to do the E step com-
putations costs O(|H| M ) per document.
For the M step, we follow the mantra laid out in Section 12.3.1. The surrogate
criterion E(θ; Q) for our document is obtained by averaging
M M
. XX
log P (x) = φa|b (x) log pa|b
a=1 b=1
.
over Q(xH ) (here, = denotes equality up to a constant independent of θ). All
we have to do is to average the indicators φa|b (x), then update the pa|b in the
same way as before, based on these average statistics. A convenient way to write
the average statistics is to extend the Q distribution over the whole document
x. To this end, we deviate from our convention above and denote the observed
values by x̃O instead of xO . Then,
Y
Q(x) = Q(xH ) I{xj =x̃j } .
j∈O

Taking an expectation w.r.t. Q(x) means plugging in x̃O for xO , then taking
the expectation over Q(xH ). Given this definition, the M step surrogate is
M M
. XX
E(θ; Q) = Qa|b log pa|b ,
a=1 b=1
  Xn
Qa|b = EQ φa|b (x) = Q(xj = a)Q(xj−1 = b).
j=2
7 In practice, for any given b, p
a|b ≈ 0 for most a ∈ {1, . . . , M }. This approximate sparsity
of the distributions can be used to speed up accumulations dramatically.
228 12 Unsupervised Learning

Please check for yourself how this computation would be done efficiently in
practice. For example, at least one of the factors in Q(xj = a)Q(xj−1 = b) is
an indicator. If most of the words are observed, it may be fastest to first do the
usual accumulation over observed pairs (xj , xj−1 ), followed by adding terms for
the missing xj . Finally, according to Section 6.5.3, the M step updates are

Qa|b
pa|b = P , a, b ∈ {1, . . . , M }.
a0 Qa0 |b

12.3.3 Convergence of Expectation Maximization


In this section, we show that under mild assumptions, the EM algorithm con-
verges to a local maximum of the log marginal likelihood. We also comment on
the relationship between EM and alternative nonlinear optimization techniques.
Recall that the log marginal likelihood for a general latent variable model is
n
X X
L(θ) = log P (xi |θ), P (xi |θ) = P (xi , hi |θ).
i=1 hi

Here, the latent variable h is discrete. Our derivation applies without changes
for continuous (or mixed) latent variables as well, we only have to replace sums
by integrals. During EM, we maintain distributions Qi (hi ) over hi , one for each
datapoint, writing Q for {Qi }. Our argument is based on the auxiliary criterion
n
X
φ(θ; Q) = (EQi [log P (xi , hi |θ)] + EQi [− log Qi (hi )]) .
i=1

Here, X
H[Qi (hi )] = EQi [− log Qi (hi )] = Qi (hi )(− log Qi (hi ))
hi

is the entropy of Qi , a measure of the amount of uncertainty in hi ∼ Qi [10].


Note that
n
X
φ(θ; Q) = E(θ; Q) + H[Qi (hi )],
i=1

where E(θ; Q) is the surrogate criterion defined in Section 12.3.1. The entropic
part is added to the energy E for technical reasons. It does not depend on θ,
so the M step is not influenced.
In order to establish EM convergence, we show that L(θ) cannot decrease with
an EM iteration. Moreover, if it does not increase, we must be at a local maxi-
mum. A key step will be the bound

φ(θ; Q) ≤ L(θ) for all θ, Qi (hi ), (12.5)

where φ is maximized w.r.t. the Qi in the E step and increased sufficiently


w.r.t. θ in the M step. Our arguments are illustrated in Figure 12.5. A general
requirement for EM convergence is that the log marginal likelihood L(θ) is
upper bounded, which may require additional assumptions on the parameters.
For example, the log likelihood for a Gaussian mixture model with different
12.3 Latent Variable Models. Expectation Maximization 229

(a)

(b) E step (c) M step

Figure 12.5: Illustration of one iteration of the EM algorithm. (a) The EM


criterion φ(θ; Q) is a lower bound on the log marginal likelihood L(θ). (b)
In the E step, we update Q so to equate φ(θ; Q) and L(θ) for the current
parameters θ. (c) In the M step, we update θ → θ new so to maximize φ(θ; Q).
Since L(θ new ) ≥ φ(θ new ; Q), this update increases the marginal likelihood as
well.
Figure inspired by [5], figures 9.11 until 9.13.

covariances for each component is not upper bounded per se. We can place one
component on top of a datapoint and shrink the variance to zero in order to
obtain infinite likelihood! Such degenerate solutions are avoided by constraining
the variances by some positive lower bound. In the remainder of this section,
we assume that L(θ) is upper bounded.
We begin by relating L(θ) and φ(θ; Q) for arbitrary distributions Qi (hi ). Pick
any i ∈ {1, . . . , n}. For the following, recall the definition of the posterior
P (hi |xi , θ):

P (xi , hi |θ)
 
log P (xi |θ) = EQi [log P (xi |θ)] = EQi log
P (hi |xi , θ)
P (xi , hi |θ)Qi (hi ) P (xi , hi |θ)
   
Qi (hi )
= EQi log = EQi log + log
P (hi |xi , θ)Qi (hi ) Qi (hi ) P (hi |xi , θ)
= EQi [log P (xi , hi |θ)] + H[Qi (hi )] + D[Qi (hi ) k P (hi |xi , θ)].

Here, D[Qi (hi ) k P (hi |xi , θ)] is the relative entropy from Section 6.5.3 (make
sure you recall the derivations in that section, we will need them here). This
means that
Xn
φ(θ; Q) = L(θ) − D[Qi (hi ) k P (hi |xi , θ)] (12.6)
i=1

for any θ and any distributions Qi (hi ). But we know that the relative entropy
between two distributions is nonnegative, and zero if and only if the distributions
are the same, so that (12.6) implies (12.5), with equality if and only if Qi (hi ) =
230 12 Unsupervised Learning

P (hi |xi , θ) for all i = 1, . . . , n. As this is what we do in the E step, it can


only increase φ(θ; Q). Moreover, the M step increases E(θ; Q), and therefore
φ(θ; Q), by definition. Since φ(θ; Q) ≤ L(θ), it is upper bounded as well, which
completes our convergence proof.
Moreover, if an iteration of EM leaves φ(θ; Q) the same, we have reached a
stationary point of the log marginal likelihood L(θ). Specifically, if Qi (hi ) =
P (hi |xi , θ) for all i = 1, . . . , n, then

∇θ L(θ) = ∂θ φ(θ; Q).

The right hand side are partial derivatives, the Qi do not depend on θ. There-
fore, a stationary point of the EM algorithm must be a stationary point of L(θ)
as well. We already derived the gradient of L(θ) in Section 12.3.1:
X 1 X
∇θ log P (xi , hi |θ) = P (xi , hi |θ) (∇θ log P (xi , hi |θ))
P (xi |θ)
hi hi

= EQi (hi ) [∇θ log P (xi , hi |θ)] = ∇θ (EQi [∇θ log P (xi , hi |θ)] + H[Qi (hi )]) .

Summing over i = 1, . . . , n, we obtain the gradient identity.

To EM or not to EM

This concludes our discussion of the EM algorithm, a simple and often surpris-
ingly efficient method for finding a stationary point (local maximum) of the log
likelihood of observed data. As discussed in Section 12.3.1, EM is particularly
attractive if the M step computations can be done in closed form, by accumulat-
ing sufficient statistics over E step posteriors rather than counts. However, it is
important to note that EM is not always the best algorithm to solve maxθ L(θ)
locally. We briefly discussed nonlinear gradient-based optimizers in Section 3.4.2.
As seen above, computing the gradient ∇θ L(θ) comes at exactly the same cost
as a single E step in EM. When comparing EM with alternatives, we should
focus on two points:

• Can we solve the M step in closed form, or do we need iterative techniques


there as well?

• How many EM iterations do we require until convergence, versus how


many iterations for an alternative optimizer?

To address the second point first, EM is generally reported to converge rapidly


to a solution of low accuracy, but can be very slow to attain medium to high
accuracy. If it is important to attain high accuracy, EM is not a good choice.
However, in the author’s opinion, it is the first point which decides for or against
EM. If the M step can be done in closed form, it is hard to argue against the
simplicity of EM, in particular since a low accuracy solution is typically reached
in few iterations. On the other hand, if the M step requires iterative optimization
as well, or even worse, necessitates some additional bounding, EM should not
be used, since proper nonlinear optimization codes are almost always faster.
An infamous example is maximum likelihood learning of conditional random
12.3 Latent Variable Models. Expectation Maximization 231

fields, or log-linear undirected models in general. Most of the early work used
variants of EM, such as iterative proportional fitting. In these algorithms, we
obtain closed form M step updates only after additional crude bounding. These
approaches have vanished entirely today, as modern optimizers such as scaled
conjugate gradients, limited memory quasi-Newton or truncated Newton run
several orders of magnitude faster. Another poor idea is to use EM in order
to compute principal components analysis (PCA) directions (Section 11.1): the
Lanczos algorithm is much faster in practice (Section 11.1.4), and good code is
publicly available.
A final comment concerns the E step computations, which are required in order
to compute ∇θ L(θ) just as well. For the Gaussian mixture models with a mod-
erate number K of components, the E step posteriors are cheap to compute. But
in general, this computation can be hard to do if h is large and has dependent
components. If you find yourself in such a situation, you need to consider ap-
proximate posterior computations. It is possible to extend the EM algorithm to
allow for such. The resulting variational EM algorithm will still be convergent,
but it does not in general maximize the exact log likelihood anymore.
232 12 Unsupervised Learning
Appendix A

Lagrange Multipliers and


Lagrangian Duality

We haved derived the dual formulation of the soft margin SVM problem in
Section 9.3. In this section, we provide a much more general picture on the
underlying principle of Lagrange duality, which plays a central role in modern
machine learning, far beyond support vector machines. We will introduce the
framework in two stages, first seeking to generalize the first-order stationary
condition to constrained problems by way of a Lagrangian function, then ex-
posing the duality embedded in this function, leading to a dual optimization
problem. Finally, we will rederive the soft margin SVM dual from this more
general perspective. This section is slightly more advanced and can be skipped
in a first reading.
Lagrange duality is a powerful framework which helps solving constrained op-
timization problems with continuously differentiable objective. We will restrict
ourselves to linear constraints, as this is all we need in this course, but the frame-
work is more generally applicable. There are two aspects to the Lagrangian tech-
nique. The first is a generalization of the famous first order optimality condition
∇x f (x) = 0 to the constrained case, by introducing new variables (multipliers)
and the Lagrangian function. This is in fact what Lagrange did. It often helps to
fence in optimal solutions by exploring stationary points of the Lagrangian, but
does not provide a method to find them. The second aspect is Lagrange duality.
The nontrivial stationary points of the Lagrangian are saddlepoints, which can
be narrowed down naturally via two optimization problems (primal and dual).
At least for convex optimization problems, Lagrange duality directly leads to
algorithms for finding optimal solutions. Modern optimization textbooks often
start with the duality and present the optimality condition in passing, but we
will thread with Lagrange and develop the Lagrangian from first principles.
The optimization problem we will be interested in here is

p̃∗ = min f (x), subj. to gj (x) = 0 ∀j, hk (x) ≤ 0 ∀k, (A.1)


x

where gj (x) = (aj )T x − bj , hk (x) = (ck )T x − ek , and x ∈ Rp . It will be called


the primal problem below. Each gj (x) = 0 is called linear equality constraint,

233
234 A Lagrange Multipliers and Lagrangian Duality

each hk (x) ≤ 0 is a linear inequality constraint. The set of those x which fulfil
all constraints is called the feasible set, and its members x are called feasible.
For an unconstrained problem, all x ∈ Rp are feasible. The feasible set in our
case is a convex polytope (Section 9.1.1). The number p̃∗ is called the value of
the primal problem. We will assume that the feasible set is not empty, and that
p̃∗ ∈ R, in particular p̃∗ > −∞.

Optimality Conditions

Consider an unconstrained optimization problem minx f (x), where f (x) is con-


tinuously differentiable. If x∗ is a local minimum point, then ∇x f = 0 (Sec-
tion 2.4.1). We have used this first order1 necessary condition many times in
this course already. But suppose we add linear constraints on x. Then, this
optimality condition does not work anymore. For example, x ∈ R2 ,

min {f (x) = x1 } , subj. to x1 = 1


x

has optimal solutions x∗ = [1, α]T , α ∈ R, but ∇x∗ f = δ 1 6= 0. In order to


derive the Lagrangian generalization of ∇x f = 0, we proceed as in Section 2.4.1.
Imagine you are a mountaineer who needs to get down, as it gets dark. But there
are straight paths you cannot stray from (equality constraints), or fences and
rivers you cannot cross (inequality constraints). Are there directions along which
you can descend without violating any constraints? Let us start with a single
equality constraint:

min f (x), subj. to g(x) = aT x − b = 0.


x

Suppose we are at the feasible point x, so that g(x) = 0. Let us first determine
along which directions d we may move at all, without violating the constraint:

g(x + εd) = aT (x + εd) − b = g(x) + εaT d = εaT d.

A legal direction d to move along must be orthogonal to a: aT d = 0. Since


∇x g = a, we must have (∇x g)T d = 0. Next, we use the Taylor expansion of f
at x:
f (x + εd) = f (x) + ε(∇x f )T d + O(ε2 ).
If there is some direction d such that (∇x g)T d = 0 and (∇x f )T d < 0, we can
descend in f (x) while staying feasible. A useful optimality condition must imply

(∇x g)T d = 0 ⇒ (∇x f )T d ≥ 0.

Now, if (∇x g)T d = 0 and (∇x f )T d > 0, we can descend along −d, so we need

(∇x g)T d = 0 ⇒ (∇x f )T d = 0.

This condition is fulfilled only if ∇x f is parallel to ∇x g or zero:

∇x f = −λ∇x g, ∇x f + λ∇x g = 0, λ ∈ R.
1 First order in this context means that only the gradient (first derivatives) are used. In

contrast, second order conditions look at the Hessian as well (Section 3.4).
235

Let us see whether this works for our example above. f (x) = x1 , a = δ 1 ,
b = 1, so that ∇x f = δ 1 = ∇x g, and the condition holds for λ = −1. It is
not very useful in this example, since it holds for any x ∈ R2 , but it certainly
is a necessary condition for optimality. We illustrate Lagrange’s condition in
Figure A.1, top.

Feasible Feasible
Directions Descent

Descent Directions

Feasible Directions

Descent Directions

Figure A.1: To understand Lagrange’s first-order conditions, we can visualize


the set of descent directions A = {d | dT (∇f ) < 0} (open halfspace; blue) and
the set B of feasible directions (red). The conditions ensure that the intersection
A ∩ B is empty: there is no feasible descent direction.
Top: For a linear equality constraint g(x) = 0, B is a line with normal vector
∇g. A∩B = ∅ if ∇f and ∇g are parallel. Bottom: For an active linear inequality
constraint h(x) ≤ 0, B is a closed halfspace. In this case, it is not enough for
∇f and ∇h to be parallel, they must also point in opposite directions.

Next, consider a single inequality constraint:

min f (x), subj. to h(x) = cT x − e ≤ 0.


x

Two things can happen now at x. Either, h(x) < 0, so that h(x + εd) < 0
for small ε, no matter what d. The constraint is inactivate, and we can simply
pretend it is not there. The optimality condition is ∇x f = 0 then. Or, h(x) = 0:
the constraint is active. Then,

h(x + εd) = h(x) + εcT d = εcT d,


236 A Lagrange Multipliers and Lagrangian Duality

which is nonpositive if and only if cT d ≤ 0. For an active constraint, we must


have
(∇x h)T d ≤ 0 ⇒ (∇x f )T d ≥ 0,
This implies that ∇x f = −α∇x h, or ∇x f + α∇x h = 0, where α ≥ 0. The
condition is visualized in Figure A.1, bottom. It works for inactive constraints
as well if α = 0. A succinct way to write the optimality condition is

∇x f + α∇x h = 0, α ≥ 0, αh(x) = 0.

Our primal problem (A.1) has several equality and inequality constraints. What
do we do then? Our conditions all have the same form, so why not add them
up and divide by the number of constraints:
X X
∇x f + λj ∇x gj + αk ∇x hk = 0, αk ≥ 0, αk hk (x) = 0 ∀k.
j k

These conditions read as follows. If x is a local minimum point of the constrained


problem, then there are some values for λ, α such that (x, λ, α) fulfil the
conditions. But did we not lose something by just adding the single conditions?
Let us check. Suppose our optimality conditions hold for (x, λ, α). Sort the
inequality constraints into inactive ones (hk (x) < 0, αk = 0) and active ones
(hk (x) = 0), and let ka run over the latter only. Let d 6= 0 be any feasible
direction, meaning that dT (∇x gj ) = 0 for all j and dT (∇x hka ) ≤ 0 for all
active ka . Then,
 X X 
0 = d T ∇x f + λj ∇x gj + αk ∇x hk
j k
 X X 
T
= d ∇x f + λj ∇x gj + αka ∇x hka
j ka
X
T
= d (∇x f ) + αka d (∇x hka ) ⇒ dT (∇x f ) ≥ 0.
T
ka |{z} | {z }
≥0 ≤0

It works! If (x, λ, α) fulfils our conditions and d is a feasible direction, moving


along d does not lead to descent at least to first order. We can write the condition
in a nicer way by pulling ∇x outside. Let us introduce the Lagrangian
X X
L(x, λ, α) = f (x) + λj gj (x) + αk hk (x), α  0.
j k

Here, α  0 is short for αk ≥ 0 for all k. The Lagrangian is a function not only
of x, but also of λ, α. These additional variables are called Lagrange multipliers.
The complete optimality conditions for (x, λ, α) read
∂L ∂L
= 0, = 0 ∀j,
∂x ∂λj
(A.2)
∂L ∂L
αk ≥ 0, ≤ 0, αk = 0 ∀k.
∂αk ∂αk
Here, we used that ∂L/∂λj = gj (x) and ∂L/∂αk = hk (x). Note how all con-
ditions are expressed in terms of derivatives of the Lagrangian. A worthy re-
placement for ∇x f = 0 indeed. A brief way of describing (A.2) is that (x, λ, α)
constitutes a stationary point of the Lagrangian.
237

Lagrange Duality

We found that in order to lift the optimality condition ∇x f = 0 to the con-


strained case, we require additional Lagrange multipliers λ, α as well as an
extended criterion L(x, λ, α). Can this Lagrangian do more for us? Recall the
primal problem (A.1). We can express it in terms of the Lagrangian:

p̃∗ = min max L(x, λ, α). (A.3)


x λ,(α0)

Note that the minimization over x is unconstrained, much in contrast to the


constrained minimization in (A.1). Let us do the inner maximization for a x.
Either x is feasible or not. In the former case, gj (x) = 0 for all j, and we can
pick any λ. Also, hk (x) ≤ 0, so that αk hk (x) ≤ 0, since αk ≥ 0. The best we
can do is to set αk = 0. For a feasible x, the inner maximization gives f (x).
Now, suppose x is not feasible. Then, at least one constraint must be violated.
We show that the inner maximum2 is +∞ in this case. If gj (x) 6= 0, we can
send λj → sgn(gj (x))∞. If hk (x) > 0, we can send αk → ∞. Therefore,
 
f (x) | x feasible
max L(x, λ, α) = ,
λ,(α0) +∞ | x not feasible

whose minimum over x is precisely the primal problem (A.1). What does that
mean? The Lagrangian may have stationary points of many kinds, but as our
goal is to solve the primal problem, we should be only interested in its saddle-
points of the type (A.3).
These are not the only saddlepoints, we might just as well look at

d˜∗ = max min L(x, λ, α). (A.4)


λ,(α0) x

Here, the unconstrained minimization over x is inside, the maximization over


Lagrange multipliers is outside. This is called the dual problem. How does it
relate to the primal problem (A.3)? First of all, we always have

d˜∗ ≤ p̃∗ .

Namely, for each fixed λ, α  0:

min L(x, λ, α) ≤ min max L(x, λ0 , α0 ).


x x λ0 ,(α0 0)

The dual value d˜∗ bounds the primal value p̃∗ from below, a fact which is called
weak duality. Why is this useful? It turns out that in many situations, the dual
problem is easier to solve than the primal problem. For example, it is easily
confirmed that (A.4) is a concave3 maximization problem, no matter what the
primal problem is. Even for very hard primal problems, the dual (A.4) can
often be solved easily and provides a lower bound d˜∗ on the primal value p̃∗ .
This technique is used frequently in theoretical computer science. Moreover,
2 We explicitly allow minima to be −∞ and maxima to be +∞, with the understanding

that −∞ < v < +∞ for all v ∈ R.


3 Namely, Φ (λ, α) = min L(x, λ, α) is a minimum of affine linear functions, therefore
D x
concave.
238 A Lagrange Multipliers and Lagrangian Duality

there may be far fewer dual variables (λ, α) than primal ones (x), which is
what happens for the soft margin SVM problem (Section 9.3).
Let us look at an example, illustrated in Figure A.2. The primal problem is

1
min x2 , subj. to x ≤ −1.
x 2

The primal value is p̃∗ = 1/2, attained at x∗ = −1. The Lagrangian is

1 2
L(x, α) = x + α(x + 1), α ≥ 0.
2
For any α ≥ 0:
∂L
= x + α,
∂x
so that x∗ (α) = −α minimizes L(x, α). The dual function is

1
g(α) = L(x∗ (α), α) = α − α2 .
2

Its maximum is d˜∗ = 1/2, attained at α∗ = 1, and x∗ (−1) = −1 is the solution


to the primal problem.
We are almost there now, but need one more step. It is fine to know that the
dual may be easier to solve, but after all we want to solve the primal. Now,
under additional conditions on the optimization problem (A.1), we can ensure
that
d˜∗ = p̃∗ ,
meaning that primal and dual are simply two different formulations of the same
problem. This property is called strong duality. The condition we need for (A.1)
is that f (x) is a convex function. In short, for any primal problem (A.1) with
linear constraints and convex objective, strong duality holds. The primal and
dual values are the same, and their respective optimal points agree:

max L(x∗ , λ, α) = L(x∗ , λ∗ , α∗ ) = min L(x, λ∗ , α∗ ).


λ,(α0) x

A proof of this statement can be found in [2, ch. 3.4]. The practical meaning
of this exercise is as follows. If we know that strong duality holds, we can solve
our primal problem as follows. First, we solve the dual problem, giving rise to
λ∗ , α∗  0, and the value d˜∗ . Then, x∗ = argminx L(x, λ∗ , α∗ ) is an optimal
point for the primal problem, whose value is p̃∗ = d˜∗ . Even better, some modern
primal-dual optimizers use primal and dual problems in an interleaved fashion,
the gap between their current values is used to monitor progress.

A.1 Soft Margin SVM Revisited


In this section, we illustrate Lagrange duality by rederiving the soft margin
SVM dual problem previously obtained in Section 9.3 by a specific route. Recall
A.1 Soft Margin SVM Revisited 239

4.5

3.5

2.5

1.5

0.5

−0.5
−3 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5

Figure A.2: Example for primal and dual function for a simple QP. The primal
criterion is f (x) = x2 /2, subject to x ≤ −1. Shown are Lagrangian curves
L(x, α) for various values of α (dashed), as well as their minimum points x∗ (α)
(circles). Note that L(x, 0) coincides with f (x). The largest x∗ (α) is attained at
α∗ = 1.
Since x∗ (α) = −α, we can plot the “inversion” of the dual g(α), namely α 7→
g(−α), in the same figure. Note how g(α) ≤ f (x) in respective feasible regions
x ≤ −1 and α ≥ 0, and that they coincide at the saddle point x∗ = −1, α∗ = 1.

the notation from there. The Lagrangian is


1 X X X
L = kwk2 + C ξj + αi (1 − ti yi − ξi ) − ν j ξj
2 j i j
1
= kwk2 + αT (1 − T y) + ξ T (C1 − α − ν ).
2
Make sure to understand the vectorization before moving on. The primal vari-
ables are (w, b, ξ), the dual variables are α  0 and ν  0, both in Rn . The
criterion of the dual problem is

ΦD (α, ν ) = min L.
w,b,ξ

The inner minimization w.r.t. w and b works in the same way as in Section 9.3.
Setting the gradient equal to zero results in
n
X n
X
w= αi ti φ(xi ), αT t = αi ti = 0
i=1 i=1

Finally,
∇ξ L = C1 − α − ν = 0 ⇒ ν = C1 − α.
240 A Lagrange Multipliers and Lagrangian Duality

This allows us to eliminate the dual variables ν . But careful, ν  0 implies


additional constraints α  C1. Plugging all of this in, we obtain the dual
problem (9.7), as well as the kernel expansion (9.8).
Next, we can use the Lagrange optimality conditions (also sometimes called
Karush-Kuhn-Tucker conditions) in order to reproduce the classification of pat-
terns (xi , ti ) derived in Section 9.3. Suppose we are at a saddlepoint, dropping
the “*” subscripts. The optimality conditions are αi ∈ [0, C], νi ≥ 0, ξi ≥ 0,
and
(1 − ti yi − ξi )αi = 0, νi ξi = 0 = (C − αi )ξi .
Three different things can happen:

• αi = 0, so that ξi = 0 (no slack) and 1 − ti yi ≤ 0. Not a support vector.


• αi ∈ (0, C), so that ξi = 0 and 1 − ti yi = 0. An essential support vector.

• αi = C, so that ξi ≥ 0 and 1 − ti yi − ξi = 0, which implies 1 − ti yi ≥ 0. A


bound support vector.
Bibliography

[1] P. Bartlett and A. Tewari. Sparseness vs estimating conditional probabil-


ities: Some asymptotic results. In Conference on Computational Learning
Theory 17, pages 564–578. Springer, 2004.
[2] D. Bertsekas. Nonlinear Programming. Athena Scientific, 2nd edition, 1999.
[3] P. Billingsley. Probability and Measure. John Wiley & Sons, 3rd edition,
1995.
[4] C. Bishop. Neural Networks for Pattern Recognition. Oxford University
Press, 1st edition, 1995.
[5] C. Bishop. Pattern Recognition and Machine Learning. Springer, 1st edi-
tion, 2006.
[6] L. Bottou. Online learning and stochastic approximations. In D. Saad,
editor, On-Line Learning in Neural Networks. Cambridge University Press,
1998.
[7] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University
Press, 2004.
[8] K. L. Chung. A Course in Probability Theory. Academic Press, 2nd edition,
1974.
[9] C. Cortes and V. Vapnik. Support vector networks. Machine Learning,
20:273–297, 1995.
[10] Thomas Cover and Joy Thomas. Elements of Information Theory. John
Wiley & Sons, 1st edition, 1991.
[11] L. Devroye, L. Györfi, and G. Lugosi. A Probabilistic Theory of Pattern
Recognition. Applications of Mathematics: Stochastic Modelling and Ap-
plied Probability. Springer, 1st edition, 1996.
[12] R. Duda, P. Hart, and D. Stork. Pattern Classification. John Wiley &
Sons, 2nd edition, 2000.
[13] William H. Feller. An Introduction to Probability Theory and its Applica-
tions, volume 1. John Wiley & Sons, 3rd edition, 1968.
[14] William H. Feller. An Introduction to Probability Theory and its Applica-
tions, volume 2. John Wiley & Sons, 2nd edition, 1971.

241
242 BIBLIOGRAPHY

[15] Roger Fletcher. Practical Methods of Optimization: Unconstrained Opti-


mization, volume 1. John Wiley & Sons, 1980.

[16] P. Gill, W. Murray, and M. Wright. Practical Optimization. Academic


Press, 1981.

[17] G. Golub and C. Van Loan. Matrix Computations. Johns Hopkins Univer-
sity Press, 3rd edition, 1996.

[18] G. Grimmett and D. Stirzaker. Probability and Random Processes. Oxford


University Press, 3rd edition, 2001.

[19] C. Grinstead and J. Snell. Introduction to Probability. American Mathe-


matical Society, 2nd edition, 1997.

[20] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical


Learning: Data Mining, Inference, and Prediction. Springer, 2nd edition,
2009.

[21] David Haussler. Convolution kernels on discrete structures. Technical Re-


port UCSC-CRL-99-10, University of California, Santa Cruz, July 1999.
See http://www.cse.ucsc.edu/~haussler/pubs.html.

[22] J. Hertz, A. Krogh, and R. Palmer. Introduction to the Theory of Neural


Computation. Addison-Wesley, 1991.

[23] R. Horn and C. Johnson. Matrix Analysis. Cambridge University Press,


1st edition, 1985.

[24] Tommi Jaakkola, Marina Meila, and Tony Jebara. Maximum entropy dis-
crimination. In Solla et al. [40], pages 470–476.

[25] D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and


Techniques. MIT Press, 1st edition, 2009.

[26] N. D. Lawrence, M. Seeger, and R. Herbrich. Fast sparse Gaussian process


methods: The informative vector machine. In S. Becker, S. Thrun, and
K. Obermayer, editors, Advances in Neural Information Processing Systems
15, pages 609–616. MIT Press, 2003.

[27] D. Luenberger. Linear and Nonlinear Programming. Addison-Wesley, 2nd


edition, 1984.

[28] D. MacKay. Information Theory, Inference, and Learning Algorithms.


Cambridge University Press, 2003.

[29] I. Nabney. Netlab: Algorithms for Pattern Recognition. Advances in Pattern


Recognition. Springer, 1st edition, 2001.

[30] C. Paige and M. Saunders. LSQR: An algorithm for sparse linear equations
and sparse least squares. ACM Transactions on Mathematical Software,
8(1):43–71, 1982.

[31] J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann,


1988.
BIBLIOGRAPHY 243

[32] J. Platt. Fast training of support vector machines using sequential minimal
optimization. In B. Schölkopf, C. Burges, and A. Smola, editors, Advances
in Kernel Methods: Support Vector Learning, pages 185–208. MIT Press,
1998.
[33] J. Platt. Probabilistic outputs for support vector machines and comparisons
to regularized likelihood methods. In A. Smola, P. Bartlett, B. Schölkopf,
and D. Schuurmans, editors, Advances in Large Margin Classifiers. MIT
Press, 1999.

[34] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine


Learning. MIT Press, 2006.
[35] B. D. Ripley. Pattern Recognition and Neural Networks. Cambridge Uni-
versity Press, 1996.

[36] B. Schölkopf and A. Smola. Learning with Kernels. MIT Press, 1st edition,
2002.
[37] M. Seeger. Bayesian model selection for support vector machines, Gaussian
processes and other kernel classifiers. In Solla et al. [40], pages 603–609.
[38] M. Seeger. Gaussian processes for machine learning. International Journal
of Neural Systems, 14(2):69–106, 2004.
[39] G. Simmons. Calculus with Analytic Geometry. McGraw-Hill, 2nd edition,
1996.
[40] S. Solla, T. Leen, and K.-R. Müller, editors. Advances in Neural Informa-
tion Processing Systems 12. MIT Press, 2000.
[41] I. Steinwart and A. Christmann. Support Vector Machines. Springer, 1st
edition, 2008.
[42] G. Strang. Introduction to Linear Algebra. Wellesley – Cambridge Press,
4th edition, 2009.

[43] M. Tipping and C. Bishop. Probabilistic principal component analysis.


Journal of Roy. Stat. Soc. B, 61(3):611–622, 1999.
[44] C. Williams. Computation with infinite neural networks. Neural Compu-
tation, 10(5):1203–1216, 1998.

[45] Christopher K. I. Williams. Prediction with Gaussian processes: From


linear regression to linear prediction and beyond. In M. I. Jordan, editor,
Learning in Graphical Models. Kluwer, 1997.

You might also like