Nothing Special   »   [go: up one dir, main page]

cs131 Class Notes PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 213

C O M P I L E D B Y R A N J AY K R I S H N A

COMPUTER VISION:
F O U N D AT I O N S A N D
A P P L I C AT I O N S

S TA N F O R D U N I V E R S I T Y
Copyright © 2017 Compiled by Ranjay Krishna

published by stanford university

Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in com-
pliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/
LICENSE-2.0. Unless required by applicable law or agreed to in writing, software distributed under the
License is distributed on an “as is” basis, without warranties or conditions of any kind, either
express or implied. See the License for the specific language governing permissions and limitations under
the License.

First printing, December 2017


Contents

Introduction to Computer Vision 17

Color 25

Linear Algebra Primer 31

Pixels and Filters 45

Edge Detection 55

Features and Fitting 71

Feature Descriptors 81

Image Resizing 89

Semantic Segmentation 101

Clustering 107

Object recognition 121


4

Dimensionality Reduction 133

Face Identification 139

Visual Bag of Words 151

Object Detection from Deformable Parts 161

Semantic Hierarchies and Fine Grained Recognition 181

Motion 189

Tracking 201

Deep Learning 209

Bibliography 211
List of Figures

1 Computer vision at the intersection of multiple scientific fields 17


2 Computer vision at the intersection of multiple scientific fields 20

3 Mixing two lights produces colors that lie along a straight line in color
space. Mixing three lights produces colors that lie within the trian-
gle they define in color space. 27
4 Representation of RBG primaries and corresponding matching func-
tions. The matching functions are the amounts of primaries needed
to match the monochromatic test color at the wavelength shown on
the horizontal scale. Source: https://en.wikipedia.org/wiki/CIE_
1931_color_space 28
5 Source: https://en.wikipedia.org/wiki/CIE_1931_color_space 28
6 _
General source: https://en.wikipedia.org/wiki/HSL and HSV _ 29
7 Example of two photos, one unbalanced, and one with incorrect white
balancing. Source: http://www.cambridgeincolour.com/tutorials/
white-balance.htme 30

8 Illustrations of different pixel densities. Taken from the accompany-


ing lecture slides. (Slide 14, slide credit Ulas Bagci) 46
9 The image is sampled at two vertical positions, sampling a patch of
sky and sampling a patch of grass. The corresponding histograms
are shown to the right. Adapted from the accompanying lecture slide
(Slide 23, slide credit Dr. Mubarak Shah 46
10 Graphical representation of a system’s mapping of f to g 48
11 An example decomposition of a signal into an impulse function. (Adapted
from http://www.songho.ca/dsp/convolution/convolution.html) 51
12 An impulse function is sent through a system to create an impulse
response function. Adapted from http://www.songho.ca/dsp/convolution/convolution.html 52

13 The Hubel and Wiesel’s experiment. 55


14 The examples of the Biederman’s outlines. 56
15 The visualization of the images and outcomes. 57
16 An image with intensity function and first derivative. Source: Lec-
ture 5, slide 66 58
6

17 The gradient vector directions. Source: Lecture 5, slide 67 59


18 The gradients as applied to the image of a tiger. Source: Lecture 5,
slide 68 59
19 The derivative of an edge in a noisy image. Source: Steve Seitz; Lec-
ture 5, slide 70 60
20 Smoothing with different filters and filter sizes. Source: Steve Seitz;
Lecture 5, slide 75 60
21 Sample problems of bad edge detectors. Source: Lecture 5, slide 84 61

22 The corner of a curve appears at two scales. Note that the circular
window on the right curve captures the entire corner, while the same-
sized window on the left curve does not. Instead, we must choose
a much larger circular window on the left curve to get the same in-
formation. Source: Lecture 7, slide 12. 83
23 Two plots of the response of f (window) as a function of window size
for Images 1 and 2, where Image 2 is similar to Image 1 but scaled
by 12 . Source: Lecture 7, slide 15. 83
24 On the left: pyramid of Gaussians of different σ’s and different im-
age sizes. On the right: difference of adjacent Gaussians. Source: http://aishack.in/tutorials/sift-
scale-invariant-feature-transform-log-approximation/ 84
25 Given a coordinate in x-y-scale space (denoted by the black X), ex-
amine its 26 neighbors (denoted by the green circles) to determine
if the original coordinate is a local extrema. Source: Lecture 7, slide
22 84
26 This figure shows the percentage of correctly matched keypoints as
a function of the width of the descriptor and of the number of his-
togram bins. [1] 86
27 Here we see a visual example of keeping track of the magnitudes of
the gradients for each gradient direction. Source: Lecture 7, Slide 60. 87
28 HoG applied to a bicycle. Source: Lecture 7, Slide 65. 87

29 The effects of different pixel removal methods on the image quality. 91


30 The red line indicates the location of the optimal seam with the least
energy. Source: Lecture 7-22 92
31 An example of energy function used in seam carving algorithm. Source:
Lecture 7-24 92
32 The process of using relation recursion for computing the seam cost.
Source: Lecture 7-(24-27) 93
33 Using backtrack to find the optimal seam. Source: Lecture 7-(28-31) 93

34 Importance Measurement by function method. Source: Lecture 7-11. 101


35 Importance Measurement by more sophisticated methods such as
eye tracking. 102
36 Optical illusions regarding the problem of image segmentation. 102
7

37 Output segmentation. Source: lecture 12, slide 4 107


38 Examples of gestalt factors. Source: Forsyth & Ponce 108
39 Source: Forsyth & Ponce 108
40 Source: Forsyth & Ponce 109
41 Source: Forsyth & Ponce 109
42 Agglomerative clustering on sample input, and resulting dendrogram 112
43 Image segmentation example using single link measurement of near-
est clusters. Source: lecture 12, slide 46 113
44 Image segmentation example using complete link measurement of
nearest clusters. Source: lecture 12, slide 47 114
45 Image segmentation example using average link measurement of near-
est clusters. Source: lecture 12, slide 48 114
46 Image segmentation example using k-means. The picture on the top
left has three distinct colors, but the bottom picture has Gaussian noise.
Source: lecture 11, slide 8, slide credit: Kristen Grauman 115
47 Clustering for summarization. Source: lecture 11, slide 11 116
48 Visualization of k-means clustering. Source: lecture 11, slide 15 116
49 Image of a panda. Source: lecture 11, slide 19 117
50 Results of mean-shift clustering. Source: Y. Ukrainitz and B. Sarel 119

51 Right: Original Image, Left: Compressed Image 137


52 Relative Error as Function of PCA dimensions 138

53 The region occupied by images of faces is a small subspace of the to-


tal space of images. Source: Lecture 13, slide 14 140
54 Faces and Eigenfaces. Source: Lecture 13, slide 29 141
55 The reconstructed face after projection. Source: Lecture 13, slide 25 141
56 Eigenvalues sorted in descending order of magnitude. Source: Lec-
ture 13, slide 26 143
57 Reconstructed faces with varying number of eigenfaces. Source: Lec-
ture 13, slide 27 143
58 Eigenfaces expressing happiness. Source: Lecture 13, slide 33 144
59 Eigenfaces expressing disgust. Source: Lecture 13, slide 34 144
60 PCA vs. LDA. Source: Lecture 13, slide 41 145
61 Between Class Scatter vs. Within Class Scatter. Source: Lecture 13,
slide 43 145
62 Variation in Facial Expression, Eyewear, and Lighting. 148
63 Eigenface vs. Fisherface. Source: Lecture 13, slide 61 149

64 An example of an object detection algorithm detecting the categories


of a person, dog, and a chair 161
65 An example of a labeled ILSVR test image. 163
66 Object segmentation used for the COCO challenge. 163
67 Yellow boxes represent ground truth while green boxes are predic-
tions. 164
8

68 Example classifications using an overlap threshold of 0.5. (a) True


positive, because ground truth (yellow box) and prediction (green
box) overlap is more than 0.5. (b) False positive, since the prediction
boxes (green) do not overlap with any ground truth boxes. (c) False
negative, since the ground truth box (yellow) is not detected by the
model. 164
69 Summary chart of classifications, with green being counts you want
to maximize and red being counts you want to minimize. 165
70 Predictions are green boxes while ground truth is yellow. All the ground
truths are correctly predicted, making recall perfect. However, pre-
cision is low, since there are many false positives. 165
71 Faster-RCNN model is the best of the three models since it has the
most area under the curve. 166
72 Consider the problem of detecting people in an image. (a) - (c) slid-
ing window across image and at each position classifying window
as not containing a person. (b) window over person and classifying
window as containing a person. Image source: Flickr user neilalder-
ney123 167
73 An image of a person along with its respective HOG descriptor. Note
that the last two images are the HOG descriptor weighted by the pos-
itive and negative weights of the classifier using them. The outline
of the person is very visible in the weighted descriptors. Image source:
Dalal and Triggs. 168
74 The average face image above is created by averaging 31 aligned face
images of the same size. The HOG descriptor of this average face im-
age can then be used as a template for detecting faces in images. 168
75 Consider, again, the problem of detecting people, except this time
our sliding window is much smaller. (a) The template and sliding
window are still large enough to detect the smaller, distant person.
(b) The person in the foreground is a lot bigger than our window size
and is not being detected. 169
76 Using a feature pyramid of different image resizings allows the ob-
ject template to match with objects that might have originally been
bigger or much smaller than the the template. Image source: Lecture
15, Slide 40 169
77 An illustration of Fischler and Elschlager’s deformable face model 170
78 On the left is a visualization of the star model, with x1 as the root,
and on the right is an example of person detection with a star model
for deformations 171
79 A global HOG filter for a person and a more detailed HOG filter for
the head 171
80 Deformable Parts model of a car with multiple orientations 172
81 Deformable Parts model of a bike with original image shown 172
82 Illustration of HOG pyramid for deformable parts detector 173
9

83 Illustration of full DPM Pipeline 175


84 Step 2 of the detection pipeline 176
85 Step 3 of the detection pipeline 177
86 Response scores 178
87 Typical errors for the DPM model 178
88 DPM extension: fully-connected shape model 179

89 Global Internet Traffic. Source: Lecture 16 183


90 Example Image Recognition Problem. Source: Lecture 16 184
91 ImageNet Confusion Matrix. Source: Lecture 16 184
92 Example Semantic Hierarchy. Source: Lecture 16 185
93 Bubble Method. Source: Lecture 16 186
94 Bubble Heatmaps. Source: Lecture 16 187

95 In the aperture problem, the line appears to have moved to the right
when only in the context of the frame, but the true motion of the line
was down and to the right. The aperture problem is a result of op-
tical flow being unable to represent motion along an edge–an issue
that can lead to other errors in motion estimation as well. 190
96 Conditions for a solvable matrix A T A may be interpreted as differ-
ent edge regions depending on the relation between λ1 and λ2 . Cor-
ner regions produce more optimal conditions. 192
97 Example of regions with large λ1 and small λ2 (left), small λ1 and
smallλ2 (center, low texture region), large λ1 and large λ2 (right, high
texure region) 193

98 Courtesy of Jean-Yves BouguetâĂŞVision Lab, California Institute


of Technology 203
99 Frames from fish-tracking video, courtesy of Kanade 203
100 Frames from man-tracking video, courtesy of Kanade 204
101 Types of 2D transformations. 204
102 Translation 205
List of Tables
13

Dedicated to those who wish to learn everyday.


Introduction

This document contains topics that were taught in CS131 Computer


Vision: Foundations and Applications. All the chapters are a work in
progress and will continue to evolve and change.
Each chapter was originally written by a group of students who
were taking the class. The aim was to crowdsource the generation
of comprehensive notes for the class that the entire class can use to
learn. By open sourcing the notes, we hope to make this resource
available for all those interested in computer vision.
Below are the list of authors who contributed to each chapter:
Introduction to Computer Vision: Olivier Moindrot
Color: John McNelly, Alexander Haigh, Madeline Saviano, Scott
Kazmierowicz, Cameron Van de Graaf
Linear Algebra Primer: Liangcheng Tao, Vivian Hoang-Dung
Nguyen, Roma Dziembaj, Sona Allahverdiyeva
Pixels and Filters: Brian Hicks, Alec Arshavsky, Sam Trautwein,
Christine Phan, James Ortiz
Edge Detection: Stephen Konz, Shivaal Roy, Charlotte Munger,
Christina Ramsey, Alex Iyabor Jr
Features and Fitting: Winston Wang, Antonio Tan-Torres, Hesam
Hamledari
Feature Descriptors: Trevor Danielson, Wesley Olmsted, Kelsey
Wang, Ben Barnett
Image Resizing: Harrison Caruthers, Diego Celis, Claire Huang,
Curtis Ogren, Junwon Park
Semantic Segmentation: Mason Swofford, Rachel Gardner, Yue
Zhang, Shawn Fenerin
Clustering: Vineet Kosaraju, Davy Ragland, Adrien Truong, Effie
Nehoran, Maneekwan Toyungyernsub
Object recognition: Darrith Bin Phan, Zahra Abdullah, Kevin Cul-
berg, Caitlin Go
Dimensionality Reduction: Kyu seo Ahn, Jason Lin, Mandy Lu,
Liam Neath, Jintian Liang
Face Identification: JR Cabansag, Yuxing Chen, Jonathan Griffin,
Dunchadhn Lyons, George Preudhomme
16

Visual Bag of Words: Megha Srivastava, Jessica Taylor, Shubhang


Desai, Samuel Premutico, Zhefan Wang
Object Detection from Deformable Parts: David R. Morales, Austin
O. Narcomey, Minh-An Quinn, Guilherme Reis, Omar Solis
Semantic Hierarchies and Fine Grained Recognition: Tyler Dammann,
Patrick Mogan, Qiqi Ren, Cody Stocker, Evin Yang
Motion: Kevin Chavez, Ben Cohen-Wang, Garrick Fernandez, Noah
Jackson, Will Lauer
Tracking: Khaled Jedoui, Jamie Xie, Michelle Guo, Nikhil Cheerla
Introduction to Computer Vision

What is computer vision?

Definition
Two definitions of computer vision Computer vision can be defined as
a scientific field that extracts information out of digital images. The type of
information gained from an image can vary from identification, space
measurements for navigation, or augmented reality applications.
Another way to define computer vision is through its applications.
Computer vision is building algorithms that can understand the content
of images and use it for other applications. We will see in more details in
the last section from the different domains where computer vision is
applied.

A bit of history The origins of computer vision go back to an MIT


undergraduate summer project in 1966 1 . It was believed at the time 1
Seymour A Papert. The summer vision
that computer vision could be solved in one summer, but we now project. 1966

have a 50-year old scientific field which is still far from being solved.

Figure 1: Computer vision at


the intersection of multiple
scientific fields
18 computer vision: foundations and applications

An interdisciplinary field
Computer vision brings together a large set of disciplines. Neuro-
science can help computer vision by first understanding human
vision, as we will see later on. Computer vision can be seen as a part
of computer science, and algorithm theory or machine learning are
essential for developing computer vision algorithms.
We will show in this class how all the fields in figure 13 are con-
nected, and how computer vision draws inspiration and techniques
from them.

A hard problem
Computer vision has not been solved in 50 years, and is still a very
hard problem. It’s something that we humans do unconsciously but
that is genuinely hard for computers.

Poetry harder than chess The IBM supercomputer Deep Blue defeated
for the first time the world chess champion Garry Kasparov in 1997.
Today we still struggle to create algorithms that output well formed
sentences, let alone poems. The gap between these two domains
show that what humans call intelligence is often not a good criteria
to assess the difficulty of a computer task. Deep Blue won through
brute force search among millions of possibilities and was not more
intelligent than Kasparov.

Vision harder than 3D modeling It is today easier to create a 3D model


of an object up to millimeter precision than to build an algorithm that
recognizes chairs. Object recognition is still a very difficult problem,
although we are approaching human accuracy.

Why is it so hard? Computer vision is hard because there is a huge


gap between pixels and meaning. What the computer sees in a
200 × 200 RGB image is a set of 120, 000 values. The road from these
numbers to meaningful information is very difficult. Arguably, the
human brain’s visual cortex solves a problem as difficult: understand-
ing images that are projected on our retina and converted to neuron
signals. The next section will show how studying the brain can help
computer vision.

Understanding human vision

A first idea to solve computer vision is to understand how human


vision works, and transfer this knowledge to computers.
introduction to computer vision 19

Definition of vision
Be it a computer or an animal, vision comes down to two compo-
nents.
First, a sensing device captures as much details from an image
as possible. The eye will capture light coming through the iris and
project it to the retina, where specialized cells will transmit informa-
tion to the brain through neurons. A camera captures images in a
similar way and transmit pixels to the computer. In this part, cameras
are better than humans as they can see infrared, see farther away or
with more precision.
Second, the interpreting device has to process the information
and extract meaning from it. The human brain solves this in multiple
steps in different regions of the brain. Computer vision still lags
behind human performance in this domain.

The human visual system


In 1962, Hubel & Wiesel 2 tried to understand the visual system of 2
David H Hubel and Torsten N Wiesel.
a cat by recording neurons while showing a cat bright lines. They Receptive fields, binocular interaction
and functional architecture in the cat’s
found that some specialized neurons fired only when the line was in visual cortex. The Journal of physiology,
a particular spot on the retina or if it had a certain orientation. 3 160(1):106–154, 1962
3
More information in this blog post
Their research led to the beginning of a scientific journey to under-
stand the human visual system, which is still active today.
They were awarded the Nobel Prize in Physiology and Medecine
in 1981 for their work. After the announcement, Dr. Hubel said:

There has been a myth that the brain cannot understand itself. It is
compared to a man trying to lift himself by his own bootstraps. We feel
that is nonsense. The brain can be studied just as the kidney can.

How good is the human visual system?


Speed The human visual system is very efficient. As recognizing
threats and reacting to them quickly was paramount to survival,
evolution perfected the visual system of mammals for millions of
years.
The speed of the human visual system has been measured 4 to Simon Thorpe, Denise Fize, and
4

around 150ms to recognize an animal from a normal nature scene. Catherine Marlot. Speed of processing
in the human visual system. nature, 381
Figure 2 shows how the brain responses to images of animals and (6582):520, 1996
non-animals diverge after around 150ms.

Fooling humans However, this speed is obtained at the price of some


drawbacks. Changing small irrelevant parts of an image such as
water reflection or background can go unnoticed because the human
brain focuses on the important parts of an image 5 . Ronald A Rensink, J Kevin O’Regan,
5

and James J Clark. On the failure to


detect changes in scenes across brief
interruptions. Visual cognition, 7(1-3):
127–145, 2000
20 computer vision: foundations and applications

Figure 2: Computer vision at


the intersection of multiple
scientific fields

If the signal is very close to the background, it can be difficult to


detect and segment the relevant part of the image.

Context Humans use context all the time to infer clues about images.
Previous knowledge is one of the most difficult tool to incorporate
into computer vision. Humans use context to know where to focus
on an image, to know what to expect at certain positions. Context
also helps the brain to compensate for colors in shadows.
However, context can be used to fool the human brain.

Lessons from nature


Imitating birds did not lead humans to planes. Plainly copying
nature is not the best way or the most efficient way to learn how
to fly. But studying birds made us understand aerodynamics, and
understanding concepts like lift allowed us to build planes.
The same might be true with intelligence. Even though it is not
possible with today’s technology, simulating a full human brain
to create intelligence might still not be the best way to get there.
However, neuroscientists hope to get insights at what may be the
concepts behind vision, language and other forms of intelligence.

Extracting information from images

We can divide the information gained from images in computer


vision in two categories: measurements and semantic information.
introduction to computer vision 21

Vision as a measurement device


Robots navigating in an unknown location need to be able to scan
their surroundings to compute the best path. Using computer vision,
we can measure the space around a robot and create a map of its
environment.
Stereo cameras give depth information, like our two eyes, through
triangulation. Stereo vision is a big field of computer vision and there
is a lot of research seeking to create a precise depth map given stereo
images.
If we increase the number of viewpoints to cover all the sides of
an object, we can create a 3D surface representing the object 6 . An 6
Anders Heyden and Marc Pollefeys.
even more challenging idea might be to reconstruct the 3D model of Multiple view geometry. Emerging topics
in computer vision, pages 45–107, 2005
a monument through all the results of a google image search for this
monument 7 . 7
Michael Goesele, Noah Snavely, Brian
There is also research in grasping, where computer vision can help Curless, Hugues Hoppe, and Steven M
Seitz. Multi-view stereo for community
understand the 3D geometry of an object to help a robot grasp it. photo collections. In Computer Vision,
Through the camera of the robot, we could recognize and find the 2007. ICCV 2007. IEEE 11th International
Conference on, pages 1–8. IEEE, 2007
handle of the object and infer its shape, to then enable the robot to
find a good grasping position 8 . 8
Ashutosh Saxena, Justin Driemeyer,
and Andrew Y Ng. Robotic grasping
of novel objects using vision. The
International Journal of Robotics Research,
A source of semantic information
27(2):157–173, 2008

On top of measurement informations, an image contains a very dense


amount of semantic information. We can label objects in an image,
label the whole scene, recognize people, recognize actions, gestures,
faces.
Medical images also contain a lot of semantic information. Com-
puter vision can be helpful for a diagnosis based on images of skin
cells for instance, to decide if they are cancerous or not.

Applications of computer vision

Cameras are everywhere and the number of images uploaded on


internet is growing exponentially. We have images on Instagram,
videos on YouTube, feeds of security cameras, medical and scien-
tific images... Computer vision is essential because we need to sort
through these images and enable computers to understand their
content. Here is a non exhaustive list of applications of computer
vision.

Special effects Shape and motion capture are new techniques used
in movies like Avatar to animate digital characters by recording the
movements played by a human actor. In order to do that, we have to
22 computer vision: foundations and applications

find the exact positions of markers on the actor’s face in a 3D space,


and then recreate them on the digital avatar.

3D urban modeling Taking pictures with a drone over a city can be


used to render a 3D model of the city. Computer vision is used to
combine all the photos into a single 3D model.

Scene recognition It is possible to recognize the location where a


photo was taken. For instance, a photo of a landmark can be com-
pared to billions of photos on google to find the best matches. We can
then identify the best match and deduce the location of the photo.

Face detection Face detection has been used for multiple years in
cameras to take better pictures and focus on the faces. Smile detec-
tion can allow a camera to take pictures automatically when the
subject is smiling. Face recognition is more difficult than face de-
tection, but with the scale of today’s data, companies like Facebook
are able to get very good performance. Finally, we can also use com-
puter vision for biometrics, using unique iris pattern recognition or
fingerprints.

Optical Character Recognition One of the oldest successful applica-


tions of computer vision is to recognize characters and numbers. This
can be used to read zipcodes, or license plates.

Mobile visual search With computer vision, we can do a search on


Google using an image as the query.

Self-driving cars Autonomous driving is one of the hottest applica-


tions of computer vision. Companies like Tesla, Google or General
Motors compete to be the first to build a fully autonomous car.

Automatic checkout Amazon Go is a new kind of store that has no


checkout. With computer vision, algorithms detect exactly which
products you take and they charge you as you walk out of the store 9 . 9
see their video here

Vision-based interaction Microsoft’s Kinect captures movement in


real time and allows players to interact directly with a game through
moves.

Augmented Reality AR is also a very hot field right now, and multi-
ple companies are competing to provide the best mobile AR platform.
Apple released ARKit in June and has already impressive applica-
tions 10 . 10
check out the different apps
introduction to computer vision 23

Virtual Reality VR is using similar computer vision techniques


as AR. The algorithm needs to know the position of a user, and
the positions of all the objects around. As the user moves around,
everything needs to be updated in a realistic and smooth way.
Color

Physics of Color

What is color?
Color is the result of interaction between physical light in the envi-
ronment and our visual system. A psychological property of our
visual experiences when we look at objects and lights, not a physical
property of those objects or lights 11 . 11
Stephen E Palmer. Vision science:
Photons to phenomenology. MIT press,
1999
Color and light
White light is composed of almost equal energy in all wavelengths of
the visible spectrum

Electromagnetic Spectrum
Light is made up of waves of different wavelengths. The visual spec-
trum of light ranges from 400nm to 700nm, and humans are most
sensitive to light with wavelengths in the middle of this spectrum.
Humans see only visible light because the Sun emits yellow light
more than any other color and due to its temperature.

Visible light
Plank’s Law for Blackbody radiation estimates the wavelengths
of electromagnetic radiation emitted by a star, based on surface
temperature. For instance, since the surface of the sun is around
5800K, the peak of the sun’s emitted light lies in the visible region.

The Physics of Light


Any source of light can be completely described physically by its
spectrum (i.e., the amount of energy emitted, per time unit, at each
wavelength 400-700nm). Surfaces have reflectance spectra: reflected
light is focused on a certain side of the visible light spectrum. For
26 computer vision: foundations and applications

example, bananas reflect mostly yellow light, and tomatoes reflect


mostly red light.

Interaction of light and surfaces


Reflected color is the result of interaction of the light source spec-
trum with the surface reflectance. As a rule, definitions and units are
stated as "per unit wavelengths", and terms are assumed to be "spec-
tral" (i.e., referring to a spectrum of light, not a single wavelength).
Illumination is quantified as Illumination. ∗ Re f lectance = ColorSignal
12 12
Brian A Wandell. Foundations of vision.
Sinauer Associates, 1995

Human Encoding of Color

As mentioned in the previous section, color is not a mere physical


property of light - rather, the phenomenon of color arises via the
interaction between light and the human visual system.

Rods and Cones


When we look at a scene, light first enters our eyes through the
pupil before reaching the retina. The retina is primarily composed
of two types of light-sensitive cells: rods and cones, named for their
appearance under a microscope. Rods are the more numerous of
the two and are highly sensitive, making them ideal for detecting
objects in low-light conditions. However, they do not encode any
color information. Cones, on the other hand, are less numerous and
less sensitive, but they are useful for distinguishing between objects
in high light conditions. They also allow us to perceive colors by a
mechanism discussed below.

Cones and Color


A crucial difference between rods and cones is that the latter comes
in three different types, each characterized by a unique response
curve to different wavelengths of light. Each response curve peaks at
a unique wavelength, namely 440, 530, and 560nm, aligning with the
colors of blue, green, and red. However, both cones and rods act as
filters, the output of which can be interpreted as the multiplication of
each response curve by the spectrum, integrated over all wavelengths.
While the information encoded by the resulting three numbers is
sufficient for most tasks, some information is lost in the compression
from spectrum to electrical impulse in the retina. This implies that
some subset(s) of spectra will be erroneously perceived as identical -
such spectra are called metamers
color 27

Color Matching
Since we are interested in designing systems that provide a consistent
visual experience across viewers, it is helpful to understand the
minimal colors that can be combined to create the experience of
any perceivable color. An experiment from Wandell’s Foundations
of Vision (Sinauer Assoc., 1995) demonstrates that most people
report being able to recreate the color of a given test light by tuning
three experimental lights of differing colors. The only condition
is that each of three lights must be a primary color. Moreover, the
experiment showed that for the same test light and primaries, the
majority of people select similar weights, though color blind people
are an exception. Finally, this experiment validates the trichromatic
theory of color - the proposition that three numbers are sufficient for
encoding color - which dates from Thomas Young’s writings in the
1700s.

Color Spaces

Definition
Color space, also known as the color model (or color system), is an
abstract mathematical model which describes the range of colors as
tuples of numbers, typically as 3 or 4 values or color components (e.g.
RGB). A color space may be arbitrary or structured mathematically.
Most color models map to an absolute and globally understood
system of color interpretation.

Linear Color Spaces


Defined by a choice of three primaries, and the coordinates of the
color are given by the weights of the primaries used to match it

Figure 3: Mixing two lights


produces colors that lie along a
straight line in color space. Mix-
ing three lights produces colors
that lie within the triangle they
define in color space.

• RGB Space

– Primary colors are monochromatic lights (for monitors, they


correspond to the three types of phosphors)
28 computer vision: foundations and applications

Figure 4: Representation of
RBG primaries and correspond-
ing matching functions. The
matching functions are the
amounts of primaries needed to
match the monochromatic test
color at the wavelength shown
on the horizontal scale. Source:
https://en.wikipedia.org/
wiki/CIE_1931_color_space

– Subtractive matching is required for certain wavelengths of light

• CIE XYZ Color Space

– Primaries are imaginary, but matching functions are everywhere


positive
– The Y parameter corresponds to brightness or luminance of a
color
– Related to RGB space by linear transformation, upholding
Grassmann’s Law

Figure 5: Source: https:


//en.wikipedia.org/wiki/
CIE_1931_color_space

Nonlinear Color Spaces: HSV


• Designed to reflect more traditional and intuitive color mixing
models (e.g. paint mixing)

• Based on how colors are organized and conceptualized in human


vision

• Dimensions are: Hue, Saturation, and Value (Intensity)


color 29

Figure 6: General source:


https://en.wikipedia.org/
wiki/HSL_and_HSV

White Balancing

Definition

White Balance is the processing of adjusting the image data received


by sensors to properly render neutral colors (white, gray, etc). This
adjustment is performed automatically by digital cameras (custom
settings for different light), and film cameras offer several filters and
film types for different shooting conditions.

Importance of White Balancing

Unadjusted images have an unnatural color "cast" for a few reasons,


making white balancing very important:

1. The sensors in cameras or film are different from those in our eyes

2. Different display media render images differently, which must be


accounted for

3. The viewing conditions when the image was taken are usually
different from the image viewing conditions

**Upload images from slides

Von Kries Method

Von Kries’ method for white balancing was to scale each channel by a
"gain factor" to match the appearance of a gray neutral object.
In practice, the best way to achieve this is the Gray Card Method:
hold up a neutral (gray or white) and determine the values of each
channel. If we find that the card has RGB values rw , gw , bw , then we
scale each channel of the image by r1w , g1w , b1w .
30 computer vision: foundations and applications

Figure 7: Example of two


photos, one unbalanced, and
one with incorrect white
balancing. Source: http:
//www.cambridgeincolour.com/
tutorials/white-balance.htme

Other Methods
Without Gray Cards, we need to guess which pixels correspond to
white objects. Several methods attempt to achieve this, including
statistical and Machine Learning models (which are beyond the scope
of this class)

Gray World Assumption Under this model, we assume that the


average pixel value in the photo (r ave , gave , bave ) is gray and scale the
1 1 1
image pixels by rave , gave , and bave

Brightest Pixel Assumption this works on non-saturated images and


assumes that image highlights usually have the color of the light
source (which is usually white). So, it corrects white balance by
weighting each channel inversely proportional to the values of the
brightest pixels

Gamut Mapping The Gamut of an image is the set of all pixel colors
displayed in an image (in mathematical terms, this is a "convex hull"
and a subset of all possible color combinations). We can then apply a
transformation to the image that maps the gamut of the image to the
gamut of a "standard" image under white light.

Other Uses of Color in Computer Vision


Color plays a critical role in skin detection, nude detection, and
image segmentation, among other applications.
Linear Algebra Primer

Vectors and Matrices

Definition Vectors and matrices are simply collections of ordered


numbers that represent something: movements in space, scaling
factors, pixel brightness, etc.

Vectors

A column vector v ∈ Rnx1 where

 
v1
 v2 
 
v=
 .. 

.
vn

h i
A row vector v T ∈ R1xn where v T = v1 v2 . . . vn . T denotes
the transpose operation which flips a matrix over its diagonal, switch-
ing the row and column indices of the matrix.
As a note, CS 131 will use column vectors as the default. You can
transpose vectors in python: for vector v, do v.t.

Uses of Vectors There are two main uses of vectors. Firstly, vectors
can represent an offset in 2 or 3 dimensional space. In this case, a
point is a vector from the origin. Secondly, vectors can represent data
(such as pixels, gradients at an image keypoint, etc.). In this use case,
the vectors do not have a geometric interpretation but calculations
like "distance" can still have value.
32 computer vision: foundations and applications

Matrices
A matrix A ∈ Rmxn is an array of numbers with m rows and n
columns:  
a11 a12 a13 . . . a1n
 a21 a22 a23 . . . a2n 
 
A=  .. .. .. .. .. 
 . . . . . 

am1 am2 am3 ... amn


If m = n, we say A is square.

An image is represented in python as a matrix of pixel brightness.


The upper left corner of the image is [y, x ]. Grayscale images have
one number per pixel and are stored as an mxn matrix. Color images
have 3 numbers per pixel – red, green, and blue brightness (RGB)
and are thus stored as a mxnx3 matrix. Multidimensional matrices
like these are called tensors.

Basic Matrix Operations


"# " # " #
a b 1 2 a+1 b+2
Addition + =
c d 3 4 c+3 d+4
You
" can
# only add
" a matrix with
# matching dimensions or a scalar.
a b a+7 b+7
+7 =
c d c+7 d+7

" # " #
a b 3a 3b
Scaling ∗3 =
c d 3c 3d

s
n
Norm k x k2 = ∑ xi2
i =1
More formally, a norm is any function f : Rn → R that satisfies
four properties:

1. Non-negativity: for all x ∈ Rn , f ( x ) ≥ 0

2. Definiteness: f ( x ) = 0 if and only if x = 0

3. Homogeneity: for all x ∈ Rn , t ∈ R, f (tx ) = |t| f ( x )

4. Triangle Inequality: for all x, y ∈ Rn , f ( x + y) ≤ f ( x ) + f (y)

Example Norms:

n
One Norm k x k1 = ∑ | x i |
i =1

Infinity Norm k x kinf = maxi | xi |


linear algebra primer 33

 1/p
n p
General P Norm kxk p = ∑ xi
i =1

s
m n p
Matrix Norm k Ak F = ∑ ∑ A2ij = tr ( A T A)
i =1 j =1

Inner or Dot Product Multiply corresponding entries of two vectors


and add up the result. x · y is k x k kyk cos(the angle between x and y).
Thus if y is a unit vector then x · y gives the length of x which lies in
the direction of y.  
y
i  1
.
h
T
One dimensional dot product: x y = x · y = x1 . . . xn  ..  =

yn
n
∑ i =1 i i
x y (scalar)

Multiplication The product of two matrices A ∈ Rmxn , B ∈ Rnxp


n
results in C = AB ∈ Rmxp where Cij = ∑ Aik Bkj .
k =1

a1T
 
a1T b1 a1T b2 a1T b p

  ...
a2T ...  T
a2T b2 a2T b p 
  a2 b1 ...
  
 
C = AB =  ..  b1 b2 ... bp  = 
 .. .. .. .. 
.  . . . . 
  
  ...
amT Tb
am Tb
am ... Tb
am
1 2 p

Each entry of the matrix product is made by taking the dot product
of the corresponding row in the left matrix, with the corresponding
column in the right one.
Matrix multiplication is
1. Associative: ( AB)C = A( BC )

2. Distributive: ( A + B)C = AC + BC

3. Not Commutative: Generally AB 6= BA. For example, if A ∈


Rmxn , B ∈ Rnxq , the product BA does not even exist if m and q are
not equal!

Powers By convention, we can refer to the matrix product AA as


A2 and AAA as A3 , etc. Obviously only square matrices can be
multiplied that way.

Transpose Flip matrix so row 1 becomes column 1:


 T
1 2 " #
1 3 5
3 4 =
 
2 4 6
5 6

A useful identity: ( ABC ) T = C T B T A T .


34 computer vision: foundations and applications

Determinant det(A) returns a scalar. It represents the area (or vol-


ume) of the parallelogram described by the vectors in the rows of the
matrix. " #
a b
For A = , det(A) = ad − bc
c d
Properties:

• det(AB) = det(BA)

• det(A−1 ) = 1
det(A)

• det(AT ) = det(A)

• det(A) = 0 if and only if A is singular.

" #
1 3
Trace tr (A) is the sum of diagonal elements. For A = ,
5 7
tr (A) = 1 + 7 = 8
Properties:

• tr (AB) = tr (BA)

• tr (A + B) = tr (A) + tr (B)

Special Matrices

• Identity Matrix: a square matrix with 1’s along the diagonal and
0’s elsewhere.
 I *another matrix = that matrix. Example identity
1 0 0
matrix: 0 1 0
 
0 0 1

• Diagonal Matrix: a square matrix with numbers along the diago-


nal and 0’s elsewhere. A diagonal * another
 matrix scales
 the rows
3 0 0
of that matrix. Example diagonal matrix: 0 5 0 
 
0 0 2.5
 
1 2 5
T
• Symmetric Matrix: A = A. Example: 2 1 7
 
5 7 1
 
0 −2 −5
T
• Skew-symmetric matrix:A = − A. Example: 2 0 −7
 
5 7 0
linear algebra primer 35

Vectors and Matrices Recap

Vector
A column vector v ∈ IRnx1 where

v1
 v2 
 
v=
 .. 

.
vn

A row vector vT ∈ IR1xn where


h i
v T = v1 v2 ··· vn

T denotes the transpose operation.


The norm is s
n
|| x ||2 = ∑ xi2
i =1

Formally, the norm can also be defined as any function f : IRn 7→ IR


that satisfies 4 properties:

• Non-negativity: For all x ∈ IRn , f ( x ) ≥ 0

• Definiteness: f ( x ) = 0 if and only if x = 0

• Homogeneity: For all x ∈ IRn , t ∈ IR, f (tx ) = |t| f ( x )

• Triangle inequality: For all x, y ∈ IRn , f ( x + y) ≤ f ( x ) + f (y)

Projection A projection is an inner product (dot product) of vectors.


If B is a unit vector, then A · B gives the length of A which lies in the
direction of B.

Matrix
A matrix A ∈ IRmxn is an array of numbers with size m by n.
 
a11 a12 a13 · · · a1n
 a21 a22 a23 · · · a2n 
 
A= . .. 
 .. . 

am1 am2 am3 ··· amn

If m = n, we say that A is square.

An Application of Matrices Grayscale images have one number per


pixel and are stored as an m × n matrix. Color images have 3 numbers
per pixel - red, green, and blue
36 computer vision: foundations and applications

Transformation Matrices

Matrices can be used to transform vectors in useful ways, through


multiplication: x = Ax. The simplest application of that is through
scaling, or multiplying a scaling matrix with scalars on its diagonal
by the vector.
We can also use matrices to rotate vectors. When we multiply a
matrix and a vector; the resulting x coordinate is the original vector
dot the first row.
In order to rotate a vector by an angle θ, counter-clockwise:
x 0 = cosθx − sinθy and
y0 = cosθy + sinθx
therefore, we multiply it by the matrix
" #
cosθ −sinθ
M=
sinθ cosθ
which gives us that P0 = RP.
We can also use multiple matrices to transform a point. For exam-
ple, p0 = R2 R1 Sp. The transformations are applied one after another,
from right to left. In our example, this would be ( R2 ( R1 (Sp))).
In order to translate vectors, we have to implement a somewhat
hacky solution of adding a "1" at the very end of the vector. This way,
using these "homogeneous coordinates", we can also translate vectors.
The multiplication works out so the rightmost column of our matrix
gets added to the respective coordinates. A homogenous matrix will
have [0 0 1] in the bottom row to ensure that the elements get added
correctly, and the resulting vector has ’1’ at the bottom.
By convention, in homogeneous coordinates, we divide the result
by its last coordinate after doing matrix multiplication.
 
x
y
 
7
 
x/7
 y/7
 
1
So to obtain the result of P( x, y)− > P0 = (s x x, sy y)
we have to first P = ( x, y)− > ( x, y, 1) and then P0 = (s x x, sy y)− >
(s x x, sy y, 1) so we can then do the matrix multiplication S*P. Though,
we have to note that scaling and translating is not the same as trans-
lating and scaling. In other words, T ∗ S ∗ P 6= S ∗ T ∗ P
Any rotation matrix R belongs to the category of normal matrices,
and it satisfies interesting properties. For example, R Ṙ T = I and
det( R) = 1
linear algebra primer 37

The rows of a rotation matrix are always mutually perpendicular


(a.k.a. orthogonal) unit vectors; this is what allows for it to satisfy
some of the few unique properties mentioned above.

Matrix Inverse

Given a matrix A, its inverse A−1 is a matrix such that:

AA−1 = A−1 A = I

where I is the identity matrix of the same size.


An example of a matrix inverse is:
" # −1 " #
1
2 0 2 0
= 1
0 3 0 3

A matrix does not necessarily have an inverse. If A−1 exists, A is


known as invertible or non-singular.
Some useful identities for matrices that are invertible are:

• ( A −1 ) −1 = A

• ( AB)−1 = B−1 A−1

• A − T , ( A T ) −1 = ( A −1 ) T

Pseudoinverse
Frequently in linear algebra problems, you want to solve the equa-
tion AX = B for X. You would like to compute A−1 and multiply
both sides to get X = A−1 B. In python, this command would be:
np.linalg.inv(A)*B.
However, for large floating point matrices, calculating inverses can
be very expensive and possibly inaccurate. An inverse could also not
even exist for A. What should we do?
Luckily, we have what is known as a pseudoinverse. This other
matrix can be used to solve for AX = B. Python will try many meth-
ods, including using the pseudo-inverse, if you use the following
command: np.linalg.solve(A,B). Additionally, using the pseudo-
inverse, Python finds the closest solution if there exists no solution to
AX = B.

Matrix Rank

• The rank of a transformation matrix A tells you how many dimen-


sions it transforms a matrix to.
38 computer vision: foundations and applications

• col-rank(A) = maximum number of linearly independent column


vectors of A

• row-rank(A) = maximum number of linearly independent row


vectors of A. Column rank always equals row rank.

• For transformation matrices, the rank tells you the dimensions of


the output.

• For instance, if rank of A is 1, then the transformation p0 = Ap


points onto a line.

• Full rank matrix- if mxm and rank is m

• Singular matrix- if mxm matrix rank is less than m, because at least


one dimension is getting collapsed. (No way to tell what input was
from result) –> inverse does not exist for non-square matrices.

Eigenvalues and Eigenvectors (SVD)

Definitions
An eigenvector x of a linear transformation A is a non-zero vector that,
when A is applied to it, does not change its direction. Applying A to
the eigenvector scales the eigenvector by a scalar value λ, called an
eigenvalue.
The following equation describes the relationship between eigen-
values and eigenvectors:

Ax = λx, x 6= 0

Finding eigenvectors and eigenvalues


If we want to find the eigenvalues of A, we can manipulate the above
definition as follows:

Ax = λx, x 6= 0
Ax = (λIx), x 6= 0
(λI − A)x = 0, x 6= 0
Since we are looking for non-zero x, we can equivalently write the
above relation as:

|λI − A| = 0
Solving this equation for λ gives the eigenvalues of A, and these
can be substituted back into the original equation to find the corre-
sponding eigenvectors.
linear algebra primer 39

Properties
• The trace of A is equal to the sum of its eigenvalues:
n
tr ( A) = ∑ λi
i =1

• The determinant of A is equal to the product of its eigenvalues:


n
| A | = ∏ λi
i =1

• The rank of A is equal to the number of non-zero eigenvalues of


A.

• The eigenvalues of a diagonal matrix D = diag(d1 , . . . , dn ) are just


the diagonal entries d1 , . . . dn .

Spectral Theory
Definitions

• An eigenpair is the pair of an eigenvalue and its associated eigen-


vector.

• An eigenspace of A associated with λ is the space of vectors where:

( A − λI ) = 0

• The spectrum of A is the set of all its eigenvalues:

σ( A) = {λ ∈ C : λI − A is singular}

Where C is the space of all eigenvalues of A

• The spectral radius of A is the magnitude of its largest magnitude


eigenvalue:
ρ( A) = max {|λ1 |, . . . , |λn |}

Theorem: Spectral radius bound Spectral radius is bounded by the


infinity norm of a matrix:

ρ( A) = lim || Ak ||1/k
k→∞

Proof :
|λ|k ||v|| = ||λ|k v|| = || Ak v||
By the CauchyâĂŞSchwarz inequality (||uv|| ≤ ||u|| · ||v||):

|λ|k ||v|| ≤ || Ak || · ||v||


40 computer vision: foundations and applications

Since v 6= 0:
|λ|k ≤ || Ak ||

And we thus arrive at:

ρ( A) = lim || Ak ||1/k
k→∞

Diagonalization
An n × n matrix A is diagonalizable if it has n linearly independent
eigenvectors.
Most square matrices are diagonalizable

• Normal matrices are diagonalizable


Note: Normal matrices are matrices that satisfy:

A∗ A = AA∗

Where A∗ is the complex conjugate of A

• Matrices with n distinct eigenvalues are diagonalizable

Lemma: Eigenvectors associated with distinct eigenvalues are


linearly independent.
To diagonalize the matrix A, consider its eigenvalues and eigen-
vectors. We can construct matrices D and V, where D is the diagonal
matrix of the eigenvalues of A, and V is the matrix of corresponding
eigenvectors:
 
λ1
D=
 .. 
 . 

λn
h i
V = v1 v2 ... vn

Since we know that:


AV = VD

We can diagonalize A by:

A = VDV −1

If all eigenvalues are unique, then V is orthogonal. Since the inverse


of an orthogonal matrix is its transpose, we can write the diagonaliza-
tion as:
A = VDV T
linear algebra primer 41

Symmetric Matrices
If A is symmetric, then all its eigenvalues are real, and its eigenvec-
tors are orthonormal. Recalling the above diagonalization equation,
we can diagonalize A by:

A = VDV T

Using the above relation, we can also write the following relation-
ship:
Given y = V T x:
n
x T Ax = x T VDV T x = y T Dy = ∑ λi y2i
i =1

Thus, if we want to do the following maximization:

max x∈Rn ( x T Ax ) subject to || x ||22 = 1

Then the maximizing x can be found by finding the eigenvector


corresponding to the largest eigenvalue of A.

Applications
Some applications of eigenvalues and eigenvectors include, but are
not limited to:

• PageRank

• Schroedinger’s equation

• Principle component analysis (PCA)

Matrix Calculus

The Gradient
If a function f : IRmxn 7→ IR takes as input a matrix A of size (m × n)
and returns a real value, then the gradient of f is
 ∂ f ( A) ∂ f ( A) ∂ f ( A)

···
 ∂∂A 11 ∂A12 ∂A1n
 f ( A) ∂ f ( A)
··· ∂ f ( A) 

∇ A f ( A) ∈ IRmxn = 
∂A
 21 ∂A22 ∂A2n 
.. .. .. .. 

 . . . . 

∂ f ( A) ∂ f ( A) ∂ f ( A)
∂Am1 ∂Am2 ··· ∂Amn

Every entry in the matrix is:

∂ f ( A)
∇ A f ( A)ij =
∂Aij
42 computer vision: foundations and applications

The size of ∇ A f ( A) is always the same as the size of A. Thus, if A


is a vector x:
 ∂ f (x) 
 ∂∂x 1 
 f (x) 
 ∂x 
∇x f (x) =  . 2 
 .. 
 
∂ f (x)
∂xn

The Gradient: Properties

• ∇ x ( f ( x ) + g( x )) = ∇ x f ( x ) + ∇ x g( x )

• For t ∈ IR, ∇ x (t f ( x )) = t∇ x f ( x )

The Hessian

The Hessian matrix with respect to x can be written as ∇2x f ( x ) or as


H. It is an nxn matrix of partial derivatives

 
∂2 f ( x ) ∂2 f ( x ) ∂2 f ( x )
···
 ∂x12 ∂x1 ∂x2 ∂x1 ∂xn 
 ∂2 f ( x ) ∂2 f ( x ) ∂2 f ( x ) 

∂x22
··· 
∇2x f ( x ) ∈ IRnxn = 
 ∂x2 ∂x1 ∂x2 ∂xn 
.. .. .. .. 
.
 
 . . . 
∂2 f ( x ) ∂2 f ( x ) ∂2 f ( x )
 
∂xn ∂x1 ∂xn ∂x2 ··· ∂xn2

Every entry in the matrix is:

∂2 f ( x )
∇2x f ( x )ij =
∂xi ∂x j

It’s important to note that the Hessian is the gradient of every


entry of the gradient of the vector. For instance, the first column of
∂ f (x)
the Hessian is the gradient of ∂x .
1

The Hessian: Properties

Schwarz’s theorem: The order of partial derivatives doesn’t matter so


long as the second derivative exists and is continuous.
Thus, the Hessian is always symmetric:

∂2 f ( x ) ∂2 f ( x )
=
∂xi ∂x j ∂x j ∂xi
linear algebra primer 43

Example Calculations
Example Gradient Calculation For x ∈ IRn , let f ( x ) = b T x for some
known vector b ∈ IRn
 
x1
 x2 
iT 
h 
f ( x ) = b1 b2 ··· bn  . 

 .. 

xn

Thus,
n
f (x) = ∑ bi x i
i =1
n
∂ f (x) ∂
∂xk
=
∂xk ∑ bi x i = b k
i =1

Therefore, we can conclude that: ∇ x b T x = b.

Example Hessian Calculation Consider the quadratic function f ( x ) =


x T Ax
n n
f (x) = ∑ ∑ Aij xi x j
i =1 j =1
n n
∂ f (x) ∂
∂xk
=
∂xk ∑ ∑ Aij xi x j
i =1 j =1


∂xk i∑
= [ ∑ Aij xi x j + ∑ Aik xi xk + ∑ Akj xk x j + Akk xk2 ]
6=k j6=k i 6=k j6=k

= ∑ Aik xi + ∑ Akj x j + 2Akk xk


i6=k j6=k
n n n
= ∑ Aik xi + ∑ Akj x j = 2 ∑ Aki xi
i =1 j =1 i =1

∂2 f ( x ) ∂ ∂ f (x) ∂ n
∂xk i∑
= [ ]= [ 2Ali xi ]
∂xk ∂xl ∂xk ∂xl =1
= 2Alk = 2Akl
Thus,
∇2x f ( x ) = 2A
Pixels and Filters

Image Sampling and Quantization

Image Types

Binary Images contain pixels that are either black (0) or white (1).

Grayscale Images have a wider range of intensity than black and


white. Each pixel is a shade of gray with pixel values ranging be-
tween 0 (black) and 255 (white).

Color Images have multiple color channels; each color image can
be represented in different color models (e.g., RGB, LAB, HSV). For
example, an image in the RGB model consists of red, green, and blue
channel. Each pixel in a channel has intensity values ranging from
0-255. Please note that this range depends on the choice of color
model. A 3D tensor usually represents color images (Width x Length
x 3), where the 3 channels can represent the color model such as RGB
(Red-Green-Blue), LAB (Lightness-A-B), and HSV (Hue-Saturation-
Value).

Sampling and Resolution

Images are samples: they are not continuous; they consist of discrete
pixels of a certain size and density. This can lead to errors (or grain-
iness) because pixel intensities can only be measured with a certain
resolution and must be approximated.

Resolution is a sampling parameter, defined in dots per inch (DPI).


The standard DPI value for screens is 72 DPI.

Pixels are quantized (i.e, all pixels (or channels of a pixel) have
one of a set numbers of values (usually [0, 255]). Quantization and
sampling loses information due to a finite precision.
46 computer vision: foundations and applications

Figure 8: Illustrations of dif-


ferent pixel densities. Taken
from the accompanying lecture
slides. (Slide 14, slide credit
Ulas Bagci)

Image Histograms

Histograms measure the frequency of brightness within the image:


how many times does a particular pixel value appear in an image.

Figure 9: The image is sam-


pled at two vertical positions,
sampling a patch of sky and
sampling a patch of grass. The
corresponding histograms are
shown to the right. Adapted
from the accompanying lecture
slide (Slide 23, slide credit Dr.
Mubarak Shah

Histograms help us detect particular features in images, for exam-


ple:

• Sky: smooth coloration denotes consistency in image, consistent


with the image of a sky.

• Grass: a jagged histogram shows wide ranging variety in col-


oration, consistent with the shadows of a grass field.

• Faces: the color composition of a face will be displayed in a his-


togram.

An image histogram measures the frequency of certain grayscale


intensities in an image. Histograms can be used to provide a quantifi-
able description of what things look like; this can be used as input to
classifiers.
pixels and filters 47

Images as Functions

Most images that we deal with in computer vision are digital, which
means that they are discrete representations of the photographed
scenes. This discretization is achieved through the sampling of
2-dimensional space onto a regular grid, eventually producing a
representation of the image as a matrix of integer values.
When dealing with images, we can imagine the image matrix as
infinitely tall and wide. However, the displayed image is only a finite
subset of this infinite matrix. Having employed such definition of
images, we can write them as coordinates in a matrix

.. .. .
 
. . ..
 

 f [−1, 1] f [0, 1] f [1, 1] 

. . . f [−1, 0] f [0, 0] f [0, 1] . . .
 
 

 f [−1, −1] f [0, −1] f [1, −1] . . . 
. .. ..
.. . .

An Image can also be treated as a function f : R2 → R N . When


we do so, f [n, m] where f [n, m] is the intensity of a pixel at position
(m, n). Note that we use square brackets, rather than the typical
parentheses, to denote discrete functions.
When we treat an image as a function, it is defined over a rectan-
gle with finite range. For example, the following function f returns
the (grayscale) intensity of a single pixel in an image located between
a and b horizontally and c and d vertically.

f : [ a, b] × [c, d] → [0, 255] (Grayscale Pixel Intensity)

The set of values [ a, b] × [c, d] is known as the domain support, and


contains all values that are valid inputs to the function f , while
[0, 255] (in this case) is the range defining the set of possible outputs.
An Image can also be treated as a function mapping R2 → R3 . For
example the RGB intensities of a given pixel can be written as the
function g below
 
r [ x, y]
g[ x, y] =  g[ x, y] (Color Pixel Intensity)
 
b[ x, y]

where r, g, b : [ a, b] × [c, d] → [0, 255].

Linear Systems (Filters)

The term filtering refers to a process that forms a new image, the pixel
values of which are transformations of the original pixel values. In
48 computer vision: foundations and applications

general, the purpose in applying filters is the extraction of useful


information (e.g., edge detection) or the adjustment of an image’s
visual properties (e.g., de-noising).

f [n, m] S g[n, m] Figure 10: Graphical representa-


tion of a system’s mapping of f
to g
Filters are examples of systems, which are units that convert an
input function f [m, n] to an output (or response) function g[m, n]
where m, n are the independent variables. When dealing with images,
m, n are the representation of a spatial position in the image.
Notationally, S is referred to as the system operator, which maps a
member of the set of possible outputs g[m, n] to a member of the set
of possible inputs f [m, n]. When using notation involving S , we can
write that
S[ g] = f
S { f [m, n]} = g[m, n]
S
f [m, n] −
→ g[m, n]

Examples of Filters
Moving Average One intuitive example of a filter is the moving
average. This filter sets the value of a pixel to be the average of its
neighboring pixels (e.g., the nine pixels in a 3 × 3 radius, when
applying a 3 × 3 filter). Mathematically, we can represent this as

1 1 1
g[m, n] = ∑ ∑
9 i=−1 j=−1
f [m − i, n − j] (Weighted Average)

This weighted average filter serves to smooth out the sharper edges
of the image, creating a blurred or smoothed effect.

Image Segmentation We can also use filters to perform rudimentary


image segmentation based on a simple threshold system. In this case,
the filter sets the value of a pixel either to an extremely high or an
extremely low value, depending on whether or not it meets the
threshold t. Mathematically, we write this as

255 f [m, n] ≥ t
g[m, n] = (Threshold)
0 otherwise

This basic image segmentation filter divides an image’s pixels into


binary classifications of bright regions and dark regions, depending
on whether or not the f [m, n] ≥ t.
pixels and filters 49

Properties of Systems
When discussing specific systems, it is useful to describe their prop-
erties. The following includes a list of properties that a system may
possess. However, not all systems will have all (or any) of these prop-
erties. In other words, these are potential characteristics of individual
systems, not traits of systems in general.

Amplitude Properties

• Additivity: A system is additive if it satisfies the equation

S[ f i [m, n] + f j [m, n]] = S[ f i [m, n]] + S[ f j [m, n]]

• Homogeneity: A system is homogeneous if it satisfies the equation

S[α f i [n, m]] = αS[ f i [n, m]

• Superposition: A system has the property of superposition if it


satisfies the equation

S[α f i [n, m]] + β f j [n, m]] = αS[ f i [n, m]] + βS f j [n, m]]

• Stability: A system is stable if it satisfies the inequality

| f [n, m]| ≤ k =⇒ | g[n, m]| ≤ ck

for some c.

• Invertibility: A system is invertible if it satisfies the equation

S −1 [S[ f [n, m]]] = f [n, m]

Spatial Properties

• Causality: A system is causal if for m < m0 and n < n0

f [m, n] = 0 =⇒ g[m, n] = 0

• Shift Invariance: A system is shift invariant if

S
f [ m − m0 , n − n0 ] −
→ g [ m − m0 , n − n0 ]

Linear Systems
A linear system is a system that satisfies the property of superposition.
When we employ a linear system for filtering; we create a new image
whose pixels are weighted sums of the original pixel values, using
50 computer vision: foundations and applications

the same set of weights for each pixel. A linear shift-invariant system is
a linear system that is also shift invariant.
Linear systems also have what is known as an impulse response. To
determine the impulse response of a system S , consider first δ2 [m, n].
This is a function defined as follows

1 m = 0andn = 0
δ2 [m, n] =
0 otherwise

The impulse response r is then simply

r = S[δ2 ]

A simple linear shift-invariant system is a system that shifts the


pixels of an image, based on the shifting property of the delta func-
tion.
∞ ∞
f [m, n] = ∑ ∑ f [i, j]δ2 [m − i, n − j]
i =−∞ j=−∞

We can then use the superposition property to write any linear shift-
invariant system as a weighted sum of such shifting system
∞ ∞ ∞ ∞
α1 ∑ ∑ f [i, j]δ2,1 [m − i, n − j] + α2,2 ∑ ∑ f [i, j]δ2,3 [m − i, n − j] + . . .
i =−∞ j=−∞ i =−∞ j=−∞

We then define the filter h of a linear shift-invariant system as

h[m, n] = α1 δ2,1 [m − i, n − j] + α2 δ2,2 [m − i, n − j] + . . .

Impulse response (for all linear systems) Delta[n,m] function: has


value that is 1 specifically at one pixel, gives back a response, h[n,m]
A shifted delta function gives back a shifted response
Linear shift invariant systems (LSI) Example: a moving average
filter is a summation of impulse responses

• Systems that satisfy the superposition property

• Have an impulse response: S[δ2 [n, m]] = δ2 [n, m]

• Discrete convolution: f [n, m] ∗ h[n, m] (multiplication of shifted-


version of impulse response by original function)

Convolution and Correlation

Convolution
The easiest way to think of convolution is as a system that uses
information from neighboring pixels to filter the target pixel. A good
example of this is a moving average, or a box filter discussed earlier.
pixels and filters 51

Convolution is represented by *, for example:


f [n, m]*h[n, m] represents a function being multiplied by a shifted
impulse response
Convolution allows us to compute the output of passing any input
signal through a system simply by considering the impulse response
of the system. To understand what this means, we must first under-
stand how to break up a signal into a set of impulse functions.
The impulse (delta) function, δ[n], is defined to be 1 at n = 0
and 0 elsewhere. As seen in the image below, any signal can be
decomposed as the weighted sum of impulse functions.

Figure 11: An example decom-


position of a signal into an im-
pulse function. (Adapted from
http://www.songho.ca/dsp/convolution/conv

x [ n ] = x [0] δ [ n ] + x [1] δ [ n − 1] + x [2] δ [ n − 2]

More generally, an arbitrary signal x can be written as x [n] =


∑∞k =−∞ x [ k ] δ [ n − k ].
The impulse response of a system, h[n], is defined as the output
resulting from passing an impulse function into a system. When a
system is linear, scaling the impulse function results in the scaling
of the impulse response by the same amount. Moreover, when the
system is shift-invariant, shifting the impulse function shifts the
impulse response by the same amount. The image below illustrates
these properties for a signal consisting of 3 components.
More generally, for an arbitrary input signal x [n] = ∑∞
k =−∞ x [ k ] δ [ n −
52 computer vision: foundations and applications

Figure 12: An impulse func-


tion is sent through a sys-
tem to create an impulse re-
sponse function. Adapted from
http://www.songho.ca/dsp/convolution/conv

k] passed into a linear, shift-invariant system, the output is y[n] =


∑∞k =−∞ x [ k ] h [ n − k ], i.e. the convolution of the signal x with the im-
pulse response h.
Convolution can also be done in 2 dimensions. For example, when
an arbitrary, 2D signal x [n, m] = ∑i∞=−∞ ∑∞ j=−∞ x [i, j ] δ [ n − i, m − j ] is
passed into a linear, shift-invariant system, the output is y[n, m] =
∑i∞=−∞ ∑∞ j=−∞ x [i, j ] h [ n − i, m − j ], i.e. the convolution of the signal x
with the impulse response h in 2 dimensions.

Correlation
Cross correlation is the same as convolution, except that the filter ker-
nel is not flipped. Two-dimensional cross correlation is represented
as:
r [k, l ] = ∑∞ ∞
m=−∞ ∑n=−∞ f [ m + k, n + l ] g [ m, n ]
It can be used to find known features in images by using a kernel
that contains target features.

Linear Systems
Linear Systems (Filters) form new images, the pixels of which are a
weighted sum of select groupings of the original pixels. The use of
different patterns and weights amplifies different features within the
original image. A system S is a linear system If and Only If it satisfies
the Superposition Property of systems:

S[α f i [n, m] + β f j [h, m]] = αS[ f i [n, m]] + βS[ f j [h, m]]

As previously introduced in the last lecture, the process of apply-


ing a filter to an input image is referred to as convolution.

LSI (Linear Shift Invariant Systems) and the impulse response


A shift invariant system is one in which shifting the input also shifts
the output an equal amount. A linear shift-invariant (LSI) system’s is
characterized by its response to an impulse; this response is known
as the impulse response. The impulse response helps us to understand
the output of an LSI system for any given input signal.
pixels and filters 53

Why are Convolutions Flipped?


Proof of 2D cross correlation’s commutativity:

f [n, m] ∗ ∗h[n, m] = ∑ ∑ f [k, l ] · h[n − k, m − l ]


k l
let N = n − k, M = m − l so k = n − N and l = m − M
= ∑ ∑ f [n − N, m − M] · h[ N, M]
k l

= ∑ ∑ h[ N, M] · f [n − N, m − M]
N M
= h[n, m] ∗ ∗ f [n, m]

Example Apply kernel k to matrix M.


 
0 0 0
M = 0 1 0 ,
 
0 0 0
 
x1,1 x1,2 x1,3
k =  x2,1 x2,2 x2,3 
 
x3,1 x3,2 x3,3
 
x3,3 x3,2 x3,1
M ∗ k =  x2,3 x2,2 x2,1
 

x1,3 x1,2 x1,1
Here, the kernel is expected to match the convolution. Instead, the
output is equal to the kernel flipped in the x and y direction. To
rectify this, the kernel is flipped at the initial step to ensure correct
output form.

Convolution vs Cross Correlation


A convolution is an integral that expresses the amount of overlap of
one function as it is shifted over another function. We can think of
the convolution as a filtering operation.
The Correlation calculates a similarity measure for two input
signals (e.g., two image patches). The output of correlation reaches
a maximum when the two signals match best. Correlation can be
though of as a measure of relatedness of two signals.
Edge Detection

Edge Detection in Mammals

Hubel & Wiesel


In a series of experiments conducted by Hubel and Wiesel 13 , the 13
DH Hubel and TN Wiesel. Receptive
neuron responses in a cat’s brain were recorded to observe how each fields of optic nerve fibres in the spider
monkey. The Journal of physiology, 154(3):
part of the brain responds to different stimuli. They revealed that the 572–580, 1960
catâĂŹs neurons were most excited by the edges at different orien-
tations; namely, certain neurons correlated to specific orientations of
edges or edges that moved in specific directions. Having one of these
edges moving in its field of vision would cause that particular neuron
to fire excitedly.

Figure 13: The Hubel and


Wiesel’s experiment.

Biederman
Biederman investigated the rate at which humans can recognize the
object they’re looking at. To test this, he drew outlines of common
and recognizable objects and split them into two halves, with each
56 computer vision: foundations and applications

line segment divided in only one of the halves. These outlines were
then shown to partipants to test whether they could recognize the
original objects while only seeing half of the original outline.
Surprisingly, he observed no difference in terms of the speed
with which people recognized the objects. It was easy for them to
recognize an object via only parts of its edges. This study benefits
computer vision by providing an insight: even if only part of the
original image is shown, a system theoretically should still be able to
recognize the whole object or scene.

Figure 14: The examples of the


Biederman’s outlines.

Walther, Chai, Caddigan, Beck & Fei-Fei


In similar experiments, another group of researchers took two varia-
tions of the same image - the original color image and the outline of
that image - to test color image vs line image recognition in humans.
They traced recognition through different layers, each one a different
level of processing in the visual pcortex of the brain. They found that
lower layers could recognize the scenes faster via the line image, but
as it moved up the layers of the brain (which encode increasingly
higher level concepts), the color drawings were much more helpful
in allowing people to recognize the scene as compared to the line
drawing. This is believed to have happened because lower layers are
better at recognizing pieces, like edges, while higher layers are better
at recognizing concepts (like "human," "chair," or "tiger").

Edge Detection for Computer Vision

The goal of edge detection is to identify sudden changes (discontinu-


ities) in an image. Intuitively, most semantic and shape information
from the image can be encoded in its edges.
The edges help us extract information, recognize objects, and
recover geometry and viewpoint. They arise due to discontinuities in
surface normal, depth, surface color, and illumination .
edge detection 57

Figure 15: The visualization of


the images and outcomes.

Types of Discrete Derivative in 1D

There are three main types of derivatives that can be applied on


pixels. Their formulas and corresponding filters are:
Backward
df
= f ( x ) − f ( x − 1) = f 0 ( x )
dx

[0, 1, −1]

Forward:
df
= f ( x ) − f ( x + 1) = f 0 ( x )
dx

[−1, 1, 0]

Central:
df
= f ( x + 1) − f ( x − 1) = f 0 ( x )
dx

[1, 0, −1]

Discrete Derivative in 2D

Gradient vector
" #
f
∇ f ( x, y) = x
fy

Gradient magnitude
q
|∇ f ( x, y)| = f x2 + f y2

Gradient direction
 df 
−1 dy
θ = tan df
dx
58 computer vision: foundations and applications

Example
The gradient at a matrix index can be approximated using neighbor-
ing pixels based on the central discrete derivative equation expanded
to 2D. A filter like  
−1 0 1
1
 −1 0 1

3
−1 0 1
When overlapped on top of a pixel x [m, n], such that the center of the
filter is located at x [m, n] shown with its neighbors below
 
... ... ... ... ...
... x
m−1,n−1 xm−1,n xm−1,n+1 ...


... xm,n−1 xm,n xm,n+1 ...
 
 
... xm+1,n−1 xm+1,n xm+1,n+1 ...
... ... ... ... ...

Produces an output of
1 
( xm−1,n+1 − xm−1,n−1 ) + ( xm,n+1 − xm,n−1 ) + ( xm+1,n+1 − xm+1,n−1 ) =
3
Which is equivalent to an approximation of the gradient in the hor-
izontal (n) direction at pixel (m, n). This filter detects the horizontal
edges, and a separate kernel is required to detect vertical ones.

Simple Edge Detectors

Characterizing Edges
Characterizing edges (i.e., characterizing them properly so they can
be recognized) is an important first step in detecting edges. For our
purposes, we will define an edge as a rapid place of change in the
image intensity function. If we plot the intensity function along the
horizontal scanline, we see that the edges correspond to the extra of
the derivative. Therefore, noticing sharp changes along this plot will
likely give us edges.

Figure 16: An image with


intensity function and first
derivative. Source: Lecture 5,
slide 66
edge detection 59

Image Gradient
The gradient of an image has been defined as the following:
 
∂f ∂f
∇f = , ,
∂x ∂y
while its direction has been defined as:
 
∂f ∂f
θ = tan−1 / .
∂y ∂x

Figure 17: The gradient vector


directions. Source: Lecture 5,
slide 67

The gradient vectors point toward the direction of the most rapid
increase in intensity. In vertical edge, for example, the most rapid
change of intensity occurs in the x-direction.

Figure 18: The gradients as


applied to the image of a tiger.
Source: Lecture 5, slide 68

Effects of Noise
If there is excessive noise in an image, the partial derivatives will not
be effective for identifying the edges.
In order to account for the noise, the images must first be smoothed.
This is a process in which pixel values are recalculated so that they
more closely resemble their neighbors. The smoothing is achieved by
means of convolving the image with a filter (e.g., gaussian kernel).
There are, of course, some concerns to keep in mind when smooth-
ing an image. The image smoothing does remove noise, but it also
blurs the edges; the use of large filters can result in the loss of edges
and the finer details of the image.
60 computer vision: foundations and applications

Figure 19: The derivative of an


edge in a noisy image. Source:
Steve Seitz; Lecture 5, slide 70

Figure 20: Smoothing with


different filters and filter sizes.
Source: Steve Seitz; Lecture 5,
slide 75

The image smoothing facilitates the process of edge detection.


After smoothing an image f , the search for peaks is initiated by
d
calculating f ∗ dx g with kernel g.

Gaussian Blur

The Gaussian blur is the result of blurring an image by a Gaussian


function to reduce image noise. It is a low-pass filter, meaning it
attenuates high frequency signals. You will generally use Gaussian
blurring as a preliminary step.
One dimension:
2
1 −x
G(x) = √ e 2σ2
2πσ
Two dimension:
1 − x2 +2y2
G ( x, y) = e 2σ
2πσ
edge detection 61

Designing a Good Edge Detector

An optimal edge detector must have certain qualities:

1. Good detection

• It must minimize the probability of detecting false positives


(which are spurious edges, generally caused by noise) and
false negatives (missing real edges, which can be caused by
smoothing, among other things). If it detects something as an
edge, it should be an edge.

2. Good localization

• The detected edges must be as close as possible to the actual


edges in the original image. The detector must identify where
the edges occur and pinpoint the exact location of the edge;
it must also be consistent in determining which pixels are
involved in each edge.

3. Silent response

• It must minimize the number of local maxima around the true


edge (returning one point only for each true edge point). It
should tell you that there is one very specific edge instead of
splitting one edge into multiple detected edges. In other words,
only the real border of the edge is captured; other possibilities
are suppressed.

Figure 21: Sample problems


of bad edge detectors. Source:
Lecture 5, slide 84

Edge Detection

Motivation for Edge Detection


Edge detection is extremely relevant for mammalian eyes. Certain
neurons within the brain are adept at recognizing straight lines.
The information from these neurons is put together in the brain for
recognition of objects. In fact, edges are so useful for recognition in
62 computer vision: foundations and applications

humans, line drawings are almost as recognizable as the original im-


age (Fig. 1).We would like to be able to extract information, recognize
objects, and recover the geometry and viewpoint of an image.

Fig. 1. Certain areas of the brain react to edges; the line drawings
are as recognizable as the original image; image source: 14 14
Dirk B. Walther, Barry Chai, Eamon
Caddigan, Diane M. Beck, and Li Fei-
Fei. Simple line drawings suffice for
functional mri decoding of natural
Edge Basics scene categories. Proceedings of the
National Academy of Sciences, 108(23):
There are four possible sources of edges in an image: surface normal 9661âĂŞ–9666, 2011

discontinuity (surface changes direction sharply), depth discontinuity


(one surface behind another), surface color discontinuity (single
surface changes color), illumination discontinuity (shadows/lighting).
These discontinuities are demonstrated in the Fig.2a; different types
of edges can be seen in Fig.2b:

Fig. 2. Different types of edges due to discontinuities in surface


color, surface depth, and surface normal (source: lecture notes)
edge detection 63

Edges occur in images when the magnitude of the gradient is high

Finding the Gradient


In order to find the gradient, we must first find the derivatives in
both the x and y directions

Discrete Derivatives
d f (x) f ( x ) − f ( x − δx )
= lim = f 0 (x)
dx δx →0 δx
d f (x) f (x) − f (x − 1
= = f 0 (x)
dx 1
d f (x)
= f ( x ) − f ( x − 1) = f 0 ( x )
dx
It is also possible to take the derivative three different ways

• Backward: f 0 ( x ) = f ( x ) − f ( x − 1)

• Forward: f 0 ( x ) = f ( x + 1) − f ( x )
( x +1)− f ( x −1)
• Central: f 0 ( x ) = 2

Each of these can also be represented as a filter (convoluting the filter


with the image gives the derivative)

• Backward: f 0 ( x ) = f ( x ) − f ( x − 1) → [0, 1, −1]

• Forward: f 0 ( x ) = f ( x ) − f ( x + 1) → [−1, 1, 0]

• Central: f 0 ( x ) = f ( x + 1) − f ( x − 1) → [1, 0, −1]

The gradient (∇ f ) can be calculated as follows:


" ∂ f ( x,y) #
∇ f ( x, y) = ∂x
∂ f ( x,y)
∂y
" #
f
= x
fy

We can also calculate the magnitude and the angle of the gradient:
q
|∇ f ( x, y)| = f x2 + f y2
θ = tan−1 f y / f x


Reducing noise
Noise will interfere with the gradient, making it impossible to find
edges using the simple method, even though the edges are still
detectable to the eye. The solution is to first smooth the image.
64 computer vision: foundations and applications

Let f be the image and g be the smoothing kernel. Thus, in order


to find the smoothed gradient, we must calculate (1D example):

d
( f ∗ g)
dx
By the derivative theorem of convolution:

d d
( f ∗ g) = f ∗ g
dx dx
This simplification saves us one operation. Smoothing removes noise
but blurs edges. Smoothing with different kernel sizes can detect
edges at different scales

Sobel Noise Detector


This algorithm utilizes 2 3 × 3 kernels which, once convolved with the
image, approximate the x and y derivatives of the original image.
   
1 0 −1 1 2 1
G x = 2 0 −2 Gy =  0 0 0 
   
1 0 −1 −1 −2 −1

These matrices represent the result of smoothing and differentiation


   
1 0 −1 1 h i
G x = 2 0 −2 = 2 1 0 −1
   
1 0 −1 1

The Sobel Filter has many problems, including poor localization. The
Sobel Filter also favors horizontal and vertical edges over oblique
edges

Canny Edge Detector


The Canny Edge Detector has five algorithmic steps:

• Suppress noise

• Compute gradient magnitude and direction

• Apply non-maximum suppression

• Hysteresis thresholding

• Connectivity analysis to detect edges

Suppress noise We can both suppress noise and compute the deriva-
tives in the x and y directions using a method similar to the Sobel
filter.
edge detection 65

Compute gradient magnitude and direction From above,


q
|∇ f ( x, y)| = f x2 + f y2

θ = tan−1 f y / f x


Apply non-maximum suppression The purpose of this portion of the


algorithm is to make sure the edges are specific. Thus, we assume
that the edge occurs when the gradient reaches a maximum. We
suppress any pixels that have a non-maximum gradient.
Basically, if the pixel is not the largest of the three pixels in the
direction and opposite the direction of its gradient, it is set to 0.
Furthermore, all gradients are rounded to the nearest 45°.

Hysteresis thresholding All remaining pixels are subjected to hys-


teresis thresholding. This part uses two values, for the high and
low thresholds. Every pixel with a value above the high threshold
is marked as a strong edge. Every pixel below the low threshold is
set to 0. Every pixel between the two thresholds is marked as a weak
edge

Connectivity analysis to detect edges The final step is connecting the


edges. All strong edge pixels are edges. For weak edge pixels, only
the weak edge pixels that are linked to strong edge pixels are edges.
The part uses BFS or DFS to find all the edges.

Hough Transforms

Intro to Hough Transform

Hough Transform is a way to detect particular structures in images,


namely lines. However, Hough transform can be used to detect any
structure whose paramteric equation is known. It gives a robust
detector under noise and partial occlusion.

Goal of Hough Transform for detecting lines

Hough transform can be used to detect lines in images. To do this,


we want to locate sets of pixels that make up straight lines in the
image. This works to detect lines in an image after an edge detector
is applied to get the pixels of just the edges (and thus we find which
sets of those pixels make up straight lines).
66 computer vision: foundations and applications

Detecting lines using Hough Transform in a,b space

Say we have a xi , yi . There are infinite lines that could pass through
this point. We can define a line that passes through this pixel xi , yi as

yi = a ∗ xi + b

. Using this, we can transform each pixel into a, b space by re-writing


this equation as:

b = − a ∗ xi + yi

This equation represents a line in a, b space, and each a, b point on


the line represents a possible line passing through our point xi , yi .
Thus, for each pixel xi , yi in our set of edge pixels, we transform it
into a, b space to get a line.

Fig. 3. The transformation from the original space to the Hough


space; source: lecture slides
The intersection of lines in a, b space represent the a, b values that
compromise a line yi = a ∗ xi + b passing through those points.
Example: Say we have two points x1 , y1 = (1, 1), and x2 , y2 = (2, 3).
We transform these points into a, b space with the lines b = − a ∗ 1 + 1
and b = − a ∗ 2 + 3. Solving for the intersection of these two lines gives
us a = 2 and b = −1. This intersection point in ( a, b) space gives us
the values for the line that goes through both points in x, y space.
edge detection 67

Fig. 4. The lines passing through a point in the original space;


source: lecture slides

Accumulator Cells

In order to get the "best" lines, we quantize the a, b space into cells.
For each line in our a, b space, we add a "vote" or a count to each cell
that it passes through. We do this for each line, so at the end, the
cells with the most "votes" have the most intersections and therefore
should correspond to real lines in our image.
The algorithm for Hough transform in a, b space is as follows:
68 computer vision: foundations and applications

Fig. 5. The Hough transform algorithm; source: lecture slides

Hough transform in ρ, θ space


A problem with using a, b space to represent lines is that they are
limited and cannot represent vertical lines. To solve this, we use polar
coordinates to represent lines. For a pixel xi , yi , we transform it using
the equation:
x ∗ cos θ + y ∗ sin θ = ρ

In ρ, θ space, points are not represented as lines but instead as sine


wave-like functions.

Fig. 6. The Hough transform in ρ, θ space; source: lecture slides


The intersection of these functions in ρ, θ space still correspond to
the ρ, θ that comprise a line passing through those points.
So, for each pixel xi , yi we transform it into a function in ρ, θ space.
We apply the same accumulator cell algorithm to count the most
intersections of functions.
In this case, we quantize our ρ, θ space into cells, and add a "vote"
to each cell that our function passes through. The cells with the most
votes are the most likely real lines in our image.

Concluding Remarks
Advantages of the Hough Transform is that it is conceptually simple
(just transforming and finding intersection in Hough space). It is also
fairly easy to implement, and it can handle missing and occluded
data well. Another advantage is that it can find other structures other
than lines, as long as the structure has a parametric equation.
Some disadvantages include that it gets more computationally
complex the more parameters you have. It can also only look for one
kind of structure (so not lines and circles together). The length and
the position of a line segment can also not be detected by this. It can
be fooled by "apparent" lines, and co-linear line segments cannot be
separated.
edge detection 69

RANSAC

With the increase in model complexity (i.e., the number of parame-


ters), the Hough transform loses its effectiveness; this section elab-
orates on the design of the RAndom Sample Consensus (RANSAC)
technique 15 which provides a computationally efficient means 15
Martin A Fischler and Robert C Bolles.
of fitting models in images. We begin with an introduction of the Random sample consensus: a paradigm
for model fitting with applications
RANSAC’s basic idea and then introduce its algorithm. to image analysis and automated
cartography. Communications of the ACM,
24(6):381–395, 1981
Features and Fitting

Introduction

This lecture covers edge detection, Hough transformations, and


RANSAC. The detection of edges provides meaningful semantic
information that facilitate the understanding of an image. This can
help analyzing the shape of elements, extracting image features, and
understanding changes in the properties of depicted scenes such as
discontinuity in depth, type of material, and illumination, to name
a few. We will explore the application of Sobel and Canny edge de-
tection techniques. The next section introduces the Hough transform,
used for the detection of parametric models in images;for example,
the detection of linear lines, defined by two parameters, is made
possible by the Hough transform. Furthermore, this technique can be
generalized to detect other shapes (e.g., circles). However, as we will
see, the use of Hough transform is not effective in fitting models with
a high number of parameters. To address this model fitting problem,
the random sampling consensus (RANSAC) is introduced in the last
section; this non-deterministic approach repeatedly samples subsets
of data, uses them to fit the model, and classifies the remaining data
points as "inliers" or "outliers" depending on how well they can be
explained by the fitted model (i.e., their proximity to the model). The
result is used for a final selection of data points used in achieving
the final model fit. A general comparison of RANSAC and Hough
transform is also provided in the last section.

Introduction to RANSAC Basics


The RANSAC algorithm is used for estimating the parameters
of models in images (i.e., model fitting). The basic idea behind
RANSAC is to solve the fitting problem many times using randomly
selected minimal subsets of the data and choosing the best perform-
ing fit. To achieve this, RANSAC tries to iteratively identify the data
points that correspond to model we are trying to fit.
Fig. 7a illustrates an example where a linear model (i.e., 2 parame-
72 computer vision: foundations and applications

ters) is to be fitted to the data; while the majority of data points fit a
linear model, the two points in the top right corner can significantly
affect the accuracy of overall fit (if they are included in the fit). The
RANSAC algorithm aims to address this challenge by identifying the
"inliers" and "outliers" in the data.
RANSAC randomly selects samples of the data, with the assump-
tion that if enough samples are chosen, there will be a low probabil-
ity that at all samples provides a bad fit.

Fig. 7. a) the outliers are detected by RANSAC to improve pa-


rameter estimation; b) RANSAC application for image stitching;
sources(16 and Derek Hoiem) Simon JD Prince. Computer vision:
16

models, learning, and inference. Cambridge


University Press, 2012
Applications
The RANSAC algorithm can be used to estimate parameters of
different models; this is proven beneficial in image stitching (Fig. 6b),
outlier detection, lane detection (linear model estimation), and stereo
camera calculations.

The Algorithm
The RANSAC algorithm iteratively samples nominal subsets of the
original data (e.g., 2 points for line estimation); the model is fitted
to each sample, and the number of "inliers" corresponding to this
fit is calculated; this includes the data points that are close to the
fitted model. The points closer than a threshold (e.g., 2 standard
deviation, or a pre-determined number of pixels) are considered
"inliers". The fitted model is considered good if a big fraction of the
data is considered as "inliers" for that fit. In the case of a good fit, the
model is re-fitted using all the inliers, and the outliers are discarded.
This process is repeated, and model estimates with a big enough
fraction of inliers (e.g., bigger than a pre-specified threshold) are
compared to choose the best-performing fit. Fig. 8 illustrates this
process for a linear model and its three samples. The third sample
(Fig. 8c) provides the best fit as it includes the most number of
features and fitting 73

inliers.

Fig. 8. The demonstration of the RANSAC algorithm for a linear


model estimation and three random samples; source(17 ) Simon JD Prince. Computer vision:
17

models, learning, and inference. Cambridge


University Press, 2012

Fig. 9. The pseudocode for RANSAC; source(18 ) 18


David Forsyth and Jean Ponce.
The RANSAC is detailed in Fig. 9. The major steps included in the Computer vision: a modern approach.
Upper Saddle River, NJ; London:
RANSAC loop: Prentice Hall, 2011

1. Randomly select a seed group from data.

2. Perform parameter estimation using the selected seed group.

3. Identify the inliers (points close to the estimated model).

4. (If there exists a sufficiently large number of inliers,)re-estimate


the model using all inliers.

5. repeat steps 1-4 and finally keep the estimate with most inliers
and best fit.

How Many Samples are Needed?


The RANSAC is a non-deterministic model fitting approach; this
means that the number of samples need to be large enough to pro-
vide a high confidence estimate of parameters. The number of re-
74 computer vision: foundations and applications

quired samples depends on 1) the number of fitted parameters and


2) the amount of noise. Fig. 10 lists the minimum number of sam-
ples needed based on p=0.99 and variations of sample size (i.e., the
number of parameters) and the fraction of outliers (i.e., noise). More
samples are needed for estimating bigger models and noisier data. A
large enough number of samples (k) need to be chosen such that a
low probability of only seeing bad samples (Pf ) is guaranteed:

P f = (1 − W n ) k = 1 − p

where W and n are respectively the fraction of inliers and the num-
ber of points needed for model fitting. The minimum number of
samples:
log(1 − p)
k=
log(1 − W n )

Fig. 10. The number of samples for different choices of noise popu-
lation and model size; source: David Lowe)

Advantages, Limitations, and Considerations


The advantages of the RANSAC lie in its simple implementation and
wide application range in the model fitting domain. Other advan-
tages include its computational efficiency; the sampling approach
provides a better alternative to solving the problem for all possible
combinations of features.
In some cases, it’d be more efficient to use the Hough transform
instead of the RANSAC:

1. The number of parameters are small; for example, linear model


estimation (2 parameters) can be achieved efficiently using Hough
features and fitting 75

transform, while image stitching requires a more computationally


frugal approach such as RANSAC.

2. If the noise population is high; as we saw earlier, increase in noise


requires a more extensive sampling approach (higher number of
samples), increasing computation cost. Increased noise reduces the
chances of correct parameter estimation and the accuracy of inlier
classification.

The poor performance in highly noisy data is a primary limitation


of the RANSAC; this is espcially crucial as real-world problems have
high rate of outliers.

RANSAC

Goal
RANdom SAmple Consensus (RANSAC) is used for model fitting
in images (e.g., line detection); it can be extremely useful for object
identification, among other applications. It is often more effective
than pure edge detection which is prone to several limitations: edges
containing extra points due to the noise/clutter, certain parts of the
edges being left out, and the existence of noise in measured edges’
orientation.

Motivation
One of the primary advantages of RANSAC is that it is relatively
efficient and accurate even when the number of parameters is high.
However, it should be noted that RANSAC is likely to fail or produce
inaccurate results in images with a relatively large amount of noise.

General Approach
The intuition for RANSAC is that by randomly sampling a group of
points in an edge and applying a line of best fit to those points many
times, we have a high probability of finding a line that fits the points
very well. Below is the general process "RANSAC loop":

1. Randomly select a seed group of points from the overall group of


points you are trying to fit with a line.

2. Compute a line of best fit among the seed group. For example, if
the seed group is only 2 distinct points, then it is clear to see that
there is only one line that passes through both points, which can
be determined with relative ease from the points’ coordinates.
76 computer vision: foundations and applications

3. Find the number of inliers to this line by iterating over each point
in the data set and calculating its distance from the line; if is less
than a (predetermined) threshold value, it counts as an inlier.
Otherwise, it is counted as an outlier.

4. If the number of inliers is sufficiently large, conclude that this


line is "good", and that at least one line exists that includes the
inliers tallied in the previous step. To further improve the line,
re-compute the line using a least-squares estimate using all of
the inliers that were within the threshold distance. Keep this
transformation as the line that best approximates the data.

Drawbacks
The biggest drawback to RANSAC is its inefficient handling of highly
noisy data; an increase in the fraction of outliers in a given data set
results in an increase in the number of samples required for model
fitting (e.g., line of best fit). More importantly, the noisier an image is,
the less likely it is for a line to ever be considered sufficiently good at
fitting the data. This is a significant problem because most real world
problems have a relatively large proportion of noise/outliers.

Local Invariant Features

Motivation
The local invariant image features and their descriptors are used in
a wide range of computer vision applications; they include, but are
not limited to, object detection, classification, tracking, motion esti-
mation, panorama stitching, and image registration. The previously
discussed methods, such as cross-correlation, are not effective and
robust in many of such appl. This method works by finding local,
distinctive structures within an image (i.e., features), and it describes
each feature using the surrounding region (e.g., a small patch cen-
tered on the detected feature). This "local" representation of image
features (as opposed to a "global" one, such as cross-correlation)
provides a more robust means of addressing the above-mentioned
computer vision problems; such strategy is invariant to object rota-
tions, point of view translations, and scale changes.

General Approach
The general approach for employing local invariant features is de-
tailed below:

1. Find and define a set of distinctive keypoints.


features and fitting 77

2. Define a local region around the keypoint.

3. Extract and normalize the regional content from the area.

4. Compute a local descriptor from the normalized region (i.e., a


function of pixel intensity).

5. Match local descriptors.

Figure 1: The local features are identified within similar pictures;


descriptors are calculated for normalized patches and compared to
each other for quantifying the similarity. Source: Lecture 6 slides.

Requirements
Good local features should have the following properties:

1. Repeatability: given the same object or scene under different image


conditions such as lighting or change in viewpoint, a high amount
of features should be detectable in both images being compared.
In other words, the feature detection should be invariant to rota-
tions and viewpoint changes and robustly handle lighting changes,
noise, and blur.

2. Locality: features should be local to avoid issues caused by occlu-


sion and clutter.

3. Quantity: enough features should be chosen to ensure the accuracy


of techniques relying on descriptors such as object detection.
78 computer vision: foundations and applications

4. Distinctiveness: results need to contain "salient" image features


that show a large amount of variation; this ensures that the de-
tected features can be distinguished from each other and properly
matched between different images.

5. Efficiency: feature matching in new images should be conducive to


real-time applications.

Keypoint Localization

Motivation
The goal of keypoint localization is to detect features consistently
and repeatedly, to allow for more precise localization, and to find
interesting content within the image.

General Approach
We will look for corners since they are repeatable and distinctive in
the majority of images. To find corners, we look for large changes in
intensity in all directions. To provide context, a "flat" region would
not have change in any direction, and an edge would have no change
along the direction of the edge. We will find these corners using the
Harris technique.

Harris Detector
The intuition behind Harris detector is that if a window (w) slides
over an image, the change in the intensity of pixel values caused
by the shift is highest at corners. This is because change in pixel
intensity is observed in both directions (x and y) at corners, while it
is limited to only one direction at the edges, and it is negligible at flat
image regions. To calculate the change of intensity due to the shift
[u, v]:
E(u, v) = ∑ w(x, y)[ I (x + u, y + v) − I (x, y)]2
x,y

To find corners, we must maximize this function. Taylor Expansion is


then used in the process to get the following equation:
" #
h i u
E(u, v) = u v M
v

where we have M defined as:


" #
Ix Ix Ix Iy
M = ∑ w( x, y)
x,y Ix Iy Iy Iy
features and fitting 79

This matrix reveals:


" # " #
∑ Ix Ix ∑ Ix Iy λ 0
M= = 1
∑ Ix Iy ∑ Iy Iy 0 λ2

Corners have both large and similar eigenvalues, whereas edges


have one significantly greater eigenvalue, and flat regions have
two small eigenvalues. Because the calculation of eigenvalues is
computationally intensive, an alternative approach is employed
where the Corner Response Function (θ) is used to compute a score
for each window:
θ = det( M ) − αtrace( M )2
where α is a constant around .04 − .06.

Figure 2: The visualization of theta thresholds from Corner Response


Function (θ) indicating corner, edge, or flat regions. Source: Lecture 6
slides

This is not rotation invariant. To allow for rotation invariance, we


will smooth with Gaussian, where the Gaussian already performs the
weighted sum:
" #
Ix Ix Ix Iy
M = g(σ)
Ix Iy Iy Iy
Illustrated below is an of keypoints identified by Harris detector.
80 computer vision: foundations and applications

Figure 3: An example of Harris keypoint detection on an image.


Source: Lecture 6 slides
Feature Descriptors

Scale Invariant Keypoint Detection

Motivation

Earlier we used the Harris detector to find keypoints or corners. This


detector used small windows in order to maintain good locality.
Since the Harris detector uses this small window, if an image is
scaled up, the window now no longer sees the same gradients it had
with a smaller image.

Figure 4: Harris detector windows on an increased scale. Source: 19 19


OpenCV 3.1.0 Documentation: https :
//docs.opencv.org/3.1.0/. URL https:
//docs.opencv.org/3.1.0/sift_scale_
Figure 4: Looking at the above image, we can see how the three invariant.jpg
windows on the right no longer see the sharp gradient in the x and
y directions. All three windows are, therefore, classified edges. In
order to address this problem, we need to normalize the scale of the
detector.

Solution

We design a function to be scale-invariant, meaning that the function


values for the corresponding regions are the same regardless of the
scale. We can use a circle to represent this scalable function. A point
on the circle is a function of the region size of the circle’s radius.
82 computer vision: foundations and applications

General Approach
We can find the local maximum of a function. Relative to the local
maximum, the region size should be the same regardless of the scale.
This also means that the region size is co-variant with the image scale.
A "good" function results in a single and distinct local maximum.
In general, we should use a function that responds well to stark
contrasts in intensity.

Figure 5: The examples of different functions for finding local


maximums. Source: Lecture 7 slides.

The function is defined as: f = kernel ∗ image. Two such kernels


include the Laplacian and the Difference of Gaussians (DoG).

L = σ2 ( Gxx ( x, y, σ ) + Gyy ( x, y, σ ))

DoG = G ( x, y, kσ ) − G ( x, y, σ )

1 x 2 + y2

G ( x, y, σ) = √ e 2σ2
2πσ
Both these kernels are scale and rotation invariant.

Scale invariant keypoint detection

Motivation
Thus far, we have covered the detection of keypoints in single images,
but broader applications require such detections across similar
images at vastly different scales. For example, we might want to
search for pedestrians from the video feed of an autonomous vehicle
without the prior knowledge of the pedestrians’ sizes. Similarly, we
might want to stitch a panorama using photos taken at different
scales. In both cases, we need to independently detect the same
keypoints at those different scales.

General methodology
Currently, we use windows (e.g., in Harris Corner Detection) to
detect keypoints. Using identically sized windows will not enable
the detection of the same keypoints across different-sized images
feature descriptors 83

Figure 22: The corner of a curve


appears at two scales. Note
that the circular window on the
right curve captures the entire
corner, while the same-sized
window on the left curve does
not. Instead, we must choose a
(Figure 1). However, if the windows are appropriately scaled, the much larger circular window
same content can be captured. on the left curve to get the same
How do we independently find the correctly scaled windows information. Source: Lecture 7,
for each image? We need to describe what it means to "capture the slide 12.
same content" in a scale-invariant way. More specifically, consider a
function, f (window), that takes in a region of content and outputs
the same value for all scales of that region.
Now consider two similar images at different scales. We can
independently vary window size for each of the images and plot the
response of f (window) as a function of window size:

Figure 23: Two plots of the


response of f (window) as a
function of window size for
Images 1 and 2, where Image 2
is similar to Image 1 but scaled
Within each of the two plots, we can independently identify local
by 12 . Source: Lecture 7, slide
extrema as keypoints. The window sizes that correspond to those
15.
extrema (in the case of Figure 2, s1 and s2 ) provide us with the scale
difference between the two images.

Average intensity One candidate for such a function f (window) is


the average intensity of the pixels within the window; this is because
average intensity does not change as we scale the window up or
down.
However, the average intensity is not great at capturing contrast
or sharp changes within a window; this makes it harder to find clear
extrema when comparing f across two images. To capture contrast,
we need to bring in derivatives into the mix.

Difference of Gaussians Another candidate would be to use the


Difference of Gaussians method.
Consider an image I. First, the I is repeatedly convolved with
Gaussian filters of different σ’s. These convolutions are repeated
with scaled down (i.e., down-sampled) versions of I. This results
in the pyramid of Gaussians of different σ’s and different image
84 computer vision: foundations and applications

sizes (Figure 3). The adjacent Gaussian-convolved images are then


subtracted to calculate their difference of Gaussians (DOG):

DOG (σ) = ( G (kσ ) − G (σ )) ∗ I

Figure 24: On the left: pyra-


mid of Gaussians of different
σ’s and different image sizes.
On the right: difference of
adjacent Gaussians. Source:
http://aishack.in/tutorials/sift-
scale-invariant-feature-
transform-log-approximation/
Intuitively, these differences of Gaussians capture details about I
at different scales. More specifically, the difference of two Gaussians
σ1 and σ2 remove all details that appear at both σ1 and σ2 and keep
only those details that appear between σ1 and σ2 . The differences of
Gaussians for small and large σ’s respectively capture fine and coarse
details.
Given the differences of Gaussians pyramid in x-y-scale space, we
can now identify local extrema within that 3D space to identify both
keypoints and their associated scales. We compare a given coordinate
against its 26 neighbors (in 3D space) and deem it an extrema if it is
smaller or larger than all neighbors (see Figure 4).

Figure 25: Given a coordinate


in x-y-scale space (denoted
by the black X), examine its
26 neighbors (denoted by the
green circles) to determine if
the original coordinate is a local
extrema. Source: Lecture 7,
Harris-Laplacian A third candidate is to use the Harris-Laplacian slide 22
method [2], shown to be more effective at scale-invariant keypoint
detection than the DoG but also potentially more computationally
expensive.
Consider again an image I. First, we create multiple scales of I and
run the Harris detector on each to localize keypoints per scale level.
We then select the keypoints that maximize the Laplacian across all
the scales.

Transition to scale-invariant descriptors: Now that we have several


methods to detect consistent keypoints across multiple scales, we
feature descriptors 85

can move on to developing methods to describe those keypoints in a


scale-invariant manner, so that they can be matched.

SIFT: an image region descriptor

Invariant Local Features


Point descriptor should be invariant and distinctive. To achieve
robustness of point descriptors, we transform image content into
local feature coordinates that are invariant to translation, rotation,
scale, and other imaging parameters .
The advantages of invariant local features include:

• Locality: features describe parts and are robust to occlusion and


clutter (no prior segmentation).

• Distinctiveness: features are identifiable from a large database of


objects.

• Quantity: many features can be generated for even small objects.

• Efficiency: close to real-time performance.

• Extensibility: can easily be extended to wide range of differing


feature types, each adding robustness to changes.

Scale invariance

• The only reasonable scale-space kernel is a Gaussian. (Koenderink,


1984; Lindeberg, 1994)

• An efficient choice is to detect peaks in the difference of Gaussian


pyramid (Burt & Adelson, 1983; Crowley & Parker, 1984 - but
examining more scales)

• Difference-of-Gaussian with constant ratio of scales is a close


approximation to Lindeberg’s scale-normalized Laplacian (can be
shown from the heat diffusion equation)

Rotation invariance Given a keypoint and its scale from DoG,

1. Smooth (blur) the image associated with the keypoint’s scale.

2. Calculate the image gradients over the keypoint neighborhood.

3. Rotate the gradient directions and locations by negative keypoint


orientation. In other words, describe all features relative to the
orientation.
86 computer vision: foundations and applications

SIFT descriptor formation Using precise gradient locations is frag-


ile, so we want to produce a similar descriptor while allowing for
generalization. We create an array of orientation histograms and
place gradients into local orientation histograms of 8 orientation bins.
Dividing gradients into 8 bins is recommended, and this number of
bins was found to exhibit best performance through experimentation.
More concretely,

1. Create an array of orientation histograms

2. Put rotated gradients into local orientation histograms, where each


gradient contributes to the nearby histograms based on distance;
gradients far from center are scaled down. The SIFT authors [3]
found that the best results were achieved using 8 orientation bins
per histogram and a 4x4 histogram array (Figure 2).

3. Compare each vector between two images to find the matching


keypoints.

4. To add robustness to illumination changes in high contrast photos,


normalize the vectors before the comparison. This mitigates the
unreliable 3D illumination effects such as glare that are caused
by the very large image gradients; this is achieved by clamping
the values in vector to under 0.2 (an experimentally tuned value)
before normalizing again.

Figure 26: This figure shows the


percentage of correctly matched
keypoints as a function of the
width of the descriptor and of
the number of histogram bins.
[1]

HoG: Another image region descriptor

Histogram of Oriented Gradients


The histogram of oriented gradients (HOG)[4] Descriptor finds an
object within an image that pops-out, an object that can be discrimi-
nated. The general algorithm for HOG proceeds as follows:
feature descriptors 87

1. Divide the image window into small spatial regions or cells.

2. For each cell, accumulate a local histogram; group gradient di-


rections into evenly-spaced bins, and allocate the magnitude of
a pixel’s gradients into the appropriate bin corresponding to the
gradient direction (Figure 6).

3. Normalize the local histograms over a larger region, called a


"block", comprised of a number of cells.

Figure 27: Here we see a visual


example of keeping track of the
magnitudes of the gradients for
each gradient direction. Source:
Lecture 7, Slide 60.

There are a few downsides to using HOG:

1. Large variations and ranges when detecting

2. Very slow

3. Not very organized when the backgrounds have different illumina-


tions

Despite these downsides, HOG can be quite effective. Note the


results of applying HOG in the image below.

Figure 28: HoG applied to a


bicycle. Source: Lecture 7, Slide
65.

Difference between HoG and SIFT


There are some minor differences between the two. HOG is used
over an entire image to find gradients. SIFT is used for key point
matching. The SIFT histograms orient towards the natural positive
gradient while HOG does not. HOG uses neighborhood bins, while
SIFT uses weights to compute varying descriptors.
Image Resizing

Image resizing with seam carving

Because there are different screen sizes, we need to resize content


according to display capabilities. Normally, we try to force content
to fill up any kind of display by stretching or shrinking an image.
However, this produces less than desirable results. So what is the
solution? Content-aware re-targeting operators.
Retargeting means that we take an input and “re-target” to a
different shape or size. Imagine input as being an image of size n × m
and the desired output as an image of size n0 × m0 . The idea behind
retargeting is to
1. adhere to geometric constraints (e.g., aspect ratio),

2. preserve important content and structures, and

3. limit artifacts.
However, what is considered “important” is very subjective, for what
may be important to one observer may not be important to another.

Pixel energy
A way to decide what is considered “important” is using saliency
measures. There are many different types of saliency measures,
but the concept is the same: each pixel p has a certain amount of
“energy” that can be represented by the function E( p).
The concept is that pixels with higher energy values are more
salient, or more important, than pixels with lower energy values.
What actually goes into the heart of E is up to the beholder.
A good example is to use the gradient magnitude of pixel p to
heavily influence E( p), for this usually indicates an edge. Since
humans are particularly receptive to edges, this is a part of the image
that is potentially valuable and interesting, compared to something
that has a low gradient magnitude. As a result, this preserves strong
contours and is overall simple enough to produce nice results. This
example of E for image I could be represented as
90 computer vision: foundations and applications


∂I ∂I
E(I) = + .
∂x ∂y

Seam carving

Let’s assume that we have an input image with resolution m × n and


we are looking for an output image m × n0 , where n0 < n. How do we
know what pixels to delete? We can use this concept of pixel energy
to identify paths of adjacent pixels, or seams, that have the lowest
combined pixel energy to remove from the image.
Note that the seams need not be strictly rows and columns. In
fact, most of the time seams are curves that go through an image
horizontally or vertically. A seam is horizontal if it reaches from
the bottom edge to the top edge of an image. Similarly, a seam is
vertical if it reaches from the left edge to the right edge of an image.
However, seams are always laid out in a way such that there is only
one pixel per row if the seam is vertical, or only one pixel per column
if the seam is horizontal.
In essence, a seam avoids all important parts of an image when
choosing what to remove from the image so as to cause the least
disruption to the image when removed. There are more things to
consider regarding seam carving and use cases that can be improved
with similar techniques, but this is the core idea of how seams oper-
ate.

References

[1] David G. Lowe, "Distinctive image features from scale-invariant


keypoints,âĂİ International Journal of Computer Vision, 60, 2 (2004),
pp. 91-110
[2] Mikolajczyk, Krystian. âĂIJDetection of Local Features Invari-
ant to Affine Transformations.âĂİ Perception Group, Institut National
Polytechnique de Grenoble, INRIA Grenoble RhÃt’ne-Alpes, 2002.
[3] Lowe, David G. "Object recognition from local scale-invariant
features." Computer vision, 1999. The proceedings of the seventh
IEEE international conference on. Vol. 2. Ieee, 1999.
[4] Dalal, Navneet, and Bill Triggs. "Histograms of oriented gradi-
ents for human detection." Computer Vision and Pattern Recognition,
2005. CVPR 2005. IEEE Computer Society Conference on. Vol. 1.
IEEE, 2005.
image resizing 91

Seam Carving

Basic Idea
The human vision is more sensitive to the edges in an images. Thus,
a simple but effective solution is to remove contents from smoother
areas and preserve the more informative image regions that con-
tain edges; this is achieved using a gradient-based energy function,
defined as:
∂ ∂
E( I ) = | I | + | I |
∂x ∂y
Unimportant contents are, therefore, pixels with smaller values of
energy function.

Pixel Removal
There exist different approaches for removing the unimportant pixels,
and each can lead to different visual results. The figure below demon-
strates three examples of such approaches; the first two (i.e., optimal
least-energy pixel and row removal) are observed to negatively affect
the image quality. The last one (i.e., least-energy column removal), on
the other hand, works significantly better, but it still causes plenty of
artifacts in the new image. An alternative solution is presented in the
next section.

1. Removing all pixels with less energy.

2. Removing the rows of pixels with the least energy.

3. Removing the columns of pixels with the least energy.

Figure 29: The effects of differ-


ent pixel removal methods on
the image quality.

A Seam
1. A seam is defined as a connected path of pixels from top to bot-
tom (or left to right). For top-to-bottom pixel, we shall pick exactly
one pixel from each row. The mathematical definition is

s x = {six }in=1 = { x (i ), i }in=1 , s.t.∀i, | x (i ) − x (i − 1)| ≤ 1


y
sy = {s j }m m
j=1 = { j, y ( j )} j=1 , s.t.∀ j, | y ( j ) − y ( j − 1)| ≤ 1
92 computer vision: foundations and applications

2. The optimal seam is the seam which minimizes the energy func-
tion, based on pixel gradients.
∂ ∂
s∗ = argmins E(s), where E( I ) = | I| + | I|
∂x ∂y

Figure 30: The red line indicates


the location of the optimal seam
with the least energy. Source:
Lecture 7-22

3. The recursion relation can be used to find the optimal seam. If


M (i, j) is defined as the minimal energy cost of a seam going
through pixel (i, j), the recursion relation is
M(i, j) = E(i, j) + min( M(i − 1, j − 1), M (i − 1, j), M (i − 1, j + 1))
This problem can be solved efficiently by using dynamic program-
ming in O(snm), s = 3 in the original algorithm. Given the energy
function value as The recursion relation gives

Figure 31: An example of en-


ergy function used in seam
carving algorithm. Source:
Lecture 7-24

4. To search for the optimal seam, backtracking method is intro-


duced. It starts from the pixel at the bottom with the lowest energy
function, and it then goes up toward the top of the image (i.e., first
image row).

Seam Carving Algorithms


This algorithm runs O((n − n0 )mn). In each loop, each update of
E, s and im takes O(mn). For vertical resizing, the image could be
transposed so that the same algorithm can be used.
image resizing 93

Figure 32: The process of using


relation recursion for com-
puting the seam cost. Source:
Lecture 7-(24-27)

Figure 33: Using backtrack to


find the optimal seam. Source:
Lecture 7-(28-31)

The average energy of images increase given that the seam carv-
ing algorithm removes low energy pixels. The described seam
carving algorithm can be used to modify aspect ratio, to achieve
object removal, and to perform image resizing. The process is
the same if an image is flipped. When resizing, both horizontal
and vertical seams need to be removed. One can solve the order
of adding and removing seams in both directions by dynamic
programming. Specifically, the recurrence relation is: T (r, c) =
min( T (r − 1, c) + E(s x ( In−r−1×m−c )), T (r, c − 1) + E(sy ( In−r×m−c−1 )))
for more information refer to the SIGGRAPH paper on seam carving
20 . 20
Ariel Shamir Shai Avidan. Seam
carving for content-aware image
resizing. ACM Trans. Graph, 26(3):10,
2007
94 computer vision: foundations and applications

Algorithm 1: Seam-Carving
1: im ← original image of size m × n
2: n0 ← desired image size n’
3:
4: Do (n-n’) times:
5: E ← Compute energy map on im
6: s ← Find optimal seam in E
7: im ← Remove s from im
8:
9: return im

Figure 8: The Sea off Satta, Hiroshige woodblock print (Public


Domain)

Advanced Seam Carving

Image Expansion
A similar approach can be employed to increase the size of images.
By expanding the least important areas of the image (as indicated
by our seams), the image dimensions can be increased without
impacting the main content. A naive approach is to iteratively find
and duplicate the lowest energy seam. However, this provides us
results as depicted below:
image resizing 95

Figure 9: Duplicating the least-energy seam is not a good strategy for


image expansion 21 21
https://commons.wikimedia.org/wiki/file:broadway_tow
a

On the right side of the image (Figure 9), one seam has been dupli-
cated repeatedly. This is the program retrieves the same (least-energy)
seam repeatedly. A more effective implementation is to find the first
k seams at once and duplicate each:

Figure 10: A more effective image expansion strategy using the first k
low-energy seams 22 22
https://commons.wikimedia.org/wiki/file:broadway_tow
b

As observed above, the second image expansion strategy signifi-


cantly improves the quality of resized image. Note, however, that
this method can only enlarge the image by a factor of 2 (since there
simply aren’t enough seams to duplicate). For very dramatic enlarge-
ments, you can instead iteratively enlarge by 1.4-1.5x.

Multi-Size Image Representation

While we’ve seen that image resizing can be very effective, it is


still very computationally intensive. In practice, images are stored
alongside a representation of their seams to forgo seam re-calculation
of the seams and streamline image resizing on devices. These seam
representations are the same dimensions as the image, but instead of
pixel intensities, they have numbered paths ordering seams based on
their energy values (ranging from least-energy to highest-energy). In
order to calculate these seams in the pre-processing step, the image
energy must be recalculated after each seam removal (this is because
of the changes in the cost function). See below for an example of such
a representation:
96 computer vision: foundations and applications

Figure 11: Examples of pre-computed seams used for fast image


resizing on devices 23 . 23
Ariel Shamir Shai Avidan. Seam
carving for content-aware image
Given this representation, a program seeking to remove k seams from resizing. ACM Trans. Graph, 26(3):10,
2007
the original image can remove the pixels corresponding to those
labeled 1 through k in the seam image (at right).

Object Removal
By allowing users to specify which areas of the image to give high
or low energy, we can use seam carving to specifically preserve or
remove certain objects. The algorithm chooses seams specifically so
that they pass through the given object (in green below).

Figure 12: The seam carving algorithm can be used for object
removal by assigning low energy values to part of the image 24 . 24
Ariel Shamir Shai Avidan. Seam
carving for content-aware image
resizing. ACM Trans. Graph, 26(3):10,
Limitations 2007

Seam carving is an effective method for image resizing. However,


there are limitations: 1) the primary limitation is the lower and upper
limit to effectiveness as the size of the image is drastically changed;
image resizing 97

2) the failure in recognizing important features in the context of an


object that can be low energy. Let us consider the image below.

Figure 13: The limitations of seam carving; source: CS131 Lecture 7,


slide 71

While the flat and smooth image regions (i.e., with low gradients)
are important to the image, they are removed; for example, this
includes the woman’s cheeks and forehead. While these regions are
low in energy, they are important features to human perception and
should be preserved. To address such limitations, the energy function
can be modified to consider additional information. For example, a
face detector can be used to identify important contents (i.e., human
faces) or other constraints can be applied by the users.

Figure 14: Seam Carving for Content-Aware Image Resizing 25 . 25


Shai Avidan and Ariel Shamir. Seam
carving for content-aware image
resizing. In ACM Transactions on graphics
Forward Energy (TOG), volume 26, page 10. ACM, 2007

When we carve seams, we are removing the lowest energy pixels


and preserving the highest energy pixels. Hence, the average image
energy is increased and can lead to artifacts and jagged edges. One
way to avoid this issue is to instead focus on removing seams that
98 computer vision: foundations and applications

insert the least energy into the image. This approach is known as the
forward energy; our original accumulated cost matrix is modified
by adding the forward energy from corresponding new neighbors
as seen in the image below. The originally introduced method is
referred to as "backward" energy.

Figure 15: The forward energy calculations; source: Lecture 7-71

The figure below compares the performance of backward and


forward computations for image resizing. The traditional "backward"
energy approach resulted in jagged edges appearing along the han-
dle of the wrench and the stem of the plant. The "forward" energy
approach, on the other hand, minimizes the added energy, and the
smoothness of the edges is maintained.

Figure 16: Forward energy approach for content-aware image


resizing 26 . 26
Shai Avidan and Ariel Shamir. Seam
carving for content-aware image
resizing. In ACM Transactions on graphics
Seam-Carving in Videos (TOG), volume 26, page 10. ACM, 2007

While the powerful capabilities of seam carving is explored, the


question remains as to how it can be applied to videos. This algo-
rithm faces challenges with respect to video resizing. Videos are
significantly more difficult to resize. We face two primary issues.
First, let us consider a one-minute window of a film recorded at
30fps. In that duration, we have 1800 frames. If our seam carving
image resizing 99

algorithm takes a full minute to process an image, it would take us 30


hours to completely process the video.
The second issue is with temporal coherency. One can consider
the intuitive and naive approach of seam carving to process each
frame independently. However, this does not necessarily preserve
the important content with respect to the relation of consecutive
frames. Since the human eye is particularly sensitive to movement,
the failure to consider the context across images can create a poor
looking video with noticeable distortion from frame to frame. There
is no coherency to changes across frames. A more effective approach
is to consider the video as a three dimensional space where each
vertical plane is an image from the video. The lowest energy 2D seam
can be calculated throughout the entire video’s length. This produces
much better results, but it still faces limitations.

Figure 17: Improved Seam Carving for Video Retargeting 27 . 27


Michael Rubinstein, Ariel Shamir, and
Shai Avidan. Improved seam carving for
This 3D representation gives us the same capabilities as 2D Seam- video retargeting. In ACM transactions
on graphics (TOG), volume 27, page 16.
Carving such as object removal and frame resizing. ACM, 2008
Semantic Segmentation

Introduction

The devices used for displaying images and videos are of different
sizes and shapes. The optimal image/video viewing configuration,
therefore, varies across devices and screen sizes. For this reason, im-
age resizing is of great importance in computer vision. The intuitive
idea is to rescale or crop the original image to fit the new device, but
that often causes artifacts or even loss of important content in images.
This lecture discussed the techniques used for resizing images while
preserving important content and limiting the artifacts.

Problem Statement
Input an image of size n × m and return an image of desired size
n0 × m0 which will be a good representative of original image. The
expectations are:
1. The new image should adhere to device geometric constraints.

2. The new image should preserve the important content and struc-
tures.

3. The new image should have limited artifacts.

Importance Measures
1. A function, S : p → [0, 1], is used to determine which parts in
an image are important; then, different operators can be used
to resize the image. One solution is to use an optimal cropping
window to extract the most important contents, but this may result
in the loss of important contents.

Figure 34: Importance Mea-


surement by function method.
Source: Lecture 7-11.
102 computer vision: foundations and applications

2. There are also more sophisticated techniques for measuring the


image regions with higher degree of important content; they
include, but are not limited to, attention models, eye tracking (gaze
studies) 28 , and face detectors. 28
Tilke Judd, Krista Ehinger, Frédo
Durand, and Antonio Torralba. Learning
to predict where humans look. In Com-
puter Vision, 2009 IEEE 12th international
conference on, pages 2106–2113. IEEE,
Figure
2009 35: Importance Measure-
ment by more sophisticated
methods such as eye tracking.

Segmentation

In computer vision, we are often interested to identify groups of


pixels that go together. We call this problem image segmentation.
Humans perform image segmentation by intuition. For instance, two
people looking at the same optical illusion 29 might see different 29
https://www.weirdoptics.com/hidden-
things, all depending on how their brain segments the images. In the lion-visual-optical-illusion

image below, you might see zebras, or you might see a lion.

Figure 36: Optical illusions


regarding the problem of image
segmentation.

One of the motivations behind image segmentation is the separa-


tion of an image into coherent objects. Here are two examples of this
type of segmentation:
semantic segmentation 103

Figure 19: Human segmentation of example images; source: Svetlana


Lazebnik

We might also want to segment an image into many groups based


off nearby pixels being similar. We call these groups "superpixels."
Superpixels allow us to treat many individual pixels as one cluster,
and therefore enable faster computations. Here is an example of an
image segmented by superpixels:

Figure 20: The superpixels allow faster computations by clustering


pixels 30 . 30
Xiaofeng Ren and Jitendra Malik.
Learning a classification model for
Superpixel segmentation and other forms of segmentation can segmentation. In null, page 10. IEEE,
2003
104 computer vision: foundations and applications

help in feature support. We can treat the groups of pixels as one


feature and garner information about the image from them.Image
segmentation is also beneficial for some common photo effects such
as background removal. If we can properly segment an image, we
will be able to only keep the groups we want and remove the others.
While segmentation is clearly useful and has many practical
applications, there is no one way to segment an image, and we must
compare different segmentation algorithms to find our optimal
solution. The images are prone to under-segmentation and over-
segmentation if they respectively have very few or an excessive
number of groups. However, even a properly segmented photo can
have multiple different possible groupings.
To tackle the problem of how to segment an image, we will think
of segmentation as clustering. By clustering, we are able to group
similar data points together and represent them with one singular
value. This again aids in our ability to manipulate the image or
extract features from it. However, we must decide a few important
issues:

1. How do we determine if two pixels, patches, or images are simi-


lar?

2. How do we compute an overall grouping from pairwise similari-


ties?

Clustering algorithms have different answers for these questions; they


will be discussed in depth in the next notes.
In general, the two broad categories of clustering algorithms are
top down and bottom up. The top-down clustering approach groups
pixels and patches together if they lie on the same visual entity.
The bottom-up approach groups pixels together if they are locally
coherent.
We may also use certain human-recognizable visual patterns for
our clustering algorithm. Some example patterns include grouping
similar objects together or using symmetry to aid in segmentation.
In some instances, we can also look at "common fate." Common fate
means that a group of objects appear to be moving together, so they
share the same "fate." Here is an example of camels, which we can
group by their common fate.
semantic segmentation 105

Figure 21: Common fate provides visual cues for the segmentation
problem; source: Arthus-Bertrand (via F. Durand)

We can also illustrate common fate with this optical illusion. This
illusion, called the MÃijller-Lyer illusion, tricks us into thinking the
bottom line segment is longer than the top line segment, even though
they are actually the same length (disregarding the four mini-tails).

Figure 22: Common fate results in an optical illusion; source: Simon


Barthelme

Another way we can group objects is by proximity. With proximity,


we group objects with what they appear to be close to. For instance,
in this image, we might group the three people in the foreground
together.

Figure 23: Proximity can aid the image segmentation; source: Kristen
Grauman
Clustering

Clustering and Segmentation

Image segmentation is a task in computer vision; it aims to identify


groups of pixels and image regions that are similar and belong
together. Different similarity measures can be used for grouping
pixels; this includes texture and color features. An instance of image
segmentation is illustrated below. In Figure 1, the objective is to
group all the pixels that make up the tiger, the grass, the sky, and the
sand. The resulting segmentation can be observed in Figure 2.

Figure 37: Output segmentation.


Source: lecture 12, slide 4

Other examples of image segmentation include grouping video


frames into shots and separating an image’s foreground and back-
ground. By identifying the groups of pixels that belong together, an
image can be broken down into distinct objects. In addition to its
immediate use for object recognition, this can also allow for greater
efficiency in any further processing.

Gestalt School and Factors

Many computer vision algorithms draw from the areas outside of


computer science, and image segmentation is no different. Computer
vision researchers drew inspiration from the field of psychology,
specifically the Gestalt Theory of Visual Perception. At a very high
level, this theory states that "the whole is greater than the sum of
its parts." The relationships between parts can yield new properties
and features. This theory defines Gestalt factors, properties that can
define groups in images. Below are examples of such factors.
Here is an example of an other factor; while Figure 4 does not ap-
108 computer vision: foundations and applications

Figure 38: Examples of gestalt


factors. Source: Forsyth &
Ponce

pear to have meaningful visual content, the addition of the overlaying


gray lines (Figure 5) on the same image provides visual cues as to the
pixel groupings and image content.

Figure 39: Source: Forsyth &


Ponce

We can now more clearly see that the image is a few 9’s occluded
by the gray lines. This is an example of a continuity through occlu-
sion cue. The gray lines give us a cue that the black pixels are not
separate and should in fact be grouped together. By grouping the
black pixels together and not perceiving them as separate objects, our
brain is able to recognize that this picture contains a few digits.
Below is another example:
What do you see? Do you see a cup or do you see the side of 2
heads facing each other? Either option is correct, depending on your
perspective. This variation in perception is due to what we identify as
the foreground and the background. If we identify the black pixels as
clustering 109

Figure 40: Source: Forsyth &


Ponce

Figure 41: Source: Forsyth &


Ponce

the foreground, we see a cup. But, if we identify the white pixels as


the foreground, we see 2 faces.
This is an overview of some of the factors that affect human’s
perception of pixel/image region grouping within visual data. One
thing to note, however, is that while these factors may be intuitive,
it’s hard to translate these factors into working algorithms.
How do we perform image segmentation? What algorithms do we
use? One way to think about segmentation is clustering. We have a
few pixels and we want to assign each to a cluster. In the following
sections, different methods of clustering will be detailed.
110 computer vision: foundations and applications

Agglomerative Clustering

Clustering is an unsupervised learning technique where several data


points, x1 , ..., xn , each of which are in R D , are grouped together into
clusters without knowing the correct label/cluster. Agglomerative
clustering is one of the commonly used techniques for clustering.
The general idea behind agglomerative clustering is to look at
similarities between points to decide how these points should be
grouped in a sensible manner. Before we discuss the details of the
algorithm, we must first decide how we determine similarity.

Distance Measures
We measure the similarity between objects by determining the dis-
tance between them: the smaller the distance, the higher the degree
of similarity. There are several possible distance functions, but it is
hard to determine what makes a good distance metric, so usually the
focus is placed on standard, well-researched distance metrics such as
the two detailed below.

Euclidean Distance One common measure of similarity is the Eu-


clidean distance; it measures the distances between two data points,
x and x 0 by taking into account the angle and the magnitude. We can
write the Euclidean distance as the following:

sim( x, x 0 ) = x T x 0 (1)

This distance measure does not normalize the vectors, so their


magnitude is factored into the similarity calculations.

Cosine Similarity Measure Another common solution is the Cosine


similarity measure; it only accounts for the angle between two data
points, x and x 0 . Note that unlike the Euclidean distance, the cosine
measure only represents similarity, not distance. This means that the
similarity between a data point x, and itself, equals 1. As its name
implies, this measure relies on the cosine between the two points, as
found by:

sim( x, x 0 ) = cos(θ ) (2)

x T x0
= (3)
|| x || · || x 0 ||

x T x0
= √ p (4)
x T x x0 T x0
clustering 111

The division by the magnitudes of the vectors results in the nor-


malization of the distance metric, and it ensure that the measure is
only dependent on the angle between the objects.

Desirable Clustering Properties

Now that the potential distance metrics are defined, the next step
is to choose a clustering technique. There are various properties of
clustering methods that we might want to consider when choosing
specific techniques:

1. Scalable - in terms of compute power & memory

2. Different data types - algorithm should support arbitrary data


being in Rd for all d

3. Input parameters - The parameter tuning for the algorithm should


not be difficult. The algorithm is more useful if it does not heavily
rely on our accurate understanding of the data.

4. Interpretable - we should be able to interpret the results.

5. Constraints - The algorithm should effectively use the prede-


fined constraints (e.g., we know two points should be in the same
cluster, or they shouldn’t belong together).

The following sections cover the implementation of the agglomera-


tive clustering and its benefits and drawbacks.

Agglomerative Clustering Implementation

The agglomerative clustering calculates the similarities among data


points by grouping closer points together. The newly created groups
can further be merged to other groups that are close to them; this
iterative process results in the generation of bigger groups until there
only remains one group. This creates a hierarchy, best viewed as a
dendrogram.
We can visualize this in the following diagram, which shows data
points and the results of an agglomerative clustering algorithm.
The first picture shows all the data points, and pictures 2 to 5 show
various steps in the clustering algorithm. Step 2 groups the two red
points together; step 3 groups the two green points together; step 4
groups the previous green group and the nearby blue point into a
new orange group; and step 5 groups all the points together into one
large group. This creates the dendrogram in picture 6.
112 computer vision: foundations and applications

Figure 42: Agglomerative clus-


tering on sample input, and
resulting dendrogram

Algorithm Our general algorithm is based on our intuition and has


four main steps:

1. Initialize each point as its own cluster

2. Find the most similar pair of clusters

3. Merge the similar pair of clusters into a parent cluster

4. Repeat steps 2 & 3 until we have only 1 cluster.

Questions Although agglomerative clustering is a powerful tech-


nique, various factors should be taken into consideration when
implementing it. For instance:

1. How do we define similarity between clusters? How do we measure the


distance between two clusters?

We can measure the distance between two clusters in a variety of


different ways including the average distance between points, the
minimum distance between points in the clusters, the distance
between the means of each cluster, and the maximum distance
between points in the clusters. The method used for measuring the
distance between clusters can highly affect the results.

2. How many clusters do we chose?

When we create the dendrogram, we can decide how many clus-


ters we want based on a distance threshold. Alternatively, we can
cut the dendrogram horizontally at its different levels to create as
many clusters as we want.
clustering 113

Different measures of nearest clusters


There are three main models we can use to determine the distance
between cluster points as we segment our dataset: single, complete,
and average.

1. Single link:
Distance is calculated with the formula:

d(Ci , Cj ) = min x∈Ci ,x0 ∈Cj d( x, x 0 ) (5)

With single linkage, we cluster by utilizing the minimum distance


between points in the two clusters.
This is also known as a minimum spanning tree.
We can stop clustering once we pass a threshold for the acceptable
distance between clusters.
This algorithm tends to produces long, skinny clusters (since it
is very easy to link far away points to be in the same cluster as
we only care about the point in that cluster with the minimum
distance).

Figure 43: Image segmentation


example using single link mea-
surement of nearest clusters.
Source: lecture 12, slide 46

2. Complete link:
Distance is calculated with the formula:

d(Ci , Cj ) = max x∈Ci ,x0 ∈Cj d( x, x 0 ) (6)

With complete linkage, we cluster by utilizing the maximum


distance between points in the two clusters.
This algorithm tends to produces compact and tight clusters of
roughly equal diameter (since it favors having all the points close
together).

3. Average link:
Distance is calculated with the formula:
∑ x ∈ Ci , x 0 ∈ Cj d( x, x 0 )
d(Ci , Cj ) = (7)
|Ci | · |Cj |
114 computer vision: foundations and applications

Figure 44: Image segmentation


example using complete link
measurement of nearest clus-
ters. Source: lecture 12, slide
47

With average linkage, we cluster by utilizing the average distance


between points in the two clusters.
This model is robust against noise because the distance does not
depend on single pair of points unlike single links and complete
links, where points can be affected by artifacts in the data.

Figure 45: Image segmentation


example using average link
measurement of nearest clus-
ters. Source: lecture 12, slide
48

Agglomerative clustering conclusions


The positive characteristics of agglomerative clustering:
1. Simple to implement and apply

2. Cluster shape adapts to dataset

3. Results in a hierarchy of clusters

4. No need to specify number of clusters at initialization


In terms of its negative characteristics:
1. Can return imbalanced clusters

2. Threshold value for number of clusters must be specified

3. Does not scale well with a runtime of O(n3 )

4. Greedy merging can get stuck at local minima


clustering 115

K-Means Clustering

Another algorithm is k-means clustering. It identifies a fixed number


of cluster "centers" as representatives of their clusters, and labels
each point according to the center it is closest to. A major difference
between k-means and agglomerative clustering is that k-means
requires the input of a target number of clusters to run the algorithm.

Image Segmentation Example

At the top left of figure 11, we have an image with three distinct
color regions, so segmenting the image using color intensity can be
achieved by assigning each color intensity, shown on the top right,
to a different cluster. In the bottom left image, however, the image is
cluttered with noise. To segment the image, we can use k-means.

Figure 46: Image segmentation


example using k-means. The
picture on the top left has three
distinct colors, but the bottom
picture has Gaussian noise.
Source: lecture 11, slide 8, slide
credit: Kristen Grauman

Using k-means, the objective here is to identify three cluster cen-


ters as the representative intensities, and label each pixel according
to its closest center. The best cluster centers are those that minimize
Sum of Square Distance between all points and their nearest cluster
center ci :

SSD = ∑ ∑ ( x − c i )2 (8)
i ∈clusters x ∈clusteri

When we are using k-means to summarize a dataset, the goal is


to minimize the variance in the data points that are assigned to each
cluster. We would like to preserve as much information as possible
given a certain number of clusters. This is demonstrated by the
equation below.
116 computer vision: foundations and applications

Figure 47: Clustering for sum-


marization. Source: lecture 11,
slide 11

Algorithm

Finding the cluster centers and group memberships of points can be


thought of as a "chicken and egg" problem. If we knew the cluster
centers, we could allocate points to groups by assigning each point
to the closest center. On the other hand, if we knew the group mem-
berships, we could find the centers by computing the mean of each
group. Therefore, we alternate between the tasks.
In order to find the centers and group memberships, we start by
initializing k cluster centers, usually by assigning them randomly. We
then run through an iterative process that computes group member-
ships and cluster centers for a certain number of iterations or until
the values of the cluster centers converge. The process is outlined
below:

1. Initialize cluster centers c1 , ..., cK . Usually, these centers are ran-


domly chosen data points.

2. Assign each point in the dataset to the closest center. As in ag-


glomerative clustering, we can use the Euclidean distance or the
cosine distance measure to compute the distance to each center.

3. Update the cluster centers to be the mean of the points in the


cluster’s group.

4. Repeat Steps 2-3 until the value of the cluster centers stops chang-
ing or the algorithm has reached the maximum number of itera-
tions.

Figure 48: Visualization of


k-means clustering. Source:
lecture 11, slide 15
clustering 117

Output

Each time it is run, k-means converges to a local minimum solution.


Additionally, since the centers are initialized randomly, each run of
the algorithm may return a different result. Therefore, initializing
multiple runs of k-means and then choosing the most representative
clustering yields the best results. The best clustering can be measured
by minimizing the sum of the square distances to the centers or the
variance of each cluster. K-means works best with spherical data.

Segmentation as Clustering

As with the example in section 4.1, clustering can be used to segment


an image. While color intensity alone can be effective in situations
like Figure 11, others such as Figure 14 may require us to define a
feature space, in which we choose which features of pixels should
be used as input to the clustering. In other words, the choice of
feature space directly affects the calculation of the similarity measure
between data points; the creative choice of feature space enables us
to "represent" the data points in a way that clusters are more easily
distinguishable from each other.

Figure 49: Image of a panda.


Source: lecture 11, slide 19

In addition to pixel intensity, examples of pixel groupings using


an image feature space include RGB color similarities, texture similar-
ities, and pixel positions. Clustering based on color similarities can
be modeled using separate features for red, green, and blue. Texture
can be measured by the similarities of pixels after applying specific
filters. Position features include the coordinates of pixels within an
image. Both intensity and position can be used together to group
pixels based on similarity and proximity.
118 computer vision: foundations and applications

K-Means++
K-means method is appealing due to its speed and simplicity but
not its accuracy. By augmentation with a variant on choosing the
initial seeds for the k-means clustering problems, arbitrarily bad
clusterings that are sometimes a result of k-means clustering may be
avoided. The algorithm for choosing the initial seeds for k-means++
is outlined as following:

1. Randomly choose a starting center from the data points

2. Compute a distance D ( X ), which is a distance between each data


point x to the center that has been chosen. By using a weighted
probability distribution, a new data point is chosen as a new center
based on a probability that is proportional to D ( x )2

3. Repeat the previous step until k centers have been chosen and
then proceed with the usual k-means clustering process as the
initial seeds have been selected

It has been shown that k-means++ is Olog(K) competitive.

Evaluation of clusters
The clustering results can be evaluated in various ways. For example,
there is an internal evaluation measure, which involves giving a
single quality score the results. External evaluation, on the other
hand, compares the clustering results to an existing true classification.
More qualitatively, we can evaluate the results of clustering based on
its generative measure: how well is the reconstruction of points from
the clusters or is the center of the cluster a good representation of the
data. Another evaluation method is a discriminative method where
we evaluate how well clusters correspond to the labels. We check if
the clusters are able to separate things that should be separated. This
measure can only be worked with supervised learning as there are no
labels associated with unsupervised learning.

Pros & Cons


There are advantages and disadvantages associated with k-means
clustering technique:
Pros

1. Simple and easy to implement

2. Fast for low-dimensional data

3. Good representation of the data (cluster centers minimize the


conditional variance)
clustering 119

Cons

1. Doesn’t identify outliers

2. Need to specify k value, which is unknown

3. Cannot handle non-globular data of different sizes and densities

4. Restricted to data which has the notion of a center

5. Converge to a local minimum of the objective function instead


and is not guaranteed to converge to the global minimum of the
objective function

In order to choose the number of clusters or the k value, the objective


function can be plotted against different k values. The abrupt change
in the objective function at a certain k value is suggestive of that
specific number of clusters in the data. This technique is called "knee-
finding" or "elbow-finding"

Mean-shift Clustering

Mean-shift clustering is yet another clustering algorithm. At its


essence, mean-shift clustering is about finding the densest areas in
our feature space. This algorithm has four main steps:

1. Initialize random seed, and window W

2. Calculate center of gravity (the "mean) of W: ∑ xH ( x )


x ∈W

3. Shift the search window to the mean

4. Repeat step 2 until convergence

One way to mentally visualize this algorithm is to picture each data


point as a marble. If each marble is attracted to areas of high density,
all the marbles will eventually converge onto 1 or more centers.
We can also try to visualize the algorithm via this picture: In this

Figure 50: Results of mean-shift


clustering. Source: Y. Ukrainitz
and B. Sarel

picture, we see the algorithm will generate 2 clusters. All the data
points on the left converge onto one center and all the data points on
the right converge onto a different center.
120 computer vision: foundations and applications

To learn more about mean-shift clustering please refer to the next


set of notes.
Object recognition

Mean-Shift

Previously, we discussed the applications of agglomerative clustering


and k-means algorithm for image segmentation. This lecture focuses
on a mean shift, a mode-seeking technique that analyzes a density
function to identify its maxima; one of the primary application
of mean-shift algorithm is image segmentation. In the following
sections, the different steps involved in the mean-shift algorithm are
detailed.

Optimizations
To correctly assign points to clusters, a window must be initialized
at each point and shifted to the most dense area. This procedure can
result in a large number of redundant or very similar computations.
It is possible to improve the speed of the algorithm by computing
window shifts in parallel or by reducing the number of windows that
must be shifted over the data; this is achieved using a method called
"basin of attraction".

Parallelization The computations required to shift different windows


are independent of each other and can be split across multiple pro-
cessors and computed simultaneously. By parallelizing mean-shift
over multiple processors or machines, it is possible to achieve a large
speed increase without any loss to accuracy.

Basin of Attraction It is very likely that points close to the path


and stopping points of the window will be assigned to the same
cluster. Thus, these points can be assigned early without the need for
calculating mean-shift operations and initializing windows at their
location; this reduces the computation time.
after each update to the window location, nearby points are as-
signed using the following methods:
Assign all points within a radius r of the shift end point to the
122 computer vision: foundations and applications

same cluster. This is because the window has just been shifted to an
area of higher density, thus it is likely that points close to this area all
belong to the same cluster

Figure 1: Adding points within radius r of shift end point


Assign all points within radius c of the path of the shift to the
same cluster. Imagine initializing a window at one of the points near
the path of the shift. Because all windows are shifted in the direction
of the most dense area, it is likely that this window will be shifted in
the same direction and the point will eventually be assigned to the
same cluster. It is common to select the radius c such that c ≤ r.

Figure 2: Adding points within radius c of window path


There is a trade off in selecting the values for r and c. The smaller
the values, the fewer nearby points are assigned early and the higher
object recognition 123

the computation cost will be. However, the resulting cluster assign-
ments will have less error if mean-shift was calculated without this
method. The larger the values, the more nearby points are assigned
resulting in faster speed increases, but also the possibility that the
final cluster assignments will be less accurate to standard mean-shift.

Technical Details

In order to correctly shift the window you must first identify a


nearby area with highest density to calculate the shift vector. This
can be accomplished using the multivariate kernel density estimate,
a means of estimating the probability density function of a random
variable.
Given n data points x ∈ Rd , the multivariate kernel density
estimate, using a radially symmetric (Comaniciu & Meer, 2002)
kernel K ( x ), is given by

n
x − xi
 
1
fˆK =
nhd
∑K h
, (9)
i =1

where h is known as the bandwidth parameter which defines the


radius of the kernel. The radially symmetric kernel, K ( x ) is given by

K ( x ) = ck k(|| x ||2 ), (10)

where ck represents a normalization constant.


Selecting the appropriate h is crucial to achieving accurate density
estimation. Selecting a value that is too small results in a small radius
and can be subject to noise in the data. Selecting a value that is too
large includes many far away points and results in fewer clusters.
The resulting derivative of the multivariate kernel density estimate
is given by
 
x−x
 
∑in=1 xi g || h i ||2
" #
n
x−x

2c
∇ fˆ( x ) = dk,d
nh +2
∑ g || h i ||2 
n

x − xi 2
 − x ,
i =1 ∑i=1 g || h ||
(11)
0
where g( x ) = −K ( x ) denotes the derivative of the selected kernel
profile. h  i
2ck,d x−x
The first term in equation (3), nhd+2
∑in=1 g || h i ||2 , is propor-
" 
x−x
 #
∑in=1 xi g || h i ||2
tional to the density estimate at x. The second term, 
x−x
 −x ,
∑in=1 g || h i ||2
is the mean-shift vector that points towards the direction of maxi-
mum density.
124 computer vision: foundations and applications

Mean-Shift Procedure
From a given point xt , calculate the following steps in order to reach
the center of the cluster.

1. Compute the mean-shift vector m [term 2 from equation (3)


above]:  
x−x
 
∑in=1 xi g || h i ||2
m=   − x (12)
n x − xi 2
∑i=1 g || h ||

2. Translate the density window using the mean-shift vector:

xit+1 = xit + m(xit ) (13)

3. Repeat steps 1 and 2 until convergence

∇ f ( xi ) = 0 (14)

Kernel Functions
A kernel, K ( x ) is a non-negative function that integrates to 1 over all
values of x. These requirements ensure that kernel density estimation
will result in a probability density function.
Examples of popular kernel functions include:

• Uniform (rectangular)
1
– K(x) = 2 where | x | ≤ 1 and 0 otherwise

• Gaussian
1 2
– K ( x ) = √ 1 e− 2 u

• Epanechnikov (parabolic)

– K ( x ) = 34 (1 − x2 ) where | x | ≤ 1 and 0 otherwise

Mean-Shift Conclusions
The mean-shift algorithm has many pros and cons to consider.
Positives of mean-shift:

• Very general and application independent.

• Model-free. Mean-shift does not assume any prior shape of data


clusters.

• Only depends on a single parameter defining the window size.


Additional parameters will be required (r and c) if basin of attrac-
tion is applied.
object recognition 125

• Finds a variable number of modes. The modes of a distribution


function are the local maximums and the locations of these maxi-
mums are where the clusters will be centered.

• Robust to outliers in the data. Mean-shift won’t try to force out-


liers to an existing cluster if they are further than the window size
away from all other points.

Negatives of mean-shift:

• Output depends on window size and determining the appropriate


window size is not trivial.

• Computationally relatively expensive to compute the mean-shift


from all data points.

• Does not scale well with dimension of feature space.

Object recognition

Object recognition tasks


The field of object recognition can be broken down into a number of
different visual recognition tasks.

Classification Classification involves teaching computers to properly


label images based on the categories of object. It focuses on answer-
ing the questions, "is this image of a particular object?" and "does this
image contain a particular object?" Typically, classification questions
have yes or no answers. For example, typical classification questions
could ask if the image contains a building or if the image is of a lake.

Image search This recognition task involves searching a collection


of photos for photos of a particular object; an examples is Google
Photos.

Organizing photo collections Object recognition can help organize


photo collections based upon location of the photos, similarity of ac-
tivities, the people who appear in the photos, and other recognizable
objects.

Detection Detection focuses on the question, "where is a particu-


lar object in the image?" Traditional detection methods search for a
bounding box that contains the object in question. Using segmenta-
tion techniques, the object can also be more specifically selected from
the pixels in the image, for what is called accurate localization.
126 computer vision: foundations and applications

Detection can also be targeted at finding geometric and semantic


attributes. For example, detection tasks include asking questions such
as "is the traffic light red?", "what angle do two buildings make with
one another?". Other common attributes found by detection include
the distance objects are from one another, and what view of the object
the image has.

Single Instance Recognition Instead of searching to generally catego-


rize objects, single instance recognition seeks to recognize a particular
object or landmark in images. For example, we may want to deter-
mine if the picture contains the Golden Gate Bridge or just a bridge.
We may want to find a box of a specific brand of cereal.

Activity or event recognition Activity or event recognition is used to


detect what is occurring in a photo. For example, we can ask what
people are doing or if the event is a wedding.

Challenges
The challenges to building good object recognition methods include
both image and category complications. Since computers can only
see the pixel values of images, object recognition methods must
account for a lot of variance.

Category numbers Studies have shown that humans can recognize


roughly 10,000 to 30,000 categories of objects. Currently, the best
visual recognition methods can work with about 200 categories for
detection, and 1,000 for classification. Many researchers are working
to build image datasets to improve object recognition; performing
recognition on higher number of categories is the subject of many
competitions!

Viewpoint variation There are many potential ways to view an object,


and the change in viewpoint can lead an object to look very different.

Illumination Different levels of light, particularly low light or a


different light direction, will cause shadows to shift and the details of
an object to become obscured.

Scale Objects belonging to one category can come in a variety of


sizes. If a classification is only trained on a particular size of object,
then it will fail to recognize the same object in a different size. One
way to solve this bias would be to figure out a way to make better
datasets that account for scale variation.
object recognition 127

Deformation Objects can change form and look very different, while
still being considered the same object. For example, a person can be
photographed in a number of poses, but is still considered a person if
they’re bending over or if their arms are crossed.

Occlusion Objects may be occluded, which could hide aspects of


their characteristic geometry. While occlusion may be an excellent
tool for artists (see Magritte), it is a challenge for computers.

Background clutter Similarities between the texture, color, and shapes


in the background and the foreground can make it difficult to de-
tect an object by blending in the object and/or causing it to appear
differently.

Intra-class variation There can be significant shape variations event


within one class of objects. For example, everything from a barstool
to a lounge chair, and a beanbag to an abstract artistic bench can be
considered a chair.

K-nearest neighbors

Supervised learning
We can use machine learning to learn how to label an image based
upon its features. The goal of supervised learning is to use an exist-
ing data set to find the following equation:

y = f (x) (15)

In the above equation, y is the output, f is the prediction function,


and x is a set of features from an image.
There are two phases of supervised learning: the train and the
test phase. In the train phase, we fit an equation, f , to the training
data set {( x1 , y1 )...}, which matches sets of image features to known
identifier labels. f can be estimated by minimizing the prediction
error, the difference between y and f (x).
In the second phase, the test phase, we evaluate our equation,
y = f ( x ), using a new test example that has never been seen by f
before. We can compare f (x), and its expected output, y, to evaluate
if the prediction function works on a new image. If the prediction
function does not work on new images, it’s not very useful!
One way to use supervised machine learning is to learn the deci-
sion rules for a classification task. A decision rule divides input space
into decision regions that are separated by decision boundaries.
128 computer vision: foundations and applications

Figure 3: Decision regions and decision boundaries (in red) for three
categories, R1 , R2 , and R3 over a feature space with two dimension. ?

Nearest neighbor classifier

The nearest neighbor classifier is an algorithm for assigning cate-


gories to objects based upon their nearest neighbor. We assign a new
test data point the same label as its nearest training data point.
Nearest neighbors are found using the Euclidean distance between
features. For a set of training data, where X n and X m are the n-th and
the m-th data points, the distance equation is written as:
r
Dist( X n , X m ) = || X n − X m ||2 = ∑(Xin − Xim )2 (16)
i

The definition of the nearest neighbors classifier allows for com-


plex decision boundaries that are shaped around each data point in
the training set.

K-nearest neighbors classifier

We can enhance the nearest neighbors classifier by looking at the


k nearest neighbors as opposed to just the nearest neighbor. The
k-nearest neighbor classifier calculates the k nearest neighbors, and
labels a new object by a score calculated over this set of nearest
neighbors. A commonly used metric is to assign a new object to
the category that a majority of their nearest neighbors belong to.
object recognition 129

Heuristics are used to break ties, and are evaluated based upon what
works the best.

Figure 4: For the + data point, the green circle represents it’s
k-nearest neighbors, for k = 5. Since three out of five of its nearest
neighbors are green circles, our test data point will be classified as a
green circle. 31 31

Like the nearest neighbor classifier, the k-nearest neighbor al-


gorithm allows for complex decision boundaries using a relatively
simple equation!

Pros of using k-nearest neighbors


K-NN is a very simple algorithm which makes it a good one to try
out at first. IN addition, K-NN has very flexible decision boundaries.
Another reason to use K-NN is that with infinite examples, 1-NN
provably has error that is at most twice Bayes optimal error (proof is
out of scope for this class).

Problems with k-nearest neighbors


Choosing the value of k It’s important to choose the proper value of
K when using the K-nearest neighbor algorithm. If the value of K is
too small, then the algorithm will be too sensitive to noise points. If
130 computer vision: foundations and applications

the value of K is too high, then the neighborhood may include points
from other classes. Similarly, as K increases, the decision boundaries
will become more smooth, and for a smaller value of K, there will be
more smaller regions to consider.

Figure 1: Comparing the result of choosing K=1 vs. K=15

Solution: Cross validate!


One way to find the proper value of K is to create a holdout cross
validation/development set from the training set. From there, we
would adjust the hyper parameters (e.g, K) on the development set to
maximize training, and then ’test’ the development set’s accuracy. We
would then change the validation set from the training set and repeat
until we find the best value of K.

Euclidean measurement Another issue that can arise with the K-


nearest neighbor algorithm is that using the Euclidean measure
might provide counter-intuitive results. See example below:

Figure 3: Different values, but using the Euclidean measure equates


them to be the same

Solution: Normalize!
Normalizing the vectors to unit length will guarantee that the
proper Euclidean measurement is used.

Curse of dimensionality When using the K-nearest neighbor algo-


rithm, one issue we need to keep in mind is how to deal with larger
object recognition 131

dimensions. When the dimensions grow, we need to cover a larger


space to find the nearest neighbor. This makes it take longer and
longer to calculate the nearest points, and the algorithm will run
slower. This also means we will need to have a greater number of
examples to train on. Currently, there is no best solution to the di-
mensionality problem.

Figure 4: Larger dimensions require long calculation times

Problem: Assume there are 5000 points uniformly distributed in


the unit hypercube, and we want to apply 5-NN (Nearest Neighbor).
Suppose our query point is at the origin:
5
• In 1 dimension, we must go a distance of 5000 =0.001 on the aver-
age to capture 5 nearest neighbors

• In 2 dimensions, we must go 0.001 to get a square that contains
0.001 of the volume
1
• In d dimensions, we must go (0.001) d

Note: K-nearest neighbors is just one of the many classifiers to


choose from. That being said, it’s not always easy to choose the
"best" model for your data. It’s best to think about how well you can
generalize your model (i.e., how well does your model generalize
from the data it was trained on to a new set?).

Bias-variance trade-off
:
The key to generalization error is to get low, generalized error, by
finding the right number/type of parameters. There are two main
components of generalization error: bias and variance. Bias is defined
as how much the average model over all the training sets differs
from the true model. Variance is defined as how much the models
estimated from different training sets differ from each other. We need
132 computer vision: foundations and applications

to find the right balance between bias and variance, hence, the bias-
variance trade-off. Models with too few parameters are inaccurate
because of a large bias (not enough flexibility). Similarly, models with
too many parameters are inaccurate because of a large variance (too
much sensitivity to the sample). To types of incorrect fitting will be
listed below:

Underfitting: The model is too "simple" to represent all of the


relevant class characteristics.

• High bias and low variance

• High training error and high test error

Overfitting: The model is too "complex" and fits irrelevant character-


istics (noise) in the data.

• Low bias and high variance

• High training error and and high test error

Figure 5: A graph that depicts the trade-off between bias and


variance.
Dimensionality Reduction

Overview and Motivation

Overview
Dimensionality reduction is a process for reducing the number of
features used in an analysis or a prediction model. This enhances
the performance of computer vision and machine learning-based
approaches and enables us to represent the data in a more efficient
way. There are several methods commonly used in dimensionality
reduction. The two main methods covered in this lecture are Singular
Value Decomposition (SVD) and Principal Component Analysis
(PCA).

Motivation
Dimension reduction benefits models for a number of reasons.

1. Reduction in computational cost can be achieved. In many data


sets, most of the variance can be explained by a relatively small
number of input variables and their linear combinations. Focusing
on these key components using dimensionality reduction, we can
reduce the computational cost without losing much granularity in
the data.

2. Reduce the effects of the âĂIJcurse of dimensionalityâĂİ. In


lecture 11 we learned that as we increase the dimension of a
feature space, the number of data points needed to âĂIJfill inâĂİ
that space with the same density explodes exponentially. That is
to say, the more dimensions used in a machine learning algorithm,
the more examples are needed for learning and the longer it takes
the algorithm to analyze the same number of data points. By
performing dimensionality reduction, we can mitigate the effects
of this âĂIJcurse of dimensionalityâĂİ.

3. Compress data. By reducing the dimensionality of an image, we


can dramatically reduce the data storage requirements.
134 computer vision: foundations and applications

In such cases the computational cost per data point may be re-
duced by many orders of magnitude with a procedure like SVD

Singular Value Decomposition

Overview
Intuitively, Singular Value Decomposition (SVD) is a procedure that
allows one to represent the data in a new sub-feature space, such
that the majority of variations in the data is captured; this is achieved
by "rotating the axes" of the original feature space to form new axes
which are linear combinations of the original axes/features (e.g. age,
income, gender, etc. of a customer). These âĂIJnew axesâĂİ are
useful because they systematically break down the variance in the
data points (how widely the data points are distributed) based on
each directions contribution to the variance in the data:
The result of this process is a ranked list of "directions" in the
feature space ordered from most variance to least. The directions
along which there is greatest variance are referred to as the "princi-
pal components" (of variation in the data); by focusing on the data
distribution along these dimensions, one can capture most of the in-
formation represented in the original feature space without having to
deal with a high number of dimensions in the original space (but see
below on the difference between feature selection and dimensionality
reduction).

Technical Details of Singular Value Decomposition


• SVD represents any matrix A as a product of three matrices:
A = UΣV T where U and V are rotation matrices, and Σ is a
diagonal scaling matrix. For example:

• For many readers, it may be sufficient to extract SVD values


by writing: [U, S, V] = numpy.linalg.svd(A). However, the un-
derpinnings of how SVD is computed is useful for later topics.
Computers typically compute SVD by taking the following steps:

– Compute the eigenvectors of AA T . These vectors are the


columns of U. Square root of the eigenvalues are the singu-
lar values (entries of Σ)
dimensionality reduction 135

– Compute the eigenvectors of A T A. These vectors are columns of


V (or rows of V T )

• Since SVD relies on eigenvector computation, which are typically


fast, SVD can be performed quite quickly; even for large matrices.

• A more detailed, geometric explanation of SVD may be found


here32 . 32
http://www.ams.org/samplings/feature-
column/fcarc-svd

Applications of Singular Value Decomposition

• One the most utilized applications of SVD is the computation of


matrix inverses. If an arbitrary matrix A can be decomposed
by way of: A = UΣV T , the inverse of A may be defined as:
A+ = V T Σ−1 U. Although this inverse is an approximation, it
allows one to calculate the inverses of many non-square matri-
ces. MacAusland (2014) discusses the mathematical basis of this
inverse, which is named the Moore-Penrose inverse, after its cre-
ators33 . Unsurprisingly, a large variety of matrix problems can be 33
R. MacAusland. The moore-penrose
solved be utilizing this approach. inverse and least squares. 2014

• Singular Value Decomposition can also be used to compute the


Principal Components of a matrix. Principal Components are heav-
ily utilized in various data analysis and machine learning routines,
hence SVD is typically a core routine within many programs.

Principal Component Analysis

What are Principal Components?

Continuing with the SVD example we have above, notice that Col-
umn 1 of U gets scaled by the first value from Σ.

Then, the resulting vector UΣ gets scaled by row 1 of V T to pro-


duce a contribution to the columns of A which is denoted A partial .
Each product of (column i of U)*(value i from Σ)*(row i of V T ) pro-
duces a component of the final A.
136 computer vision: foundations and applications

In this process we are building the the matrix A as a linear combi-


nation of the columns of U. As seen above, if we use all columns of
U, we rebuild A perfectly. However, in real-world data, we can use
only the first few columns of U to get a good approximation of A.
This arises due to the properties of Σ. Σ is a diagonal matrix where
the largest value is in the top left corner, and the rest of the values on
the diagonal decreases as you move to the right. Thus, the first few
columns of U contribute the largest weight towards A. These first few
columns of U are called principal components.
However, not all matrices can be easily compressed as in the
previous example. One way to evaluate the feasibility is Principal
Component Analysis. From a high level standpoint, we want to see
if it’s possible to remove dimensions that don’t contribute much to
the final image. We achieve this by analyzing the covariance matrix.
Although the value of covariance doesn’t matter as much, the sign
of covariance does, with positive indicating positive correlation
and negative indicating negative correlation. A covariance of zero
indicates the two are independent of one another.

Performing Principal Component Analysis

Principal Component Analysis can be performed using the sklearn


package: sklearn.decomposition.PCA. However, it was alluded
to earlier that SVD can be used to perform Principal Component
Analysis. A non-formal approach is outlined below:

1. Format your data into a m ∗ n matrix where m denotes the number


of samples and p represents the number of features or variables
corresponding to a single sample.

2. Center the matrix X by subtracting the mean and dividing by the


standard deviation along each column(feature) in X

3. Diagonalizing X using SVD yields: X = UΣV T


dimensionality reduction 137

4. Eigenvectors are the principal directions and the projections on


these axises are the components. This ultimately means we want to
compute XV

5. Since V holds eigenvectors and is thus orthonormal, XV =


UΣV T V = US

6. (5) implies we simply need the columns of US, both matrices that
are surfaced by SVD

Detailed explanations that elucidate the reasoning behind the


above steps are discussed by Moore (1981) and can be found on
numerous websites online.

Applications of Principal Components


• PCA has been extensively used in image compression. Much of
the information captured within an image matrix can be extracted
using matrices of lower ranks. This allows large images to be
compressed without significant loss of quality. An example of PCA
based compression, using only the first 16 principal components, is
shown below:

Figure 51: Right: Original


Image, Left: Compressed Image

With just the first 16 principal components, an image that closely


resembles the original image can be reconstructed. The relative
error as a result of the dimensions used for PCA for the above
image is shown below:

• Web search engines also utilize PCA. There are billions of pages on
the Internet that may have a non-trivial relation to a provided
search phrase. Companies such as Google, Bing and Yahoo
typically narrow the search space by only considering a small
subset of this search matrix, which can be extracted using PCA34 . 34
Snasel V. Abdulla H.D. Search
This is critical for timely and efficient searches, and it speaks to the result clustering using a singular value
decomposition (svd). 2009
power of SVD.
138 computer vision: foundations and applications

Figure 52: Relative Error as


Function of PCA dimensions

In essence, PCA represents data samples as weights on various


components - allowing one to essentially represent the difference
between samples. This can significantly reduce data redundancy and
can make algorithms used in a variety of industries more efficient
and insightful!
Face Identification

Introduction to Facial Recognition

Neuroscience Background
In the 1960’s and 1970’s, neuroscientists discovered that depending
on the angle of observation, certain brain neurons fire when look-
ing at a face. More recently, they have come to believe that an area
of the brain known as the Fusiform Face Area (FFA) is primarily
responsible for reacting to faces. These advances in the biological
understanding of facial recognition have been mirrored by similar
advances in computer vision, as new techniques have attempted to
come closer to the standard of human facial recognition.

Applications
Computer facial recognition has a wide range of applications:
• Digital Photography: Identifying specific faces in an image allows
programs to respond uniquely to different individuals, such as
centering the image focus on a particular individual or improving
aesthetics through various image operations (blur, saturation, etc).

• Surveillance: By recognizing the faces of specific individuals, we


can use surveillance cameras to detect when they enter a location.

• Album Organization: If we can recognize a specific person, we can


group images in which they appear.

• Person tracking: If we can recognize a specific person, we can track


their location through frames of video (useful for surveillance or
for robots in the home).

• Emotions and Expressions: By detecting emotions or facial expres-


sions, we can build smart devices that interact with us based on
mood.

• Security and Warfare: If we can recognize a specific person, we can


identify potential threats in drone images and video.
140 computer vision: foundations and applications

• Teleconferencing: If we can recognize specific people, then telecon-


ferencing applications could automatically provide information to
users about who they are communicating with.

A Key Distinction: Detection vs. Recognition


While face detection determines whether an image contains faces and
where in the image they are, face recognition determines to whom a
detected face belongs to (i.e., identifying the identity of the person).

Space of Faces
If we consider an m × n image of a face, that image can be represented
by a point in high dimensional space (Rmn ). But relatively few high-
dimensional vectors consist of valid face images (images can contain
much more than just faces), and thus the region that an arbitrary face
image could fall into is a relatively small subspace. The task is to
effectively model this subspace of face images.

Figure 53: The region occupied


by images of faces is a small
subspace of the total space of
images. Source: Lecture 13,
slide 14

In order to model this subspace or "face-space" we compute the k-


dimensional subspace such that the projection of the data points onto
the subspace has the largest variance among all k-dimensional sub-
spaces. This low-dimensional subspace captures the key appearance
characteristics of faces.

The Eigenfaces Algorithm

Key Ideas and Assumptions


• Assume that most face images lie on a low-dimensional subspace
determined by the first k directions of maximum variance.
face identification 141

• Use Principle Components Analysis (PCA) to determine the


vectors or âĂIJeigenfacesâĂİ that span that subspace.

• Represent all face images in the dataset as linear combinations of


eigenfaces, where eigenfaces are defined as the principal compo-
nents of SVD decomposition.

What are eigenfaces? "Eigenfaces" are the visual representations of


the eigenvectors in the directions of maximum variance. They often
resemble generic-looking faces.

Figure 54: Faces and Eigenfaces.


Source: Lecture 13, slide 29

Training Algorithm

Figure 55: The reconstructed


face after projection. Source:
Lecture 13, slide 25

Why can we do this? Empirically, the eigenvalues (variance along


eigenvectors) drop rapidly with the number of principle components,
which is why we can reduce dimensionality without much loss of
information.

Reconstruction and Error We only select the top k eigenfaces, which


reduces the dimensionality. Fewer eigenfaces result in more informa-
tion loss, and hence less discrimination between faces.
142 computer vision: foundations and applications

Algorithm 2: Eigenfaces Training Algorithm


1: Align training images x1 , ..., xn
2: Compute the average face:

1
µ=
N ∑ xi
3: Compute the difference image (the centered data matrix):

1 1
Xc = X − µ1T = X − X11T = X (1 − 11T )
N N
4: Compute the covariance matrix:

1
Σ= Xc Xc T
N
5: Compute the eigenvectors of the covariance matrix Σ using PCA
(Principle Components Analysis)
6: Compute each training image xi ’s projections as

xi → ( xi c · φ1 , xi c · φ2 , ..., xi c · φk ) ≡ ( a1 , a2 , ..., ak )

where φi is the i’th-highest ranked eigenvector


7: The reconstructed face xi ≈ µ + a1 · φ1 + ... + ak · φk

Testing Algorithm

Advantages

• This method is completely knowledge-free – it does not know


anything about faces, expressions, etc.

• It is a non-iterative (fast), globally-optimal solution.

Disadvantages

• This technique requires carefully controlled data.

Algorithm 3: Eigenfaces Testing Algorithm


1: Take query image t
2: Project onto eigenvectors:

t → ((t − µ) · φ1 , (t − µ) · φ2 , ..., (t − µ) · φk ) ≡ (w1 , w2 , ..., wk )

3: Compare projection w with all N training projections. Use eu-


clidean distance and nearest-neighbors algorithm to output a
label
face identification 143

Figure 56: Eigenvalues sorted in


descending order of magnitude.
Source: Lecture 13, slide 26

Figure 57: Reconstructed faces


with varying number of eigen-
faces. Source: Lecture 13, slide
27

1. All faces must be centered in the frame. Otherwise the results


may be noisy.
2. The images must be the same size.
3. There is some sensitivity to the face angle.

• Method is completely knowledge free.

1. It makes no effort to preserve class distinctions.


2. PCA doesn’t take into account who it is trying to represent
in this lower dimensional space (it doesn’t take into account
the labels associated with the faces). Therefore, it might map
different faces near the same subspace, making it difficult for
classifiers to distinguish between them.

• PCA projection is optimal for reconstruction from a low dimen-


sional basis but may not be optimal for discrimination (the algo-
rithm does not attempt to preserve class distinctions).

Beyond Facial Recognition: Expressions and Emotions


This technique also generalizes beyond simple facial recognition
and can be used to detect expressions and emotions. The subspaces
would therefore represent happiness, disgust, or other potential
expressions, and the algorithm would remain unchanged.
144 computer vision: foundations and applications

Figure 58: Eigenfaces express-


ing happiness. Source: Lecture
13, slide 33

Figure 59: Eigenfaces express-


ing disgust. Source: Lecture 13,
slide 34

Linear Discriminant Analysis

PCA vs. LDA


PCA and LDA are similar in that both reduce the dimensions of
a sample. However, PCA projections don’t consider the labels of
the classes. An alternative approach is to move away from PCA
toward an algorithm that is optimal for classification (as opposed
to reconstruction). Linear Discriminant Analysis (LDA) finds a
projection that keeps different classes far away from each other.
• PCA maintains maximum variance.

• LDA allows for class discrimination by finding a projection that


maximizes scatter between classes and minimizes scatter within
classes.
The difference between PCA and LDA projections is demonstrated
in the figure above. PCA preserves the maximum variance and maps
the points of the classes along the line with the positive slope, which
makes it difficult to distinguish a points’ class. Meanwhile, LDA
maps the points onto the line with the negative slope, which results
in points being located close to other points in their class and far
from points in the opposite class.
face identification 145

Figure 60: PCA vs. LDA.


Source: Lecture 13, slide 41

General Idea
LDA operates using two values: between class scatter and within
class scatter. Between class scatter is concerned with the distance
between different class clusters, whereas within class scatter refers to
the distance between points of a class. LDA maximizes the between-
class scatter and minimizes the within-class scatter.

Figure 61: Between Class Scat-


ter vs. Within Class Scatter.
Source: Lecture 13, slide 43

Mathematical Formulation of LDA with 2 Variables


We want to find a projection w that maps points with classes 0 and
1 in the space x ∈ Rn to a new space z ∈ Rm , such that z = w T x.
Ideally, m < n, and our projection should maximize the function:

SB when projected onto w


J (w) =
SW when projected onto w
In this equation, SB represents between class scatter and SW rep-
146 computer vision: foundations and applications

resents the within-class scatter. Let us then define a variable µi that


represents the mean of a class’ points:

µ i = E X |Y [ X | Y = i ]
Let us also define a variable Σi that represents the covariance
matrix of a class:

Σi = EX |Y [( X − µi )( X − µi ) T |Y = i ]
Using these values, we can redefine our variables SB and SW to be:

SB = (µ1 − µ0 )2 = (µ1 − µ0 )(µ1 − µ0 ) T


SW = (Σ1 + Σ0 )
Plugging these new values of SB and SW back into J (w), we get:

(µ1 − µ0 )2 when projected onto w w T (µ1 − µ0 )(µ1 − µ0 ) T w


J (w) = =
(Σ1 + Σ0 ) when projected onto w w T ( Σ1 + Σ0 ) w

We can maximize J (w) by maximizing the numerator, w T (µ1 −


µ0 )(µ1 − µ0 ) T w, and keeping its denominator, w T (Σ1 + Σ0 )w constant.
In other words:

max w T (µ1 − µ0 )(µ1 − µ0 ) T w subject to w T (Σ1 + Σ0 )w = K

Using Lagrange multipliers, we can define the Lagrangian as:

L = w T SB w − λ(w T SW w − K ) = w T (SB − λSW )w + K


We must then maximize L with respect to λ and w. We can do so
by taking its gradient with respect to w and finding where the critical
points are:

∇w L = 2(SB − λSW )w = 0
Using this equation, we get that the critical points are located at:

SB w = λSW w
This is a generalized eigenvector problem. In the case where
−1
SW = (Σ1 + Σ0 )−1 exists, we obtain:

−1
SW SB w = λw
We can then plug in our definition of SB to get:

−1
SW (µ1 − µ0 )(µ1 − µ0 )T w = λw
face identification 147

Notice that (µ1 − µ0 ) T w is a scalar, and thus we can represent it by


a term α such that:

−1 λ
SW ( µ1 − µ0 ) = w
α
The magnitude of w does not matter, so we can represent our
projection w as:

−1
w∗ = SW ( µ 1 − µ 0 ) = ( Σ 1 − Σ 0 ) −1 ( µ 1 − µ 0 )

LDA with N Variables and C Classes


Preliminaries Variables:

• N sample images: { x1 , · · · , x N }

• C classes: {Y1 , Y2 , · · · , YC }. Each of the N sample images is associ-


ated with one class in {Y1 , Y2 , · · · , YC }.

• Average of each class: the mean for class i is µi = 1


Ni ∑ xk
xk ∈Yi

N
• Average of all data: µ = 1
N ∑ xk
k =1

Scatter Matrices:

• Scatter of class i: Si = ∑ ( xk − µi )( xk − µi )T
xk ∈Yi

c
• Within class scatter: Sw = ∑ Si
i =1

c
• Between class scatter: Sb = ∑ Ni (µi − µ)(µi − µ)T
i =1

Mathematical Formulation We want to learn a projection W such that


it converts all the points from x ∈ Rm to a new space z ∈ Rn , where
in general m and n are unknown:

z = w T x, x ∈ Rm , z ∈ Rn

After the projection, the between-class scatter is SeB = W T SB W, where


W and SB are calculated from our original dataset. The within-class
scatter is SeW = W T SW W. So the objective becomes:

SB
e T
W S B W
Wopt = argmax = argmax
W S W |W T SW W |
W
e
148 computer vision: foundations and applications

Again, after applying Lagrange multipliers we obtain a generalized


eigenvector problem, where we have:

SB wi = λi SW wi , i = 1, . . . , m

Note that Rank(SB ) and Rank(SW ) are limited by the number of


classes (C) and the number of sample images (N):

Rank(SB ) ≤ C − 1

Rank(SW ) ≤ N − C

And therefore the rank of Wopt is limited as well.

Results: Eigenface vs. Fisherface


Belhumeur, Hespanha, Kriegman performed an experiment comparing
the recognition error rates of PCA to LDA ("Eigenface vs Fisherface")
using a dataset of 160 images of 10 distinct people. The images
contained significant variation in lighting, facial expressions, and
eye-wear. Error rates were determined using the "leaving-one-out"
strategy, where a single image was classified by removing that image
from the dataset and training on the other 159 images, at which point
classification was done on the left-out image with a nearest-neighbors
classifier. This process was repeated over all 160 images in order to
determine an error rate 35 . 35
J. Hespanha P. Belhumeur and
D. Kriegman. Eigenfaces vs. fisherfaces:
Recognition using class specific linear
projection. IEEE Transactions on pattern
Figure 62: Variation in Facial
analysis and machine intelligence, 19(7):
Expression,
711–720, 1997 Eyewear, and Light-
ing.

Error rates for the two algorithms (and a variation of standard


Eigenface) are plotted in the figure above. Eigenface’s error rate
actually improves by removing the first three principle components.
Fisherface, as shown in the figure above, gives the lowest error rate.
face identification 149

Figure 63: Eigenface vs. Fisher-


face. Source: Lecture 13, slide
61
Visual Bag of Words

Introduction

In this lecture, we learn another approach to recognition. To recog-


nize objects in images, we need to first represent them in the form
of feature vectors. Feature vectors are mathematical representations
of an image’s important features. These feature vectors, for example,
can be the raw color values of the image or contain information about
the position of the pixel in the image as we have seen and imple-
mented in Homework 5. We then create a space representation of the
image to view the image values in a lower dimensional space. Every
image is then converted into a set of coefficients and projected into
the PCA space. The transformed data is classified using a classifier.
Some examples of such classifiers include K-means and HAC. This
process of going from an image to a useful representation of the
image in a lower dimensional space can be achieved in many ways.
In this lecture, we discuss another approach entitled Visual Bag of
Words.

Idea of Bag of Words


The idea behind "Bag of Words" is a way to simplify object rep-
resentation as a collection of their subparts for purposes such as
classification. The model originated in natural language processing,
where we consider texts such as documents, paragraphs, and sen-
tences as collections of words - effectively "bags" of words. Consider
a paragraph - a list of words and their frequencies can be considered
a "bag of words" that represents the particular paragraph, which we
can then use as a representation of the paragraph for tasks such as
sentiment analysis, spam detection, and topic modeling.
Although "Bag of Words" appears to be associated with language,
the idea of simplifying complex objects into collections of their sub-
parts can be extended to different types of objects. In Computer
Vision, we can consider an image to be a collection of image fea-
tures. By incorporating frequency counts of these features, we can
152 computer vision: foundations and applications

apply the "Bag of Words" model towards images and use this for
prediction tasks such as image classification and face detection.
There are two main steps for the "Bag of Words" method when
applied to computer vision, and these will further be explored in the
Outline section below.
1. Build a "dictionary" or "vocabulary" of features across many
images - what kinds of common features exist in images? We can
consider, for example, color scheme of the room, parts of faces such
as eyes, and different types of objects.
2. Given new images, represent them as histograms of the features
we had collected - frequencies of the visual "words" in the vocabulary
we have built.

Origins
The origins of applying the "Bag of Words" model to images comes
from Texture Recognition and, as previously mentioned, Document
Representation.
1. Textures consist of repeated elements, called textons - for ex-
ample, a net consists of repeated holes and a brick wall consists of
repeated brick pieces. If we were to consider each texton a feature,
then each image could be represented as a histogram across these
features - where the texton in the texture of the image would have
high frequency in the histogram. Images with multiple textures,
therefore, can be represented by histograms with high values for
multiple features.
2. Documents consist of words which can be considered their
features. Thus, every document is represented by a histogram across
the words in the dictionary - one would expect, for example, the
document of George Bush’s state of the union address in 2001 to
contain high relative frequencies for "economy", "Iraq", "army", etc.
Thus, a "bag of words" can be viewed as a histogram representing
frequencies across a vocabulary developed over a set of images or
documents - new data then can be represented with this model and
used for prediction tasks.

Algorithm Summary

Let’s describe in detail how the Bag of Words algorithm can be


applied to a large set of images.

Extracting Interesting Features


The first step in the Bag-of-Words algorithm is extracting the features
of the images. We eventually use these features to find the most com-
visual bag of words 153

mon features across our dataset of images. We can choose any type
of feature we want to find our features. For example, we can sim-
ply split our images into a grid and grab the subimages as features
(shown below). Or, we can use corner detection of SIFT features as
our features.

Using grid of subimages as features 36 Ranjay Krishna Juan Carlos Niebles.


36

Lecture: Visual bag of words. http://


vision.stanford.edu/teaching/cs131_
fall1718/files/14_BoW_bayes.pdf

Learning Visual Vocabulary

Once we have our features, we must turn this large feature set into a
small set of "themes". These "themes" are analogous to the "words"
in the Natural Language Processing version of the algorithm. As
mentioned above, in the Computer Vision application, the "words"
are called textons.
To find textons, we simply cluster our features. We can use any
clustering technique (K-Means is most common, but Mean Shift or
HAC may also work) to cluster the features. We then use the centers
of each cluster as the textons. Our set of textons is known as a visual
vocabulary. An example of a visual vocabulary is given below.
154 computer vision: foundations and applications

Example of a visual vocabulary 37 Ranjay Krishna Juan Carlos Niebles.


37

Lecture: Visual bag of words. http://


vision.stanford.edu/teaching/cs131_
Quantize Features fall1718/files/14_BoW_bayes.pdf

A synonym for texton in this case is codevector. In other words, the


center of each of our features cluster is a codevector. Altogether, our
set of codevectors form a codebook. We can use this codebook to
quantize features: we extract features from new images using the
same method we used to extract features from our dataset, and then
use our codebook to map the feature vectors to the indexes of the
closest codevectors.
The size of our codebook (which is exactly equivalent to amount
of clusters in our clustering algorithm) is an important hyperparame-
ter. If it’s too small, then our codevectors are not representative of the
underlying data. If it’s too large, then the codebook will start to over-
fit the underlying data. We must be conscious of this when picking
the K value for K-Means (if, of course, we decide to use K-Means as
our clustering algorithm).

Represent Images by Frequencies


Once we have built our codebook, we can use it to do interesting
things. First, we can represent every image in our dataset as a his-
togram of codevector frequencies (shown below). We use feature
quantization to accomplish this. Then, we have two options, de-
visual bag of words 155

pending on our type of problem. If we have a supervised learning


problem (i.e. our data has labels), we can train a classifier on the
histograms. This classifier will then be trained on the appearance of
the textons and hence will be a robust way to distinguish between
classes. If we have an unsupervised learning problem (i.e. our data
does not have labels), we can further cluster the histograms to find
visual themes/groups within our dataset.

Representing our images as a histogram of texton frequencies 38 Ranjay Krishna Juan Carlos Niebles.
38

Lecture: Visual bag of words. http://


vision.stanford.edu/teaching/cs131_
We can create our visual vocabulary from a different dataset than
fall1718/files/14_BoW_bayes.pdf
the dataset that we are interested in classifying/clustering, and so
long as our first dataset is representative of the second, this algorithm
will be successful.

Large-Scale Image Search


Large-scale image matching is one of the ways that the Bag-of-words
model has been useful. Given a large database, which can hold tens
of thousands of object instances, how can one match an images to
this database?
The Bag-of-words model can help build the database. First, fea-
tures can be extracted from the database images. Then we can learn a
vocabulary using k-means (typical k:100,000). Next we compute the
weights for each word. Going back to the word dictionary example,
weighting the words can help us decrease the importance of certain
words. If we are trying to find the topic of a document, we can give
words like "the", "a", and "is" low weights since they are likely to be
common between documents and used frequently within a document.
With images we can do the same, giving useless features low weights
and the more important features higher weights. Once the features
have been weighted, we can create an inverted file mapping words to
images.
156 computer vision: foundations and applications

Term Frequency Inverse Document Frequency (TF-IDF) scoring


weights each word by it’s document frequency.
The inverse document frequency (IDF) of a word j can be found by

NumDocs
IDF = log( )
NumDocs jappears

To compute the value of bin j in image I:

Bin j = f requncy j in I ∗ IDF

We can create an inverted file that holds the mapping of words to


documents to quickly compute the similarity between a new image
and all of the images in the database. If we have images that have
around 1000 features, and a database of around 100,000 visual words,
each histogram will be extremely sparse. We would only consider
images whose bins overlap with the new image.
Large-scale image search works well for CD covers and movie
posters, and real-time performance is possible. The downside for
the large scale image search is that the performance of the algorithm
degrades as the database grows. Using the Bag-of-Words model for
this problem sometimes results in noisy image similarities due to
quantization error and imperfect feature detection.39 Noah Snavely Song Cao. Learning to
39

match images in large-scale collections

Spatial Pyramid Matching

Motivation

So far, we have not exploited the spatial information. But there is a


simple yet smart method to incorporate the spatial information in the
model: spatial pyramid matching.

Pyramids

A pyramid is built by using multiple copies of the source image.


Each level in the pyramid is 14 of the size of the previous level. The
lowest level is of the highest resolution and the highest level is of
the lowest resolution. If illustrated graphically, the entire multi-scale
representation looks like a pyramid, with the original image on the
bottom and each cycle’s resulting smaller image stacked one atop the
other. 40 40
Wikipedia. Pyramid (image process-
ing). https://en.wikipedia.org/wiki/
Pyramid_(image_processing). [Online;
accessed 15-Nov-2017]
visual bag of words 157

Bag of Words + Pyramids

Bag of Words alone doesn’t discriminate if a patch was obtained


from the top, middle or bottom of the image because it doesn’t save
any spatial information. Spatial pyramid matching partitions the
image into increasingly fine sub-regions and allows us to computes
histograms (BoW) of local features inside each sub-region. 41 Tomas Mardones. How should
41

If the BoWs of the upper part of the image contain "sky visual we understand the Spatial Pyramid
Matching? https://www.quora.com/
words", the BoWs in the middle "vegetation and mountains visual How-should-we-understand-the-Spatial-Pyramid-Match
words" and the BoWs at the bottom "mountains visual words", then it [Online; accessed 15-Nov-2017]
is very likely that the image scene category is "mountains".
158 computer vision: foundations and applications

Some results

Strong features (ie.larger vocabulary size) is better than weaker


features (ie. smaller vocabulary size). Notice also that as expected,
incorporating pyramid matching always generate better result than
single level feature extraction. This is exactly what we expected
because under the same circumstance, pyramid approach encodes
more information (ie.spacial information) than single-level approach
does.

Naive Bayes

Basic Idea

Once we have produced a visual word histogram, we can use Naive


Bayes to classify the histogram. To do so, we simply measure
whether a given visual word is present or absent, and assume the
visual bag of words 159

presence/absence of one visual word to be conditionally independent


of each other visual word given an object class.

Consider some visual word histogram X, where xi is the count of


visual word i in the histogram. We are only interested in the presence
or absence of word i, we have xi ∈ {0, 1}.

Prior
P(c) denotes that probability of encountering one object class versus
others. For all m object classes, we then have
m
∑ P(c) = 1
i =1

For an image represented by histogram x, and some object class c, we


can compute
m
P( x |c) = ∏ P ( xi | c )
i =1

Posterior
Using the prior equation, we can now calculate the probability than
the image represented by histogram x belongs to class category c
using Bayes Theorem
P(c) P( x |c)
P(c| x ) =
∑c0 P ( c0 ) P ( x | c0 )
Expanding the numerator and denominator, we can rewrite the
previous equation as
P(c) ∏im=1 P( xi |c)
P(c| x ) =
∑c0 P(c0 ) ∏im=1 P( xi |c0 )

Classification
In order to classify the image represented by histogram x, we simply
find the class c∗ that maximizes the previous equation:

c∗ = argmax c P(c| x )
160 computer vision: foundations and applications

Since we end up multiplying together a large number of very small


probabilities, we will likely run into unstable values as they approach
0. As a result, we use logs to calculate probabilities:

c∗ = argmax c logP(c| x )

Now consider two classes c1 and c2 :

P(c1 ) ∏im=1 P( xi |c1 )


P ( c1 | x ) =
∑c0 P(c0 ) ∏im=1 P( xi |c0 )

and
P(c2 ) ∏im=1 P( xi |c2 )
P ( c2 | x ) =
∑c0 P(c0 ) ∏im=1 P( xi |c0 )
Since the denominators are identical, we can ignore it when calculat-
ing the maximum. Thus
m
P ( c1 | x ) ∝ P ( c1 ) ∏ P ( x i | c1 )
i =1

and
m
P ( c2 | x ) ∝ P ( c2 ) ∏ P ( x i | c2 )
i =1

and for the general class c:


m
P ( c | x ) ∝ P ( c ) ∏ P ( xi | c )
i =1

and using logs:


m
logP(c| x ) ∝ logP(c) + ∑ logP( xi |c)
i =1

Now, classification becomes

c∗ = argmax c P(c| x )

c∗ = argmax c logP(c| x )
m
c∗ = argmax c logP(c) + ∑ logP( xi |c)
i =1
Object Detection from Deformable Parts

Introduction to Object Detection

Previously, we introduced methods for detecting objects in an image;


in this lecture, we describe methods that detect and localize generic
objects in images from various categories such as cars and people.
The categories that are detected depend on the application domain.
For example, self-driving cars need to detect other cars and traffic
signs.

Figure 64: An example of an


object detection algorithm
detecting the categories of a
person, dog, and a chair

Challenges
Object detection, however, faces many challenges. The challenges in-
clude the varying illumination conditions, changes in the viewpoints,
object deformations, and intra-class variability; this makes objects of
the same category appear different and makes it difficult to correctly
detect and classify objects. In addition, the algorithms introduced
herein only give the 2D location of the object in the image and not
the 3D location. For example, the algorithms cannot determine if an
162 computer vision: foundations and applications

object is in front or behind another object. Additionally, the following


object detectors do not provide the boundary of an object it finds; the
object detector just provides a bounding box of where the object was
found.

Current Object Detection Benchmarks

Today, object detection has practically been addressed for certain


applications such as face detection. To evaluate the performance
of an object detector, researchers use standardized object detection
benchmarks. Benchmarks are used to make sure we are moving
forward and performing better with new research.

PASCAL VOC

The first widely used benchmark was the PASCAL VOC Challenge
42 , or the Pattern Analysis, Statistical Modeling, and Computational 42
M. Everingham, L. Van Gool,
Learning Visual Object Classes challenge. The PASCAL VOC chal- C. K. I. Williams, J. Winn, and
A. Zisserman. The PASCAL Visual
lenge was used from 2005 to 2012 and tested 20 categories. PASCAL Object Classes Challenge 2012
was regarded as a high quality benchmark because its test categories (VOC2012) Results. http://www.pascal-
network.org/challenges/VOC/voc2012/workshop/index.h
had high variability within each category. Each test image also had
bounding boxes for all objects of interest like cars, people, cats, etc.
PASCAL also had annual classification, detection, and segmentation
challenges.

ImageNet Large Scale Visual Recognition Challenge

The benchmark that replaced PASCAL is the ImageNet Large Scale


Visual Recognition Challenge (ILSVR) 43 . The ILSVR Challenge tested 43

200 categories of objects, significantly more than what PASCAL


tested, had more variability in the object types, and had many objects
in a single image.

Common Objects in Context

Another benchmark that is still used today is the Common Objects


in Context (COCO) challenge 44 . The COCO challenge tests 80 cat- 44

egories, but in addition to testing bounding boxes of locations of


objects, it also tests object segmentation, which are detailed bounding
areas of an object. Creating the test images for the dataset used in the
COCO challenge is very time consuming because each object that is
tested needs to be traced almost perfectly.
object detection from deformable parts 163

Figure 65: An example of a


labeled ILSVR test image.

Figure 66: Object segmentation


used for the COCO challenge.

Evaluating Object Detection

When evaluating an object detection algorithm, we want to compare


the predictions with ground truth. The ground truth is provided by
humans who manually classify and locate objects in the images.
When comparing predictions with ground truth, there are four
different possibilities:
1. True Positive (TP)
True positives are objects that both the algorithm (prediction) and
annotator (ground truth) locate. In order to be more robust to slight
variations between the prediction and ground truth, predictions are
considered true positives when the overlap between the prediction
and ground truth is greater than 0.5 (Figure 68a). The overlap be-
tween the prediction and ground truth is defined as the intersection
over the union of the prediction and ground truth. True positives are
also sometimes referred to as hits.
2. False Positive (FP)
False positives are objects that the algorithm (prediction) locates
but the annotator (ground truth) does not locate. More formally, false
positives occur where the overlap of the prediction and ground truth
164 computer vision: foundations and applications

Figure 67: Yellow boxes repre-


sent ground truth while green
boxes are predictions.

is less than 0.5 (Figure 68b). False positives are also referred to as
false alarms.
3. False Negative (FN)
False negatives are ground truth objects that our model does not
find (Figure 68b). These can also be referred to as misses.
4. True Negative (TN)
True negatives are anywhere our algorithm didnâĂŹt produce a
box and the annotator did not provide a box. True negatives are also
called correct rejections.

Figure 68: Example classifica-


tions using an overlap thresh-
old of 0.5. (a) True positive,
because ground truth (yellow
box) and prediction (green box)
overlap is more than 0.5. (b)
(a) (b) (c) False positive, since the pre-
diction boxes (green) do not
In general, we want to minimize false positives and false negatives overlap with any ground truth
while maximizing true positives and true negatives (Figure 69. boxes. (c) False negative, since
Using the counts of true positives (TP), true negatives (TN), false the ground truth box (yellow) is
positives (FP), and false negatives (FN), we can calculate two mea- not detected by the model.
sures: precision and recall.

TP
Precision =
TP + FP
Precision can be thought of as the fraction of correct object predic-
object detection from deformable parts 165

Figure 69: Summary chart of


classifications, with green being
counts you want to maximize
and red being counts you want
to minimize.

tions among all objects detected by the model.

TP
Recall =
TP + FN
Recall can be thought of as the fraction of ground truth objects that
are correctly detected by the model.

Figure 70: Predictions are green


boxes while ground truth is
yellow. All the ground truths
are correctly predicted, making
recall perfect. However, pre-
cision is low, since there are
many false positives.

For every threshold we use to define true positives (In Figure 68


the overlap threshold was set to 0.5), we can measure the precision
and recall. Using that information, we can create a Precision-Recall
curve (PR curve). Generally, we want to maximize both precision and
recall. Therefore, for a perfect model, precision would be 1 and recall
would be 1, for all thresholds. When comparing different models and
166 computer vision: foundations and applications

parameters, we can compare the PR curves. The better models have


more area under the curve.

Figure 71: Faster-RCNN model


is the best of the three models
since it has the most area under
the curve.

However, depending on the application, we may want specific


values for precision and recall. Therefore, you can choose the best
model by fixing, say the recall, and finding the model that has the
best precision at that recall.
We can also use the counts of TP, FP, TN, and FN to see how our
model is making errors.

A Simple Sliding Window Detector

The detection can be treated as a classification problem. Instead


of attempting to produce the location of objects in an image by
processing the entire image at once, slide a window over the image
and classify each position of the window as either containing an
object or not (Figure 72).

Feature Extraction and Object Representation


Dalal and Triggs 45 showed the effectiveness of using Histograms of 45

Oriented Gradient (HOG) descriptors for human detection. Although


their feature extraction methods were focused on human detection,
they can be applied for detecting various objects.
Recall HOG descriptors from lecture 8. An image window is
divided into blocks; the magnitude of the gradients of the pixels in
each block are accumulated into bins according to the direction of the
gradients. These local histograms of the blocks are then normalized
and concatenated to produce a feature representation of the image
window.
Figure 73 shows an example of transforming an image into a HOG
feature space. Producing a prototypical representation of an object
object detection from deformable parts 167

Figure 72: Consider the prob-


lem of detecting people in an
image. (a) - (c) sliding window
across image and at each po-
sition classifying window as
not containing a person. (b)
window over person and clas-
sifying window as containing
(a) (b) a person. Image source: Flickr
user neilalderney123

(c)

(d)

would then involve considering many image windows labeled with


containing that object. One approach to creating this representation
would be to train a classifier on the HOG descriptors of these many
labeled image windows and then proceed to use the trained classifier
to classify the windows in images of interest. In their aim to improve
human detection, for example, Dalal and Triggs 46 train a linear 46

Support Vector Machine on the HOG descriptors of image windows


containing people.
A more simple approach, and the approach that will be assumed
below, is that of averaging the window images containing an object
of interest and then extracting the HOG descriptor of that average
image to create a template for the object (Figure 74).
168 computer vision: foundations and applications

Figure 73: An image of a per-


son along with its respective
HOG descriptor. Note that the
last two images are the HOG
descriptor weighted by the
positive and negative weights
of the classifier using them. The
outline of the person is very vis-
ible in the weighted descriptors.
Image source: Dalal and Triggs.

Figure 74: The average face


image above is created by av-
eraging 31 aligned face images
of the same size. The HOG
descriptor of this average face
image can then be used as a
template for detecting faces in
images.

Classifying Windows

Now that we have a method for extracting useful features from an


image window and for constructing object templates, we can proceed
to detecting objects in images. The idea is to compare each window
with the object template and search for matches. That is, the object
template itself acts as a filter that slides across the image. At each
position, the HOG descriptor of the window being compared is ex-
tracted and a similarity score between the two HOG descriptors is
computed. If the similarity score at some location is above a prede-
fined detection threshold, then an object can be said to have been
detected in the window at that location.
The similarity score can be as simple as the dot product of the
window HOG descriptor and the template HOG descriptor. This is
object detection from deformable parts 169

the scoring method that is assumed in following sections.


As effective as the method seems so far, it’s success is very limited
by the size of the sliding template window. For example, consider the
case where objects are larger than the template being used to detect
them (Figure 75).

Figure 75: Consider, again, the


problem of detecting people,
except this time our sliding win-
dow is much smaller. (a) The
template and sliding window
are still large enough to detect
the smaller, distant person. (b)
The person in the foreground is
(a) (b) a lot bigger than our window
size and is not being detected.
Multi Scale Sliding Window
To account for variations in size of the objects being detected, mul-
tiple scalings of the the image are considered. A feature pyramid
(Figure 76) of different image resizings is created. The sliding win-
dow technique is then applied as usual over all the the pyramid
levels. The window that produces the highest similarity score out of
the resizings is used as the location of the detected object.

Figure 76: Using a feature pyra-


mid of different image resizings
allows the object template to
match with objects that might
have originally been bigger
or much smaller than the the
template. Image source: Lecture
15, Slide 40

The Deformable Parts Model (DPM)

The simple sliding window detector is not robust to small changes


in shape (such as a face where the eyebrows are raised or the eyes
are further apart, or cars where the wheel’s may be farther apart or
170 computer vision: foundations and applications

the car may be longer) so we want a new detection model that can
handle these situations. Recall the bag of words approach, in which
we represent a paragraph as a set of words, or an image as a set of
image parts. We can apply a similar idea here and detect an object
by its parts instead of detecting the whole singular object. Even if
the shape is slightly altered, all of the parts will be present and in
approximately the correct position with some minor variance.

Early Deformation Model for Face Detection

In 1973, Fischler and Elschlager developed a deformable parts model


for facial recognition 47 : the parts of the face (such as eyes, nose, 47

and mouth) are detected individually, and there are spring-like


connections between each part that allows each part to move slightly,
relative to the other parts, but still largely conform to the typical
configuration of a face. This allows a face to be detected even if the
eyes are farther apart or a change in orientation pushes some parts
closer together, since each part has some flexibility in its relative
location in the face. See Figure 77 for illustration

Figure 77: An illustration of


Fischler and Elschlager’s de-
formable face model

More formally, the springs indicate that there is a desired relative


position between two parts, and just like a spring stretches or com-
presses, we allow some deviations from that desired position, but
apply an increasing penalty for larger deviations, much like a string
pulls back harder and harder as it is stretched further.

More General Deformable Parts Models

The deformable model depicted in Figure 77 is one specific de-


formable model for face detection, where Fischler and Elschlager
chose springs between parts that worked well with their methods
for face detection. For a more general deformable parts model, one
popular approach is the star-shaped model, in which we have some
detector as the root and we have springs between every other part
and the root (see Figure 78a for illustration)
object detection from deformable parts 171

Figure 78: On the left is a vi-


sualization of the star model,
with x1 as the root, and on the
right is an example of person
detection with a star model for
deformations

(a)
(b)

In face-detection, for example, we could have a detector for the


entire body as the root and thus define the location of all body parts
relative to the location of entire body. This is shown in Figure 78b,
where the cyan box is the location of the bounding box from global
person detection (which we use as the root) and the yellow boxes are
the bounding boxes resulting from the detection of each part.
This means that the head should be near the top-center of the
global location of the person, the feet should be near the bottom,
the right arm should be near the left of image (if detector is for a
front facing person), etc. In this class we will assume that we already
know which parts to use (such as head, arms, and legs if we are
detecting people), but it is possible to learn the parts for optimal
object detection through machine learning

Examples of Deformable Parts Models


It is typical to use a global detector for the desired object (such as a
person or a bike) as the root, and to use smaller and more detailed
filters to detect each part. As a simple example, see Figure 79

Figure 79: A global HOG fil-


ter for a person and a more
detailed HOG filter for the head

It is also common to use a multi-component model for multiple


orientations of an object, in which we have a global filter and multi-
ple parts filters for each orientation. A deformable model will protect
against the changing positions of parts due to mild rotation, but for
more significant rotations such as 90 degrees, all of the parts looks
different and require different detectors for each rotation. As an ex-
172 computer vision: foundations and applications

ample, in figure 80, each row corresponds to an orientation. Also,


the left column is a global car detector for a particular orientation,
the middle column contains all of the finer detailed parts filters
for that orientation, and the right column shows the deformation
penalties for each part in that orientation (where darker grey in the
center is smaller penalties and white further from the center is larger
penalties).

Figure 80: Deformable Parts


model of a car with multiple
orientations

As another example, consider 2 orientations for bike detection. A


bicycle looks very different from the front and from the side so we
want multiple detectors. Here we can see the parts identified in the
original image alongside the deformable model

Figure 81: Deformable Parts


model of a bike with original
image shown
object detection from deformable parts 173

Calculating Score for Deformable Parts Models


To implement a deformable parts model we need a formal way
to calculate score. To do this we will calculate a global score from
the global object detector, and then calculate a score for each part,
determined by it’s deformation penalty. The final score is the global
score minus all of the deformation penalties, such that an object that
is detected strongly but has many of the parts very far from where
they should be will be highly penalized.
To express this more formally we will first define some variables:
The entire model with n parts is encompassed by an (n+2) tuple:

( F0 , P1 , P2 , ...Pn , b)

where F0 is the root filter, P1 is the model for the first part, and b is a
bias term. Breaking it down further, each part’s model Pi is defined
by a tuple
( Fi , vi , di )

where Fi is the filter for the i-th part, vi is the "anchor" position for
part i relative to the root position, and di defines the deformation
cost for each possible placement of the part relative to the anchor
position.
We can calculate the location of the global filter and each part filter
with a HOG pyramid (see Figure 82). We run the global HOG filter
and each part’s HOG filter over the image at multiple scales so that
the model is robust to changes in scale. The location of each filter is
the location where we see the strongest response. Since we are taking
the location of responses across multiple scales we have to take care
that our description of the location of each part is scale-invariant
(one way this can be done is by scaling the maximum response map
for each part up to the original image size and then taking scale-
invariant location).

Figure 82: Illustration of HOG


pyramid for deformable parts
detector
174 computer vision: foundations and applications

We calculate the detection score as


n n
∏ Fi · φ( pi , H ) − ∑ di (dxi , dyi , dxi2 , dy2i )
i =0 i =1

Taken as a whole, this means that we are finding detection score


of the global root and all the parts and subtracting all of the deforma-
tion penalties.
The left term represents the product of the scores for the global
filter and each part filter (note that this is identical to the simpler
Dalal and Triggs 48 or sliding window method explained previously). 48

Recall that the score for each filter is the inner product of the filter
(as a vector) and φ( pi , H ) (defined as the HOG feature vector of a
window defined by position pi of the filter). Note that the windows
can be visualized in the HOG pyramid in Figure 82: the window for
the root is the cyan bounding box, and the window for each of the
parts is the yellow bounding box corresponding to that part. We are
taking the HOG feature vector of the portion of the image enclosed in
these bounding boxes and seeing how well it matches with the HOG
features of the template for that part.
Returning to the score formula, the right term represents the sum
of the deformation penalties for each part. We have di representing
the weights of each penalty for part i, corresponding to quantities
dxi (the distance in x direction from the anchor point where the part
should be), dyi (the distance in y direction from the anchor point
where the part should be), as well as dxi2 and dy2i . As an example, if
di = (0, 0, 1, 0), then the deformation penalty for part i is the square
of the distance in the x direction of that part from the anchor point.
All other measures of distance are ignored.

The DPM Detection Pipeline

The Deformable Parts Model detection pipeline has several stages.


We must first use the global filter to detect an object. Then, the parts
filters are used to calculate the overall score of that detection.
1. Generate copies of the original image at different resolutions (so
that the fixed window size can capture objects at varied scales);
store the HOGs for these different resolutions, as that is what the
filters will be applied to.

2. Apply the global filter to these images. Upon a detection by the


global filter, apply the parts filters. This step represents the section
of the pipeline depicted in Fig. 84, and contributes the term:

n
∏ Fi · φ( pi , H )
i =0
object detection from deformable parts 175

Figure 83: Illustration of full


DPM Pipeline

3. Having applied the parts filter, we now calculate the spatial costs
(i.e., a measure of the deformation of the parts with regards to the
global):

n
∑ di (dxi , dyi , dxi2 , dy2i )
i =1

4. For every detection, we now sum these components to calculate


the detection score:

n n
F0 + ∏ Fi · φ( pi , H ) − ∑ di (dxi , dyi , dxi2 , dy2i )
i =1 i =1

5. These scores then represent the strength of the detection of the


given object at each coordinate of the image; hence, we can plot
the response scores throughout the image:
176 computer vision: foundations and applications

Figure 84: Step 2 of the detec-


tion pipeline

DPM Detection Results

The Deformable Parts Model makes some key assumptions: an object


is defined by a relationship between a global representation (e.g., a
car) and representations of parts (e.g. wheels, or a bumper); that the
strength of this detection increases with the decrease in deformation
between the root and the parts; and that if a high response score is
achieved, that object is indeed present (regardless of the potential
presence of different categories of objects). Hence, DPM is vulnerable
to error in situations where these assumptions are violated:
Note how in the top image on the right, DPM has successfully
found sections matching the parts filters: wheels and a windshield.
DPM has also successfully found one true positive for the global
filter (the car in the background). However, DPM assumes that these
object detection from deformable parts 177

Figure 85: Step 3 of the detec-


tion pipeline

parts are related to each other because they are spatially close (i.e.,
they fit the deformation model), and that they correspond to the car
identified as the global filter – when in reality, there is not one but
two cars in the image providing the parts.
Similarly, with the bottom image, DPM indeed detects an object
very close to a car. However, since it does not take into account that
the object is even closer to being a bus than a car, and does not take
into account features explicitly not present in a car (e.g., the raised
roof spelling "School bus"), DPM results in a wrong detection.

DPM Summary

Approach

• Manually selected set of parts: a specific detector is trained for


178 computer vision: foundations and applications

Figure 86: Response scores

Figure 87: Typical errors for the


DPM model

each part

• Spatial model is trained on the part activations

• Joint likelihood is evaluated for these activations

Advantages

• Parts have an intuitive meaning

• Standard detection approaches can be used for each part

• Works well for specific categories

Disadvantages

• Parts need to be selected manually

• Semantically motivated parts sometimes donâĂŹt have a simple


appearance distribution

• No guarantee that some important part hasnâĂŹt been missed

• When switching to another category, the model has to be rebuilt


from scratch
object detection from deformable parts 179

Notably, the Deformable Parts Model was state-of-the-art 3-4 years


ago, but has since fallen out of favor. Its "fully-connected" extension,
pictured below, is particularly no longer used in practice:

Figure 88: DPM extension:


fully-connected shape model
Semantic Hierarchies and Fine Grained Recognition

Introduction

One overarching goal of computer vision is to not only determine


an object’s class, but also relevant information about that object.
Motivating examples include:

• Determining the number of calories in a sushi using its image?

• Using the image of a tiger to identify the safest course of actio?

• Given a chair, can we find out where we can buy it online?

• using computer vision and images of mushrooms to identify if


they are edible?

In general, computer vision should be able to give helpful informa-


tion about an object beyond just basic characteristics.

What can computers recognize today?

Computer vision algorithms can easily detect and recognize faces in


images.
Object identification software such as Google Goggles can find
very specific kinds of objects (e.g. logos, landmarks, books, text,
contact info, and artwork), but can only find exact matches. Finding a
generic object is much harder for today’s systems.

What’s next to work on?

A principle milestone in computer vision yet to be reached is univer-


sal class recognition. That is, we would like an algorithm that could,
for an image of any object, identify its class. Examples include:

• For an image of an orange mug, we would like coffee mugs to be


the result, but image search found similar images by looking for
something orange in the center âĂŞ not mugs.
182 computer vision: foundations and applications

• Given an image of someone with a gas pump, we’d like to recog-


nize that the gas pump is in the picture, but Google image search
gives us images of people standing in same area of picture, maybe
wearing the same color clothing.

The PASCAL VOC challenge1 taught models how to classify 20


classes of objects. However, a particular model can only recognize
a finite number of object classes. A model cannot recognize an ar-
bitrary object, such as a coffee mug, if it is outside the model’s set
of training classes. We want to be able to recognize everything, but
there are a lot of “things”, and the agreed-upon number of “things”
is increasing over time, so it is hard to decide which ones to focus
on. In addition, there is not even an agreed-upon number of “things”
in the universe – there are 60K+ product categories on eBay, 80K+
English nouns on WordNet, 3.5M+ unique tags on Flickr, and 4.1M+
articles on Wikipedia.

Big Data from the Internet

Because of the sheer amount of available data, the Internet should be


able to teach us a huge number of things. Internet traffic has been
steadily increasing, and vast majority (86%) is visual data.
Visual data is known as the dark matter of the Internet because
we can’t autoclassify or auto-analyze it. Trying to categorize a pho-
toshopped “pitbullfrog” shows that machines struggle to generalize,
improvise, and extrapolate with the ease that humans do. Thus, im-
ages are not enough; just as important is leveraging the crowd, which
can provide answers to questions that machines would otherwise
have difficulty tackling themselves, such as in the following example:
Vitally, behind the big data is all of the users who are contributing
their domain knowledge. When an individual cannot find the answer
to a problem, it is easy to find others that can help. Image recognition
systems, on the other hand, do not necessarily achieve this right now.
The Internet gives an environment in which these two resources can
be combined.

ImageNet and Confusion Matrices

Since models are limited by the number of the classes in the training
set, one possible method of advancing universal class recognition is
to create datasets with many more than the 20 classes the PASCAL
VOC provided.
ImageNet is a large image database created by Jia Deng in 2009
that has 22,000 categories and 14 million images. It has become so im-
semantic hierarchies and fine grained recognition 183

Figure 89: Global Internet


Traffic. Source: Lecture 16

portant to computer vision that most projects now train on ImageNet


before tackling other challenges. Other well-known databases in-
clude: Caltech101 (9K images, 101 categories), LabelMe (30k images),
SUN (131K images).
Deng applied four top classification methods from PASCAL VOC
to ImageNet, but found that increasing the number of categories
on which the models trained decreased their accuracy greatly. He
plotted his findings as the following confusion matrix:
A confusion matrix plots categories on the x and y axis and mea-
sures the degrees to which objects in a category are correctly or
incorrectly classified. If we had a perfect detector, the only bright
areas would be along the center diagonal line; this means everything
is correctly classified (i.e., no misclassification).
184 computer vision: foundations and applications

Figure 90: Example Image


Recognition Problem. Source:
Lecture 16

Figure 91: ImageNet Confusion


Matrix. Source: Lecture 16

We see from this matrix that models are more prone to making
classification errors when presented with items whose categories are
very similar; for example, models struggle with discerning between
different types of aquatic birds, while it can more easily categorize
birds versus dogs. In other words, distinguishing between “fine-
grained” categories is very difficult, since the distance between those
categories is much smaller than between larger categories.

Challenges and Solutions

Semantic Hierarchy
One method for solving the issue of correctly classifying similar
classes without making wrong guesses is the idea of a “semantic
semantic hierarchies and fine grained recognition 185

Figure 92: Example Semantic


Hierarchy. Source: Lecture 16

hierarchy”. An example can be seen in Figure 4. Essentially, a tree is


created, where every child is a more specific subclass of the parent.
As the category gets more specific, uncertainty increases. The system
will attempt to be as specific as possible (as low down on the tree
as possible) without an unacceptable probability of misclassifying
the object. This concept is called “hedging” - the system tries to
identify where in the uncertainty tree to make a guess such that it is
as informative as possible with few mistakes.
To formally define this problem, we will assume that the training
and test set have the same distribution. In addition, we will assume
that we have access to a base classifier g that gives a posterior prob-
ability on the hierarchy. We then define a reward function R( f ). We
make R( f ) to have higher values at points farther down the hierar-
chical tree (when it classifies objects more specifically). Then, we also
define an expected accuracy function A( f ). This function decreases
as we move down the tree (since we are less certain about more
specific guesses). Our problem statement is then to maximize R( f )
subject to the constraint that A( f ) ≥ 1 − e, where e is a predetermined
constant that represents the allowable error of our classifier over all
of our examples.
However, potential non-optimal solutions (in terms of R( f )) can
arise given this method. In order to fix this, we can define a global,
fixed, scalar parameter λ ≥ 0. For each node, we add λ to the reward
value of that node, then normalize the posterior distribution. The
process is then as follows:

1. Pick a λ.

2. Find the decision rule f with λ.


186 computer vision: foundations and applications

3. Measure the performance on the validation set.

4. Check if A ≈ 1 − e. Repeat from step one if necessary.

We can use binary search to find the best λ quickly.

Fine-grained Classes
Existing work selects features from all possible locations in an image,
but it can fail to find the right feature. For example, the difference
between the Cardigan Welsh Corgi and the Pembroke Welsh Corgi
is the tail. The computer may be unable to discover that this is the
distinguishing feature. The easiest way to solve this type of problem
is through crowd-sourcing.
What seems like a simple solution, however, is actually difficult
– what is the best method for asking a crowd which features distin-
guish classes of images?
Bubble study: In order to detect the difference between smiling
and neutral faces, we display only small bubbles of an image to
people in an experiment to determine which bubbles allow them
to detect when the image is of a person smiling. An example of
how the bubble method works can be seen in figure 5. However,
this method is costly and time consuming. Another idea is to ask

Figure 93: Bubble Method.


Source: Lecture 16

people to annotate images themselves (to simply point out where the
semantic hierarchies and fine grained recognition 187

important features are). However, this method has no way of easily


assuring the quality of these annotations. Crowd-sourced bubble
games: online bubble games are similar in nature to the bubble study,
but are fun and have monetary rewards for players. Notice that this
solves both problems of the above methods – it is cost effective to
have people play, while quality is assured by the reward system of
the game. Two examples of games include the following:

• Peekaboom: Peekaboom was a two-player game in which one


player reveals parts of an image, while the other player attempts
to guess what the image represents. This game worked well for
finding important parts of an image, but was not good for distin-
guishing between fine-grained classes (since often the important
parts of the classes are the same).

• The Bubbles Game: This game, created by Jia Deng, is an online


game in which players are given example images of two classes.
Then, a blurred-out image is shown, and the player attempts to
guess which class the blurred image belongs to, while revealing
only small parts of the image. The less of the image revealed, the
more reward is given to the player. This game is much better for
finding differences between fine-grained classes, and it led to
bubble heatmaps that pinpointed the features that distinguished
fine-grained classes, as seen in figure 6. The creators of the game

Figure 94: Bubble Heatmaps.


Source: Lecture 16
188 computer vision: foundations and applications

pooled bubbles from multiple images in a category together and


the resulting information was provided to a model, which per-
formed with higher precision than other models at the time.
Motion

Introduction

In this class so far we have learned a variety of tools that enable


us to detect key points, recognize objects, and use segmentation
in images. However, in many cases we want to be able to perform
similar operations on video. Specifically, we are often interested
not only in the location of certain objects, but also the movement of
these objects over time. This lecture focuses on how we can apply
previously covered techniques along with new methods to effectively
track the motion of pixels across many images, with applications in
areas such as self-driving cars, robots, and security systems to name
a few.

Optical Flow and Key Assumptions

Optical Flow
Put simply, optical flow is the movement of pixels over time. The
goal of optical flow is to generate a motion vector for each pixel
in an image between t0 and t1 by looking at two images I0 and I1 .
By computing a motion vector field between each successive frame
in a video, we can track the flow of objects, or, more accurately,
"brightness patterns" over extended periods of time. However, it
is important to note that while optical flow aims to represent the
motion of image patterns, it is limited to representing the apparent
motion of these patterns. This nuanced difference is explained more
in depth in the Assumptions and Limitations section.

Assumptions and Limitations


Apparent Motion Given a two dimensional image, optical flow can
only represent the apparent motion of brightness patterns, meaning
that the movement vectors of optical flow can be the result of a
variety of actions. For instance, variable lighting can cause strong
190 computer vision: foundations and applications

motion vectors on static objects, and movement into or out of the


frame cannot be captured by the 2D motion vectors of optical flow.
One example of an issue poorly dealt with by optical flow is the
aperture problem.

Figure 95: In the aperture prob-


lem, the line appears to have
moved to the right when only
in the context of the frame, but
the true motion of the line was
down and to the right. The
aperture problem is a result of
optical flow being unable to rep-
resent motion along an edge–an
issue that can lead to other
errors in motion estimation as
well.

Brightness Consistency As optical flow can only represent apparent


motion, to correctly track the motion of points on an image we must
assume that these points remain at the same brightness between
frames. The equation for this brightness consistency equation is as
follows
I ( x, y, t − 1) = I ( x + u( x, y), y + v( x, y), t)

where u(x,y) represents the horizontal motion of a point and v(x,y)


represents the vertical motion.

Small Motion Optical flow assumes that points do not move very
far between consecutive images. This is often a safe assumption, as
videos are typically comprised of 20+ frames per second, so motion
between individual frames is small. However, in cases where the
object is very fast or close to the camera this assumption can still
prove to be untrue. To understand why this assumption is necessary,
we must consider the Brightness Consistency equation defined
above. When trying to solve this equation, it is useful to linearize the
right side using a Taylor expansion. This yields

I ( x + u( x, y), y + v( x, y), t) ≈ I ( x, y, t − 1) + Ix · u( x, y) + Iy · v( x, y) + It

Linearizing in this way allows us to solve for the u and v motion


vectors we want, but in this case we have only included the first order
Taylor series terms. When motion is large between frames, these
terms do a poor job of capturing the entire motion, thus leading
motion 191

to inaccurate u,v. More information about higher order derivative


constraints can be found in references [1], page 12.

Spatial Coherence Spatial coherence is the assumption that nearby


pixels will move together, typically because they are part of the
same object. To see why this assumption is necessary, consider the
equation for optical flow as defined above

I ( x + u( x, y), y + v( x, y), t) ≈ I ( x, y, t − 1) + Ix · u( x, y) + Iy · v( x, y) + It

I ( x + u( x, y), y + v( x, y), t) − I ( x, y, t − 1) = Ix · u( x, y) + Iy · v( x, y) + It
Giving us
Ix · u + Iy · v + It ≈ 0
5 I · [u v] T + It = 0
Ignoring the meaning of this derivation for the moment, it is clear
that we do not have enough equations to find both u and v at every
single pixel. Assuming that pixels move together allows us to use
many more equations with the same [u v], making it possible to solve
for the motion of pixels in this neighborhood.

Lucas-Kanade

Recovering image motion given by (u, v) in the above equation


requires at least two equations per pixel. To achieve this, the Lucas-
Kanade 49 technique for image tracking relies on an additional 49
Bruce D Lucas, Takeo Kanade, et al.
constraint — spatial coherence. An iterative image registration tech-
nique with an application to stereo
The spatial coherence constraint is applied to a pixel using a vision. 1981
window of size k × k. The assumption is that the neighboring pixels in
this window will have the same (u, v). For example, in a 5x5 window
the following equations apply:

0 = It (pi ) + ∇ I (pi ) · [u v]
 
Ix (p1 ) Iy (p1 )
 Ix (p2 ) Iy (p2 ) 
 
 . .. 
 .
 . . 

Ix (p25 ) Iy (p25 )
This produces an overly-constrained system of linear equations of
the form Ad = b. Using a least squares method for solving over-
constrained systems, we reduce the problem to solving for d in
( A T A)d = A T b. More explicitly the system to solve is reduced to
" #" # " #
∑ Ix Ix ∑ Ix Iy u ∑ Ix It
=−
∑ Iy Ix ∑ Iy Iy v ∑ Iy It
AT A AT b
192 computer vision: foundations and applications

Condition for an Existing Solution


In order to solve the system following conditions should hold:

• A T A should be invertible

• A T A should not be too small due to noise.


Eigenvalues λ1 and λ2 of A T A should not be too small

• A T A should be well-conditioned
i.e λ1 /λ2 should not be too large (for λ1 > λ2 )

Geometric Interpretation
It should be evident that the least squares system of equations above
produce a second moment matrix M = A T A. In fact, this is the
Harris matrix for corner detection.
" # " #
∑ Ix Ix ∑ Ix Iy I h i
T
A A= = ∑ x Ix Iy = ∑ ∇ I (∇ I )T = M
∑ Iy Ix ∑ Iy Iy Iy

We can relate the conditions above for solving the motion field [u v]
to tracking corners detected by the Harris matrix M. In particular, the
eigenvectors and eigenvalues of M = A T A relate to the direction and
magnitude of a possible edge in a region.
Using this interpretation, it is apparent that an ideal region for
Lucas-Kanade optical flow estimation is a corner. Visually, if λ1 and
λ2 are too small this means the region is too âĂIJflatâĂİ. If λ1 >> λ2 ,
the method suffers from the aperture problem, and may fail to solve
for correct optical flow.

Figure 96: Conditions for a


solvable matrix A T A may be
interpreted as different edge
regions depending on the rela-
tion between λ1 and λ2 . Corner
regions produce more optimal
conditions.
motion 193

Figure 97: Example of regions


with large λ1 and small λ2 (left),
small λ1 and smallλ2 (center,
low texture region), large λ1
and large λ2 (right, high texure
region)

Error in Lucas-Kanade
The Lucas-Kanade method is constrained under the assumptions of
optical flow. Supposing that A T A is easily invertible and that there is
not much noise in the image, errors may still arise when:

• Brightness constancy is not satisfied, meaning that a pixel may


change intensity from different time steps.

• The motion is not small or and does not change gradually over
time.

• Spatial coherence is not satisfied, meaning neighboring pixels do


not move alike.
This may arise due to in an inappropriately sized window
(choosing bad k).

Improving Accuracy
From the many assumptions made above, Lucas-Kanade can improve
its accuracy by including the higher order terms previously dropped
in the Taylor expansion approximation for the brightness constancy
equation. This loosens the assumptions of small motion and more
accurately reflects optical flow. Now, the problem to be solved is:

I ( x + u, y + v) = I ( x, y) + Ix u + Iy v + higher order terms − It−1 ( x, y)

This is a polynomial root finding problem and can be solved with


an iterative approach using NewtonâĂŹs method.
In summary, the refined Iterative Lucas-Kanade Algorithm may be
applied as:

1. Estimate velocity at each pixel by solving Lucas-Kanade equations.

2. Warp I (t − 1) towards I (t) using the estimated flow field and


image warping techniques.

3. Repeat until convergence.


194 computer vision: foundations and applications

Horn-Schunk

Horn-Schunk Method for Optical Flow


The Horn-Schunk method for computing optical flow formulates
flow as the following global energy function which should be mini-
mized with respect to u( x, y) and v( x, y).
Z Z
E= [( Ix u + Iy v + It )2 + α2 (||∇u||2 + ||∇v||2 )]dxdy

The first term of this energy function reflects the brightness constancy
assumption, which states that the brightness of each pixel remains
the same between frames, though the location of the pixel may
change. According to this assumption, Ix u + Iy v + It should be
zero. The square of this value is included in the energy function to
ensure that this value is as close to zero as possible, and thus u and v
comply with the brightness constancy assumption.
The second term of this energy function reflects the small motion
assumption, which states that the points move by small amounts be-
tween frames. The squares of the magnitudes of u and v are included
in the energy function to encourage smoother flow with only small
changes to the position of each point. The regularization constant α
is included to control smoothness, with larger values of α leading to
smoother flow.
To minimize the energy function, we take the derivative with
respect to u and v and set to zero. This yields the following two
equations
Ix ( Ix u + Iy v + It ) − α2 ∆u = 0
Iy ( Ix u + Iy v + It ) − α2 ∆v = 0
2 ∂2
where ∆ = ∂x

2 + ∂y2
is called the Lagrange operator, which in practice
is computed as
∆u( x, y) = ū( x, y) − u( x, y)
where ū( x, y) is the weighted average of u in a neighborhood around
( x, y). Substituting this expression for the Lagrangian in the two
equations above yields

( Ix2 + α2 )u + Ix Iy v = α2 ū − Ix It

Ix Iy u + ( Iy2 + α2 )v = α2 v̄ − Iy It
which is a linear equation in u and v for each pixel.

Iterative Horn-Schunk
Since the solution for u and v for each pixel ( x, y) depends on the op-
tical flow values in a neighborhood around ( x, y), to obtain accurate
motion 195

values for u and v we must recalculate u and v iteratively once the


neighbors have been updated. We can iteratively solve for u and v
using
Ix ( Ix ūk + Iy v̄k + It )
uk+1 = ūk −
α2 + Ix2 + Iy2

Iy ( Ix ūk + Iy v̄k + It )
vk+1 = v̄k −
α2 + Ix2 + Iy2

where ūk and v̄k are the values for ū and v̄ calculated during the k’th
iteration, and uk+1 and vk+1 are the updated values for u and v for
the next iteration.

Smoothness Regularization
The smoothness regularization term ||∇u||2 + ||∇v||2 in the energy
function encourages minimizing change in optical flow between
nearby points. With this regularization term, in texture free regions
there is no optical flow, and on edges, points will flow to the nearest
points, solving the aperture problem.

Dense Optical Flow with Michael Black’s Method


Michael Black extended the Horn-Schunk method by replacing the
regularization term ||∇u||2 + ||∇v||2 which is a quadratic function of
the magnitudes of the gradients of u and v
196 computer vision: foundations and applications

with the following function

Pyramids for Large Motion

Revisiting Lucas-Kanade, recall that one of our original assumptions


was that there would be small motion of points between consecutive
frames. This assumption causes the algorithm to fall apart when
dealing with large motion:

Notice in the graphic above, Lucas-Kanade can’t find a consistent


vector for the flow of the tree trunk. In order to correct for this, we
can apply a tactic where we apply Lucas-Kanade iteratively to a
lower-resolution version of the image, similar to how we created
image pyramids for our sliding-window feature detector.
motion 197

Now, when we try to find the flow vector, the small motion condi-
tion is fulfilled, as the downsampled pixels move less from frame to
consecutive frame than pixels in the higher resolution image. Here is
another example from the slides using Lucas-Kanade with pyramids:

Notice how the flow vectors now point mostly in the same direc-
tion, indicating that the tree trunk is moving in a consistent direction.

Common Fate

We can gain more information about an image by analyzing it


through the the lens of common fate, which in this context is the
198 computer vision: foundations and applications

idea that each pixel in a given segment of the image will move in
a similar manner. Our goal is to identify the image segments, or
"layers", that move together.

Identify Layers
We compute layers in an image by dividing the image into blocks and
grouping based on the similarity of their affine motion parameters.
For each block, finding the vector a that minimizes

Err ( a) = ∑[ Ix (a1 + a2 x + a3 y) + Iy (a4 + a5 x + a6 y) + It ]2


for all pixels (x, y) in each block.
The above equation is derived from two parts: (1) the brightness
constancy equation, and (2) the components of affine motion:

Ix u( x, y) + Iy v( x, y) + It ≈ 0

Ix , Iy , It are the gradients of the image with respect to the x direc-


tion, y direction, and time, respectively. u( x, y) and v( x, y) are the
components of affine motion in the horizontal and vertical directions:

u( x, y) = a1 + a2 x + a3 y

v( x, y) = a4 + a5 x + a6 y
From there, we map our parameter vectors ai into motion param-
eter space and perform k-means clustering on the affine motion
parameter vectors.

The final centers of the k-means clustering are the parameters


a1 ...a6 that minimize the above error function, and the vectors ai
motion 199

in each grouping correspond to the original blocks that should be


grouped in a single layer. Intuitively, layers should be comprised
of blocks that have similar parameters, as that implies their affine
motion is similar.
Tracking

Introduction: What is tracking?

Definition
Visual tracking is the process of locating a moving object (or multiple
objects) over time in a sequence.

Objective
The objective of tracking is to associate target objects and estimate
target state over time in consecutive video frames.

Applications
Tracking has a variety of application, some of which are:

• Human-computer interaction: Eye Tracking Systems50 50


S. Chandra, G. Sharma, S. Malhotra,
D. Jha, and A. P. Mittal. Eye tracking
• Security and surveillance51 based human computer interaction:
Applications and their uses. pages 1–5,
• Augmented reality52 Dec 2015
51
A. Hampapur, L. Brown, J. Connell,
• Traffic control: Road transportation systems53 A. Ekin, N. Haas, M. Lu, H. Merkl, and
S. Pankanti. Smart video surveillance:
exploring the concept of multiscale
• Medical imaging54 spatiotemporal tracking. IEEE Sig-
nal Processing Magazine, 22(2):38–51,
March 2005. ISSN 1053-5888. doi:
Challenges 10.1109/MSP.2005.1406476
52
A. I. Comport, E. Marchand, M. Pres-
Note, that tracking can be a time consuming process due to the sigout, and F. Chaumette. Real-time
amount of data that in video. Besides, tracking relies on object recog- markerless tracking for augmented
nition algorithms for tracking, which might become more challenging reality: the virtual visual servoing
framework. IEEE Transactions on Visu-
and prone to failure for the following reasons: alization and Computer Graphics, 12(4):
615–628, July 2006. ISSN 1077-2626. doi:
• Variations due to geometric changes like the scale of the tracked 10.1109/TVCG.2006.78
object
53
O. Rostamianfar, F. Janabi-Sharifi,
and I. Hassanzadeh. Visual track-
ing system for dense traffic intersec-
• Changes due to illumination and other photometric aspects
tions. In 2006 Canadian Conference on
Electrical and Computer Engineering,
• Occlusions in the image frame pages 2000–2004, May 2006. doi:
10.1109/CCECE.2006.277838
54
P. Mountney, D. Stoyanov, and
G. Z. Yang. Three-dimensional tissue
deformation recovery and tracking. IEEE
Signal Processing Magazine, 27(4):14–24,
July 2010. ISSN 1053-5888
202 computer vision: foundations and applications

• Motion that is non-linear

• Blurry videos or videos with bad resolution might make the


recognition fail

• Similar objects in the scene

Feature Tracking

Definition
Feature tracking is the detection of visual feature points (corners,
textured areas, ...) and tracking them over a sequence of frames
(images).

Challenges of feature tracking


• Figure out which features can be tracked

• Efficiently track across frames âĂŞ Some points may change


appearance over time (e.g., due to rotation, moving into shadows,
etc.)

• Drift: small errors can accumulate as appearance model is updated

• Points may appear or disappear: need to be able to add/delete


tracked points

What are good features to track?


What kinds of image regions can we detect easily and consistently?
We need a way that can measure âĂIJqualityâĂİ of features from
just a single image. Intuitively, we want to avoid smooth regions
and edges. A solution for such a problem is to use Harris Corners.
Detecting Harris corners as our key points to track guarantees small
error sensitivity!
Once we detect the features we want to track, we can use optical
flow algorithms to solve our motion estimation problem and track
our features.

Example

Tracking methods
Simple KanadeâĂŞLucasâĂŞTomasi feature tracker The KanadeâĂŞLu-
casâĂŞTomasi (KLT) feature tracker is an approach to feature extrac-
tion. KLT makes use of spatial intensity information to direct the
search for the position that yields the best match. Its algorithm is:
tracking 203

Figure 98: Courtesy of Jean-


Yves BouguetâĂŞVision Lab,
California Institute of Technol-
ogy

1. Find a good point to track (harris corner)

• Harris corner points have sufficiently large eigenvalues, so the


optical flow equation is solvable.

2. For each Harris corner compute motion (translation or affine)


between consecutive frames.

3. Link motion vectors in successive frames to get a track for each


Harris point

• If the patch around the new point differs sufficiently from the
old point, we discard these points.

4. Introduce new Harris points by applying Harris detector at every


(10 or 15) frames

5. Track new and old Harris points using steps 2-3

In the following frames from tracking videos, arrows represent the


tracking motion of the harris corners.

Figure 99: Frames from fish-


tracking video, courtesy of
Kanade

2D Transformations

Types of 2D Transformations
There are several types of 2D transformations. Choosing the correct
2D transformations can depend on the camera (e.g. placement, move-
ment, and viewpoint) and objects. A number of 2D transformations
are shown in Figure . Examples of 2D transformations include:
204 computer vision: foundations and applications

Figure 100: Frames from man-


tracking video, courtesy of
Kanade

Figure 101: Types of 2D trans-


formations.

• Translation Transformation. (e.g. Fixed overhead cameras)

• Similarity Transformation. (e.g. Fixed cameras of a basketball


game)

• Affine Transformation. (e.g. People in pedestrian detection)

• Projective Transformation. (e.g. Moving cameras)

In this section we will cover three common transformations: transla-


tion, similarity, and affine.

Translation
Translational motion is the motion by which a body shifts from one
point in space to another. Assume we have a simple point m with
coordinates ( x, y). Applying a translation motion on m shifts it from
( x, y) to ( x 0 , y0 ) where
x 0 = x + b1
(17)
y0 = y + b2
We can write this as a matrix transformation using homogeneous
coordinates:  
! ! x
x 0 1 0 b1  
0 = y (18)
y 0 1 b2
1
Let W be the above transformation defined as:
!
1 0 b1
W ( x; p) = (19)
0 1 b2
!
b1
where the parameter vector is p = .
b2
tracking 205

Figure 102: Translation

Taking the partial derivative of W with respect to p we get:


!
∂W 1 0
( x; p) = (20)
∂p 0 1

This is called the Jacobian.

Similarity Motion

Similarity motion is a rigid motion that includes scaling and transla-


tion.
We can define the similarity as:

x 0 = ax + b1
(21)
y0 = ay + b2

The similarity transformation matrix W and parameters p are


defined as the following:
!
a 0 b1
W=
0 a b2 (22)
 
T
p= a b1 b2
206 computer vision: foundations and applications

The Jacobian of the similarity transformation is then:


!
∂W x 1 0
( x; p) = (23)
∂p y 0 1

Affine motion
Affine motion includes scaling, rotation, and translation. We can
express this as the following:

x 0 = a1 x + a2 y + b1
(24)
y0 = a3 x + a4 y + b2

The affine transformation can be described with the following


transformation matrix W and parameters p:
!
a1 a2 b1
W=
a3 a4 b2 (25)
 
T
p = a1 a2 b1 a3 a4 b2

Finally, the Jacobian for affine motion is the following:


!
∂W x y 1 0 0 0
( x; p) = (26)
∂p 0 0 0 x y 1

Iterative KLT tracker

Problem formulation
Given a video sequence, find the sequence of transforms that maps
each frame to the next frame. Should be able to deal with arbitrary
types of motion, including object motion and camera/perspective
motion.

Approach
This approach differs from the simple KLT tracker by the way it links
frames: instead of using optical-flow to link motion vectors and track
motion, we directly solve for the relevant transforms using feature
data and linear approximations. This allows us to deal with more
complex (such as affine and projective) transforms and link objects
more robustly.
Steps:

1. First, use Harris corner detection to find the features to be tracked.


tracking 207

2. For each feature at location x = [ x, y] T : Choose a feature descrip-


tor and use it to create an initial template for that feature (likely
using nearby pixels): T ( x ).

3. Solve for the transform p that minimizes the error of the feature
description around x2 = W ( x; p) (your hypothesis for where the
feature’s new location is) in the next frame. In other words, solve
the equation
∑ [T (W (x; p)) − T (x)]2
x

4. Iteratively reapply this to link frames together, storing the coordi-


nates of the features as the transforms are continuously applied.
This should give you a measure of how objects move through
frames.

5. Just as before, every 10-15 frames introduce new Harris corners to


account for occlusion and "lost" features.

Math
We can in fact analytically derive an approximation method for
finding p (in Step 3). Assume that you have an initial guess for p, p0 ,
and p = p0 + ∆p.
Now,

E= ∑ [T (W (x; p)) − T (x)]2 = ∑ [T (W (x; p0 + ∆p)) − T (x)]2


x x

But using the Taylor approximation, we see that this error term is
roughly equal to :

∂W
E≈ ∑ [T (W (x; p0 )) + ∇T ∂p ∆p − T (x)]2
x
To minimize this term, we take the derivative with regard to p0
and set it equal to 0, then solve for p0 .

∂E ∂W ∂W
∂p
≈ ∑ [∇T ( ∂p )T ][T (W (x; p0 )) + ∇T ∂p ∆p − T (x)] = 0
x

∂W T
∆p = H −1 ∑ [∇ T ][ T ( x ) − T (W ( x; p0 ))]
x ∂p
T
, where H is ∑ x [∇ T ∂W ∂W
∂p ] [∇ T ∂p ]
By iteratively setting p0 = p0 + ∆p, we can eventually converge
on an accurate, error-minimizing value of p, which tells us what the
transform is.
208 computer vision: foundations and applications

Link to Harris Corner Detection


For translation motion,
!
∂W 1 0
( x; p) =
∂p 0 1

, so it is easy to show that


!
Ix2 Ix Iy
H=
Ix Iy Iy2

. However, the Harris corner detector assumes that whenever H has


large eigenvalues (i.e it is stably invertible), then the point within
the window is a corner. Thus, corners tend to be good features for
computing translations, precisely because the resulting matrices
produced by corners are stably invertible.
Deep Learning
Bibliography

http://www.ams.org/samplings/feature-column/fcarc-svd.

https://commons.wikimedia.org/wiki/file:broadway_tower_edit.jpg.
a.

https://commons.wikimedia.org/wiki/file:broadway_tower_edit.jpg.
b.

https://www.weirdoptics.com/hidden-lion-visual-optical-illusion.

OpenCV 3.1.0 Documentation: https : //docs.opencv.org/3.1.0/. URL


https://docs.opencv.org/3.1.0/sift_scale_invariant.jpg.

Snasel V. Abdulla H.D. Search result clustering using a singular value


decomposition (svd). 2009.

Shai Avidan and Ariel Shamir. Seam carving for content-aware image
resizing. In ACM Transactions on graphics (TOG), volume 26, page 10.
ACM, 2007.

S. Chandra, G. Sharma, S. Malhotra, D. Jha, and A. P. Mittal. Eye


tracking based human computer interaction: Applications and their
uses. pages 1–5, Dec 2015.

A. I. Comport, E. Marchand, M. Pressigout, and F. Chaumette. Real-


time markerless tracking for augmented reality: the virtual visual
servoing framework. IEEE Transactions on Visualization and Com-
puter Graphics, 12(4):615–628, July 2006. ISSN 1077-2626. doi:
10.1109/TVCG.2006.78.

M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and


A. Zisserman. The PASCAL Visual Object Classes Chal-
lenge 2012 (VOC2012) Results. http://www.pascal-
network.org/challenges/VOC/voc2012/workshop/index.html.

Martin A Fischler and Robert C Bolles. Random sample consensus: a


paradigm for model fitting with applications to image analysis and
automated cartography. Communications of the ACM, 24(6):381–395,
1981.
212 computer vision: foundations and applications

David Forsyth and Jean Ponce. Computer vision: a modern approach.


Upper Saddle River, NJ; London: Prentice Hall, 2011.

Michael Goesele, Noah Snavely, Brian Curless, Hugues Hoppe, and


Steven M Seitz. Multi-view stereo for community photo collec-
tions. In Computer Vision, 2007. ICCV 2007. IEEE 11th International
Conference on, pages 1–8. IEEE, 2007.

A. Hampapur, L. Brown, J. Connell, A. Ekin, N. Haas, M. Lu,


H. Merkl, and S. Pankanti. Smart video surveillance: exploring
the concept of multiscale spatiotemporal tracking. IEEE Signal
Processing Magazine, 22(2):38–51, March 2005. ISSN 1053-5888. doi:
10.1109/MSP.2005.1406476.

Anders Heyden and Marc Pollefeys. Multiple view geometry. Emerging


topics in computer vision, pages 45–107, 2005.

David H Hubel and Torsten N Wiesel. Receptive fields, binocular


interaction and functional architecture in the cat’s visual cortex. The
Journal of physiology, 160(1):106–154, 1962.

DH Hubel and TN Wiesel. Receptive fields of optic nerve fibres in the


spider monkey. The Journal of physiology, 154(3):572–580, 1960.

Ranjay Krishna Juan Carlos Niebles. Lecture: Visual bag of words.


http://vision.stanford.edu/teaching/cs131_fall1718/files/
14_BoW_bayes.pdf.

Tilke Judd, Krista Ehinger, Frédo Durand, and Antonio Torralba.


Learning to predict where humans look. In Computer Vision, 2009
IEEE 12th international conference on, pages 2106–2113. IEEE, 2009.

Bruce D Lucas, Takeo Kanade, et al. An iterative image registration


technique with an application to stereo vision. 1981.

R. MacAusland. The moore-penrose inverse and least squares. 2014.

Tomas Mardones. How should we understand the Spa-


tial Pyramid Matching? https://www.quora.com/
How-should-we-understand-the-Spatial-Pyramid-Matching.
[Online; accessed 15-Nov-2017].

P. Mountney, D. Stoyanov, and G. Z. Yang. Three-dimensional tissue


deformation recovery and tracking. IEEE Signal Processing Magazine,
27(4):14–24, July 2010. ISSN 1053-5888.

J. Hespanha P. Belhumeur and D. Kriegman. Eigenfaces vs. fisherfaces:


Recognition using class specific linear projection. IEEE Transactions
on pattern analysis and machine intelligence, 19(7):711–720, 1997.
bibliography 213

Stephen E Palmer. Vision science: Photons to phenomenology. MIT press,


1999.

Seymour A Papert. The summer vision project. 1966.

Simon JD Prince. Computer vision: models, learning, and inference.


Cambridge University Press, 2012.

Xiaofeng Ren and Jitendra Malik. Learning a classification model for


segmentation. In null, page 10. IEEE, 2003.

Ronald A Rensink, J Kevin O’Regan, and James J Clark. On the


failure to detect changes in scenes across brief interruptions. Visual
cognition, 7(1-3):127–145, 2000.

O. Rostamianfar, F. Janabi-Sharifi, and I. Hassanzadeh. Visual tracking


system for dense traffic intersections. In 2006 Canadian Conference
on Electrical and Computer Engineering, pages 2000–2004, May 2006.
doi: 10.1109/CCECE.2006.277838.

Michael Rubinstein, Ariel Shamir, and Shai Avidan. Improved seam


carving for video retargeting. In ACM transactions on graphics (TOG),
volume 27, page 16. ACM, 2008.

Ashutosh Saxena, Justin Driemeyer, and Andrew Y Ng. Robotic


grasping of novel objects using vision. The International Journal of
Robotics Research, 27(2):157–173, 2008.

Ariel Shamir Shai Avidan. Seam carving for content-aware image


resizing. ACM Trans. Graph, 26(3):10, 2007.

Noah Snavely Song Cao. Learning to match images in large-scale


collections.

Simon Thorpe, Denise Fize, and Catherine Marlot. Speed of processing


in the human visual system. nature, 381(6582):520, 1996.

Dirk B. Walther, Barry Chai, Eamon Caddigan, Diane M. Beck, and


Li Fei-Fei. Simple line drawings suffice for functional mri decoding
of natural scene categories. Proceedings of the National Academy of
Sciences, 108(23):9661âĂŞ–9666, 2011.

Brian A Wandell. Foundations of vision. Sinauer Associates, 1995.

Wikipedia. Pyramid (image processing). https://en.wikipedia.


org/wiki/Pyramid_(image_processing). [Online; accessed 15-Nov-
2017].

You might also like