Nothing Special   »   [go: up one dir, main page]

Distance-Preserving Dimensionality Reduction (Wiley Interdisciplinary Reviews - Data Mining and Knowledge Discovery, Vol. 1, Issue 5) (2011)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Overview

Distance-preserving
dimensionality reduction
Li Yang∗

This paper presents an overview of basic concepts and principles that deal with the
problem of mapping high-dimensional data to low-dimensional space such that
distances between all or some pairs of data points are preserved. It introduces re-
lated techniques and systematizes today’s methods into linear methods, methods
using iterative optimization, methods preserving exact distances, methods using
geodesic distances, and methods using alignments of local models. It discusses
these methods by focusing on their basic ideas, by summarizing their common
features and differences, and by comparing their strengths and weaknesses. This
paper assumes no familiarity with dimensionality reduction. The main text should
be readable by people with little technical background. Technical information of
important algorithms is briefly presented in sidebars, the reading of which as-
sumes basics in statistics and matrix computation. C 2011 John Wiley & Sons, Inc. WIREs
Data Mining Knowl Discov 2011 1 369–380 DOI: 10.1002/widm.39

INTRODUCTION In principle, the existence of a data set is inde-


pendent of a coordinate system within which the data
D imensionality reduction is an important step for
data preprocessing in data mining and knowl-
edge discovery. The problem is defined as follows:
set is represented. As a representation of the data set,
data coordinates and data dimensionality do not di-
given a set of high-dimensional data points, project rectly reflect the information the data set conveys.
them to a low-dimensional space so that the resulting We may visualize the situation by using the analogy
transformed data perform more efficiently and more of a physical structure, which has a strut between ev-
effectively than the original data in further processing ery pair of points. The structure is rigid and should
such as classification, clustering, indexing, and search- not be distorted whatever the data are represented by
ing. Dimensionality reduction contributes to the pre- high-dimensional coordinates or by low-dimensional
processing of high-dimensional data in two important coordinates. The structure is what we are interested in
ways: first, it gives a low-dimensional representation while the data coordinate is merely a representation.
of high-dimensional data for the purpose of effective As measures of dissimilarities between data observa-
and efficient data analysis; second, it provides a way tions, distances between pairs of data points reflect
to visualize the data, enables human intervention into fundamental information in the data set and should,
the data mining process, and helps to select appropri- in principle, be preserved by dimensionality reduc-
ate data mining techniques and parameters for further tion techniques. Therefore, it is the job of a distance-
processing. Dimensionality reduction is also believed preserving dimensionality reduction method to
to be fundamental to human perception: an image represent the data in a low-dimensional space while
can be thought as a point in high-dimensional space; preserving the distances between pairs of data points.
although the input dimensionality is very high (e.g., In reality, it is rarely possible to project a high-
4096 for a small image with size of 64 × 64), the per- dimensional data set to a low-dimensional space with-
ceptually meaningful structure of a sequence of such out change of any distances. Therefore, the problem
images has probably much fewer independent degrees of dimensionality reduction becomes how to project
of freedom. the data into a low-dimensional space with the goal
of best preservation of distances. Distance-preserving
methods for dimensionality reduction can be roughly

Correspondence to: li.yang@wmich.edu classified into linear methods and nonlinear methods.
Department of Computer Science, Western Michigan University, Linear methods refer to processes that derive new di-
Kalamazoo, MI, USA. mensions of the data based on linear transformations
DOI: 10.1002/widm.39 of the original dimensions. A celebrated linear method

Volume 1, September/October 2011 


c 2011 John Wiley & Sons, Inc. 369
Overview wires.wiley.com/widm

is principal component analysis (PCA),1 which maps BOX 1: THE INPUT DATA
data to a lower dimensional space such that vari-
ance of the data is maximized. Nonlinear methods We are given n data observations {x1 , . . . , xn }, each being
often have the explicit goal of preserving distances. a column vector xi = [xi 1 , . . . , xip ]T , representing a point
One commonly used idea is to construct a cost func- in p-dimensional Euclidean space. The data set has a mean
tion, which measures difference between the distances μ = E (xi ) and a covariance matrix  p× p = E {(xi −
in the input space and the corresponding distances μ)(xi − μ)T }. For easy representation, we denote the input
in the output space, and project the data to a low- data in a matrix form X n× p = [xi , . . . , xn ]T . Assume the
dimensional space by minimizing the cost function. data is centered, that is, μ = 0, the covariance matrix can
Recently, research on manifold learning for dimen- be expressed as  = 1n X T X. A popular measure of dis-
sionality reduction has become active. One way to tance between xi and xj is the Minkowski distance, which
understand this is to assume that the data are lo- is defined as dij = (|xi 1 − x j 1 |c + · · · + |xip − xjp |c )1/c ,
cated on a manifold in high-dimensional space and where c is a constant. When c = 2, it is the usual Euclidean
geodesic distance along the manifold provides bet- distance. When c = 1, it is called Manhattan distance or
ter measurement of dissimilarity between pairs of the city distance.
data points than Euclidean distances. Using the anal-
ogy of a folded newspaper, the paper needs to be
unfolded to reveal its contents. linear method is characterized by a projection ma-
The above constitutes major ideas used for trix that linearly transforms input data points to a
distance-preserving dimensionality reduction. This low-dimensional space. As an example, PCA has the
paper introduces related techniques and systematizes distinction of being the optimal linear transforma-
major methods using these ideas. Throughout the tion such that projected data in the subspace have the
paper, we assume that all dimensions (variables, fea- largest variances. From a distance-preservation per-
tures) are numerical and do not consider the practical spective, it offers the best linear mapping such that
issue of how to define distance measures for categor- the sum of changes of squared distances is minimized.
ical variables and how to reduce the dimensionality Nonlinear methods are often derived either through
of data with mixed categorical and numerical vari- extensions of linear methods or by minimizing cost
ables. In addition to using data coordinates as input, functions that are defined explicitly in terms of change
algorithms should also be able to directly accept dis- of distances. These methods often try to preserve all
tances as input, in which case all we need to do is to distances and usually end with the case where no dis-
find a data configuration in Euclidean space with a tance is exactly preserved. Nonlinear methods are also
fixed dimensionality such that the input distances are developed to explicitly retain some exact distances.
best preserved. These algorithms are often in dual-
ity with techniques using data coordinates as input
and are also considered as dimensionality reduction Linear Methods
techniques. They have applications in areas such as Linear methods include PCA, classical multidimen-
psychometrics and perceptual mapping where the col- sional scaling (classical MDS)2 that take interpoint
lected data are interpoint distances. distances as inputs and are equivalent to PCA, higher-
order generalizations such as independent compo-
nent analysis (ICA),3 and random projection where
the projection matrix is randomly chosen. In various
TRADITIONAL METHODS fields, PCA is also known as Karhunen–Loève trans-
The two commonly used strategies for dimension- form or Hotelling transform.
ality reduction are feature selection and feature ex- PCA offers an orthogonal linear transformation
traction. Feature selection reduces the dimensionality that projects the data to a low-dimensional space.
by selecting a subset of dimensions. Clearly, distance It seeks to reduce the dimensionality of the data by
preservation is not an objective of feature selection. finding a few orthogonal linear combinations of the
Feature extraction goes a step further by extracting original dimensions such that the greatest variance
new dimensions from input data based on transfor- by any projection of the data comes to lie on the
mations or combinations of the original dimensions. first coordinate, the second greatest variance on the
Distance-preserving methods for dimensionality re- second coordinate, and so on, as shown in Figure 1.
duction are feature extraction methods. Traditional PCA achieves this through eigenvalue decomposition
distance-preserving methods can be classified into two of the covariance matrix. For many data sets, a few
categories: linear methods and nonlinear methods. A principal coordinates would explain most of the data

370 
c 2011 John Wiley & Sons, Inc. Volume 1, September/October 2011
WIREs Data Mining and Knowledge Discovery Distance-preserving dimensionality reduction

10
ality is high, however, vectors having random direc-
tions might be sufficiently close to being orthogonal
with each other and a random projection would well
5
approximate an orthogonal projection.
y
1

y2

x2 0
Kernel PCA
Kernel principal component analysis (kernel PCA)6 is
a nonlinear extension of PCA using a technique called
−5
kernel method. It is equivalent to mapping the data to
a very high (up to infinite) dimensional space, namely,
reproducing kernel Hilbert space (RKHS), and apply-
−10
−15 −10 −5 0
x1
5 10 15
ing PCA in the RKHS. According to Mercer’s The-
orem, the inner product of two mapped points can
F I G U R E 1 | Principal components of a Gaussian distribution. be expressed as a kernel function of the two points.
Therefore, any algorithm that solely depends on the
variances so that the rest of the coordinates can be inner products between pairs of input vectors can
disregarded with minimal information loss. be transformed to a kernelized version by using a
Classical MDS is a technique that takes inter- kernel function to replace the otherwise intractable
point distances as input and derives data coordinates inner product operation. The kernelized version is
through eigenvalue decomposition of the inner prod- equivalent to the algorithm operating in an RKHS.
uct matrix (Gram matrix) of all data points. Because Because the kernel function is used, however, map-
distances are invariant to translations, we assume that ping to the RKHS is never explicitly computed. As we
the data set is centered at the origin in order to de- have discussed, classical MDS is such an algorithm
rive a unique inner product matrix. The inner product that involves only inner products of the input data.
matrix is obtained from the input distances through By replacing the inner product with a kernel oper-
a bidirectional centering process. Classical MDS is ation, classical MDS is extended to a nonlinear ver-
equivalent to PCA. In fact, when Euclidean distances sion, which effectively performs PCA after a nonlinear
are used as input distances, PCA and classical MDS mapping of the input data to the RKHS. Because of
produce identical results. the nonlinear mapping, distance preservation is not an
Being based on eigenvalue decomposition of the objective of kernel PCA although PCA offers distance
covariance matrix, PCA is a second-order method preservation in the RKHS.
which projects the data to uncorrelated dimensions.
One step further, ICA4 has the goal to produce sta-
tistically independent dimensions. This is a stronger Distance Preservation by Iterative
requirement involving higher-order statistics. In order Optimization
for the problem to have a solution, the input data set Being linear methods, PCA and classical MDS pre-
has to be indeed generated by a linear combination of serve distances while reducing data dimensionality
independent components. In practice, ICA is usually under the assumption that the data are distributed on
implemented as an extra unitary transformation after or close to a hyperplane in high-dimensional space.
the application of PCA. Such an assumption is usually too restrictive in many
Random projection5 is a simple and computa- applications. Therefore, methods have been devel-
tionally efficient approximation to the above linear oped to generate nonlinear maps with the explicit
methods. In random projection, the linear transfor- objective of distance preservation.
mation is randomly chosen. For distance preservation, Many of these methods belong to an area called
we can constrain the projection matrix such that its multidimensional scaling. Multidimensional scaling
columns have unit lengths. The key idea of random originated from the earlier study in psychometrics
projection comes from the Johnson–Lindenstrauss where researchers were interested in giving quantita-
Lemma: if data points in a vector space are pro- tive scores to subjects and dimensionalities to abstract
jected onto a randomly selected subspace of suitable concepts. It searches for a configuration of data points
high dimension, the distances between the points are in a low-dimensional space such that the Euclidean
approximately preserved. Strictly speaking, random distances between the points in the space match the
projection is not a projection because the projection original dissimilarities as much as possible. Thus it is
matrix may not be orthogonal. When the dimension- most often used to project data when only interpoint

Volume 1, September/October 2011 


c 2011 John Wiley & Sons, Inc. 371
Overview wires.wiley.com/widm

BOX 2: PCA AND CLASSICAL MDS


Assume the input data X is centered at the origin, that simply keep the first q columns in Y, which are produced
1/2 1/2
is, it has a zero mean μ = 0, its covariance matrix  = by [u1 , . . . , uq ]diag(k1 , . . . , kq ). It can be shown that
n
X X can be decomposed as  = 1n V V T , where  =
1 T
classical MDS gives an optimal linear solution in the sense
diag(λ1 , . . . , λ p ) is a diagonal matrix of eigenvalues of that the sum of differences between the resulting squared Eu-
 and V = [υ1 , . . . , υ p ] is a matrix whose column vec- clidean distances and {δij2 }’s is minimum. In fact, the squared
tors are the corresponding normalized eigenvectors. X can distance dij2 between xi and xj after the projection can be
then be transformed into Y = XV. It is easy to verify that p
expressed as d2ij = l =1 κl (uli − ulj )2 , where uli and ulj are
Y has zero mean μY = 0 and a diagonal covariance ma- the ith and the jth coordinates of the eigenvector ul , respec-
trix Y = 1n Y T Y = 1n . Without loss of generality, we list tively. If the eigenvalues κq+1 , . . . , κ p are very small, their
the eigenvalues of  in a nonincreasing order, that is, contributions to dij2 can be neglected. If the input data are
λ1 ≥ λ2 ≥ . . . ≥ λ p , columns of Y have nonincreasing vari- indeed distributed on a q-dimensional hyperplane, in particu-
ances. To project the data to a q-dimensional space, we can lar, κq+1 , . . . , κn are zero and the input distances are exactly
simply keep the first q coordinates, that is, the first q columns preserved.
in Y, and discard the rest coordinates. It can be shown that the Classical MDS is equivalent to PCA and produces identical
subspace spanned by the first q eigenvectors has the smallest results when the input distances are Euclidean distances. This
mean square derivation from X among all subspaces of dimen- is because that nonzero eigenvalues of XXT are the same as
sion q. This procedure is well known as principal component nonzero eigenvalues of XT X. In fact, if λ is an eigenvalue of
analysis. XT X and υ is the corresponding eigenvector, that is, XT Xυ =
Classical MDS takes distances between pairs of data points λυ, then we have XXT Xυ = λXυ, which means that λ is
as input and produces identical results as PCA if the input also an eigenvalue of XXT and Xυ is the corresponding eigen-
distances are Euclidean distances. Suppose we are given vector. Therefore, n and G share the same set of nonzero
a matrix D n×n = {δij2 }, where δij2 = (xi − x j )T (xi − x j ) is eigenvalues, that is,  = K . U can be obtained by normal-
the squared Euclidean distance between xi and xj . Assume izing columns of XV, that is, U = X V −1/2 . Inversely, V
that the input data X is centered at the origin, that is, can be obtained by normalizing columns of XT U, that is,
n
e X = 0, where e = [1, . . . , 1]T is a column vector of n
1 T
V = X T U K −1/2 . In fact, we have Y = X V = U K 1/2 .
ones. The inner product matrix (Gram matrix) G n×n = X X T ICA goes one step further than PCA and demands that fea-
can then be obtained from D using a bidirectional centering tures are statistically independent. In addition to zero cross-
process correlation achieved by PCA, this demand is equivalent to
   
1 1 1 the requirement that all higher-order cross-cummulants are
G=− 1 − eeT D 1 − eeT .
2 n n also zero. In real world applications, it is often sufficient to
check up to the fourth-order cummulants. We may further
By definition, G is symmetric and positive semidefinite. It has assume that the data distribution is symmetric, making all
a rank of at most p. In a similar way as decomposition of odd-order cummulants zero. Therefore, a common method
the covariance matrix used in PCA, G can be decomposed for ICA has two steps: the first step performs PCA on the in-
as G = U K U T where K = diag(k1 , . . . , k p ) is a diago- put data; the second step computes a unitary transformation
nal matrix of eigenvalues of G and U = [u1 , . . . , u p ] is a matrix so that the fourth-order cross-cummulants of the trans-
matrix whose column vectors are the corresponding normal- formed data are close to zero. This problem is to diagonalize
ized eigenvectors. The input data X can then be recovered as a four-dimensional array of the fourth-order cummulants. It
Y = U K 1/2 . Without loss of generality, we can list the eigen- is equivalent to an optimization problem that searches for a
values of G in a nonincreasing order, k1 ≥ k2 ≥ · · · ≥ k p . unitary transformation that maximizes the sum of squares of
To project the data to a q-dimensional space, we can the fourth-order auto-cummulants.

dissimilarities are available. When the data coordi- purpose of distance preservation, the error measure
nates are available, multidimensional scaling can also is usually defined as a weighted sum of differences
be used by calculating interpoint distances and by between distances in the input space and the corre-
creating a low-dimensional configuration of points to sponding distances in the output space. Because of
best preserve the distances. the complexity of the error measure, there is usually
To minimize distance changes, a simple idea is to no closed form solution to the minimization problem.
define an error measure of distance changes and then A method using this idea usually employs an iterative
design a procedure to minimize the measure. For the procedure to minimize the error measure. With regard

372 
c 2011 John Wiley & Sons, Inc. Volume 1, September/October 2011
WIREs Data Mining and Knowledge Discovery Distance-preserving dimensionality reduction

BOX 3: KERNEL PCA amount of adjustment to be made in each iteration


and thus the speed of convergence to the final result.
Given a data set X = [x1 , . . . , xn ]T . For each xi , we These iterative optimization techniques share a well-
make an implicit mapping φ(xi ) into an RKHS. Denote known deficiency: they may stop at local optima. To
(X ) = [φ(x1 ), . . . , φ(xn )]T . Following classical MDS, find the global optima, they are often combined with
the inner product matrix (X ) T (X ) in the RKHS can simulated annealing or genetic algorithms to escape
be decomposed as (X ) T (X ) = U K U T , where K = from the local optima. In many cases, they produce
diag(κ1 , . . . , κn ) is a diagonal matrix of eigenvalues and better preservation of distances (especially short dis-
U = [u1 , . . . , un ] is a matrix whose column vectors are the tances) than linear methods.
corresponding normalized eigenvectors. A nonlinear con-
figuration of X can be given as Y = U K 1/2 . Without loss
of generality, let κ1 ≥ κ2 ≥ . . . ≥ κn and κ1 , . . . , κq be Exact Preservation of Some Distances
the first q dominant eigenvalues, a q-dimensional configu- In a q-dimensional space, a point can be mapped to
1/2 1/2
ration of X is given as [u1 , . . . , uq ] diag(κ1 , . . . , κq ). a location such that its Euclidean distances to other
Furthermore, let the covariance matrix  = n (X ) (X )
1 T
q points are exactly preserved. There are two candi-
be decomposed into V V T in the RKHS. From our previous date locations for the point. We can choose one lo-
discussion on the equivalence between PCA and classical cation which better preserves the distance to an extra
MDS, V can be expressed as V = T (X )U K −1/2 . There- reference point. For example, a data set can be pro-
fore, an arbitrary input data point x is projected to jected to two-dimensional so that each point preserves
V T φ(x) = K −1/2 U T (X )φ(x) distances to two other points and minimizes the
change of distances to an extra point. This idea has
= K −1/2 U T [φ T (x)φ(x1 ), . . . , φ T (x)φ(xn )]T . been used in a triangulation mapping technique11 to
Please note that φ(.) appears in pairs and only inner prod- map data to two-dimensional and in a method12 for
ucts of the mapped data points are involved. By Mercer’s mapping data to an arbitrary low-dimensional space.
theorem, φ T (x)φ(xi ) can be replaced by a kernel function Whenever a new point is mapped, its distances to a
K (x, xi ). In this way, the inner product matrix (X ) T (X ) few points previously mapped are exactly preserved.
and the projection of an arbitrary input data point can be The question here is which distances to preserve and
computed without explicitly calculating φ(x). in which sequence to map data points.
Among the distances to be preserved from a
point, one could select the distance to the nearest
neighbor among the mapped points. This can be done
to this method, one has to make decisions on how by constructing a minimal spanning tree, which is a
to define the error measure and how to minimize it. spanning tree that connects all the points and whose
The basic questions are which distances to preserve total length is a minimum. The other distances to
and how techniques and algorithms work to preserve preserve can be chosen from the distances to the sec-
them. ond/third nearest neighbors and/or distances to some
Example algorithms include Kruskal’s metric reference points. These choices provide us a few al-
multidimensional scaling (MDS),7 Sammon’s non- ternatives to display data.
linear mapping (NLM),8 and curvilinear component As an example, Figure 2 shows snapshots of
analysis (CCA).9 These algorithms work in similar three-dimensional visualization of the United Na-
ways: Each algorithm starts with an estimated con- tion statistical data of its member states. The input
figuration of points in the output space. It uses an data set comes from the United Nations InfoNation
error measure that is defined as a function of differ- Database and has eight dimensions: population, GDP
ences between input distances and the corresponding per capita, life expectancy, illiteracy rate, spending
output distances. Several optimization techniques can on education, and numbers of TV sets, phones, and
be used to iteratively reconfigure the coordinates of newspaper circulations per 1000 inhabitants. Only
points to minimize the error measure. These tech- 24 states whose populations exceed 50 million are
niques include the gradient descent algorithm and considered. Again, Euclidean distance was used. In
its improvements such as Newton’s algorithm and Figure 2(a), the initial point was chosen to be USA
conjugate-gradient algorithm. They are all iterative and each data point is projected so that distances to
refinement algorithms. Each algorithm usually stops its three nearest neighbors are preserved. It looks that
when the error measure falls below a user-defined the distance between USA and China is the great-
threshold or the number of iterations exceeds a user- est. However, this is an illusion because global dis-
specified limit. The difference between them is the tances are not preserved at all. Figure 2(b) shows the

Volume 1, September/October 2011 


c 2011 John Wiley & Sons, Inc. 373
Overview wires.wiley.com/widm

measured along the manifold. It provides the ground


truth of dissimilarity between a pair of data points.
Therefore, distance-preserving techniques for dimen-
sionality reduction should take geodesic distances as
input and project the data to a q-dimensional space
such that the geodesic distances are preserved by the
Euclidean distances in the q-dimensional space.
In implementation, the geodesic distance be-
tween a pair of data points is usually estimated by
the length of the shortest path between the pair on
a neighborhood graph that connects every point to
its neighbor points. As the number of data points in-
creases, the length of the shortest path is expected to
asymptotically converge to the true geodesic distance.
A typical algorithm using geodesic distances consists
of three steps: the first step constructs a neighborhood
graph that spans all data points by connecting neigh-
bor points; the second step estimates the geodesic dis-
tance between every pair of data points by calculat-
ing the length of the shortest path between the pair
of vertices on the neighborhood graph, which is usu-
ally accomplished by applying Dijkstra’s algorithm
or Floyd’s algorithm to the neighborhood graph; the
third step applies a traditional algorithm to map data
F I G U R E 2 | Plots of the UN statistical data using (a) nearest to a low-dimensional space such that the estimated
neighbors; (b) two reference points. geodesic distances are preserved by Euclidean dis-
tances in the low-dimensional space. These three steps
result of using USA and China as two reference points are illustrated in Figure 3. Example algorithms in-
and gives a global view of distances to them from all clude Isomap,13 GeoNLM14 and curvilinear distance
other states such that each point preserves distance analysis (CDA).15 They use the same approach to es-
to its mapped nearest neighbor, and distances to timate geodesic distances. Differences between them
points representing USA and China. One can thus use reside at the third step: Isomap uses classical MDS,
this map to analyze the dissimilarities between USA, GeoNLM uses NLM, and CDA uses CCA to derive
China and to other countries. the final data configuration.
As long as good estimates of geodesic distances
are obtained, any traditional distance-preserving al-
USING GEODESIC DISTANCES gorithm would perform well to project the data and
In many applications, nonlinear procedures produce preserve the geodesic distances. The success of these
better data projections. One way to understand this is algorithms depends on the first step, that is, to build
to assume that real world data sets are usually located a quality neighborhood graph so that geodesic dis-
on a q-dimensional, possibly nonlinear, manifold in tances can be well estimated. There are two simple
p-dimensional feature space (q < p), in which case Eu- ways to define whether two points are neighbors and
clidean distance is not a good measure of dissimilarity are connected by an edge on a neighborhood graph:
between a pair of data points. A significant recent de- the first approach is called the k-nearest neighbor (k-
velopment in dimensionality reduction is the use of NN) approach, it defines that two points are neigh-
geodesic distances. Intuitively, geodesic distance be- bors if one is within the k nearest neighbors of the
tween a pair of points on a manifold is the distance other; the second (ε-neighbor) approach defines that

Euclidean Neighborhood Graph Low-dimensional


distances graph distances embeddi ng
Neighborhood graph Dijkstra’s or Floyd’s Distance-preserving
construction algorithm data embedding

F I G U R E 3 | Three steps for dimensionality reduction using geodesic distances.

374 
c 2011 John Wiley & Sons, Inc. Volume 1, September/October 2011
WIREs Data Mining and Knowledge Discovery Distance-preserving dimensionality reduction

BOX 4: DISTANCE PRESERVATION VIA ITERATIVE OPTIMIZATION


Let δ ij denote the original distance between xi and xj and why NLM has an effect of unfolding a data manifold. Nie-
let dij denote the corresponding distance after the mapping. mann and Weiss10 have further given a unified error measure
(c+2) i < j (δij − dij ) δij and a way to improve the conver-
1 2 c
An error measure can be defined to measure the change of
i < j δij
distances. In fact, classical MDS is an approach that min- gence of NLM. The error measure becomes ENLM when c =

imize the error measure i < j (δij2 − dij2 ). It belongs to a −0 and becomes EMDS when c = 0. A value c < 0 prefers
group of methods for metric MDS, where each method defines the preservation of short distances. A value c ≥ 0 prefers
its own error measure. Many error measures take the form
 the preservation of long distances. To unfold strongly twisted
E = i < j wij (δij − dij )2 with various weight wij . A funda- patterns, CCA goes one step further by completely ignoring
mental difference between E and the error measure of classical the preservation of long distances. Its error measure is de-
MDS is that E is no longer a quadratic form of the underlying xi 
fined as E CCA = i < j (δij − dij )2 F (dij , λ), where F (dij , λ)
and cannot be minimized by solving an eigenvalue problem. is a weight function with a user-defined parameter λ. F has to
In general, we cannot obtain a closed-form solution and have be a monotone decreasing function to dij to make CCA favor
to turn to iterative procedures to minimize E. the preservation of short distances. Usually, F is defined as a
In Kruskal’s MDS, one definition (normalized stress) of E binary gate function that makes CCA to completely ignore dis-
is given as the normalized sum of squared differences of tances longer than λ. It has been reported successful to unfold
distances: highly twisted data manifolds. CCA also employs a new proce-
1  dure for fast minimization and to escape from local minima of
E MDS = (δij − dij )2 .
i < j δij2 i < j ECCA .
An error measure E is often minimized iteratively. Let Y
Sammon’s NLM defines E as denote the low-dimensional projection of X. The iteration
1  (δij − dij )2 starts from an initial estimate of Y. Each iteration has the
E NLM = . form Y(new) = Y(old) = Y, where Y denotes the adjust-
i < j δij i<j
δij
ment in each step. In the gradient descent algorithm, Y
It is commonly referred as Sammon’s stress. is calculated as Y = −μ ∂∂YE |Y=Y(old) , where μ > 0 is a
The difference between EMDS and ENLM is that each term user-defined constant parameter. In Newton’s algorithm, μ
of ENLM is normalized by δ ij . While changes of long dis- is literally replaced by the inverse of the Hessian matrix
tances dominate EMDS , changes of shorter distances dom- of E computed at Y(old) . Newton’s algorithm requires more
inate ENLM . In NLM, short distances have a higher prior- computation and converges faster than the gradient descent
ity to be preserved than long distances. That is the reason algorithm.

two points are neighbors if the distance between them to choose a proper value of k or ε becomes a diffi-
is smaller than a user-defined threshold ε. cult problem. If the parameter were chosen too small,
For illustration, Figure 4(a) shows a synthetic the neighborhood graph would be disconnected. If it
data of 1000 points randomly distributed on a 4 × were chosen too large, a so-called ‘short-circuit’ prob-
1 rectangle, which is then wrapped into a three- lem would occur and the constructed neighborhood
dimensional Swiss roll. Figure 4(b) shows its 5-NN graph would not follow the manifold. This is shown
(k = 5) neighborhood graph superimposed on the in Figure 4(b) where the points A and B may be di-
data. Figure 4(c) displays the neighborhood graph rectly connected by an edge if the parameter is chosen
unwrapped in two-dimensional. Figures 4(b) and (c) too large.
illustrate the shortest path between an example pair In principle, graph distances approach the cor-
of data points A and B. Geodesic distance between responding geodesic distances as the number of data
A and B is estimated as the length of the path. Ap- points increases. Data projection using geodesic dis-
plying distance-preserving methods (such as classical tances is guaranteed to converge asymptotically to the
MDS) to the estimated geodesic distances will unfold true intrinsic structure of the data. In practice, how-
the data set. ever, neighborhood graph may not offer good estima-
k-NN and ε-neighbor have problems in con- tion of a geodesic distance, especially a short one that
structing neighborhood graphs for geodesic distance spans a few edges on the neighborhood graph. For
estimation. Neither approach guarantees the connec- example, the constructed neighborhood graph shown
tivity of the constructed neighborhood graph, espe- in Figure 4(c) contains many holes and looks like a
cially when the data are not evenly distributed. How Swiss cheese. As illustrated by a path between points

Volume 1, September/October 2011 


c 2011 John Wiley & Sons, Inc. 375
Overview wires.wiley.com/widm

(a) (b)

B
C D

(c)
F I G U R E 4 | (a) A Swiss-roll data set, (b) its 5-NN neighborhood graph, and (c) its two-dimensional projection.

C and D in the figure, geodesic distance between a then be defined for each data point on the manifold, in
pair of points on opposite sides of a hole is over- which all points resume the behavior in q-dimensional
estimated by length of the path along the bank of Euclidean space. Therefore, q-dimensional local mod-
the hole. This means that the calculated graph dis- els can be derived at each neighborhood through lin-
tances, especially the short ones each of which spans ear methods. On the global scale, we are interested
a few edges on the graph, may not be good estimates in data projection through an alignment of the local
of the corresponding geodesic distances. This may ex- models. The local models are coordinated and aligned
plain the different behavior of Isomap, GeoNLM, and by affine transformations, one for each model from
CDA. The underlying methods, NLM and CCA, used its own variable space to a single global coordinate
in GeoNLM and CDA emphasize the preservation system. If the local models preserve distances, we ex-
of short distances, some of which may be overesti- pect that the resulting global configuration preserves
mated by detoured paths on a neighborhood graph local distances. The alignment is inherently an opti-
with holes. mization problem to minimize a pre-defined global
alignment error. An exciting result is that, as long as
the error can be expressed in a quadratic form of the
resulting coordinates, it can be minimized by solving a
ALIGNMENT APPROACHES
sparse eigenvalue problem subject to constraints that
Another way for dimensionality reduction of data dis- make the problem well posed. Results of the align-
tributed on a q-dimensional manifold is through di- ment are globally optimized and there is no problem
rect alignment of local linear models. In principle, a with local optima. The underlying manifold may be
q-dimensional smooth manifold locally resembles a highly nonlinear although both the local models and
q-dimensional Euclidean space. A neighborhood can the alignment process are linear.

376 
c 2011 John Wiley & Sons, Inc. Volume 1, September/October 2011
WIREs Data Mining and Knowledge Discovery Distance-preserving dimensionality reduction

scale, LLE finds a low-dimensional data configuration


so that the weights used in the linear approximations
of all data points are best preserved. Clearly, the local
model used in LLE is not isometric and it is not the
goal of LLE to preserve any distance. However, LLE
sets a foundation for other methods that use isometric
local models. Computationally, the global optimiza-
tion problem is to minimize a residual measure, which
has a normalized quadratic form of resulting coordi-
nates and thus can be solved analytically by using an
eigensolver.
The idea of using an eigensolver to derive glob-
ally optimized analytical solution is so appealing
that researchers have developed techniques by using
different local models. Laplacian eigenmap17 tries to
solve a constrained optimization problem in order to
minimize a weighted sum of squared local distances
between neighbor points. The physical interpretation
(a) of the error measure is that it reflects the average of
the Laplacian operator on tangent spaces over the
manifold. Another technique, Hessian eigenmaps,18
uses a similar idea but with a Hessian matrix replac-
ing the Laplacian operator. The local models used
in these approaches are isometric. Since these ap-
proaches are based on the intrinsic geometric struc-
ture of the manifold, they exhibit stability with re-
spect to the manifold and offer less deformed results
than those produced by the LLE.
Recall that PCA and classical MDS are tradi-
tional linear methods for dimensionality reduction
and can thus be used to derive local models. Because a
smooth manifold locally resembles a Euclidean space,
models derived by PCA or classical MDS are locally
distance preserving and reflect local geometry of the
manifold. To align the local models, the alignment
error can be defined as the sum of squared distances
(b) between resulting points and the corresponding points
F I G U R E 5 | (a) Swiss-roll data superimposed with neighborhoods; on the transformed local models. The alignment er-
(b) two-dimensional projection by alignment of local models. ror takes a quadratic form to the resulting coordi-
nates and can be minimized by using an eigensolver.
Although such an alignment may not be globally dis-
For illustration, Figure 5(a) shows the Swiss- tance preserving, it maps the data to a single global
roll data set superimposed with a set of local neigh- low-dimensional coordinate system with the property
borhoods, each of which is marked by a circle. Lo- of local isometry.
cal model derived from each neighborhood could Local tangent space alignment (LTSA)19 is an
be transformed and aligned with local models de- approach taking this idea by using PCA to derive
rived from other neighborhoods in a two-dimensional a low-dimensional local model at every data point
space, as shown in Figure 5(b). and then aligning these models for dimensionality re-
A celebrated approach for model alignment is duction. Another method, Locally Multi-dimensional
locally linear embedding (LLE).16 Its local model ap- Scaling (LMDS)20 uses classical MDS to derive lo-
proximates a data point by a weighted linear combi- cal models. For model alignment, a simple observa-
nation of its neighbor points. The local approxima- tion of PCA and classical MDS is that they project
tion problem, which is decoupled across data points, all data points within a neighborhood in the same
is a typical least square fitting problem. On the global way and thus the local model does not discriminate

Volume 1, September/October 2011 


c 2011 John Wiley & Sons, Inc. 377
Overview wires.wiley.com/widm

BOX 5: LOCALLY LINEAR EMBEDDING essential information (such as distances between data
observations) and performs better than the original
Although LLE does not preserve distances, we present its data in further processing. We summarize the major
basic idea since it inspired other alignment methods with approaches used for distance-preserving dimensional-
isometric local models. LLE uses local linearity of the mani- ity reduction as the following:
fold to find a linear representation of each point in terms of
its neighbors. Let W = {wij } be a matrix of weights where
• A simple approach is linear mapping, which
each wij summarizes the contribution of xj to xi ’s recon-
is characterized by a projection matrix that
struction. The first step of LLE is to find W to minimize the
  linearly transforms high-dimensional points
reconstruction error i (xi − j wij x j )2 with constraints
 to low-dimensional space. Example meth-
j wij = 1, wii = 0 and wij = 0 if xj is not within the
ods are PCA and classical MDS. The pro-
neighborhood of xi . Please note that each column of W is
jection matrix is designed so that variations
decoupled and can be minimized independently. The sec-
of the result data points are maximized. It
ond step of LLE finds a new set of points {y1 , . . . , yn } in
offers the optimal linear mapping for dis-
q-dimensional space such that the global alignment error
tance preservation in the sense that the sum
i (yi −  j wij y j )2 is minimized. Both error measures have
of changes of squared distances is minimized.
the similar form, but variables in the first error measure are
In addition, random projection offers a sim-
wij ’s and variables in the second error measure are yi ’s.
ple and efficient approach for linear mapping.
The second error measure defines a quadratic form of yi ’s.
Random projection can project a small set
Subject to constraints that make the problem well posed,
of high-dimensional data points to a low-
it can be minimized by solving a sparse n × n eigenvector
dimensional space in such a way that dis-
problem, whose smallest q nonzero eigenvectors provide
tances between data points are approximately
an ordered set of orthogonal coordinates centered at the
preserved.
origin. It should be noted, similar to PCA and classical MDS,
that a solution can be found for all values of q simultane- • Methods can be developed to explicitly mini-
ously: the best one-dimensional projection is simply the first mize the change of distances before and after
coordinate of the best two-dimensional projection, and so the projection. Usually, the minimization of
on. Other methods extend LLE with different local models a cost function has no closed-form solution.
and alignment functions. Therefore, the cost function is iteratively min-
imized. Example methods include Kruskal’s
MDS, NLM, and CCA. A problem with iter-
ative minimization is that it may terminate at
a data point from other data points in its neighbor- a local minimum.
hood. LMDS takes advantage of this observation by • Many methods take interpoint distances as
deriving and aligning local models on a minimum set input. In addition to the usual Euclidean dis-
of overlapping neighborhoods. The set of overlap- tances, the input distances can be geodesic
ping neighborhoods can be obtained through a greedy distances between data points along the man-
approximation algorithm for the classical minimum ifold. The geodesic distance between a pair
set cover problem. Consequently, this makes LMDS of data points is estimated by the length of
scalable to the number of overlapping neighborhoods the shortest path between the pair of points
instead of the number of data points. Another ma- on a neighborhood graph, which connects ev-
jor difference between LTSA and LMDS associates ery data point to its neighbor data points.
with their requirements on input data. Unlike PCA, Example methods include Isomap, GeoNLM,
which requires input data coordinates, classical MDS and CCA. Isomap applies classical MDS,
requires only local distances as input. This makes GeoNLM applies NLM, and CDA applies
LMDS applicable to applications where each entity CCA to the estimated geodesic distances.
knows nothing beyond its neighborhoods and there
• A manifold locally resembles a Euclidean
are demands to derive global information from local
space. Therefore, linear models can be built
neighborhoods.
locally within local neighborhoods and are
then aligned on a global scale in the target
SUMMARY AND COMPARISON
space. The alignment process is to minimize
The objective of dimensionality reduction is to project an error, which is usually defined as hav-
a set of high-dimensional data observations to a low- ing a quadratic form to the resulting coor-
dimensional space such that the projection preserves dinates and can be elegantly minimized using

378 
c 2011 John Wiley & Sons, Inc. Volume 1, September/October 2011
WIREs Data Mining and Knowledge Discovery Distance-preserving dimensionality reduction

an eigensolver. These methods include LLE by Isomap, is possible only when the manifold is flat.
and its derivations using various local models A simple counterexample is data distributed on the
and alignment mechanisms. If the local mod- surface of a sphere, which cannot be projected to two-
els preserve distances, for example, in LTSA dimensional space with global distances preserved.
and LMDS, the result of global alignment of- The success of both Isomap-like and LLE-like
fers local isometry. methods depends on how to assign a neighborhood
• There are also other distance-preserving to each data point. The size of the neighborhood is a
methods that do not fit into the above frame- compromise: it must be large enough to build a con-
work. These include the triangular mapping nected neighborhood graph for Isomap or to allow
method and the tetrahedral mapping method good construction of local models for model align-
that preserve some exact distances. ment, but small enough for the data manifold on the
neighborhood to have little or no curvature. Several
approaches have been proposed22 to overcome these
The linear methods (PCA and classical MDS), problems and build connected neighborhood graphs.
Isomap (which uses classical MDS), and the methods The difficulty of choosing a proper neighborhood size
using local model alignments share an important fea- and the contradiction between the preservation of
ture that they provide closed-form solutions by solv- global and local distances reflect a fundamental prob-
ing eigenvalue problems of specially constructed ma- lem: what is global and what is local, a persistent
trices. This feature comes from the simple fact that problem in pattern analysis and machine learning that
their error measures are defined as quadratic forms of may not be easily answered without domain-specific
the unknown data configurations, the minimization of knowledge.
which can be solved using an eigensolver. These meth-
ods considered together are called spectral methods.
Their computations are based on tractable optimiza-
CONCLUSION
tion techniques such as shortest paths, least squares
fits, and matrix eigenvalue decomposition. The solu- In this paper, we have introduced major ideas and
tions they give are globally optimized. In contrast, important techniques for distance-preserving dimen-
this situation changes dramatically if the error mea- sionality reduction. These techniques and algorithms
sure cannot be expressed in a quadratic form, in which are classified into linear methods, methods using iter-
case we have to appeal to iterative procedures to min- ative optimization, methods using geodesic distances,
imize the error measure. Indeed, the power of mathe- and methods through alignments of local models. We
matical tool at our disposal decides the scope of what discussed these methods by focusing on their basic
we can do in this and many other research areas. ideas and by summarizing their common features and
Dimensionality reduction using geodesic dis- differences.
tances and dimensionality reduction using model Dimensionality reduction using geodesic dis-
alignments represent active areas of research. In both tances and through model alignment represents ac-
areas, methods take bottom-up approaches by assign- tive areas of research. In both areas, the assignment
ing a local neighborhood to each data point and com- of local neighborhoods is of ultimate importance to
bining local information to obtain global data config- the success of dimensionality reduction. There exists
uration. However, they are fundamentally different in a risk of overfitting or disconnected neighborhood
terms of the distances they preserve. By applying clas- graph if the neighborhood is too small and a risk of
sical MDS to the estimated geodesic distance between mismatch of local geometry if the neighborhood is
every pair of data points, Isomap attempts to offer too large. Therefore, one direct question is how to
global isometry. As we discussed, however, Isomap make the neighborhood sizes adaptive to local cur-
does not work well at preserving local distances. On vatures and data densities. In general, studying data
the other hand, variations of LLE, such as LTSA and transformation and manipulation on manifold opens
LMDS, try to maintain local isometry by applying a new arena of research for data processing, which
distance-preserving local models within each neigh- invites ideas and theories from differential geometry.
borhood at the cost of giving up global isometry. In From an algorithmic point of view, other interesting
recent years, there are attempts, such as maximum topics include bidirectional mapping and incremental
variance unfolding,21 that try to maximize the overall mapping of new data points. In fact, there exists incre-
variance of the embedding while preserving distances mental extension23 of Isomap. We expect to see more
within local neighborhoods. In fact, a data projection extensions and improvements along these directions
preserving global distances, such as the one attempted in the near future.

Volume 1, September/October 2011 


c 2011 John Wiley & Sons, Inc. 379
Overview wires.wiley.com/widm

REFERENCES
1. Jolliffe I. Principal Component Analysis. 2nd ed. New 14. Yang L. Sammon’s nonlinear mapping using geodesic
York: Springer; 2002. distances. In: Proceedings of the 17th International
2. Cox T, Cox M. Multidimensional Scaling. 2nd ed. Conference on Pattern Recognition (ICPR’04). Vol.
London: Chapman and Hall/CRC; 2000. 2. IEEE Computer Society, Cambridge; 2004.
3. Stone JV. Independent Component Analysis: A Tuto- 15. Lee JA, Lendasse A, Donckers N, Verleysen M. A ro-
rial Introduction. Cambridge: The MIT Press; 2004. bust nonlinear projection method. In: Proceedings of
the 8th European Symposium on Artificial Neural Net-
4. Hyvärinen A, Karhunen J, Oja E. Independent Com-
works (ESANN2000) Bruges; 2000, 13–20.
ponent Analysis. New York: Wiley-Interscience; 2001.
16. Roweis ST, Saul LK. Nonlineaar dimensionality reduc-
5. Bingham E, Mannila H. Random projection in dimens-
tion by locally linear embedding. Science 2000, 290:
ionality reduction: applications to image and text data.
2323–2326.
In: Proceedings of the 7th ACM SIGKDD international
conference on Knowledge discovery and data mining 17. Belkin M, Niyogi P. Laplacian eigenmaps for dimen-
(KDD ’01). New York: ACM Press; 2001, 245–250. sionality reduction and data representation. Neural
Comput 2003, 15:1373–1396.
6. Schölkopf B, Smola A, Müller K-R. Nonlinear com-
ponent analysis as a kernel eigenvalue problem. 1998, 18. Donoho DL, Grimes C. Hessian eigenmaps: locally lin-
10:1299–1319. ear embedding techniques for high-dimensional data.
Proc Natl Acad Sci 2003, 100:5591–5596.
7. Kruskal JB, Wish M. Multidimensional Scaling. Beverly
Hills, CA: Sage Publications; 1978. 19. Zhang Z, Zha H. Principal manifolds and nonlinear
dimensionality reduction via tangent space alignment.
8. Sammon JJ. A nonlinear mapping for data structure
SIAM J Sci Comput 2005, 26:313–338.
analysis. IEEE Trans Comput 1969, C-18:401–409.
20. Yang L. Alignment of overlapping locally scaled
9. Demartines P, Herault J. Curvilinear component anal-
patches for multidimensional scaling and dimensional-
ysis: a self-organizing neural network for nonlinear
ity reduction. IEEE Tran Pattern Anal Machine Intell
mapping of data sets. IEEE Trans Neural Netw 1997,
2008, 30: 438–450.
8:148–154.
21. Weinberger KQ, Sha F, Saul LK. Learning a kernel
10. Niemann H, Weiss J. A fast converging algorithm for
matrix for nonlinear dimensionality reduction. In: Pro-
nonlinear mapping of high-dimensional data onto a
ceedings of the 21st International Conference on Ma-
plane. IEEE Trans Comput 1979, C-28:142–147.
chine Learning (ICML-04). Banff: ACM Press; 2004,
11. Lee RC, Slagle JR, Blum H. A triangulation method 839–846.
for the sequential mapping of points from N-space to
22. Yang L. Building connected neighborhood graphs for
two-space. IEEE Trans Comput 1977, 26:288–292.
isometric data embedding. In: Proceedings of the 11th
12. Yang L. Distance-preserving projection of high- ACM SIGKDD International Conference on Knowl-
dimensional data for nonlinear dimensionality reduc- edge Discovery and Data Mining (KDD’2005). ACM
tion. IEEE Trans Pattern Anal Machine Intell 2004, Press, Chicago, IL; 2005.
26: 1243–1246. 23. Law MH, Jain AK. Incremental nonlinear dimen-
13. Tenenbaum JB, Silva Vd, Langford JC. A global geo- sionality reduction by manifold learning. IEEE
metric framework for nonlinear dimensionality reduc- Trans Pattern Anal Machine Intell 2006, 28:377–
tion. Science 2000, 290:2319–2323. 391.

FURTHER READING
Borg I, Groenen P. Modern Multidimensional Scaling: Theory and Applications. 2nd ed. New York: Springer; 2005.
Gorban AN, Kégl B, Wunsch DC, Zinovyev A, eds. Principal Manifolds for Data Visualization and Dimension Reduction.
Berlin: Springer; 2007.
Lee JA, Verleysen M. Nonlinear Dimensionality Reduction. New York: Springer; 2007.

380 
c 2011 John Wiley & Sons, Inc. Volume 1, September/October 2011

You might also like