9780521190176
9780521190176
9780521190176
MASAS H I S U G IYAMA
Tokyo Institute of Technology
TAI J I S U Z U K I
The University of Tokyo
TAKAF U M I KANAM O R I
Nagoya University
cambridge university press
Cambridge, New York, Melbourne, Madrid, Cape Town,
Singapore, São Paulo, Delhi, Mexico City
A catalog record for this publication is available from the British Library.
Cambridge University Press has no responsibility for the persistence or accuracy of URLs for external
or third-party Internet Web sites referred to in this publication and does not guarantee that any content
on such Web sites is, or will remain, accurate or appropriate.
Contents
Foreword page ix
Preface xi
3 Moment Matching 39
3.1 Basic Framework 39
3.2 Finite-Order Approach 39
3.3 Infinite-Order Approach: KMM 43
3.4 Numerical Examples 44
3.5 Remarks 45
4 Probabilistic Classification 47
4.1 Basic Framework 47
4.2 Logistic Regression 48
4.3 Least-Squares Probabilistic Classifier 50
v
vi Contents
5 Density Fitting 56
5.1 Basic Framework 56
5.2 Implementations of KLIEP 57
5.3 Model Selection by Cross-Validation 64
5.4 Numerical Examples 65
5.5 Remarks 65
6 Density-Ratio Fitting 67
6.1 Basic Framework 67
6.2 Implementation of LSIF 68
6.3 Model Selection by Cross-Validation 70
6.4 Numerical Examples 73
6.5 Remarks 74
7 Unified Framework 75
7.1 Basic Framework 75
7.2 Existing Methods as Density-Ratio Fitting 77
7.3 Interpretation of Density-Ratio Fitting 81
7.4 Power Divergence for Robust Density-Ratio Estimation 84
7.5 Remarks 87
Part V Conclusions
17 Conclusions and Future Directions 303
ix
x Foreword
xi
xii Preface
Picture taken in Nagano, Japan, in the summer of 2009. From left to right, Taiji Suzuki,
Masashi Sugiyama, and Takafumi Kanamori.
Takimoto, Yuta Tsuboi, Kazuya Ueki, Paul von Bünau, Gordon Wichern, and
Makoto Yamada.
Finally, we thank the Ministry of Education, Culture, Sports, Science and
Technology; the Alexander von Humboldt Foundation; the Okawa Foundation;
Microsoft Institute for Japanese Academic Research Collaboration Collabora-
tive Research Project; IBM Faculty Award; Mathematisches Forschungsinstitut
Oberwolfach Research-in-Pairs Program; the Asian Office of Aerospace Research
and Development; Support Center for Advanced Telecommunications Technol-
ogy Research Foundation; and the Japan Science and Technology Agency for their
financial support.
Masashi Sugiyama, Taiji Suzuki, and Takafumi Kanamori
Part I
3
4 1 Introduction
Decision boundary
Positive
Negative training samples
Input training samples
(a) Regression (b) Classification
Figure 1.1. Regression and classification tasks in supervised learning. The goal of
regression is to learn the target function from training samples, while the goal of
classification is to learn the decision boundary from training samples.
Target function
Learned function
Training samples
Appropriate model
Target
Training
Learned function
samples
function
High-dimensional space
Low-dimensional space
Figure 1.2. Research topics in supervised learning. The goal of model selection is to
appropriately control the complexity of a function class from which a learned function is
searched. The goal of active learning is to find good location of training input points. The
goal of dimensionality reduction is to find a suitable low-dimensional expression of
training samples for predicting output values.
6 1 Introduction
Inliers Outliers
Clusters
Mixing Demixing
(d) Independent˙component˙analysis
Figure 1.3. Research topics in unsupervised learning. The goal of visualization is to find a
low-dimensional expression of samples that provides us some intuition behind the data.
The goal of clustering is to group samples into several clusters. The goal of outlier
detection is to identify “irregular” samples. The goal of independent component analysis
is to extract original source signals from their mixed signals.
1993; Comon, 1994; Amari et al., 1996; Amari, 1998, 2000; Hyvaerinen, 1999;
Cardoso, 1999; Lee et al., 1999; Hyvärinen et al., 2001; Bach and Jordan, 2002;
Hulle, 2008; Suzuki and Sugiyama, 2011).
Action
Environment
Reward
Agent State
Figure 1.4. Reinforcement learning. If the agent takes some action, then the next state and
the reward are given from the environment. The goal of reinforcement learning is to let
the agent learn the control policy that maximizes the long-term cumulative rewards.
1.2 Density-Ratio Approach to Machine Learning 9
reduction are highly useful for solving realistic reinforcement learning problems.
Thus, reinforcement learning can be regarded as a challenging application of
machine learning methods. Following the rapid advancement of machine learn-
ing techniques and computer environments in the last decade, reinforcement
learning algorithms have become capable of handling large-scale, complex, real-
world problems. For this reason, reinforcement learning has gathered considerable
attention recently in the machine learning community.
transfer learning. In econometrics, this problem has been studied under the
name of sample selection bias (Heckman, 1979).
When the training and test distributions have nothing in common, it is not
possible to learn anything about the test distribution from the training sam-
ples. Thus, some similarity between training and test distributions needs to be
assumed to make the discussion meaningful. Here we focus on the situation
called a covariate shift, where the training and test input distributions are dif-
ferent but the conditional distribution of outputs given inputs is common to the
training and test samples (Shimodaira, 2000). Note that “covariate” refers to an
input variable in statistics.
Under the covariate shift setup, the density ratio between the training
and test densities can be used as the measure of the “importance” of each
training sample in the test domain. This approach can be regarded as an appli-
cation of the importance sampling technique in statistics (Fishman, 1996).
By weighting the training loss function according to the importance value,
ordinary supervised learning techniques can be systematically modified so
that suitable theoretical properties such as consistency and asymptotic unbi-
asedness can be properly achieved even under covariate shift (Shimodaira,
2000; Zadrozny, 2004; Sugiyama and Müller, 2005; Sugiyama et al., 2007,
2008; Storkey and Sugiyama, 2007; Huang et al., 2007; Yamazaki et al., 2007;
Bickel et al., 2007; Kanamori et al., 2009; Quiñonero-Candela et al., 2009;
Sugiyama and Kawanabe, 2011).
Multi-task learning (Section 9.2): When one wants to solve many supervised
learning problems, each of which contains a small number of training sam-
ples, solving them simultaneously could be more promising than solving them
independently if the learning problems possess some similarity. The goal
of multi-task learning is to solve multiple related learning problems accu-
rately based on task similarity (Caruana et al., 1997; Baxter, 1997, 2000;
Ben-David et al., 2002; Bakker and Heskes, 2003; Ben-David and Schuller,
2003; Evgeniou and Pontil, 2004; Micchelli and Pontil, 2005; Yu et al., 2005;
Ando and Zhang, 2005; Xue et al., 2007; Kato et al., 2010).
The essence of multi-task learning is data sharing among different tasks,
which can also be carried out systematically by using the importance sampling
technique (Bickel et al., 2008). Thus, a multi-task learning problem can be
handled in the same way as non-stationarity adaptation in the density ratio
framework.
Outlier detection (Section 10.1): The goal of outlier detection is to identify out-
liers in a dataset. In principle, outlier-detection problems can be solved in
a supervised way by learning a classification rule between the outliers and
inliers based on outlier and inlier samples. However, because outlier patterns
are often so diverse and their tendency may change over time in practice, such
a supervised learning approach is not necessarily appropriate. Semi-supervised
approaches are also being explored (Gao et al., 2006a, 2006b) but they still
suffer from the same limitation. For this reason, it is common to tackle the
1.2 Density-Ratio Approach to Machine Learning 11
example, when the confidence of class prediction is low, one may decide to label
the test pattern manually. The confidence of class prediction can be obtained
by the class-posterior probability given the test patterns. A classification based
on the class-posterior probability is called probabilistic classification. Because
the class-posterior probability is expressed as the joint probability of patterns
and labels over the marginal probability of patterns, probabilistic classification
can be carried out by direct application of density-ratio estimators (Sugiyama,
2010).
Independence test: Given input–output samples, testing whether the input and
output are statistically independent is an important task in statistical data anal-
ysis. This problem is referred to as the independence test. A standard approach
to the independence test is to estimate the degree of independence between the
inputs and outputs. Mutual information estimators based on density ratios can
be used for this purpose (Sugiyama and Suzuki, 2011).
Variable selection: The goal of variable selection or feature selection in super-
vised learning is to find a subset of attributes of the original input vector that
are responsible for predicting the output values. This can be carried out by
measuring the independence between a subset of input features and outputs.
Thus, mutual information estimators based on density ratios can be used for
this purpose (Suzuki et al., 2009b).
Clustering: The goal of clustering is to group data samples into several dis-
joint clusters based on their pairwise similarity. This can be carried out by
determining the cluster labels (i.e., group indices) so that the dependence on
input patterns is maximized (Song et al., 2007a; Faivishevsky and Goldberger,
2010). Mutual information estimators based on density ratios can be used as a
dependence measure (Kimura and Sugiyama, 2011).
Object matching: The goal of object matching is to find a correspondence
between two sets of objects in different domains in an unsupervised way.
Object matching is typically formulated as finding a mapping from objects
in one domain to objects in the other domain so that the pairwise depen-
dency is maximized (Jebara, 2004; Quadrianto et al., 2010). Mutual information
estimators based on density ratios can be used as a dependence measure
(Yamada and Sugiyama, 2011a).
Dimensionality reduction (Section 11.2): The goal of dimensionality reduction
or feature extraction is the same as feature selection. However, instead of sub-
sets, linear combinations of input features are searched. Similarly to feature
1.3 Algorithms of Density-Ratio Estimation 13
solving a more general and thus difficult problem of estimating the data-generation
probability.
Vapnik’s principle in the context of density ratio estimation may be interpreted
as follows:
∗
One should avoid estimating the two densities pnu (x) and
∗ ∗
pde (x) when estimating the ratio r (x).
∗
This statement sounds reasonable because knowing the two densities pnu (x) and
∗ ∗
pde (x) implies knowing the ratio r (x). However, the opposite is not necessarily
true, because the ratio r ∗ (x) cannot be uniquely decomposed into the two densities
∗ ∗
pnu (x) and pde (x) (Figure 1.5). Thus, directly estimating the ratio r ∗ (x) would be
∗
a more sensible and promising approach than estimating the two densities pnu (x)
∗
and pde (x) separately.
Following this idea, various direct density-ratio estimation methods have been
proposed so far, which are summarized in the following.
Moment matching (Chapter 3): Two distributions are equivalent if and only if
∗
all moments agree with each other and the numerator density pnu (x) can be
∗
expressed in terms of the ratio r (x) as
∗
pnu (x) = r ∗ (x)pde
∗
(x). (1.1)
∗
Thus, if a density-ratio model r(x) is learned so that the moments of pnu (x)
∗
and r(x)pde (x) agree with each other, a good approximation to the true density
ratio r ∗ (x) may be obtained. This is the basic idea of the moment-matching
approach (see Figure 1.6).
nu (x)
p*
p*nu (x), p*
de
(x) r* (x) =
p*de (x)
∗ ∗
Figure 1.5. Knowing the two densities pnu (x) and pde (x) implies knowing their ratio
∗ ∗
r (x). However, the ratio r (x) cannot be uniquely decomposed into the two densities
∗ ∗
pnu (x) and pde (x).
x x
However, directly matching all (i.e., infinitely many) moments is not possi-
ble in reality, and hence one may choose a finite number of moments and match
them. Although this results in a computationally tractable algorithm, consis-
tency is not guaranteed (i.e., even in the limit of a large sample size, the optimal
solution cannot be necessarily obtained).
An alternative approach is to employ a universal reproducing kernel Hilbert
space (Steinwart, 2001); the Gaussian reproducing kernel Hilbert space is a
typical example. It was shown that mean matching in a universal reproduc-
ing kernel Hilbert space leads to a consistent estimator (Huang et al., 2007;
Quiñonero-Candela et al., 2009). This gives a computationally efficient and
consistent algorithm for density-ratio estimation.
Probabilistic classification (Chapter 4): An alternative approach to density-
ratio estimation is to use a probabilistic classifier. A key fact is that the density
ratio r ∗ (x) can be expressed as
p∗ (y = de) p∗ (y = nu|x)
r ∗ (x) = .
p ∗ (y = nu) p ∗ (y = de|x)
Thus, if a probabilistic classifier that separates numerator samples and denom-
inator samples can be obtained, the density rato can be estimated (Qin, 1998;
Cheng and Chu, 2004; Bickel et al., 2007). For example, a logistic regres-
sion classifier (Hastie et al., 2001) and a least-squares probabilistic classifier
(LSPC; Sugiyama, 2010) can be employed for this purpose (see Figure 1.7).
The logistic regression approach is shown to be optimal among a class of
semi-parametric estimators under a correct model assumption (Qin, 1998),
while an LSPC is computationally more efficient than logistic regression and
is thus applicable to massive datasets.
Density fitting (Chapter 5): In the moment matching approach described previ-
∗ ∗
ously, density ratios are estimated so that the moments of pnu (x) and r(x)pde (x)
are matched via Eq. (1.1). On the other hand, in the density-fitting approach,
density ratios are estimated so that the Kullback–Leibler divergence from
∗ ∗
pnu (x) to r(x)pde (x) is minimized. This framework is called the KL impor-
tance estimation procedure (KLIEP; Sugiyama et al., 2008). The “importance”
is another name for the density ratio.
∗ ∗
Figure 1.7. Probabilistic classification of samples drawn from pnu (x) and pde (x).
16 1 Introduction
Part I Part II
Density-Ratio Approach Methods of Density-Ratio
to Machine Learning Estimation
Chapter 1 Chapter 2
Introduction Density Estimation
Chapter 3
Moment Matching
Chapter 4
Probabilistic Classification
Chapter 5
Density Fitting
Chapter 6
Part III Density-Ratio Fitting
Applications of Density Ratios
Chapter 7
in Machine Learning
Unified Framework
Chapter 9
Chapter 8
Importance Sampling
Direct Density-Ratio Estimation
Chapter 10 with Dimensionality Reduction
Distribution Comparison
Chapter 11
Mutual Information Estimation Part IV
Chapter 12 Theoretical Analysis of
Conditional Probability Density-Ratio Estimation
Estimation Chapter 13
Parametric Convergence Analysis
Chapter 14
Non-Parametric Convergence
Analysis
Part V
Perspectives Chapter 15
Parametric Two-Sample Test
Chapter 17
Chapter 16
Conclusions and Future
Non-Parametric Numerical
Directions
Stability Analysis
Figure 1.9. Organization of chapters. Chapters 7, 8, 13, 14, 15, and 16 contain advanced
materials, and thus beginners may skip these chapters at first.
20 1 Introduction
In this part we discuss the problem of estimating the ratio of two probability density
functions from samples.
First we formulate the problem of density-ratio estimation. Let X (⊂ Rd ) be
the data domain, and suppose we are given independent and identically distributed
nnu
(i.i.d.) samples {x nu ∗
i }i=1 from a distribution with density pnu (x) and i.i.d. samples
n
de de ∗
{x j }j =1 from another distribution with density pde (x):
nnui.i.d. n i.i.d.
{x nu ∗ de de ∗
i }i=1 ∼ pnu (x) and {x j }j =1 ∼ pde (x).
∗ ∗
We assume that pde (x) is strictly positive over the domain X , that is, pde (x) > 0
for all x ∈ X . Our goal is to estimate the density ratio
∗
pnu (x)
r ∗ (x) := ∗
pde (x)
nnu n
from the samples {x nu de de
i }i=1 and {x j }j =1 , “nu” and “de” indicate “numerator” and
“denominator,” respectively.
This part contains seven chapters. In Chapter 2, density-ratio estimation based
on a separate density estimation is explained. Parametric and nonparametric
density estimation methods are covered here. In the two-step approach of first
estimating the numerator and denominator densities and then plugging the esti-
mated densities into the ratio, the error caused by the second step (plug-in) is not
taken into account in the first step (density estimation). Thus, a more promising
approach would be a one-shot procedure of directly estimating the density ratio
without going through density estimation. Methods following this idea will be
explained in detail in the following chapters.
In Chapter 3, the framework of density-ratio estimation based on moment match-
ing is explained. The basic idea of moment matching is to obtain a “transformation”
∗
function r(x) that matches moments of the denominator density pde (x) with those
∗
of the numerator density pnu (x). Along this line, finite-order and infinite-order
moment matching methods are explained here.
22 Methods of Density-Ratio Estimation
Part II
Methods of Density Ratio Estimation
Chapter 2
Density Estimation
Chapter 3
Moment Matching
Chapter 4
Probabilistic Classification Chapter 7
Unified Framework
Chapter 5
Density Fitting
Chapter 6
Density-Ratio Fitting
Chapter 8
Direct Density-Ratio Estimation
with Dimensionality Reduction
Figure 1.10. Structure of Part II. Chapters 7 and 8 contain advanced materials, and thus
beginners may skip these chapters at first.
Methods of Density-Ratio Estimation 23
i.i.d.
{x k }nk=1 ∼ p ∗ (x).
The goal of density estimation is to obtain an estimator p(x) of the true density
p∗ (x) from {x k }nk=1 .
Here we describe two approaches to density-ratio estimation based on density
estimation: ratio of density estimators in Section 2.1.1 and uniformized density
estimation in Section 2.1.2
25
26 2 Density Estimation
This density estimation approach would be an easy and handy method for
de (x) may
density-ratio estimation. However, division by an estimated density p
nu (x).
magnify the estimation error included in p
x̃ x̃
p˜*de (x)
˜
1 1 P*de (x)
x˜ de
0 0 x
a x de b
p*de (x)
x
a b
Pˆde (x)
1
0 x
a x 1de x 2de x 3de x 4de x 5de b
where we assume that x1de ≤ · · · ≤ xndede without loss of generality (see Figure 2.2).
Since no division by an estimated quantity is included in the uniformized density
estimation approach, it is expected to be more accurate than the ratio of density esti-
mations explained in Section 2.1.1. However, as shown in Ćwik and Mielniczuk
(1989) and Chen et al. (2009), these two approaches possess the same non-
parametric convergence rate.
2.1.3 Summary
As shown in the previous subsections, density estimation can be used for esti-
mating density ratios. In the rest of this chapter, we consider a standard density
estimation problem of estimating p ∗ (x) from its i.i.d. samples {x k }nk=1 and review
two approaches to density estimation: the parametric approach (Section 2.2) and
the non-parametric approach (Section 2.3).
Then the ML density estimator pML (x) is given by p ML (x) = p(x;
θ ML ).
When the number n of samples is large, the likelihood often takes an extremely
small value. To avoid any numerical instability caused by this, the log-likelihood
is often used instead:
n
log L(θ ) = log p(x k ; θ ). (2.1)
k=1
Because the log function is monotone increasing, the same ML estimator can
be obtained by maximizing the log-likelihood, which may be numerically more
preferable:
θ ML = argmax log L(θ ).
θ∈
Under some mild assumptions, the ML estimator was shown to possess the
following properties:
Consistency: The ML estimator converges in probability to the optimal parameter
in the model (i.e., the projection of the true probability density function onto
the model) under the Kullback–Leibler (KL) divergence (see Section 2.2.3).
Asymptotic unbiasedness: The expectation of the ML estimator converges in
probability to the optimal parameter in the model.
Asymptotic efficiency: The ML estimator has the smallest variance among all
asymptotic unbiased estimators.1
Thus, the ML approach would be a useful method for parametric density estimation.
However, the previously mentioned asymptotic properties do not necessarily imply
that the ML estimator is accurate in small sample cases.
Based on this expression, the Bayesian predictive distribution p Bayes (x) is given as
Bayes (x) := p(x|θ )p(θ|D)dθ
p
p(x|θ) nk=1 p(x k |θ )π(θ)dθ
= n . (2.3)
k =1 p(x k |θ )π(θ )dθ
Thus, the parametric model p(x|θ) is averaged over the posterior probabil-
ity p(θ|D). This averaging nature would be a peculiarity of Bayes estimation:
Bayes (x) does not necessarily belong to the original parametric model p(x|θ )
p
because a parametric model often forms a curved manifold in the space of
probability densities (Amari and Nagaoka, 2000). See Figure 2.3 for illustration.
Bayes estimation was shown to be a powerful alternative to ML density esti-
mation. However, its computation is often cumbersome as a result of the integrals
involved in Eq. (2.3). To ease this problem, an approximation method called maxi-
mum a posteriori (MAP) estimation comes in handy – the integral over the posterior
probability p(θ|D)is approximated by the single maximizer of p(θ|D). Because
n
the denominator k =1 p(x k |θ )π(θ )dθ in Eq. (2.2) is independent of θ, the
posterior probability p(θ |D) can be expressed as
n
p(θ |D) ∝ p(x k |θ)π(θ ),
k=1
where ∝ means “proportional to.” Thus, the maximizer of the posterior probability
p(θ |D) is given by
n
θ MAP := argmax p(x k |θ)π(θ ).
θ∈ k=1
30 2 Density Estimation
p* (x) ˆ
∫
pBayes (x) = p(x|θ)p(θ |D)dθ
p(x|θ )
p(θ |D)
pML (x) = p(x|θˆML)
ˆ
Figure 2.3. The Bayes solution does not necessarily belong to the original parametric
model.
Then the MAP solution p MAP (x) is given by pMAP (x) := p(x| θ MAP ). Because the
integral is approximated by a single point, MAP estimation loses the peculiarity
of Bayes estimation – the MAP solution is always a member of the parametric
model.
The MAP estimator may also be obtained in the “log” domain, possibly in a
numerically stable manner:
n
θ MAP = argmax p(x k |θ) + log π(θ ) .
θ∈ k=1
Because the first term is the log-likelihood, the MAP solution can be regarded as a
variant of the ML method with a “penalty” induced by the prior probability π(θ).
For this reason, MAP estimation is also called penalized ML estimation.
2 As explained in Section 2.2.2, the estimated density is not necessarily included in the model in the
case of Bayes estimation.
2.2 Parametric Approach 31
(x):
approximation p
∗ p ∗ (x)
KL(p
p ) := p∗ (x) log dx. (2.4)
(x)
p
The use of the KL divergence as a performance measure would be natural in the
ML framework because it is consistent with the log-likelihood. That is, the KL
p ∗ (x)
divergence from p ∗ (x) to p(x; θ) can be regarded as the expectation of log p(x;θ )
over the true density p∗ (x). If this expectation is approximated by the empirical
average over i.i.d. samples {x k }nk=1 , we have
n
∗ p) := 1 p ∗ (x k )
KL(p log
n k=1 p(x k ; θ )
n n
1
1
where the first term is the negative entropy of p ∗ (x) and is independent of p
(x).
Since we are interested in estimating the KL divergence as a function of p(x), the
entropy term can be safely ignored. Thus, the model that minimizes the second
term (denoted by KL ) is now regarded as the best one:
KL := − p∗ (x) log p
(x)dx. (2.6)
KL may be estimated by the negative log-likelihood (2.1). However, simply using
the negative log-likelihood as an approximation to KL for model selection is not
appropriate because the more complex the model is, the smaller the negative log-
likelihood is. Thus, model selection based on the negative log-likelihood merely
ends up always choosing the most complex model at hand, which is meaningless
in practice.
32 2 Density Estimation
AIC
Negative
log-likelihood Dimension of
parameters
Model complexity
where dim θ denotes the dimensionality of parameter θ . Since the first term of the
AIC is the negative log-likelihood, the AIC can be regarded as an additive modifi-
cation of the negative log-likelihood by the dimensionality of parameters. By this
modification, the AIC can avoid always resulting in choosing the most complex
model (Figure 2.4). The implementation of the AIC is very simple, and thus is
practically useful. Various generalizations and extensions of the AIC have been
proposed by several researchers (e.g., Takeuchi, 1976; Shibata, 1989; Murata et al.,
1994; Konishi and Kitagawa, 1996; Ishiguro et al., 1997).
This approach is called the empirical Bayes method or type II maximum likelihood
estimation. p(D; β) is called the evidence or the marginal likelihood (Bishop,
2006).
The marginal likelihood can be expressed as
n
p(D; β) = p(x k |θ)π(θ ; β)dθ ,
k=1
3 Although the criterion multiplied by 2 is the original AIC, we omitted the factor 2 for simplicity.
2.3 Non-Parametric Approach 33
which can be directly used for model selection. However, computing the marginal
likelihood is often cumbersome because of the integral. To ease this prob-
lem, the Laplace approximation is commonly used in practice (MacKay, 2003).
The Laplace approximation is based on the second-order Taylor expansion of
log p(D; β) around the MAP solution θ MAP . The Laplace approximation of the
log-marginal-likelihood is given by
n
log(2π) 1
+ dim θ − log det(−H ), (2.7)
2 2
where det(·) denotes the determinant of a matrix. H is the matrix of size dim θ ×
dim θ with the ( , )-th element given by
n
∂ ∂
H , := log p(x k |θ) + log π(θ ; β) ,
∂θ ∂θ k=1 θ =
θ MAP
where θ denotes the -th element of θ. Note that H corresponds to the Hessian
matrix of log p(θ|D).
If higher order terms are ignored in Eq. (2.7), a much simpler criterion can be
obtained. The negative of the remaining lower order terms is called the Bayesian
information criterion (BIC; Schwarz, 1978):
n
log n
BIC := − log p(x k |
θ ML ) + dim θ .
k=1
2
It is noteworthy that the first term of the BIC is the negative log-likelihood, which
is the same as the AIC. Thus, the difference between the AIC and the BIC is only
the second term. However, because the framework of the AIC and the BIC is
completely different, one cannot simply conclude that one is better than the other
(Shibata, 1981).
p*(x) P p* (x)
R p*(x′)
x′ x′
X X
V V
(a) Notation (b) Rectangle approximation
Similarly, for m being the number of samples falling into the region R among
{x k }nk=1 , P can be approximated as P ≈ m/n. Together, the probability density
p∗ (x ) at a point x in the region R can be approximated as p ∗ (x ) ≈ m/(nV ).
The approximation quality depends on the choice of the local region R (and
thus V and m). In the following, we describe two approaches to determining R:
kernel density estimation in Section 2.3.1 and nearest neighbor density estima-
tion in Section 2.3.2. Then the issue of hyperparameter selection is discussed in
Section 2.3.3.
where, for x = (x (1) , . . . , x (d) ) , W (x) is the Parzen window function defined as
1 if max(|x (1) |, . . . , |x (d) |) ≤ 1/2,
W (x) =
0 otherwise.
This method is called Parzen window estimation.
A drawback of Parzen window estimation is its discrete nature caused by the
Parzen window function. Kernel density estimation (KDE) is a “smooth” variant
of Parzen window estimation, where the Parzen window function is replaced by a
smooth kernel function such as the Gaussian kernel:
x − x 2
K(x, x ) = exp − .
2σ 2
The KDE solution for the Gaussian kernel is given by
n
1
KDE (x) :=
p 2 d/2
K(x, x k ).
n(2πσ ) k=1
2.3 Non-Parametric Approach 35
Since #(t) = (t −1)! for any positive integer t, the gamma function can be regarded
as an extension of the factorial to real numbers. The radius τ is set to the distance
to the m-th closest sample to the center (i.e., the hypersphere has the minimum
radius with m samples being contained). Then NNDE is expressed as
m#( d2 + 1)
NNDE (x) =
p d
.
nπ 2 τ d
2.3.3 Cross-Validation
The performance of KDE and NNDE depends on the choice of hyperparameters
such as the kernel width σ in KDE or the number m of nearest neighbors in NNDE.
For choosing the hyperparameter values appropriately, it is important to evaluate
how good a candidate the hyperparameter value is. As a performance measure, let
us again use the KL divergence defined by Eq. (2.4).
Cross-validation (CV) is a general method of estimating the second term of the
KL divergence [KL in Eq. (2.6)] as follows: First, the samples D = {x k }nk=1 are
divided into T disjoint subsets {Dt }Tt=1 of approximately the same size. Then a
density estimator pt (x) is obtained from D\Dt (i.e., all samples without Dt ), and
its log-likelihood for the hold-out samples Dt is computed (see Figure 2.6):
t := 1
KL t (x).
log p
|Dt | x∈D
t
This procedure is repeated for t = 1, . . . , T , and the average of the above hold-out
log-likelihood over all t is computed as
T
:= 1
KL t .
KL
T t=1
Density Hold-out
estimator log-likelihood
Figure 2.6. Cross-validation. Hold-out estimation is carried out for all subsets, and the
average hold-out log-likelihood is output as an estimator of KL .
The Gaussian widths σnu and σde are chosen by 5-fold cross-validation (see
Section 2.3.3). The density functions estimated by KDE, p de (x), are
nu (x) and p
illustrated in Figure 2.7(a), and the density-ratio function estimated by KDE,
r(x), is plotted in Figure 2.7(b). The results show that KDE gives reasonably
good approximations to the true density functions and thus the true density-ratio
function.
2.5 Remarks 37
0.5 2.5
p*nu (x) r* (x)
0.4 p*de (x) 2 r (x )
pnu (x)
0.3 pde (x) 1.5
0.2 1
0.1 0.5
0 0
−5 0 5 −5 0 5
x x
(a) True and estimated densities (b) True and estimated density ratios
2.5 Remarks
Density-ratio estimation based on density estimation would be an easy and conve-
nient approach: just estimating the numerator and denominator densities separately
from their samples and taking the ratio of the estimated densities. However, its
two-step structure may not be suitable because density estimation in the first step
is carried out without regard to the second step of taking their ratio. For example,
optimal model/hyperparameter selection in density estimation is not necessarily
the best choice for density-ratio estimation. Furthermore, the approximation error
produced in the density estimation step can be increased by taking the ratio – divi-
sion by an estimated quantity often makes an estimator unreliable. This problem
seems to be critical when the dimension of the input domain is high, because den-
sity values tend to be small in high-dimensional cases, and thus its reciprocal is
vulnerable to a small error.
To overcome these limitations of the density estimation approach, one-shot
procedures of directly estimating the density ratio without going through density
estimation would be sensible and more promising. Methods following this idea
will be described in the following chapters.
It was advocated that one should avoid solving more difficult intermediate
problems when solving a target problem (Vapnik, 1998). This statement is some-
times referred to as Vapnik’s principle, and the support vector machine (SVM;
Cortes and Vapnik, 1995; Vapnik, 1998; Schölkopf and Smola, 2002) would be
a successful example of this principle – instead of estimating a data-generation
model, SVM directly models the decision boundary, which is sufficient for pattern
recognition. Density estimation (estimating the data-generation model) is a more
general and thus difficult problem than pattern recognition (learning the decision
boundary).
If we followed Vapnik’s principle, directly estimating the ratio r ∗ (x) would
∗ ∗
be more promising than estimating the two densities pnu (x) and pde (x), because
38 2 Density Estimation
∗ ∗
knowing two densities pnu (x) and pde (x) implies knowing the ratio r ∗ (x), but not
∗
vice versa; the ratio r (x) cannot be uniquely decomposed into the two densities
∗ ∗
pnu (x) and pde (x) (see Figure 1.5).
In Kanamori et al. (2010), the pros and cons of the two-step density estimation
approach and one-shot direct density ratio estimation approaches were theoreti-
cally investigated in the parametric framework. In a nutshell, the theoretical results
showed the following:
• The two-step density estimation approach is more accurate than one-shot
direct density-ratio estimation approaches when correctly specified density
models (i.e., the true densities are included in the parametric models) are
available.
• The one-shot direct density-ratio estimation approach shown in Chapter 5
is more accurate than the two-step density estimation approach when
density/density-ratio models are misspecified.
These theoretical results will be explained in more detail in Chapter 13.
Since correctly specified density models may not be available in practice, the
one-shot direct density-ratio estimation approaches explained in the following
chapters would be more promising than the two-step density estimation approach
described.
3
Moment Matching
Note that two distributions are equivalent if and only if all moments (i.e., for
k = 1, 2, . . .) agree with each other.
The moment matching approach to density-ratio estimation tries to match the
∗ ∗
moments of pnu (x) and pde (x) via a “transformation” function r(x) (Qin, 1998).
More specifically, using the true density ratio r ∗ (x), pnu
∗
(x) can be expressed as
∗
pnu (x) = r ∗ (x)pde
∗
(x).
∗
Thus, for a density-ratio model r(x), matching the moments of pnu (x) and
∗
r(x)pde (x) gives a good approximation to the true density ratio r ∗ (x). A schematic
illustration of the moment-matching approach is described in Figure 3.1.
39
40 3 Moment Matching
x x
∗
Figure 3.1. The idea of moment matching. The moments of r(x)pde (x) are matched with
∗
those of pnu (x).
density ratio values only at sample points, is considered. In Section 3.2.2, the case
of induction, that is, estimating the entire density ratio function, is considered.
where · denotes the Euclidean norm. Its non-linear variant can be obtained
using some non-linear function φ(x) : Rd → R as
2
∗ ∗
argmin φ(x)r(x)pde (x)dx − φ(x)pnu (x)dx .
r
where
2
MM (r) := φ(x)r(x)pde (x)dx − φ(x)pnu (x)dx
∗ ∗
.
“MM” stands for “moment matching.” Let us ignore the irrelevant constant in
MM (r) and define the rest as MM(r):
2
MM(r) := φ(x)r(x)pde (x)dx
∗
∗ ∗
−2 φ(x)r(x)pde (x)dx, φ(x)pnu (x)dx ,
∗ ∗
In practice, the expectations over pnu (x) and pde (x) in MM(r) are replaced by
sample averages. That is, for an nde -dimensional vector
r ∗de := (r ∗ (x de ∗ de
1 ), . . . , r (x nde )) ,
r de := argmin MM(r), (3.1)
r∈Rnde
where
1 2
MM(r) := 2 r
de de r − r
de nu 1nnu . (3.2)
nde nde nnu
1n denotes the n-dimensional vector with all ones, and nu and de are the t × nnu
and t × nde design matrices defined by
nu := (φ(x nu nu
1 ), . . . , φ(x nnu )),
de := (φ(x de de
1 ), . . . , φ(x nde )),
respectively. Taking the derivative of the objective function (3.2) with respect to
r and setting it to zero, we have
2 2
de de r − nu 1nnu = 0t ,
2
nde nde nnu de
where 0t denotes the t-dimensional vector with all zeros. Solving this equation
with respect to r, one can obtain the solution analytically as
nde
r de = ( de )−1
de nu 1nnu .
nnu de
One may add a normalization constraint n1 1 r = 1 to the optimization
de nde
problem (3.1). Then the optimization problem becomes a convex linearly con-
strained quadratic program. Because there is no known method for obtaining
the analytic-form solution for general convex linearly constrained quadratic pro-
grams, a numerical solver may be needed to compute the solution. Furthermore,
a non-negativity constraint r ≥ 0nde and/or an upper bound for a positive con-
stant B, that is, r ≤ B1nde , may also be incorporated in the optimization problem
(3.1), where inequalities for vectors are applied in the element-wise manner. Even
with these modifications, the optimization problem is still a convex, linearly con-
strained quadratic program. Therefore, its solution can be computed numerically
by standard optimization software.
3.2.2 Induction
The fixed-design method described in the previous section gives estimates of the
nde
density-ratio values only at the denominator sample points {x de
j }j =1 . Here we
42 3 Moment Matching
describe a finite-order moment matching method under the induction setup, where
the entire density-ratio function r ∗ (x) is estimated.
We use the following linear density-ratio model for density-ratio function
learning:
b
(r(x de de
1 ), . . . , r(x nde )) = de θ,
de := (ψ(x de de
1 ), . . . , ψ(x nde )). (3.3)
Taking the derivative of the above objective function with respect to θ and setting
it to zero, we have the solution
θ analytically as
nde
θ= ( de −1
de de de ) de de nu 1nnu .
nnu
One may include a normalization constraint, a non-negativity constraint (given
that the basis functions are non-negative), and a regularization constraint in the
optimization problem (3.4):
1
1 θ = 1, θ ≥ 0b , and θ ≤ B1b .
nde nde de
Then the optimization problem becomes a convex, linearly constrained quadratic
program whose solution can be obtained by a standard numerical solver.
The upper-bound parameter B, which works as a regularizer, may be optimized
by cross-validation (CV). That is, the numerator and denominator samples D nu =
nnu nde
{x nu
i }i=1 and D
de
= {x de nu T
j }j =1 are first divided into T disjoint subsets {Dt }t=1
de T
and {Dt }t=1 , respectively. Then a density-ratio estimator rt (x) is obtained from
D nu \Dtnu and D de \Dtde (i.e., all samples without Dtnu and Dtde ), and its moment
matching error is computed for the hold-out samples Dtnu and Dtde :
2
1
t (
MM r) := de rt (x de )
φ(x de )
|Dt | de de
x ∈Dt
3.3 Infinite-Order Approach: KMM 43
2
− de nu rt (x de )
φ(x de ) φ(x nu ) ,
|Dt ||Dt | de de nu
x ∈D nu
x ∈Dt t
where |D| denotes the number of elements in the set D. This procedure is repeated
for t = 1, . . . , T , and the average of the hold-out moment matching error over all t
is computed as
T
:= 1
MM t.
MM
T t=1
respectively. In the same way as the finite-order case, the solution can be obtained
analytically as
nde −1
r de = K K de,nu 1nnu .
nnu de,de
If necessary, one may include a non-negativity constraint, a normalization con-
straint, and an upper bound in the same way as the finite-order case. Then the
solution can be obtained numerically by solving a convex linearly constrained
quadratic programming problem.
3.3.2 Induction
For a linear density-ratio model
b
0.4 2.5
p*nu(x ) r*(x )
0.35
p*de(x ) 2 r (x de)
0.3
0.25 1.5
0.2
1
0.15
0.1
0.5
0.05
0 0
−5 0 5 −5 0 5
x x
(a) True densities (b) True and estimated density ratios
Let nnu = nde = 200. For density-ratio estimation, we use the fixed-design
KMM algorithm with the Gaussian kernel:
(x − x )2
K(x, x ) = exp − .
2σ 2
The Gaussian kernel width σ is set to the median distance between all samples,
which is a popular heuristic in kernel methods (Schölkopf and Smola, 2002).
The density-ratio function estimated by KMM, r(x), is described in
Figure 3.2(b), showing that KMM gives a reasonably good approximation to the
true density-ratio function r ∗ (x).
3.5 Remarks
Density-ratio estimation by moment matching can successfully avoid density
estimation.
The finite-order moment matching method (Section 3.2) is simple and compu-
tationally efficient, if the number of matching moments is kept reasonably small.
However, the finite-order approach is not necessarily consistent, that is, in the limit
of a large sample size, the solution does not necessarily converge to the true density
ratio. On the other hand, the infinite-order moment matching method (Section 3.3),
kernel mean matching (KMM), can efficiently match all the moments by making
use of universal reproducing kernels. Indeed, KMM has the excellent theoretical
property that it is consistent (Huang et al., 2007; Gretton et al., 2009). However,
KMM has a limitation in model selection – there is no known method for determin-
ing the kernel parameter (i.e., the Gaussian kernel width in the case of Gaussian
kernels). A popular heuristic of setting the Gaussian width to the median distance
46 3 Moment Matching
between samples (Schölkopf and Smola, 2002) would be useful in some cases, but
this may not always be reasonable.
In this chapter moment matching was performed in terms of the squared norm,
which led to an analytic-form solution (if no constraint is imposed). As shown in
Kanamori et al. (2011b), moment matching can be generalized systematically to
various divergences. This will be explained in detail in Chapter 16.
4
Probabilistic Classification
47
48 4 Probabilistic Classification
p*nu(x ) p*de(x )
∗ ∗
Figure 4.1. Probabilistic classification of samples drawn from pnu (x) and pde (x).
The prior ratio p ∗ (y = −1)/p∗ (y = +1) may be approximated simply by the ratio
of the sample size:
p∗ (y = −1) nde /(nnu + nde ) nde
≈ = .
p ∗ (y = +1) nnu /(nnu + nde ) nnu
The “class”-posterior probability p ∗ (y|x) may be approximated by separating
nnu de nde
{x nu
i }i=1 and {x j }j =1 using a probabilistic classifier. Thus, given an estimator
(y|x), a density-ratio estimator
of the class-posterior probability, p r(x) can be
constructed as
(y = +1|x)
nde p
r(x) = . (4.1)
(y = −1|x)
nnu p
A practical advantage of the probabilistic classification approach is its easy
implementability. Indeed, one can directly use standard probabilistic classification
algorithms for density-ratio estimation. In the following, we describe represen-
tative classification algorithms including logistic regression, the least-squares
probabilistic classifier, and support vector machines. For notational brevity, let
n := nnu + nde and we consider a set of paired samples {(x k , yk )}nk=1 , where
(x 1 , . . . , x n ) := (x nu nu de de
1 , . . . , x nnu , x 1 , . . . , x nde ),
where λθ θ is a penalty term included for regularization purposes. In this opti-
mization problem, ψ(x k ) θ can be regarded as an estimator of yk :
yk = ψ(x k ) θ.
Thus, in logistic regression, the log-loss is employed for measuring the loss of
estimating yk by yk :
yk ) := log (1 + exp(−yk
loss(yk , yk )) ,
where yk yk is called the margin for the sample x k (Schapire et al., 1998). A profile
of a log-loss function is illustrated in Figure 4.2.
Because the objective function in Eq. (4.2) is convex, the global optimal solu-
tion can be obtained by a standard non-linear optimization technique such as the
gradient descent method or (quasi-)Newton methods (Hastie et al., 2001; Minka,
2007). A logistic regression model classifies a new input sample x by choosing
the most probable class as
y = argmax p(y|x;
θ ).
y=±1
5
0/1-loss
Logistic loss
4 Hinge loss
0
−3 −2 −1 0 1 2 3
yy
parameter vector. The class label y takes a value in {1, . . . , c}, where c denotes the
number of classes.
The basic idea of the LSPC is to express the class-posterior probability p∗ (y|x)
in terms of the equivalent density-ratio expression:
p∗ (x, y)
p∗ (y|x) = . (4.3)
p ∗ (x)
Then the density-ratio estimation method called unconstrained least-squares
importance fitting (uLSIF; Kanamori et al., 2009; see also Section 6.2.2) is used
for estimating this density ratio.
As explained in Section 12.2, the solution of the LSPC can be obtained by
solving the following system of linear equations:
+ λI b )
(H θ =
h,
where λ (≥ 0) is the regularization parameter, I b is the b-dimensional identity
matrix,
n c n
:= 1 1
To assure that the LSPC produces a probability, the outputs are normalized
and negative outputs are rounded up to zero (Yamada et al., 2011a); thus, the final
LSPC solution is given by
w x − w0 = 0,
p*de(x ) p*nu(x )
Margin
1/ ||w||
min ξk + λw w
w∈Rd ,w0 ∈R
k=1
1
max θk − θk θk y k y k x
k x k
{θk }nk=1
k=1
2 k,k =1
n
1
s.t. 0 ≤ θk ≤ for k = 1, . . . , n, and θk yk = 0,
λ k=1
where {θk }nk=1 are Lagrange multipliers. This formulation allows one to obtain a
non-linear method by replacing the inner product x x with a reproducing kernel
function K(x, x ):
n n
1
max θk − θk θk yk yk K(x k , x k )
θ∈Rn ,b∈R
k=1
2 k,k =1
n
1
s.t. 0 ≤ θk ≤ for k = 1, . . . , n, and θk yk = 0.
λ k=1
The SVM was shown to converge to the Bayes optimal classifier as the number
of training samples tends to infinity (Lin, 2002). However, the SVM does not
4.6 Numerical Examples 53
give probabilistic outputs and thus it cannot be directly employed for density-
ratio estimation. Platt (2000) and Wu et al. (2004) proposed heuristic methods
for computing the class-posterior probability from the SVM solution, by which
density ratios may be approximated.
t :=
ME I argmax p Dnu ,Dde (y|x nu ) = +1
|Dtnu | nu nu y=±1 t t
x ∈Dt
1
de
+ de Dnu ,Dde (y|x ) = −1 ,
I argmax p
|Dt | de de y=±1 t t
x ∈Dt
T
:= 1
ME t .
ME
T t=1
is chosen.
Then the model that minimizes ME
1 For probabilistic classifiers, a hold-out likelihood may also be used as the error metric in cross-
validation.
54 4 Probabilistic Classification
where N(x; µ, σ 2 ) denotes the Gaussian density with mean µ and variance σ 2 .
The true densities are plotted in Figure 4.4(a), while the true density ratio r ∗ (x) is
plotted in Figure 4.4(c).
Let nnu = nde = 200, n = nnu + nde = 400, and
For density-ratio estimation, we use logistic regression with the Gaussian kernel:
1
p(y|x) = n .
(x − x )2
1 + exp −y θ exp −
=1
2σ 2
The Gaussian width σ and the regularization parameter λ are chosen by 5-fold
cross-validation (see Section 4.5). The true class-posterior probabilities and their
estimates obtained by logistic regression are described in Figure 4.4(b), and
the density-ratio function estimated by logistic regression, r(x), is described in
Figure 4.4(c). The results show that reasonably good approximations were obtained
by the logistic regression approach.
4.7 Remarks
Density-ratio estimation by probabilistic classification can successfully avoid den-
sity estimation by casting the problem of density ratio estimation as the problem
of learning the class-posterior probability. The availability of cross-validation for
model selection is an advantage of the probabilistic classification approach over the
infinite-order moment matching approach described in Section 3.3. Furthermore,
the probabilistic classification approach is convenient practically because existing
software packages of probabilistic classifiers can be used directly for density-ratio
estimation.
The probabilistic classification approach with logistic regression actually has a
superior theoretical property: If the logistic regression model is correctly specified,
the logistic regression approach gives the optimal estimator among a class of semi-
parametric estimators in the sense that the asymptotic variance is minimal (Qin,
1998). However, when the model is misspecified (which would be the case in
practice), this strong theoretical property is not true and the density fitting approach
explained in Chapter 5 is more preferable (Kanamori et al., 2010). Details of these
theoretical analyses will be explained in Chapter 13.
4.7 Remarks 55
0.4 1
p*nu(x)
0.35
p*de(x) 0.8
0.3
0.25 p*(nu| x)
0.6
p* (de|x )
0.2 p*(nu| x)
0.4 p*(de| x)
0.15
0.1
0.2
0.05
0 0
−5 0 5 −5 0 5
x x
(a) True densities (b) True and estimated
class-posterior probabilities
2.5
r*(x )
2 r (x )
1.5
0.5
0
−5 0 5
x
(c) True and estimated density ratios
∗
Then the numerator density pnu (x) may be modeled by
∗
pnu (x) = r(x)pde (x).
∗
Now let us consider the KL divergence from pnu (x) to pnu (x):
∗
pnu (x)
KL (pnu
∗
pnu ) := ∗
pnu (x) log dx = C − KL(r),
pnu (x)
56
5.2 Implementations of KLIEP 57
∗ p ∗ (x)
where C := pnu (x) log pnu
∗ dx is a constant irrelevant to r and KL(r) is the
de (x)
relevant part:
∗
KL(r) := pnu (x) log r(x)dx.
An empirical approximation KL(r) of KL(r) is given by
nnu
1
We assume that the basis functions are non-negative. Then the KLIEP optimization
problem for the linear model is expressed as follows (Sugiyama et al., 2008):
nnu
1
max log(ψ(x nu
i ) θ ) s.t. ψ de θ = 1 and θ ≥ 0b ,
θ∈Rb nnu i=1
58 5 Density Fitting
'nde
where ψ de := n1 de
j =1 ψ(x j ).
de
Because this optimization problem is convex (i.e., the objective function to
be maximized is concave and the feasible set is convex), there exists the unique
global optimum solution. Furthermore, the KLIEP solution tends to be sparse, that
is, many parameters take exactly zero. Such sparsity would contribute to reducing
the computation time when computing the estimated density-ratio values.
A pseudo code of the KLIEP for linear models is described in Figure 5.1. As can
nde
be confirmed from the pseudo code, the denominator samples {x de j }j =1 appear only
in terms of the basis-transformed mean ψ de . Thus, the KLIEP is computationally
efficient even when the number nde of denominator samples is very large.
The performance of the KLIEP depends on the choice of the basis functions
ψ(x). As explained below, the use of the following Gaussian kernel model is
reasonable:
nnu
r(x) = θ K(x, x nu
), (5.2)
=1
nnu n
Input: Data samples D nu = {x nu de de de
i }i=1 and D = {x j }j =1 ,
and basis functions ψ(x)
Output: Density-ratio estimator r(x)
nu ←− (ψ(x nu nu
1 ), . . . , ψ(x nnu )) ;
1 'nde de
ψ de ←− n j =1 ψ(x j );
de
Initialize θ (> 0b ) and ε (0 < ε 1);
Repeat until convergence
θ ←− θ + ε nu (1nnu ./ nu θ ); % Gradient ascent
θ ←− θ + (1 − ψ de θ )ψ de /(ψ de ψ de ); % Constraint satisfaction
θ ←− max(0b , θ ); % Constraint satisfaction
θ ←− θ/(ψ de θ ); % Constraint satisfaction
end
r(x) ←− ψ(x) θ ;
Figure 5.1. Pseudo code of the KLIEP. “./” indicates the element-wise division and “ ”
denotes the transpose. Inequalities and the “max” operation for vectors are applied in the
element-wise manner.
5.2 Implementations of KLIEP 59
density ratio
∗
pnu (x)
r ∗ (x) = ∗
pde (x)
∗ ∗
tends to take large values if pnu (x) is large and pde (x) is small. Conversely, r ∗ (x)
∗ ∗
tends to be small (i.e., close to zero) if pnu (x) is small and pde (x) is large. When a
non-negative function is approximated by a Gaussian kernel model, many kernels
may be needed in the region where the output of the target function is large. On
the other hand, only a small number of kernels would be enough in the region
where the output of the target function is close to zero (see Figure 5.2). Fol-
∗
lowing this heuristic, many kernels are allocated in the region where pnu (x)
takes large values, which can be achieved by setting the Gaussian centers to
nnu
{x nu
i }i=1 .
nnu
Alternatively, we may locate (nnu + nde ) Gaussian kernels at both {x nu i }i=1 and
n
{x de
j }j =1 . However, this seems not to further improve the accuracy but slightly
de
increases the computational cost. When nnu is very large, using all the numerator
nnu
samples {x nu i }i=1 as Gaussian centers only is already computationally expensive.
nnu
To ease this problem, a subset of {x nu i }i=1 may be chosen in practice as Gaussian
centers for computational efficiency, that is,
r(x) = θ K(x, c ),
=1
nnu
where {c }b =1 are template points randomly chosen from {x nu i }i=1 without
replacement, and b (∈ {1, . . . , nnu }) is a prefixed number.
A MATLAB® implementation of the entire KLIEP algorithm (includ-
ing model selection by cross-validation; see Section 5.3) is available from
http://sugiyama-www.cs.titech.ac.jp/˜sugi/software/KLIEP/.
The KLIEP methods for linear/kernel models are referred to as linear KLIEP
(L-KLIEP) and kernel KLIEP (K-KLIEP), respectively.
Output
Input
By definition, outputs of the log-linear model r(x; θ ) are non-negative for all
x. Thus, we do not need the non-negativity constraint on the parameter. Then the
KLIEP optimization criterion is expressed as
max J (θ ),
θ∈Rb
where
nde
1
J (θ ) := ψ nu θ − log exp(ψ(x de
j ) θ) ,
nde j =1
nnu
1
ψ nu := ψ(x nu
i ).
nnu i=1
This is an unconstrained convex optimization problem, and thus the global opti-
mal solution can be obtained by, for example, the gradient method or the (quasi-)
Newton method. The gradient vector ∇J (θ ) and the Hessian matrix ∇∇J (θ ) of
the objective function J are given by
∇J (θ ) = ψ nu − ζ (θ ),
n
1
de
∇∇J (θ) = − ψ(x de de de
j )ψ(x j ) r(x j ; θ ) + ζ (θ)ζ (θ ) ,
nde j =1
1 'nde de de nnu
where ζ (θ) := nde j =1 ψ(x j )r(x j ; θ ). Because the numerator samples {x nu
i }i=1
appear only in terms of the basis-transformed mean ψ nu , the KLIEP for log-
linear models is computationally efficient even when the number nnu of numerator
5.2 Implementations of KLIEP 61
samples is large (cf. the KLIEP for linear/kernel models is computationally efficient
when nde is large; see Section 5.2.1).
The KLIEP method for log-linear models is called the log-linear KLIEP (LL-
KLIEP).
where c is the number of mixing components, {θk }ck=1 are mixing coefficients,
{µk }ck=1 are means of Gaussian functions, { k }ck=1 are covariance matrices of Gaus-
sian functions, and N (x; µ, ) denotes the multi-dimensional Gaussian density
with mean µ and covariance matrix :
1 1 −1
N(x; µ, ) := d 1
exp − (x − µ) (x − µ) , (5.4)
(2π) 2 det( ) 2 2
where det(·) denotes the determinant of a matrix. Note that should be positive
definite, that is, all the eigenvectors of should be strictly positive.
For the Gaussian mixture model (5.3), the KLIEP optimization problem is
expressed as
nnu
c
1
nu
max log θk N (x i ; µk , k )
{θk ,µk , k }ck=1 nnu
i=1 k=1
n c
1
de
s.t. θk N (x de
j ; µk , k ) = 1,
nde j =1 k=1
θk ≥ 0 and k ! O for k = 1, . . . , c,
Section 5.3) may be used for determining the number of mixing components c and
the regularization parameter λ (see the pseudo code in Figure 5.3).
The KLIEP method for Gaussian mixture models is called the Gaussian-mixture
KLIEP (GM-KLIEP).
d nnu n
Input: Data samples D nu = {x nui (∈ R )}i=1 and D
de
= {x de d
j (∈ R )}j =1 ,
de
'nnu
γk,i (x nu − µk )(x nu − µk )
k ←− i=1 'nnu i 'nde i
i =1 γk,i − j =1 βk,j
'nde de de
j =1 βk,j (x j − µk )(x j − µk )
− 'nnu 'nde + λI d ,
i =1 γk,i − j =1 βk,j
log θk N (x nu
i ; µk , k ) ,
nnu i=1 k=1
where c is the number of mixing components and {θk }ck=1 are mixing coefficients.
N(x; µ, σ 2 ,W) is a PPCA model defined by
1 1
N(x; µ, σ 2 ,W) = d 1
exp − (x − µ) −1
C (x − µ) ,
(2πσ 2 ) 2 det(C) 2 2
where det(·) denotes the determinant of a matrix. µ is the mean of the Gaussian
function, σ 2 is the variance of the Gaussian function, W is a d × m ‘projection’
matrix onto an m-dimensional latent space (where m ≤ d), and C = WW + σ 2 I d .
Then the KLIEP optimization criterion is expressed as
nnu
c
1
nu 2
max log θk N (x i ; µk , σk ,Wk )
{θk ,µk ,σk2 ,Wk }ck=1 nnu i=1 k=1
n c
1
de
s.t. θk N (x de 2
j ; µk , σk ,Wk ) = 1,
nde j =1 k=1
θk ≥ 0 for k = 1, . . . , c.
This optimization is non-convex, and thus we may use the fixed-point iteration to
obtain a local optimal solution in the same way as the GM-KLIEP (Yamada et al.,
2010a).
When the dimensionality of the latent space m is equal to the entire dimensional-
ity d, PPCA models are reduced to ordinary Gaussian models. Thus, PPCA models
can be regarded as an extension of Gaussian models to (locally) rank-deficient
data. The KLIEP method for PPCA mixture models is called the PPCA-mixture
KLIEP (PM-KLIEP). The PM-KLIEP is suitable for learning locally rank-deficient
density-ratio functions. On the other hand, methods for accurately learning globally
rank-deficient density-ratio functions are explained in Chapter 8.
64 5 Density Fitting
t :=
KL rt (x nu ).
log
|Dtnu | nu nu
x ∈Dt
T
1
KL := KLt .
T t=1
is chosen.
Then the model that maximizes KL
A pseudo code of CV for KLIEP is summarized in Figure 5.4.
nnu n
Input: Data samples D nu = {x nu de de de
i }i=1 and D = {x j }j =1 ,
and a set of basis function candidates {ψ m (x)}M
m=1 .
Output: Density-ratio estimator r(x).
Split D nu into T disjoint subsets {Dtnu }Tt=1 ;
for each model candidate m = 1, . . . , M
for each split t = 1, . . . , T
rt (x) ←− KLIEP(D nu \Dtnu , D de , ψ(x));
'
t (m) ←− 1nu
KL |D |
t x∈Dtnu log
rt (x);
end
'
KL(m) ←− T1 Tt=1 KL t (m);
end
←− argmax m KL(m);
m
r(x) ←− KLIEP(D nu , D de , ψ m
(x));
∗
pnu (x) = N (x; 1, 12 ) and pde
∗
(x) = N (x; 0, 22 ),
where N(x; µ, σ 2 ) denotes the Gaussian density with mean µ and variance σ 2 .
The true densities are plotted in Figure 5.5(a), while the true density ratio r ∗ (x) is
plotted in Figure 5.5(b).
Let nnu = nde = 200. For density-ratio estimation, we use the KLIEP algorithm
with the following Gaussian kernel model:
nnu
(x − x nu )2
r(x) = θ exp − .
=1
2σ 2
The Gaussian width σ is chosen by 5-fold cross-validation (see Section 5.3). The
density-ratio function estimated by the KLIEP algorithm, r(x), is described in
Figure 5.5(b), which shows that the KLIEP gives a reasonably good approximation
to the true density-ratio function r ∗ (x).
5.5 Remarks
Density-ratio estimation by density fitting under the KL divergence allows one to
avoid density estimation when estimating density ratios (Section 5.1). Furthermore,
cross-validation with respect to the KL divergence is available for model selection,
as described in Section 5.3.
0.4 2.5
p*nu(x) r*(x)
0.35 r (x)
p*de(x) 2
0.3
0.25 1.5
0.2
1
0.15
0.1
0.5
0.05
0 0
−5 0 5 −5 0 5
x x
(a) True densities (b) True and estimated density ratios
67
68 6 Density-Ratio Fitting
min r(x de 2
j ) − r(x nu
i ) .
r 2nde j =1 nnu i=1
h := ∗
pnu (x)ψ(x)dx, and
h := ψ(x nu
i ). (6.2)
nnu i=1
1
b θ = θ 1 := |θ |.
=1
The term 1b θ works as the 1 -regularizer if it is combined with the non-negativity
constraint. Then the optimization problem is expressed as
1
min θ H θ − h θ + λ1b θ s.t. θ ≥ 0b ,
θ∈Rb 2
where λ (≥ 0) is the regularization parameter. We refer to this method as a con-
strained LSIF (cLSIF). The cLSIF optimization problem is a convex quadratic
program, and thus the unique global optimal solution can be obtained by a standard
optimization software.
We can also use the 2 -regularizer θ θ , instead of the 1 -regularizer 1
b θ , with-
out changing the computational property (i.e., the optimization problem is still
a convex quadratic program). However, the 1 -regularizer is more advantageous
due to its sparsity-inducing property, that is, many parameters take exactly zero
(Williams, 1995; Tibshirani, 1996; Chen et al., 1998). Furthermore, as explained
in Section 6.3.2, the use of the 1 -regularizer allows one to compute the entire
regularization path efficiently (Best, 1982; Efron et al., 2004; Hastie et al., 2004),
which highly improves the computational cost in the model selection phase.
6.3.1 Cross-Validation
nnu
More specifically, the numerator and denominator samples D nu = {x nu i }i=1 and
de de nde nu T de T
D = {x j }j =1 are divided into T disjoint subsets {Dt }t=1 and {Dt }t=1 , respec-
rt (x) is obtained using D nu \Dtnu and D de \Dtde
tively. Then a density-ratio estimator
nu de
(i.e., all samples without Dt and Dt ), and its SQ value for the hold-out samples
Dtnu and Dtde is computed:
1
1
t :=
SQ rt (x nu )2 − de
rt (x de ).
nu
2|Dt | nu nu |Dt | de de
x ∈Dt x ∈Dt
is chosen.
Then the model that minimizes SQ
ˆ )
θ(λ3
θ(λ
ˆ 1)
θ(λ
ˆ 2)
θ(λ
ˆ 0) = 0b
Figure 6.1. Regularization path tracking of cLSIF. The solution θ(λ) is shown to be
piecewise-linear in the parameter space as a function of λ. Starting from λ = ∞, the
trajectory of the solution is traced as λ is decreased to zero. When λ ≥ λ0 for some λ0 ≥ 0,
the solution stays at the origin 0b . When λ gets smaller than λ0 , the solution departs from
the origin. As λ is further decreased, for some λ1 such that 0 ≤ λ1 ≤ λ0 , the solution goes
straight to
θ (λ1 ) with a constant ‘speed’. Then the solution path changes its direction and,
for some λ2 such that 0 ≤ λ2 ≤ λ1 , the solution heads straight for θ(λ2 ) with a constant
speed as λ is further decreased. This process is repeated until λ reaches zero.
and
Input: H h % see Eqs. (6.1) and (6.2) for their definitions
Output: Entire regularization path
θ (λ) for λ ≥ 0
τ ←− 0; k ←− argmax i {
hi | i = 1, . . . , b}; λτ ←−
hk ;
A ←− {1, . . . , b}\{k};
θ (λτ ) ←− 0b ;
While λτ > 0
E ←− O|A|×b ;
For i = 1, . . . , |A|
Ei, |
ji ←− 1; % A = {j1 , . . . , j|A| j1 < · · · < j|A|
}
end
H − E −1 h −1 1b
G ←− ; u ←− G
; v ←− G ;
−E O|A|×| A| 0|A|
0|A|
Figure 6.2. Pseudo code for computing the entire regularization path of cLSIF. When the
computation of G −1 is numerically unstable, small positive diagonals may be added to H
for stabilization.
et al., 2009). Thanks to this property, the computational complexity for performing
LOOCV is the same order as just computing a single solution.
nnu de nde
In the current setup, two sets of samples, {x nui }i=1 and {x j }j =1 , generally have
different sample sizes. For i = 1, . . . , n, where n := min(nnu , nde ), suppose that the
i-th numerator sample x nu de
i and the i-th denominator sample x i are held out at
the same time in the LOOCV procedure; the remaining numerator or numerator
samples for i > n are assumed to be always used for density-ratio estimation.
6.4 Numerical Examples 73
Note that the order of numerical samples can be changed without sacrificing the
computational advantages.
Let
ri (x) be a density-ratio estimate obtained without the i-th numerator sample
x nu
i and the i-th denominator sample x de i . Then the LOOCV score is expressed as
n
1
1
LOOCV = ri (x de
( i )) 2
−
r (x
i i
nu
) .
n i=1 2
A−1 uvA−1
(A+ uv )−1 = A−1 − .
1 + vA−1 u
Efficient approximation schemes of LOOCV have been investigated under
asymptotic setups (Stone, 1974; Larsen and Hansen, 1996). On the other hand,
for uLSIF, the LOOCV score can be computed exactly, which follows the same
line as that for ridge regression (Hoerl and Kennard, 1970; Wahba, 1990).
MATLAB® and R implementations of uLSIF are available from http://sugi
yama-www.cs.titech.ac.jp/˜sugi/software/uLSIF/ and http://www.math.cm.is.na
goya-u.ac.jp/˜kanamori/software/LSIF/.
where N(x; µ, σ 2 ) denotes the Gaussian density with mean µ and variance σ 2 .
The true densities are plotted in Figure 6.3(a), while the true density ratio r ∗ (x) is
plotted in Figure 6.3(b).
Let nnu = nde = 200. For density-ratio estimation, we use the uLSIF algorithm
with the following Gaussian kernel model:
nnu
(x − x nu )2
r(x) = θ exp − .
=1
2σ 2
0.4 2.5
p*nu(x) r*(x)
0.35
p*de(x) 2 r (x)
0.3
0.25 1.5
0.2
1
0.15
0.1
0.5
0.05
0 0
−5 0 5 −5 0 5
x x
(a) True densities (b) True and estimated density ratios
6.5 Remarks
The least-squares density-ratio fitting methods for linear/kernel models are
computationally more advantageous than alternative approaches. Indeed, the
constrained method (cLSIF) with the 1 -regularizer is equipped with a
regularization-path-tracking algorithm (Section 6.3.2). Furthermore, the uncon-
strained method (uLSIF) allows one to compute the density-ratio estimator
analytically (Section 6.2.2); the leave-one-out cross-validation score can also be
computed in a closed form (Section 6.3.3). Thus, the overall computation of uLSIF
including model selection is highly efficient.
The fact that uLSIF has an analytic-form solution is actually very use-
ful beyond its computational efficiency. When one wants to optimize some
criterion defined using a density-ratio estimator, for example, mutual informa-
tion (Cover and Thomas, 2006) or the Pearson divergence (Pearson, 1900), the
analytic-form solution of uLSIF allows one to compute the derivative of the tar-
get criterion analytically. Then one can develop, for example, gradient-based and
(quasi-) Newton algorithms for optimization. This property can be successfully
utilized, for example, in identifying the central subspace in sufficient dimension
reduction (Section 11.2), finding independent components in independent compo-
nent analysis (Section 11.3), and identifying the heterodistributional subspace in
direct density-ratio estimation with dimensionality reduction (Section 8.2).
Asymptotic convergence behavior of cLSIF and uLSIF in the parametric setup
was elucidated in Kanamori et al. (2009), which is explained in Section 13.2.
The asymptotic non-parametric convergence rate of uLSIF was studied in
Kanamori et al. (2011b), which is explained in Section 14.3. Also, uLSIF was
shown to be numerically stable and reliable under condition number analysis
(Kanamori et al., 2011c), which will be detailed in Chapter 16.
7
Unified Framework
75
76 7 Unified Framework
f(t*)
BRf′ (t*||t)
f (t) ∂f (t)(t* – t)
t t*
A motivation for this choice is that the BR divergence allows one to directly obtain
an empirical approximation for any f . Let us extract a relevant part of the BR
divergence, BR f , as
% &
BR f r ∗ r = C − pde ∗
(x)f (r(x))dx − pde ∗
(x)∂f (r(x))r ∗ (x)dx
∗
+ pde (x)∂f (r(x))r(x)dx
= C + BR f (r) ,
∗
where C := pde (x)f (r ∗ (x))dx is a constant independent of r and
∗ ∗
BR f (r) := pde (x)∂f (r(x))r(x)dx − pde (x)f (r(x))dx
∗
− pnu (x)∂f (r(x))dx. (7.2)
− ∂f (r(x nu
i )). (7.3)
nnu i=1
When
1
f (t) = (t − 1)2 ,
2
BR (7.1) is reduced to the squared (SQ) distance:
1
SQ (t ∗ t) := (t ∗ − t)2 .
2
Following Eqs. (7.2) and (7.3), let us denote SQ without an irrelevant constant
(r), respectively:
term by SQ (r) and its empirical approximation by SQ
1 ∗
SQ (r) := pde (x)r(x)2 dx − pnu∗
(x)r(x)dx,
2
nde nnu
1
de 2 1
r(x) = θ K(x, x de
), (7.4)
=1
= 1 K 2de,de and
H h=
1
K de,nu 1nnu ,
nde nnu
7.2 Existing Methods as Density-Ratio Fitting 79
where 1nnu denotes the nnu -dimensional vector with all ones, and K de,de and K de,nu
are the nde × nde and nde × nnu matrices defined by
[K de,de ]j ,j = K(x de de de nu
j , x j ) and [K de,nu ]j ,i = K(x j , x i ).
Then the (unregularized) uLSIF solution (see Section 6.2.2 for details), −1
θ =H h,
is expressed as
nde −2
θ= K K de,nu 1nnu . (7.5)
nnu de,de
On the other hand, let us consider an inductive variant of KMM for the kernel
model (7.4) (see Section 3.3.2). For the density-ratio model (7.4), the matrix de
defined by Eq. (3.3) is expressed as de = K de,de . Then the KMM solution (see
Section 3.3.2 for details),
nde
θ= ( de K de,de de )−1 de K de,nu 1nnu ,
nnu
is reduced to Eq. (7.5).
Thus, KMM and uLSIF share the same solution. However, it is important to
note that the optimization criteria of the two approaches differ. As will be shown
in Chapter 16, this fact makes a significant difference in numerical stability.
More specifically, the density-ratio fitting method has a smaller condition number
than the generalized moment-matching method (Kanamori et al., 2011c). Thus,
the density-ratio fitting approach is numerically more stable and computationally
more efficient than the moment matching method if solutions are computed by
numerical algorithms such as quasi-Newton methods.
The density-ratio model r is learned so that B KL(r) is minimized.
Equation (7.6) is a generalized expression of logistic regression (Qin, 1998).
Indeed, the ordinary logistic regression formulation can be obtained from Eq. (7.6)
as follows: Let us consider a set of paired samples {(x k , yk )}nk=1 , where, for n =
nnu + nde ,
(x 1 , . . . , x n ) := (x nu nu de de
1 , . . . , x nnu , x 1 , . . . , x nde ),
r(x) = exp(ψ(x) θ ).
Then B KL is expressed as
n
1
de
% &
B KL (θ ) = log 1 + exp(ψ(x de
j ) θ)
nde j =1
nnu
1
% &
+ log 1 + exp(−ψ(x nu
i ) θ)
nnu i=1
n
% % &&
= log 1 + exp −yk ψ(x k ) θ .
k=1
When
f (t) = t log t − t,
U KL (r) := r(x de
j ) − log r(x nu
i ). (7.7)
nde j =1 nnu i=1
n
1
de
r(x de
j ) = 1.
nde j =1
For the KL divergence where f (t) = t log t, the conjugate dual function is given by
f ∗ (u) = exp(u − 1). For the PE divergence where f (t) = (t − 1)2 /2, the conjugate
dual function is given by f ∗ (u) = u2 /2 + u.
Substituting Eq. (7.9) into Eq. (7.8), we have the following lower bound
(Keziou, 2003a):
∗ ∗
ASCf (pnu pde ) = − inf ASCf (g),
g
where
ASCf (g) := f ∗ ∗
(g(x))pde (x)dx − ∗
g(x)pnu (x)dx. (7.10)
According to the variational principle (Jordan et al., 1999), the infimum of ASCf
is attained at g such that
∗
pnu (x)
∂f ∗ (g(x)) = ∗ = r ∗ (x).
pde (x)
7.3 Interpretation of Density-Ratio Fitting 83
Thus, minimizing ASCf (g) yields the true density-ratio function r ∗ (x).
In the following we derive a more explicit expression of ASCf (g). For some
g, there exists r such that g = ∂f (r). Then f ∗ (g) is expressed as
. /
f ∗ (g) = sup s∂f (r) − f (s) .
s
According to the variational principle, the supremum in the right-hand side of the
equation is attained at s = r. Thus we have
Then the lower bound ASCf (g) defined by Eq. (7.10) can be expressed as
, -
ASCf (g) = pde∗
(x) r(x)∂f (r(x)) − f (r(x)) dx
∗
− ∂f (r(x))pnu (x)dx.
This is equivalent to the criterion BR f defined by Eq. (7.2) in Section 7.1. Thus,
density-ratio fitting under the BR divergence can be interpreted as divergence
estimation under the ASC divergence.
Taking the derivative of the above criterion with respect to parameters in the
density-ratio model r and equating it to zero, we have the following estimation
equation:
∗ 2 ∗
pde (x)r(x)∇r(x)∂ f (r(x))dx − pnu (x)∇r(x)∂ 2 f (r(x))dx = 0b ,
where ∇ denotes the differential operator with respect to parameters in the density-
ratio model r, and b is the number of parameters. This implies that putting
φ(x) = ∇r(x)∂ 2 f (r(x))
in Eq. (7.11) gives the same estimation equation as density-ratio fitting, resulting
in the same optimal solution.
7.4.1 Derivation
For α > 0, let
t 1+α − t
f (t) = .
α
Then BR (7.1) is reduced to the BA divergence:
t ∗ (t α − (t ∗ )α )
BAα (t ∗ t) := t α (t − t ∗ ) − .
α
Following Eqs. (7.2) and (7.3), let us denote BAα without an irrelevant constant
term by BAα (r) and its empirical approximation by BAα (r), respectively:
∗ 1 1
BAα (r) := pde (x)r(x)α+1 dx − 1 + ∗
pnu (x)r(x)α dx + ,
α α
nde nnu
1
de α+1 1 1
1
BAα (r) := r(x j ) − 1+ r(x nu α
i ) + .
nde j =1 α nnu i=1 α
tα − 1
lim = log t
α→0 α
implies that the BA divergence tends to the UKL divergence (see Section 7.2.4)
as α → 0:
nde nnu
1
1
lim
BAα (r) = r(x de
j ) − log r(x nu
i ) = UKL (r) .
α→0 nde j =1 nnu i=1
7.4.2 Robustness
α (r) with respect to parameters included in the
Let us take the derivative of BA
density-ratio model r and equate it to zero. Then we have the following estimation
equation:
nde nnu
1
1
r(x de
j ) α
∇r(x de
j ) − r(x nu
i )
α−1
∇r(x nu
i ) = 0b , (7.12)
nde j =1 nnu i=1
nde nnu
1
1
∇r(x de
j ) − r(x nu −1 nu
i ) ∇r(x i ) = 0b .
nde j =1 nnu i=1
Comparing this with Eq. (7.12), we see that the BA method can be regarded
as a weighted version of the KLIEP according to r(x de α nu α
j ) and r(x i ) . When
r(x de nu
j ) and r(x i ) are less than 1, the BA method downweights the effect of those
samples. Thus, “outlying” samples relative to the density-ratio model r tend to
have less influence on parameter estimation, which will lead to robust estimators
(Basu et al., 1998).
Since LSIF corresponds to α = 1, it is more robust against outliers than the
KLIEP (which corresponds to α → 0) in the above sense, and BA with α > 1
would be even more robust.
86 7 Unified Framework
where N(x; µ, σ 2 ) denotes the Gaussian density with mean µ and variance σ 2 .
We draw nnu = nde = 300 samples from each density, which are illustrated in
Figure 7.3(b).
100
pnu(x) xnu (nnu = 300)
1 pde(x)
50
0.8
0
0.6 −3 −2 −1 0 1 2 3
100
0.4 xde (nde = 300)
50
0.2
0 0
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
x x
(a) Numerator and denominator (b) Numerator and denominator
density functions sample points
3
r* (x)
rBA (x)
2.5 0
rBA (x)
1
2 rBA (x)
2
rBA (x)
3
1.5
0.5
0
−3 −2 −1 0 1 2 3
x
(c) True and learned density-ratio functions
r(x) = θ K(x, x nu
),
=1
where K(x, x ) is the Gaussian kernel with kernel width σ :
x − x 2
K(x, x ) = exp − .
2σ 2
s.t. θ ≥ 0b . (7.13)
Note that this optimization problem is convex for 0 < α ≤ 1. In our implementa-
tion, we solve the above optimization problem by gradient projection, that is, the
parameters are iteratively updated by gradient descent with respect to the objective
function, and the solution is projected back to the feasible region by rounding up
negative parameters to zero. Before solving the optimization problem (7.13), we
run uLSIF (see Section 6.2.2) and obtain cross-validation estimates of the Gaus-
sian width σ and the regularization parameter λ. We then fix the Gaussian width
and the regularization parameter in the BA method to these values and solve the
optimization problem (7.13) by gradient-projection with θ = 1b /b as the initial
solution.
Figure 7.3(c) shows the true and estimated density-ratio functions by the BA
methods for α = 0, 1, 2, 3. The true density-ratio function has two peaks: higher one
at x = 0 and a lower one at around x = 1.2. The graph shows that, as α increases,
the estimated density-ratio functions tend to focus on approximating the higher
peak and ignore the lower peak.
7.5 Remarks
In this chapter we introduced a framework of density-ratio estimation by density-
ratio fitting under the Bregman divergence. This is a natural extension of the
least-squares approach described in Chapter 6 and includes various existing
approaches such as kernel mean matching (see Section 3.3), logistic regres-
sion (see Section 4.2), Kullback–Leibler importance estimation procedure (see
Section 5.1), and least-squares importance fitting (see Section 6.1). We also gave
88 7 Unified Framework
89
90 8 Direct Density-Ratio Estimation with Dimensionality Reduction
x nu nu nu
i = Aui + Bvi and x de de de
j = Auj + Bvj .
∗ ∗
Thus, pnu (x) and pde (x) are expressed as
∗ ∗
pnu (x) = c pnu (u)p ∗ (v) and pde
∗ ∗
(x) = c pde (u)p ∗ (v), (8.1)
where c is the Jacobian between the observation x and (u, v). We call the ranges
of A and B – denoted by R(A) and R(B), respectively – the heterodistributional
subspace and the homodistributional subspace, respectively. Note that R(A) and
R(B) are not generally orthogonal to each other (see Figure 8.1).
Under this decomposability assumption, the density ratio is simplified as
∗ ∗
pnu (x) c pnu (u)p ∗ (v) pnu
∗
(u)
r ∗ (x) = ∗ = ∗ ∗
= ∗ = r ∗ (u). (8.2)
pde (x) c pde (u)p (v) pde (u)
This means that the density ratio does not have to be estimated in the entire d-
dimensional space, but only in the heterodistributional subspace of dimensionality
m (≤ d).
Now we want to extract the heterodistributional components unu de
i and uj from
x nu de
i and x j , which allows estimation of the density ratio only in R(A) via Eq. (8.2).
As illustrated in Figure 8.1, the oblique projection of x nu de
i and x j onto R(A) along
nu de
R(B) gives ui and uj .
p*(υ) p*(υ)
p*nu(u ) p*de(u )
R (A) R(A)
p*nu(x ) p*de(u )
R (U) R (U)
(a) p*nu(x ) ∝ p*nu(u ) p* (υ) (b) p*de(x ) ∝ p*de(u ) p* (υ)
U and V for Aand B, respectively; that is, U is an m×d matrix and V is a (d −m)×d
matrix such that they are bi-orthogonal to each other:
where Om×m denotes the m × m matrix with all zeros. Thus, R(B) and R(U )
are orthogonal to each other and R(A) and R(V ) are orthogonal to each other.
When R(A) and R(B) are orthogonal to each other, R(U ) agrees with R(A) and
R(V ) agrees with R(B). However, they are different in general, as illustrated in
Figure 8.1.
The relation between A and B and the relation between U and V can be
∗ ∗
characterized in terms of the covariance matrix (of either pnu (x) or pde (x)) as
UA = I m and VB = I d−m ,
where I m denotes the m-dimensional identity matrix. Then the oblique projection
matrices PR(A),R(B) and PR(B),R(A) can be expressed as
unu nu de de nu nu de de
i = U x i , uj = U x j , vi = Vx i , and vj = Vx j .
Basic Idea
A key observation in identifying the heterodistributional subspace is that the exis-
tence of distributional difference can be checked whether samples from the two
distributions can be separated from each other. That is, if the samples of one dis-
tribution could be distinguished from the samples of the other distribution, the
two distributions would be different; otherwise, the distributions are similar. We
employ this idea for finding the heterodistributional subspace. Let us denote the
samples projected onto the heterodistributional subspace by
nu nnu n
{unu nu de de de de
i | ui = U x i }i=1 and {uj | uj = U x j }j =1 .
8.1 Discriminant Analysis Approach 93
nnu n
Then our goal is to find the matrix U such that {unu de de
i }i=1 and {uj }j =1 are maxi-
mally separated from each other. To achieve this goal, we may use any supervised
dimensionality reduction methods.
Among various supervised dimensionality reduction methods (e.g., Hastieand Tibshirani,
1996a, 1996b; Fukumizu et al., 2004; Goldberger et al., 2005; Globerson and Roweis,
2006), we decided to use local Fisher discriminant analysis (LFDA; Sugiyama,
2007), which is an extension of the classical Fisher discriminant analysis (FDA;
Fisher, 1936). LFDA has various practically useful properties; for example, there
is no limitation on the dimensionality of the reduced subspace,1 it works well
even when the data have a multimodal structure (such as separate clusters), it is
robust against outliers, its solution can be analytically computed using eigenvalue
decomposition in a numerically stable and computationally efficient manner, and
its experimental performance was shown to be better than competitive methods. In
the following we briefly review the LFDA method, and show how it can be used
for finding the heterodistributional subspace.
Let us consider a set of binary-labeled training samples
{(x k , yk ) | x k ∈ Rd , yk ∈ {+1, −1}}nk=1 ,
and reduce the dimensionality of x k by T x k , where T is an m × d transformation
matrix. Effectively, the training samples {(x k , yk )}nk=1 correspond to the following
setup: For n = nnu + nde ,
(x 1 , . . . , x n ) = (x nu nu de de
1 , . . . , x nnu , x 1 , . . . , x nde ),
µ := x k , µ+ := x k , and µ− := xk .
n k=1 n+ k:y =+1 n− k:y =−1
k k
(b) (w)
Let S and S be the between-class scatter matrix and the within-class scatter
matrix, defined as
S (b) := n+ (µ+ − µ)(µ+ − µ) + n− (µ− − µ)(µ− − µ) ,
1 FDA can only find a subspace with dimensionality less than the number of classes (Fukunaga, 1990).
Thus, in the binary classification scenario that we are dealing with here, the maximum dimensionality
of the subspace that FDA can find is only one.
94 8 Direct Density-Ratio Estimation with Dimensionality Reduction
That is, FDA seeks a transformation matrix T such that between-class scatter is
maximized and within-class scatter is minimized in the embedding space Rm .
Let {φ l }dl=1 be the generalized eigenvectors associated with the generalized
eigenvalues {λl }dl=1 of the following generalized eigenvalue problem:
S (b) φ = λS (w) φ.
Without loss of generality, we assume that the generalized eigenvalues are sorted as
λ1 ≥ · · · ≥ λd . Then a solution T FDA is analytically given as T FDA = (φ 1 | · · · |φ m )
(e.g., Duda et al., 2001).
FDA works very well if the samples in each class have Gaussian distributions
with a common covariance structure. However, it tends to give undesired results
if the samples in a class form several separate clusters or there are outliers. Fur-
thermore, the between-class scatter matrix S (b) is known to have rank one in the
current setup (see e.g., Fukunaga, 1990), implying that we can obtain only a single
meaningful feature φ 1 through the FDA criterion; the remaining features {φ l }dl=2
found by FDA are arbitrary in the null space of S (b) . This is an essential limitation
of FDA in dimensionality reduction.
where
1/n − 1/n+ if yk = yk = +1,
(b)
Wk,k := 1/n − 1/n − if yk = yk = −1,
1/n if yk # = yk ,
1/n+ if yk = yk = +1,
(w)
Wk,k := 1/n − if yk = yk = −1,
0 if yk # = yk .
8.1 Discriminant Analysis Approach 95
Based on this pairwise expression, let us define the local between-class scatter
matrix S (lb) and the local within-class scatter matrix S (lw) as
n
1
(lb)
S (lb) := Wk,k (x k − x k )(x k − x k ) ,
2
k,k =1
n
1
(lw)
S (lw) := Wk,k (x k − x k )(x k − x k ) ,
2
k,k =1
where
Ak,k (1/n − 1/n+ ) if yk = yk = +1,
(lb)
Wk,k := A (1/n − 1/n− ) if yk = yk = −1,
k,k
1/n if yk # = yk ,
Ak,k /n+ if yk = yk = +1,
(lw)
Wk,k := Ak,k /n− if yk = yk = −1,
0 if yk # = yk .
Ak,k is the affinity value between x k and x k , for example, defined based on the
local scaling heuristic (Zelnik-Manor and Perona, 2005):
x k − x k 2
Ak,k := exp − .
ηk ηk
The definitions of S (lb) and S (lw) imply that LFDA seeks a transformation matrix
T such that nearby data pairs in the same class are made close to each other and
the data pairs in different classes are made apart from each other; far apart data
pairs in the same class are not imposed to be close to each other.
By this localization effect, LFDA can overcome the weakness of the original
FDA against clustered data and outliers. When Ak,k = 1 for all k, k (i.e., no
locality), S (lw) and S (lb) are reduced to S (w) and S (b) , respectively. Thus, LFDA
could be regarded as a localized variant of FDA. The between-class scatter matrix
S (b) in the original FDA has only rank one, while its local counterpart S (lb) in LFDA
usually has full rank with no multiplicity in eigenvalues (given n ≥ d). Therefore,
LFDA can be practically applied to dimensionality reduction in any dimensional
subspaces, which is a significant advantage over the original FDA.
96 8 Direct Density-Ratio Estimation with Dimensionality Reduction
A solution T LFDA can be computed in the same way as the original FDA; namely,
the LFDA solution is given as
T LFDA = (ϕ 1 | · · · |ϕ m ) ,
where {ϕ l }dl=1 are the generalized eigenvectors associated with the generalized
eigenvalues {γl }dl=1 of the following generalized eigenvalue problem:
Without loss of generality, we assume that the generalized eigenvalues are sorted
as γ1 ≥ · · · ≥ γd . Thus, LFDA is computationally as efficient as the original FDA.
A pseudo code of LFDA is summarized in Figure 8.2. A MATLAB® implemen-
tation of LFDA is available from http://sugiyama-www.cs.titech.ac.jp/˜sugi/soft
ware/LFDA/.
unu
nu for i = 1, . . . , nnu ,
i := U x i (8.5)
ude
de for j = 1, . . . , nde .
j := U x j (8.6)
nnu n
Input: Two sets of samples {x nu de de
i }i=1 and {x j }j =1 on R
d
m
% span({ ϕ l }l=1 ) = span({ϕ l }m
l=1 ) for m = 1, . . . , m
←− (
U ϕ m ) ;
ϕ 1 | · · · |
Figure 8.2. Pseudo code of LFDA. 1n denotes the n-dimensional vectors with all ones,
and diag(b) denotes the diagonal matrix with diagonal elements specified by a vector b.
+ λI b )−1
θ = (H h, (8.7)
98 8 Direct Density-Ratio Estimation with Dimensionality Reduction
nde nnu
1
:= 1
H ude
ψ(j )ψ(
u de
j ) and
h := unu
ψ(i ).
nde j =1 nnu i=1
m nnu n
unu
{ i (∈ R )}i=1 and { ude m
j (∈ R )}j =1 are dimensionality-reduced samples given by
de
is computed as a function of the reduced dimensionality m, and the one that mini-
mizes the LOOCV score is chosen; we may use k-fold cross-validation instead of
LOOCV.
The D3 procedure explained here is referred to as D3 -LFDA/uLSIF, which effec-
tively combines LFDA and uLSIF, both of which have analytic-form solutions. The
pseudo code of D3 -LFDA/uLSIF is described in Figure 8.3.
nnu n
Input: Two sets of samples {x nu de de
i }i=1 and {x j }j =1 on R
d
subspace search (LHSS) is introduced for searching a subspace such that the Pear-
son divergence between two marginal distributions is maximized (Sugiyama et al.,
2011b).
An advantage of the LHSS method is that the subspace search (divergence
estimation within a subspace) is carried out also using the density-ratio estimation
method uLSIF (see Section 6.2.2). Thus, the two steps in the D3 procedure (first
identifying the heterodistributional subspace and then estimating the density ratio
within the subspace) are merged into a single step. Thanks to this, the final density-
ratio estimator can be obtained automatically without additional computation. This
single-shot density-ratio estimation procedure is called D3 via LHSS (D3 -LHSS).
A summary of the density-ratio estimation methods is given in Figure 8.4.
* (x)
pnu
from samples
{x }n i.i.d.
nu
i ~ p * (x)
nu
i =1 nu
Goal: Estimate density ratio r * (x) = n i.i.d.
* (x)
pde
{x } ~ p *(x)
de
j
de
j =1 de
Homodistributional subspace
pnu (v|u) = pde (v|u) = p(v|u)
(m –d) x d
m –d
V∈ x∈ d
v∈
m×d
U∈
u∈ m
Heterodistributional subspace
pnu (u) ≠ pde (u)
We use the Pearson (PE) divergence (Pearson, 1900) as our criterion for evalu-
ating the discrepancy between two distributions. PE is a squared-loss variant of the
∗
Kullback–Leibler (KL) divergence (Kullback and Leibler, 1951). PE from pnu (x)
∗
to pde (x) is defined and expressed as
∗ 2
∗ ∗ 1 pnu (x) ∗
PE[pnu (x), pde (x)] := ∗ − 1 pde (x)dx
2 pde (x)
∗
1 pnu (x) ∗ 1
= ∗ pnu (x)dx − .
2 pde (x) 2
∗ ∗ ∗ ∗
PE[pnu (x), pde (x)] vanishes if and only if pnu (x) = pde (x).
The following lemma (called the data-processing inequality) characterizes the
heterodistributional subspace in terms of PE.
Lemma 8.1 (Sugiyama et al., 2011b). Let
∗ 2
∗ ∗ 1 pnu (u) ∗
PE[pnu (u), pde (u)] = ∗ − 1 pde (u)du
2 pde (u)
∗
1 pnu (u) ∗ 1
= ∗ pnu (u)du − . (8.8)
2 pde (u) 2
Then
∗ ∗ ∗ ∗
PE[pnu (x), pde (x)] − PE[pnu (u), pde (u)]
∗
1 pnu (x) pnu ∗
(u) 2 ∗
= ∗ − ∗ pde (x)dx ≥ 0. (8.9)
2 pde (x) pde (u)
Equation (8.9) is non-negative, and it vanishes if and only if
∗ ∗
pnu (v|u) = pde (v|u). (8.10)
∗ ∗
Because PE[pnu (x), pde (x)] is a constant with respect to U , maximizing
∗ ∗
PE[pnu (u), pde (u)] with respect to U leads to Eq. (8.10) (Figure 8.6). That
8.2 Divergence Maximization Approach 103
nu (x ), p*
PE [ p* de
(x )]
(constant)
∗ ∗
Figure 8.6. Because PE[pnu (x), pde (x)] is a constant, minimizing
1
, pnu
∗ (x) pnu
-
∗ (u) 2
∗ ∗ ∗
2 p∗ (x)
− p∗ (u) pde (x)dx is equivalent to maximizing PE[pnu (u), pde (u)].
de de
nnu
nu ∗ ∗ 1
1
PE[p (u), pde (u)] := r(unu
i )− .
2nnu i=1 2
2 As proved in Sugiyama et al. (2011b), the data-processing inequality holds not only for PE, but
also for any f -divergence (Ali and Silvey, 1966; Csiszár, 1967). Thus, the characterization of the
heterodistributional subspace is not limited to PE, but is applicable to all f -divergences.
104 8 Direct Density-Ratio Estimation with Dimensionality Reduction
b b
∂ PE h 1
∂ H
∂ ,
= θ − θ θ , (8.11)
∂U =1
∂U 2 ∂U
, =1
where
θ is the uLSIF solution given by Eq. (8.7) and
nnu
∂
h 1
∂ψ (unu
i )
= ,
∂U nnu i=1 ∂U
nde
,
∂H 1
∂ψ (ude
j ) de de
∂ψ (ude
j )
= ψ (uj ) + ψ (uj ) ,
∂U nde j =1 ∂U ∂U
∂ψ (u) 1
= − 2 (u − c )(x − c ) ψ (u).
∂U σ
c (∈ Rd ) is a pre-image of c (∈ Rm ), that is, c = U c . Note that {
θ }b =1 in
Eq. (8.11) depend on U through H and h in Eq. (8.7), which was taken into
account when deriving the gradient. A plain gradient update rule is then given as
∂ PE
U ←− U + t ,
∂U
where t (> 0) is a learning rate. t may be chosen in practice by some approximate
line search method such as Armijo’s rule (Patriksson, 1999) or a backtracking line
search (Boyd and Vandenberghe, 2004).
A naive gradient update does not necessarily fulfill the orthonormality U U =
I m , where I m is the m-dimensional identity matrix. Thus, after every gradient
step, U needs to be orthonormalized, for example, by the Gram–Schmidt process
(Golub and Loan, 1996) to guarantee its orthonormality. However, this may be
rather time-consuming.
On a manifold, it is known that, not the ordinary gradient, but the natural gradient
(Amari, 1998) gives the steepest direction. The natural gradient ∇ PE at U is the
∂ PE d
projection of the ordinary gradient ∂U onto the tangent space of Sm (R) at U .
If the tangent space is equipped with the canonical metric, that is, for any G
and G in the tangent space
2 3 1
G, G = tr(G G ), (8.12)
2
8.2 Divergence Maximization Approach 105
where “exp” for a matrix denotes the matrix exponential, that is, for a square
matrix C,
∞
1 k
exp(C) := T . (8.13)
k=0
k!
Thus, a line search along the geodesic in the natural gradient direction is equivalent
to finding a maximizer from {U t | t ≥ 0}. More details of the geometric structure
of the Stiefel manifold can be found in Nishimori and Akaho (2005).
A natural gradient update rule is then given as U ←− U t , where t (> 0) is
the learning rate. Since the orthonormality of U is automatically satisfied in the
natural gradient method, it is computationally more efficient than the plain gra-
dient method. However, optimizing the m × d matrix U is still computationally
expensive.
Givens Rotation
Another simple strategy for optimizing U is to rotate the matrix in the plane spanned
by two coordinate axes (which is called the Givens rotations; see Golub and Loan,
1996). That is, a two-dimensional subspace spanned by the i-th and j -th variables
is randomly chosen, and the matrix U is rotated within this subspace:
(i,j )
U ←− Rθ U,
(i,j )
where Rθ is the rotation matrix by angle θ within the subspace spanned by the
(i,j )
i-th and j -th variables. Rθ is equal to the identity matrix except that its elements
(i, i), (i, j ), (j , i), and (j , j ) form a two-dimensional rotation matrix:
(i,j ) (i,j )
[Rθ ]i,i [Rθ ]i,j cos θ sin θ
(i,j ) (i,j ) = .
[Rθ ]j ,i [Rθ ]j ,j − sin θ cos θ
Subspace Rotation
Because we are searching for a subspace, rotation within the subspace does not
(see Figure 8.7). This implies that the
have any influence on the objective value PE
number of parameters to be optimized in the gradient algorithm can be reduced.
For a skew-symmetric matrix M (∈ Rd×d ), that is, M = −M , rotation of U
can be expressed as (Plumbley, 2005)
% & U
I m Om,(d−m) exp(M ) ,
V
where Od,d is the d ×d matrix with all zeros, and exp(M ) is the matrix exponential
of M [see Eq. (8.13)]. M = Od,d (i.e., exp(Od,d ) = I d ) corresponds to no rotation.
An update formula of U through the matrix M is as follows.
Let us adopt Eq. (8.12) as the inner product in the space of skew-symmetric
matrices. Then the following lemma holds.
Lemma 8.2 (Sugiyama et al., 2011b). The derivative of PE with respect to M at
M = Od,d is given by
∂ PE Om,m
∂ PE
V
= ∂U . (8.14)
∂M M =O
−( ∂∂U
PE
V ) O(d−m),(d−m)
d,d
The block structure of Eq. (8.14) has an intuitive explanation: The non-zero off-
diagonal blocks correspond to the rotation angles between the heterodistributional
On
subspace and its orthogonal complement that affect the objective function PE.
the other hand, the derivative of the rotation within the two subspaces vanishes
because this does not change the objective value. Thus, the only variables to be
optimized are the angles corresponding to the non-zero off-diagonal blocks ∂∂U
PE
V ,
which includes only m(d − m) variables. In contrast, the plain/natural gradient
algorithms optimize the matrix U consisting of md variables. Thus, when m is
Rotation across
the subspace
Rotation within
the subspace
Heterodistributional
subspace
Figure 8.7. In the heterodistributional subspace search, only rotation that changes the
subspace matters (the solid arrow); rotation within the subspace (dotted arrow) can be
ignored because this does not change the subspace. Similarly, rotation within the
orthogonal complement of the heterodistributional subspace can also be ignored (not
depicted in the figure).
8.2 Divergence Maximization Approach 107
large, the subspace rotation approach may be computationally more efficient than
the plain/natural gradient algorithms.
The gradient ascent update rule of M is given by
∂ PE
M ←− t ,
∂M M =O
d,d
r(x) = x),
θ ψ (U
=1
nnu n
Input: Two sets of samples {x nu de de
i }i=1 and {x j }j =1 on R
d
able to detect the subspace in which the two distributions are different, but
the samples are not really separable.
Here, through numerical examples, we illustrate these weaknesses of D3 -
LFDA/uLSIF as well as how D3 -LHSS can overcome them.
Let us consider two-dimensional examples (i.e., d = 2), and suppose that the
∗ ∗
two densities pnu (x) and pde (x) are different only in the one-dimensional subspace
(i.e., m = 1) spanned by (1, 0) :
∗ ∗ ∗ ∗
pnu (x) = p(v|u)pnu (u) and pde (x) = p(v|u)pde (u),
where x = (x (1) , x (2) ) = (u, v) . Let nnu = nde = 1000. The following three
datasets are used:
Rather-separate dataset (Figure 8.9):
where N(u; µ, σ 2 ) denotes the Gaussian density with mean µ and variance
σ 2 with respect to u. This is an easy and simple dataset for the purpose of
illustrating the usefulness of the idea of D3 .
Highly-overlapped dataset (Figure 8.10):
In this dataset, the conditional distribution p(v|u) is common, but the marginal
∗ ∗
distributions pnu (v) and pde (v) are different. Because v is not independent of
u, this dataset would be out of scope for D3 -LFDA/uLSIF.
The true heterodistributional subspace for the rather-separate dataset is depicted
by the dotted line in Figure 8.9(a); the solid line and the dashed line depict the
heterodistributional subspace found by LHSS and LFDA with reduced dimen-
sionality m = 1, respectively. This graph shows that LHSS and LFDA both give
very good estimates of the true heterodistributional subspace. In Figure 8.9(c)–(e),
density-ratio functions estimated by the plain uLSIF without dimensionality reduc-
tion, D3 -LFDA/uLSIF, and D3 -LHSS for the rather-separate dataset are depicted.
These graphs show that both D3 -LHSS and D3 -LFDA/uLSIF give much better
110 8 Direct Density-Ratio Estimation with Dimensionality Reduction
6
x de
4 x nu
True Subspace
LHSS Subspace
2 LFDA Subspace
x(2)
0
−2
−4
−6
−6 −4 −2 0 2 4 6
(a) Heterodistributional subspace
3 3
2 2
5 5
1 1
0 0 0 0
−5 −5
0 −5 x(2) 0 −5 x(2)
5 5
x(1) x(1)
(b) r *(x) (c) ˆ
r (x) by plain uLSIF
3 3
2 5 2 5
1 1
0 0 0 0
−5 −5
0 −5 0 −5
x(2) 5 x(2)
5
x(1) x(1)
ˆ by D3-LFDA/uLSIF
(d) r(x) (e) ˆ
r(x) by D3-LHSS
estimates of the density-ratio function [see Figure 8.9(b) for the profile of the true
density-ratio function] than the plain uLSIF without dimensionality reduction.
Thus, the usefulness of D3 was illustrated.
For the highly-overlapped dataset (Figure 8.10), LHSS gives a reasonable
estimate of the heterodistributional subspace, while LFDA is highly erroneous
due to less separability. As a result, the density-ratio function obtained by D3 -
LFDA/uLSIF does not reflect the true redundant structure appropriately. On the
other hand, D3 -LHSS still works well.
8.3 Numerical Examples 111
6
x de
4 x nu
True Subspace
2 LHSS Subspace
LFDA Subspace
x(2)
0
−2
−4
−6
−6 −4 −2 0 2 4 6
x(1)
(a) Heterodistributional subspace
2
2
5 5
1 1
0 0 0 0
–5 –5
0 –5 0 –5
x(2) x(2)
5 5
x(1) x (1)
2
1
5 1
5
0.5
0 0 0 0
–5 –5
0 0 –5
–5 x(2) x(2)
5 5
x(1) x(1)
Finally, for the dependent dataset (Figure 8.11), LHSS gives an accurate esti-
mate of the heterodistributional subspace. However, LFDA gives a highly biased
∗ ∗
solution because the marginal distributions pnu (v) and pde (v) are no longer com-
mon in the dependent dataset. Consequently, the density-ratio function obtained
by D3 -LFDA/uLSIF is highly erroneous. In contrast, D3 -LHSS still works very
well for the dependent dataset.
The experimental results for the highly-overlapped and dependent datasets
illustrated typical failure modes of LFDA, and LHSS was shown to be able to
successfully overcome these weaknesses of LFDA. Note, however, that one can
112 8 Direct Density-Ratio Estimation with Dimensionality Reduction
6
x de
4 x nu
True Subspace
2 LHSS Subspace
LFDA Subspace
x(2)
0
−2
−4
−6
−6 −4 −2 0 2 4 6
x(1)
(a) Heterodistributional subspace
3 3
2 5 2 5
1 1
0 0 0 0
–5 –5
0 –5 0 –5
5 x(2) 5 x(2)
x(1) x(1)
(b) r *(x) (c) ˆ
r(x) by plain uLSIF
2
3
5 2 5
1
1
0 0 0 0
–5 –5
0 –5 0 –5
5 x(2) 5 x(2)
x(1) x(1)
Plain uLSIF
0.6 D3−LFDA/uLSIF
D3−LHSS
0.5
0.4
Error
0.3
0.2
0.1
0
2 3 4 5 6 7 8 9 10
Entire dimensionality d
(a) Density-ratio estimation error
100 100
10 10
90 90
Dimensionality chosen by CV
9
Dimensionality chosen by CV
9
80 80
8 8
70 70
7 7
60 60
6 6
50 50
5 5
40 40
4 4
30 30
3 3
20 20
2 2
10 10
1 1
0 0
2 3 4 5 6 7 8 9 10 2 3 4 5 6 7 8 9 10
Entire dimensionality d Entire dimensionality d
(b) Choice of dimentionality by (c) Choice of dimentionality by
D3-LHSS D3-LFDA/uLSIF
Figure 8.12. Experimental results for rather-separate dataset. (a) Density-ratio estimation
error (8.15) averaged over 100 runs as a function of the entire data dimensionality d. The
best method in terms of the mean error and comparable methods according to the t-test at
the significance level 1% are specified by “◦”; the other methods are specified by “×”. (b)
The dimensionality of the heterodistributional subspace chosen by CV in LHSS. (c) The
dimensionality of the heterodistributional subspace chosen by CV in LFDA.
114 8 Direct Density-Ratio Estimation with Dimensionality Reduction
0.25
Plain uLSIF
D3−LFDA/uLSIF
0.2 D3−LHSS
0.15
Error
0.1
0.05
0
2 34 5 6 7 8 9 10
Entire dimensionality d
(a) Density-ratio estimation error
100 100
Dimensionality chosen by CV
Dimensionality chosen by CV
10 10
90 90
9 9
80 80
8 8
70 70
7 7
60 60
6 6
50 50
5 5
40 40
4 4
30 30
3 3
20 20
2 2
10 10
1 1
0 0
2 3 4 5 6 7 8 9 10 2 3 4 5 6 7 8 9 10
Entire dimensionality d Entire dimensionality d
(b) Choice of dimentionality by (c) Choice of dimentionality by
D3-LHSS D3-LFDA/uLSIF
Plain uLSIF
D3−LFDA/uLSIF
0.5 D3−LHSS
0.4
Error
0.3
0.2
0.1
0
2 3 4 5 6 7 8 9 10
Entire dimensionality d
(a) Density-ratio estimation error
100 100
Dimensionality chosen by CV
10 10
Dimensionality chosen by CV
90 90
9 9
80 80
8 8
70 70
7 7
60 60
6 6
50 50
5 5
40 40
4 4
30 30
3 3
20 20
2 2
10 10
1 1
0 0
2 3 4 5 6 7 8 9 10 2 3 4 5 6 7 8 9 10
Entire dimensionality d Entire dimensionality d
(b) Choice of dimentionality by (c) Choice of dimentionality by
D3-LHSS D3-LFDA/uLSIF
Figure 8.14. Experimental results for dependent dataset. (a) Density-ratio estimation
error (8.15) averaged over 100 runs as a function of the entire data dimensionality d. The
best method in terms of the mean error and comparable methods according to the t-test at
the significance level 1% are specified by “◦”; the other methods are specified by “×”. (b)
The dimensionality of the heterodistributional subspace chosen by CV in LHSS. (c) The
dimensionality of the heterodistributional subspace chosen by CV in LFDA.
8.4 Remarks
In this chapter we explained two approaches to estimating density ratios in high-
dimensional spaces called direct density-ratio estimation with dimensionality
reduction (D3 ). The basic idea of D3 was to identify a subspace called the heterodis-
tributional subspace, in which two distributions (corresponding to the numerator
and denominator of the density ratio) are different.
In the first approach introduced in Section 8.1, the heterodistributional sub-
space was identified by finding a subspace in which samples drawn from the
two distributions are maximally separated from each other. To this end, super-
vised dimensionality reduction methods such as local Fisher discriminant analysis
(LFDA; Sugiyama, 2007) were utilized. This approach, called D3 -LFDA/uLSIF, is
computationally very efficient because analytic-form solutions are available. It has
116 8 Direct Density-Ratio Estimation with Dimensionality Reduction
been shown to work well when the components inside and outside the heterodis-
tributional subspace are statistically independent and samples drawn from the
two distributions are highly separable from each other in the heterodistributional
subspace.
However, violation of these conditions can cause significant performance degra-
dation, as numerically illustrated in Section 8.3. This drawback can be overcome
in principle by finding a subspace such that the two conditional distributions
are similar to each other in its complementary subspace. To implement this idea,
the heterodistributional subspace was characterized as the subspace in which the
two marginal distributions are maximally different under Pearson divergence
(Lemma 8.1). Based on this lemma, an algorithm for finding the heterodistri-
butional subspace called the least-squares hetero-distributional subspace search
(LHSS) was introduced in Section 8.2 as an alternative approach to D3 .
Because a density-ratio estimation method is utilized during a heterodistri-
butional subspace search in the LHSS procedure, an additional density-ratio
estimation step is not needed after a heterodistributional subspace search. Thus, the
two steps in the first approach (heterodistributional subspace search followed by
density-ratio estimation in the identified subspace) were merged into a single step in
the second approach (see Figure 8.4). The single-shot procedure, called D3 -LHSS,
was shown to be able to overcome the limitations of the D3 -LFDA/uLSIF approach
through experiments in Section 8.3, although it is computationally more expensive
than D3 -LFDA/uLSIF. Thus, improving the computation cost for a heterodistri-
butional subspace search is an important future work. A computationally efficient
way to find a heterodistributional subspace is studied in Yamada and Sugiyama
(2011b).
Part III
In this part we show how density-ratio estimation methods can be used for solving
various machine learning problems.
In the context of importance sampling (Fishman, 1996), where the expecta-
tion over one distribution is computed by the importance-weighted expectation
over another distribution, density ratios play an essential role. In Chapter 9,
the importance sampling technique is applied to non-stationarity/domain adap-
tation in the semi-supervised learning setup (Shimodaira, 2000; Zadrozny, 2004;
Sugiyama and Müller, 2005; Storkey and Sugiyama, 2007; Sugiyama et al., 2007;
Quiñonero-Candela et al., 2009; Sugiyama and Kawanabe, 2011). It is also shown
that the same importance-weighting idea can be used for solving multi-task
learning (Bickel et al., 2008).
Another major usage of density ratios is distribution comparisons. In
Chapter 10, two methods of distribution comparison based on density-ratio estima-
tion are described: inlier-base outlier detection, where distributions are compared
in a pointwise manner (Smola et al., 2009; Hido et al., 2011), and two-sample
tests, where the overall difference between distributions is compared within the
framework of hypothesis testing (Sugiyama et al., 2011c).
In Chapter 11 we show that density-ratio methods allow one to accu-
rately estimate mutual information (Suzuki et al., 2008, 2009a). Mutual infor-
mation is a key quantity in information theory (Cover and Thomas, 2006),
and it can be used for detecting statistical independence between ran-
dom variables. Mutual information estimators have various applications
in machine learning, including independence tests (Sugiyama and Suzuki,
2011), variable selection (Suzuki et al., 2009b), supervised dimensional-
ity reduction (Suzuki and Sugiyama, 2010), independent component anal-
ysis (Suzuki and Sugiyama, 2011), clustering (Kimura and Sugiyama, 2011;
Sugiyama et al., 2011d), object matching (Yamada and Sugiyama, 2011a), and
causal inference (Yamada and Sugiyama, 2010).
Finally, in Chapter 12, density-ratio methods are applied to conditional prob-
ability estimation. When the output variable is continuous, this corresponds
118 Applications of Density Ratios in Machine Learning
119
120 9 Importance Sampling
9.1.1 Introduction
The goal of supervised learning is to infer an unknown input–output dependency
from training samples, by which output values for unseen test input points can be
predicted (see Section 1.1.1). When developing a method of supervised learning,
it is commonly assumed that the input points in the training set and the input points
used for testing follow the same probability distribution (Wahba, 1990; Bishop,
1995; Vapnik, 1998; Duda et al., 2001; Hastie et al., 2001; Schölkopf and Smola,
2002). However, this common assumption is not fulfilled, for example, when
the area outside of the training region is extrapolated or when the training input
points are designed by an active learning (a.k.a. experimental design) algo-
rithm (Wiens, 2000; Kanamori and Shimodaira, 2003; Sugiyama, 2006; Kanamori,
2007; Sugiyama and Nakajima, 2009).
Situations where training and test input points follow different probability dis-
tributions but the conditional distributions of output values given input points
are unchanged are called covariate shifts1 (Shimodaira, 2000). In this section we
introduce covariate shift adaptation techniques based on density-ratio estimation.
Under covariate shifts, standard learning techniques such as maximum like-
lihood estimation are biased. It was shown that the bias caused by a covariate
shift can be asymptotically canceled by weighting the loss function according
to the importance – the ratio of test and training input densities (Shimodaira,
2000; Zadrozny, 2004; Sugiyama and Müller, 2005; Sugiyama et al., 2007;
Quiñonero-Candela et al., 2009; Sugiyama and Kawanabe, 2011). Similarly, stan-
dard model selection criteria such as cross-validation (Stone, 1974; Wahba, 1990)
and Akaike’s information criterion (Akaike, 1974) lose their unbiasedness under
covariate shifts. It was shown that proper unbiasedness can also be recovered
by modifying the methods based on importance weighting (Shimodaira, 2000;
Zadrozny, 2004; Sugiyama and Müller, 2005; Sugiyama et al., 2007).
Examples of successful real-world applications of covariate shift adapta-
tions include brain–computer interfaces (Sugiyama et al., 2007; Y. Li et al., 2010),
robot control (Hachiya et al., 2009; Akiyama et al., 2010; Hachiya et al., 2011b),
speaker identification (Yamada et al., 2010b), audio tagging (Wichern et al., 2010),
age prediction from face images (Ueki et al., 2011), wafer alignment in semi-
conductor exposure apparatus (Sugiyama and Nakajima, 2009), human activity
1 Note that the term “covariate” refers to an input variable in statistics. Thus, a “covariate shift”
indicates a situation where input-data distributions shift.
9.1 Covariate Shift Adaptation 121
recognition from accelerometric data (Hachiya et al., 2011a), and natural language
processing (Tsuboi et al., 2009). Details of those real-world applications as well as
technical details of covariate-shift adaptation techniques are covered extensively
in Sugiyama and Kawanabe (2011).
be the training samples, where x tri is a training input point drawn from a probability
distribution with density ptr∗ (x), and yitr is a training output value following a con-
ditional probability distribution with conditional density p ∗ (y|x = x tri ). p ∗ (y|x)
may be regarded as the superposition of the true output f ∗ (x) and noise @:
y = f ∗ (x) + @.
f (x 1tr )
tr
f (x ) ⫹⑀ 1 f (x ntrtr )
y 2tr
f (x ) y 1tr ⫹⑀ trntr
tr
⫹⑀ 2 y ntrtr
f (x 2tr )
where Exte ∼pte∗ (x) denotes the expectation over x te drawn from pte∗ (x) and
Ey te ∼p∗ (y|x=xte ) denotes the expectation over y te drawn from p ∗ (y|x = x te ).
The term loss( y , y) denotes the loss function that measures the discrepancy
between the true output value y and its estimate y . When the output domain Y
is continuous, the problem is called regression and the squared loss is a standard
choice:
loss( y − y)2 .
y , y) = ( (9.1)
On the other hand, when the output domain Y is binary categories (i.e., Y =
{+1, −1}), the problem is called binary classification and the 0/1-loss is a typical
choice:
5
0 if sgn( y ) = y,
loss(y , y) =
1 otherwise,
where sgn(y) = +1 if y ≥ 0 and sgn(y) = −1 if y < 0. Note that the generalization
error with the 0/1-loss corresponds to the misclassification rate.
We use a parametric function f (x; θ) for learning, where θ is a parameter. A
model f (x; θ ) is said to be correctly specified if there exists a parameter θ ∗ such
that f (x; θ ∗ ) = f ∗ (x); otherwise the model is said to be misspecified. In practice,
the model used for learning is misspecified to a greater or less extent because we
do not generally have enough prior knowledge for correctly specifying the model.
Thus, learning theories specialized to correctly specified models are less useful in
practice; it is important to explicitly consider the case where our model at hand is
misspecified to some extent when developing machine learning algorithms.
In standard supervised learning theories (Wahba, 1990; Bishop, 1995; Vapnik,
1998; Duda et al., 2001; Hastie et al., 2001; Schölkopf and Smola, 2002), the test
input point x te is assumed to follow the same probability distribution as the training
input point x tr , that is, ptr∗ (x) = pte∗ (x). On the other hand, in this section we con-
sider the situation called a covariate shift (Shimodaira, 2000); that is, the training
input point x tr and the test input point x te have different probability densities (i.e.,
ptr∗ (x) # = pte∗ (x)). Under a covariate shift, most of the standard learning techniques
do not work well due to differing distributions. In the following, we introduce
importance sampling techniques for mitigating the influence of covariate shifts.
θ ERM := argmin tr tr
loss(f (x i ; θ ), yi ) .
θ ntr i=1
θ IWERM converges to θ ∗ under a covariate shift, even if the model is misspecified
(Shimodaira, 2000). In practice, IWERM may be regularized, for example, by
slightly flattening the importance weight and/or adding a penalty term as
ntr ∗ γ
1
pte (x tri )
argmin loss(f (x tri ; θ ), yitr ) + λθ θ , (9.2)
θ ntr i=1 ptr∗ (x tri )
Numerical Examples
Here we illustrate the behavior of IWERM using toy regression and classification
datasets.
First, let us consider a one-dimensional regression problem. Let the learning
target function be
f ∗ (x) = sinc(x),
where N(x; µ, σ 2 ) denotes the Gaussian density with mean µ and variance σ 2 .
As illustrated in Figure 9.2, we are considering a (weak) extrapolation problem
because the training input points are distributed on the left-hand side of the input
domain and the test input points are distributed on the right-hand side.
We create the training output value {yitr }ni=1
tr
as yitr = f ∗ (xitr ) + @itr , where {@itr }ni=1
tr
2
are i.i.d. noise drawn from N (@; 0, (1/4) ). Let the number of training samples be
ntr = 150, and use the following linear model for function approximation:
f (x; θ) = θ1 x + θ2 .
Training 1
1.5 Test
Importance
0.5
1
0
0.5 f *(x)
−0.5 f(x)
Training
0 Test
−0.5 0 0.5 1 1.5 2 2.5 3 −0.5 0 0.5 1 1.5 2 2.5 3
x x
(a) Input data densities (b) γ = 0
1 1
0.5 0.5
0 0
−0.5 −0.5
Figure 9.2. An illustrative regression example with covariate shifts. (a) The probability
density functions of the training and test input points and their ratio (i.e., the importance).
(b)–(d) The learning target function f ∗ (x) (solid line), training samples (◦), a learned
function f(x) (dashed line), and test samples (×).
9.1 Covariate Shift Adaptation 125
large test error under a covariate shift. Figure 9.2(d) depicts the learned function
for γ = 1, which tends to approximate the test output values well. However, it tends
to have a larger variance than the approximator obtained by γ = 0. Figure 9.2(c)
depicts a learned function for γ = 0.5, which yields an even better estimation of
the test output values for this particular data realization.
Next let us consider a binary classification problem on a two-dimensional input
space. For x = (x (1) , x (2) ) , let the class-posterior probabilities given input x be
1% % &&
p ∗ (y = +1|x) = 1 + tanh x (1) + min(0, x (2) ) ,
2
p (y = −1|x) = 1 − p∗ (y = +1|x).
∗
The optimal decision boundary, that is, a set of all x such that
1
p∗ (y = +1|x) = p∗ (y = −1|x) = ,
2
is illustrated in Figure 9.3(a).
Let the training and test input densities be
∗ 1 −2 1 0 1 2 1 0
ptr (x) = N x; , + N x; , ,
2 3 0 4 2 3 0 4
∗ 1 0 1 0 1 4 1 0
pte (x) = N x; , + N x; , ,
2 −1 0 1 2 −1 0 1
where N(x; µ, ) is the multivariate Gaussian density with mean µ and covari-
ance matrix . This setup implies that we are considering a (weak) extrapolation
problem. Contours of the training and test input densities are illustrated in
Figure 9.3(a).
3 Training
2 2
0.03
0.0
1
2
0.02
0 0
0.0
0.06
−1 Test
4
0.
0.0 06 −2
2
0.0
−2 4
−3
−4 −2 0 2 4 6 −8 −6 −4 −2 0 2 4 6
(a) Optimal decision boundary (the thick (b) Optimal decision boundary (solid line)
solid line) and contours of training and and learned boundaries (dashed line).
test input densities (thin solid lines). “o” and “×” denote the positive and negative
training samples, while “ ” and “+” denote
the positive and negative test samples.
Let the number of training samples be ntr = 500; we then create training
input points {x tri }ni=1
tr
following ptr∗ (x) and each training output label yitr follow-
∗ tr
ing p (y|x = x i ). Similarly, let the number of test samples be nte = 500 and
nte
create nte test input points {x te ∗ te
j }j =1 following pte (x) and each test output label yj
∗ te
following p (y|x = x j ). We use the following linear model for learning:
Importance-Weighted Cross-Validation
One of the popular techniques for estimating the generalization error is CV (Stone,
1974; Wahba, 1990), which has been shown to give an almost unbiased esti-
mate of the generalization error with finite samples (Luntz and Brailovsky, 1969;
9.1 Covariate Shift Adaptation 127
where fi (x) is a function learned from Z\(x tri , yitr ) [i.e., without (x tri , yitr )]. It was
proved that LOOIWCV gives an almost unbiased estimate of the generalization
error even under covariate shifts (Sugiyama et al., 2007). More precisely, LOOI-
WCV for ntr training samples gives an unbiased estimate of the generalization
error for ntr − 1 training samples:
0 1
E{xtr }ntr E{ytr }ntr G LOOIWCV = E tr ntr E tr ntr [G ]
i i=1 i i=1 {x } {y }
i i=1 i i=1
Numerical Examples
Here we illustrate the behavior of IWCV using the same toy datasets as in
Section 9.1.3.
Let us continue the one-dimensional regression simulation in Section 9.1.3.
As illustrated in Figure 9.2, IWLS with a flattening parameter γ = 0.5 appears
to work well for that particular realization of data samples. However, the best
value of γ depends on the realization of samples. To systematically investigate
this issue, let us run the simulation 1000 times with different random seeds; that is,
in each run, input–output pairs {(xitr , @itr )}ni=1
tr
are randomly drawn and the scores of
10-fold IWCVs and 10-fold ordinary CVs are calculated for γ = 0, 0.1, 0.2, . . . , 1.
The means and standard deviations of the generalization error G and its estimate
by each method are depicted as functions of γ in Figure 9.4. The graphs show that
IWCV gives accurate estimates of the generalization error, while ordinary CV is
heavily biased.
0.5
0.4
0.3
0.2
0.1
0.5 1.4
1.2
0.4
1
0.3 0.8
0.2 0.6
0.4
0.1 0.2
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
γ γ
(b) IWCV score (c) Ordinary CV score
Figure 9.4. Generalization error and its estimates obtained by IWCV and ordinary CV as
functions of the flattening parameter γ in IWLS for the regression examples in Figure 9.2.
The thick dashed curves in the bottom graphs depict the true generalization error for a
clear comparison.
9.1 Covariate Shift Adaptation 129
9.1.5 Remarks
In standard supervised learning theories (Wahba, 1990; Bishop, 1995; Vapnik,
1998; Duda et al., 2001; Hastie et al., 2001; Schölkopf and Smola, 2002), test
input points are assumed to follow the same probability distribution as training
130 9 Importance Sampling
0.3
0.2
0.1
0
0 0.2 0.4 0.6 0.8 1
γ
(a) True generalization error
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
γ γ
(b) IWCV score (c) Ordinary CV score
Figure 9.5. The generalization error G (i.e., the misclassification rate) and its estimates
obtained by IWCV and ordinary CV as functions of the flattening parameter γ in IWFDA
for the toy classification examples in Figure 9.3. The thick dashed curves in the bottom
graphs depict the true generalization error for clear comparison.
mining is a novel paradigm in the area of data mining aimed at performing some
data-processing operation (the most fundamental one would be to compute the
mean of data) with the data kept confidential to the public. The density-ratio
methods explained in Part II will play a central role in this line of research.
9.2.1 Introduction
Multi-task learning deals with a situation where multiple related learning tasks
exist. The rationale behind multi-task learning is that, rather than solving
such related learning tasks separately, solving them simultaneously by shar-
ing some common information behind the tasks may improve the prediction
accuracy (Caruana et al., 1997; Baxter, 1997, 2000; Ben-David et al., 2002;
Bakker and Heskes, 2003; Ben-David and Schuller, 2003; Evgeniou and Pontil,
2004; Micchelli and Pontil, 2005; Yu et al., 2005; Ando and Zhang, 2005;
Xue et al., 2007; Bonilla et al., 2008; Kato et al., 2010; Simm et al., 2011).
In Section 9.2.3 we describe multi-task learning methods that explicitly share
training samples with other tasks. We first introduce a naive approach that merely
borrows training samples from related tasks. This approach is useful if the other
tasks are very similar to the target task. On the other hand, if the other tasks
are rather similar but substantially different, the use of importance sampling for
absorbing differing distributions would be technically more sound (Bickel et al.,
2008).
Another line of research tries to implicitly share data samples in different tasks.
If prior knowledge that some parameters can be shared across different tasks (e.g.,
the class-wise variance of the data is common to all tasks), such a parametric
form can be utilized for improving the prediction accuracy. However, such “hard”
data-sharing models may not always be appropriate in practice.
More general approaches that do not require data-sharing models assume
the common prior distribution across different tasks in the Bayesian frame-
work (Yu et al., 2005; Xue et al., 2007; Bonilla et al., 2008) or impose solutions
of different tasks to be close to each other in the regularization framework
(Evgeniou and Pontil, 2004; Lapedriza et al., 2007; Kato et al., 2010; Simm et al.,
2011). These implicit data-sharing approaches are more flexible than the
explicit data-sharing approach. The regularization-based approach is described
in Section 9.2.3.
132 9 Importance Sampling
{(x k , yk , tk )}nk=1 ,
where tk (∈ {1, . . . , m}) denotes the index of the task to which the input–output
sample (x k , yk ) belongs. The training input point x k (∈ X ⊂ Rd ) is drawn from
a probability distribution with density pt∗k (x), and the training output value yk
(∈ Y ⊂ R) follows a conditional probability distribution with conditional density
pt∗k (y|x = x k ). pt∗ (y|x) for the t-th task, which may be regarded as the superposition
of the true output ft∗ (x) and noise @:
y = ft∗ (x) + @.
Let pt∗ (x, y) be the joint density of input x and output y for the t-th task.
(t)
min loss(f (x k ; θ ), yk ) , (9.3)
θ (t)
k=1
where f (x; θ (t) ) is a model of the t-th target function ft∗ (x) and loss(
y , y) is the
loss function that measures the discrepancy between the true output value y and
its estimate
y , for example, the squared loss (9.1).
Such a naive data-sharing approach is useful when the number of training
samples for each task is very small. For example, in an application of multi-task
learning to optical surface profiling (Sugiyama et al., 2006), three parameters in
9.2 Multi-Task Learning 133
the model
should be learned from only a single training sample. Because this is an ill-posed
problem, directly solving this task may not provide a useful solution. In such appli-
cations, borrowing data samples from “vicinity” tasks is essential, and the above
simple multi-task approach was shown to work well. See Yokota et al. (2009),
Kurihara et al. (2010), and Mori et al. (2011) for further developments along this
line of research, such as how the vicinity tasks are chosen and how model selection
is carried out.
for mutual data sharing. For estimating the density ratios pt∗ (x, y)/pt∗ (x, y) we can
use any of the various density-ratio estimators described in Part II. Among them, an
approach based on probabilistic classification described in Chapter 4 is particularly
useful in the context of multi-task learning, because density-ratio estimators for
multiple density ratios {pt∗ (x, y)/pt∗ (x, y)}m
t,t =1 can be obtained simultaneously
using multi-class probabilistic classifiers (Bickel et al., 2008).
Although the importance-based multi-task learning approach is highly flexible,
estimating the importance weights over both x and y is a hard problem. Thus,
this approach could be unreliable if the number of training samples in each task is
limited.
Basic Formulation
The parameter for the task t (∈ {1, . . . , m}) is decomposed into the common part
θ (0) that is shared for all tasks and an individual part θ (t) that can be different for
each task (see Figure 9.6):
θ (0) + θ (t) .
Then the parameters for all the tasks are learned simultaneously as
n
min loss(f (x k ; θ (0) + θ (tk ) ), yk )
{θ (t) }m
t=0 k=1
m
λ γ
(t) 2
+ θ (0) 2 + θ , (9.5)
2 2m t=1
f (x ; θ (0) + θ (1))
f (x ; θ (0))
f (x ; θ (0) + θ (2))
Figure 9.6. Decomposition of model parameters into the common part θ (0) and individual
parts θ (1) and θ (2) .
9.2 Multi-Task Learning 135
2004; Kato et al., 2010) and a probabilistic classifier such as logistic regres-
sion (Lapedriza et al., 2007), and both were shown to improve the prediction
accuracy. Although optimization techniques for the support vector machine
have been studied extensively and highly improved (Platt, 1999; Joachims,
1999; Chang and Lin, 2001; Collobert and Bengio., 2001; Suykens et al.,
2002; Rifkin et al., 2003; Tsang et al., 2005; Fung and Mangasarian, 2005;
Fan et al., 2005; Tang and Zhang, 2006; Joachims, 2006; Teo et al., 2007;
Franc and Sonnenburg, 2009), training logistic regression classifiers for large-
scale problems is still computationally expensive (Hastie et al., 2001; Minka,
2007).
This is a convex optimization problem and the final solution is given analytically
as follows (Simm et al., 2011):
where
, γ - γn
Ak,k := + δtk ,tk ψ(x k ) ψ(x k ) + δk,k ,
mλ m
, γ -
gk (x, t) := + δt,tk ψ(x) ψ(x k ).
mλ
δt,t denotes the Kronecker delta:
1 if t = t ,
δt,t =
0 otherwise.
Numerical Examples
Here we evaluate experimentally the performance of multi-task learning methods.
In the first set of experiments, we use the UMIST face recognition dataset
(Graham and Allinson, 1998), which contains images of 20 different people 575
images in total. Each face image was appropriately cropped into 112 × 92 (=
10304) pixels, and each pixel takes 8-bit intensity values from 0 to 255.
The database contains 4 female subjects among the 20 subjects. In this experi-
ment, a male subject is chosen from the 16 male subjects for each of the 4 female
subjects, and we construct 4 binary classification tasks between male (class +1)
and female (class −1). We expect that multi-task learning captures some common
structure behind the different male–female classifiers. As inputs, the raw pixel
values of the grayscale images are directly used, that is, x ∈ R10304 . The training
images are chosen randomly from the images of the target male and female sub-
jects, and the rest of the images are used as test samples. In each task, the numbers
of male and female samples are set to be equal for both training and testing.
We compare the classification accuracies and computation times of the multi-
task LSPC (MT-LSPC) method with the multi-task kernel logistic regression
(MT-KLR) method (Lapedriza et al., 2007) as functions of the number of train-
ing samples. As baselines, we also include their single-task counterparts: i-LSPC,
i-KLR, c-LSPC, and c-KLR, where the “i” indicates “independent,” meaning that
each task is treated independently and a classifier is trained for each task using
only samples of that task [this corresponds to setting λ in Eq. (9.5) large enough].
On the other hand, “c” denotes “combined,” meaning that all tasks are combined
together and a single common classifier is trained using samples from all tasks
[this corresponds to setting γ in Eq. (9.5) large enough].
In all six methods, MT-LSPC, MT-KLR, i-LSPC, i-KLR, c-LSPC, and c-KLR,
the Gaussian kernel
x − x 2
K(x, x ) = exp −
2σ 2
is adopted as the basis function. five-fold cross-validation (CV) with respect to
the classification accuracy is used to choose the regularization parameter λ and
the Gaussian kernel bandwidth σ . Additionally, for MT-LSPC and MT-KLR, the
multi-task parameter γ is also selected based on CV.
9.2 Multi-Task Learning 137
All the methods are implemented using MATLAB® . KLR solutions are numer-
ically computed by the limited-memory Broyden–Fletcher–Goldfarb–Shanno
(L-BFGS) method using the minFunc package (Schmidt, 2005). The experi-
ments are repeated 50 times with different random seeds, and the mean accuracy
and computation time are evaluated. The classification accuracy is summarized
in Figure 9.7, showing that both multi-task learning methods significantly out-
perform the single-task learning counterparts. The accuracies of MT-LSPC and
MT-KLR are comparable to each other. Figure 9.7 summarizes the computation
times, showing that LSPCs are two to three times faster than KLRs, respectively.
In the second set of experiments, we use the Landmine image classification
dataset (Xue et al., 2007). The Landmine dataset consists of 29 binary classification
tasks about various landmine fields. Each input sample x is a nine-dimensional
feature vector corresponding to a region of landmine fields, and the binary class
y corresponds to whether there is a landmine in that region. The feature vectors
are extracted from radar images, which concatenate four moment-based features,
three correlation-based features, one energy ratio feature, and one spatial variance
feature (see Xue et al., 2007, for details). The goal is to estimate whether a test
landmine field contains landmines based on the region features. In the 29 landmine
classification tasks, the first 15 tasks are highly foliated and the last 14 tasks
are regions that are bare earth or desert. All 15 highly foliated regions and the
first 2 tasks from the bare earth regions are used for experiments. We completely
reverse the class labels in the latter two datasets and evaluate the robustness of the
multi-task methods against noisy tasks.
We again compare the performance of MT-LSPC, MT-KLR, i-LSPC, i-KLR,
c-LSPC, and c-KLR. The experimental setup is the same as the previous UMIST
experiments, except that, instead of the classification accuracy, the area under the
1
Mean classification accuracy
101
Computation time
0.95
100
0.9
MT-LSPC MT-KLR MT-LSPC MT-KLR
i-LSPC i-KLR i-LSPC i-KLR
c-LSPC c-KLR 10−1 c-LSPC c-KLR
0.85
3 4 5 6 7 8 3 4 5 6 7 8
Number of samples per task Number of samples per task
(a) Accuracy (b) Computation time
Figure 9.7. Experimental results for the UMIST dataset. (a) Mean accuracy over 50 runs.
“◦” indicates the best performing method or a tie with the best performance (by t-test with
1% level of significance). “×” indicates that the method is weaker than the best one. (b)
The computation time (in seconds).
138 9 Importance Sampling
0.85
0.8 103
Computation time
0.75
Mean AUC
102
0.7
0.65 101
0.6 MT-LSPC MT-KLR MT-LSPC MT-KLR
i-LSPC i-KLR 100 i-LSPC i-KLR
0.55 c-LSPC c-KLR c-LSPC c-KLR
20 25 30 35 20 25 30 35
Number of samples per task Number of samples per task
(a) AUC score (b) Computation time
Figure 9.8. Experimental results for the Landmine dataset. (a) Mean AUC score over 50
runs. “◦” indicates the best performing method or a tie with the best performance (by
t-test with 1% level of significance). “×” indicates that the method is weaker than the best
one. (b) The computation time (in seconds).
4 To be consistent with the above performance measure, CV is also performed with respect to the AUC
score. Because the landmine datasets are highly imbalanced, the validation data in the CV procedure
can contain no landmine sample, which causes an inappropriate choice of tuning parameters. To
avoid this problem, all estimated class-posterior probabilities from different tasks are combined and
a single AUC score is calculated in the CV procedure, instead of merely taking the mean of the AUC
scores over all tasks.
9.2 Multi-Task Learning 139
9.2.5 Remarks
Learning from a small number of training samples has been an important challenge
in the machine learning community. Multi-task learning tries to overcome this
difficulty by utilizing information brought by other related tasks. In this section
we have described various multi-task learning methods by naive sample sharing,
adaptive sample sharing with importance sampling, and implicit sample sharing
using regularization.
The naive sample sharing without importance weighting is useful for overcom-
ing the ill-posedness of learning problems under very small sample sizes. The
adaptive sample sharing based on importance sampling would be theoretically
more sound, but estimating importance weights defined over both input and output
can induce another technical challenge. The implicit data sharing based on regular-
ization is a heuristic, but it can be useful in practice due to its simple formulation.
However, this makes optimization more challenging since all the task parameters
need to be learned at the same time. For regression, one may use ridge regression
(Hoerl and Kennard, 1970) as a building block for multi-task learning due to its
computational efficiency. For classification, the support vector machine (SVM)
and least-squares probabilistic classifier (LSPC) would produce computationally
efficient multi-task learning algorithms.
10
Distribution Comparison
10.1.1 Introduction
The goal of outlier detection (a.k.a. anomaly detection, novelty detection, or
one-class classification) is to find uncommon instances (“outliers”) in a given
dataset. Outlier detection has been used in various applications such as defect
detection from behavior patterns of industrial machines (Fujimaki et al., 2005;
Ide and Kashima, 2004), intrusion detection in network systems (Yamanishi et al.,
140
10.1 Inlier-Based Outlier Detection 141
2004), and topic detection in news documents (Manevitz and Yousef, 2002).
Recent studies include finding unusual patterns in time series (Yankov et al.,
2008), discovery of spatio-temporal changes in time-evolving graphs (Chan et al.,
2008), self-propagating worms detection in information systems (Jiang and Zhu,
2009), and identification of inconsistent records in construction equipment data
(Fan et al., 2009). Because outlier detection is useful in various applications, it
has been an active research topic in statistics, machine learning, and data mining
communities for decades (Hodge and Austin, 2004).
A standard outlier-detection problem falls into the category of unsupervised
learning (see Section 1.1.2), due to a lack of prior knowledge about the “anoma-
lous data”. In contrast, Goa et al. (2006a, 2006b) addressed the problem of
semi-supervised outlier detection where some examples of outliers and inliers
are available as a training set. The semi-supervised outlier detection methods per-
form better than unsupervised methods thanks to additional label information.
However, such outlier samples for training are not always available in practice.
Furthermore, the type of outliers may be diverse, and thus the semi-supervised
methods – learning from known types of outliers – are not necessarily useful in
detecting unknown types of outliers.
In this section we address the problem of inlier-based outlier detection where
examples of inliers are available. More formally, the inlier-based outlier-detection
problem is to find outlier instances in the test set based on the training set consisting
only of inlier instances. The setting of inlier-based outlier detection is more prac-
tical than the semi-supervised setting because inlier samples are often available
in abundance. For example, in defect detection of industrial machines, we know
that there was no outlier (i.e., no defect) in the past because no failure has been
observed in the machinery. Therefore, it is reasonable to separate the measurement
data into a training set consisting only of inlier samples observed in the past and
the test set consisting of recent samples from which we would like to find outliers.
As opposed to supervised learning, the outlier detection problem is often vague
and it may not be possible to universally define what the outliers are. Here we
consider a statistical framework and regard instances with low probability densities
as outliers. In light of inlier-based outlier detection, outliers may be identified via
density estimation of inlier samples. However, density estimation is known to be
a hard task particularly in high-dimensional problems, and thus outlier detection
via density estimation may not work well in practice.
To avoid density estimation, one can use a one-class support vector machine
(OSVM; Schölkopf et al., 2001) or support vector data description (SVDD;
Tax and Duin, 2004), which finds an inlier region containing a certain fraction
of training instances; samples outside the inlier region are regarded as outliers.
However, these methods cannot make use of inlier information available in the
inlier-based settings. Furthermore, the solutions of OSVM and SVDD depend
heavily on the choice of tuning parameters (e.g., the Gaussian kernel bandwidth),
and there seems to be no reasonable method to appropriately determine the values
of such tuning parameters.
142 10 Distribution Comparison
10.1.2 Formulation
Suppose we have two sets of samples – training samples {x tri }ni=1 tr
and test sam-
te nte d tr ntr
ples {x j }j =1 on R . The training samples {x i }i=1 are all inliers, while the test
nte
samples {x te j }j =1 can contain some outliers. The goal of outlier detection here is to
identify outliers in the test set based on the training set consisting only of inliers.
More formally, we want to assign a suitable outlier score for the test samples – the
larger the outlier score, the more plausible the sample is an outlier.
Let us consider a statistical framework of the inlier-based outlier-detection prob-
lem: Suppose training samples {x tri }ni=1 tr
are drawn independently from a training
nte
data distribution with density ptr (x) and test samples {x te
∗
j }j =1 are drawn indepen-
dently from a test data distribution with strictly positive density pte∗ (x). Within this
statistical framework, test samples with low training data densities are regarded as
outliers. However, the true density ptr∗ (x) is not accessible in practice, and estimat-
ing densities is known to be a hard problem. Therefore, merely using the training
data density as an outlier score may not be reliable in practice.
So instead we employ the ratio of training and test data densities as an outlier
score1 :
p∗ (x)
r ∗ (x) = tr∗ .
pte (x)
If no outlier sample exists in the test set (i.e., the training and test data densities
are equivalent), the density-ratio value is one. On the other hand, the density-ratio
value tends to be small in the regions where the training data density is low and
the test data density is high. Thus, samples with small density-ratio values are
plausible to be outliers.
1 Note that the definition of the density-ratio is inverted compared with the one used in covariate shift
adaptation (see Chapter 9). We chose this definition because the test data domain may be wider than
the training data domain in the context of inlier-based outlier detection.
10.1 Inlier-Based Outlier Detection 143
One may suspect that this density-ratio approach is not suitable when there
exist only a small number of outliers – because a small number of outliers cannot
increase the values of pte∗ (x) significantly. However, this is not a problem because
outliers are drawn from a region with small ptr∗ (x), and therefore a small change in
pte∗ (x) significantly reduces the density-ratio value. For example, let the increase
1 0.001
of pte∗ (x) be @ = 0.01; then 1+@ ≈ 1, but 0.001+@ 1. Thus, the density-ratio r ∗ (x)
would be a suitable outlier score.
A MATLAB® implementation of inlier-based outlier detection based on KLIEP
(see Section 5.2.1), called maximum likelihood outlier detection (MLOD), is avail-
able from http://sugiyama-www.cs.titech.ac.jp/˜sugi/software/MLOD/. Similarly,
a MATLAB® implementation of inlier-based outlier detection based on uLSIF
(see Section 6.2.2), called least-squares outlier detection (LSOD), is available
from http://sugiyama-www.cs.titech.ac.jp/˜sugi/software/LSOD/.
(x 1 , . . . , x n ) = (x tr1 , . . . , x trntr , x te te
1 , . . . , x nte ),
the hold-out samples (see Section 2.3.1 for details). Note that this CV procedure
corresponds to choosing σ such that the Kullback–Leibler divergence from p ∗ (x)
(x) is minimized. The estimated density values could be used directly as an
to p
outlier score. A variation of the KDE approach has been studied in Latecki et al.
(2007), where local outliers are detected from multi-modal datasets.
However, KDE is known to suffer from the curse of dimensionality (Vapnik,
1998), and therefore the KDE-based outlier-detection method may not be reliable
in practice.
The density-ratio can also be estimated by KDE, that is, first estimating the
training and test data densities separately and then taking the ratio of the estimated
densities (see Section 2.3). However, the estimation error tends to be accumulated
in this two-step procedure, and thus this approach is not reliable.
using the ratio of the average distance from the nearest neighbors as
T
1
LRDT (nearest t (x))
LOFT (x) = ,
T t=1 LRDT (x)
where nearestt (x) represents the t-th nearest neighbor of x and LRDT (x) denotes
the inverse of the average distance from the T nearest neighbors of x (LRD stands
for local reachability density). If x lies around a high-density region and its nearest
neighbor samples are close to each other in the high-density region, LRDT (x) tends
to become much smaller than LRDT [nearest t (x)] for every t. Then, LOFT (x) takes
a large value and x is regarded as a local outlier.
Although the LOF values seem to be a reasonable outlier measure, its perfor-
mance strongly depends on the choice of the locality parameter T . Unfortunately,
there is no systematic method to select an appropriate value for T , and thus subjec-
tive tuning is necessary, which is not reliable in unsupervised outlier detection. In
addition, the computational cost of the LOF score is expensive because it involves
a number of nearest neighbor search procedures.
10.1.4 Experiments
Here, the performances of various outlier-detection methods are compared
experimentally. In all the experiments, the statistical language environment R
(R Development Core Team, 2009) is used. The comparison includes the following
density-ratio estimators:
In addition, the native outlier-detection methods KDE, OSVM, and LOF reviewed
in Section 10.1.3 are included in the comparison. A package of the the limited-
memory Broyden–Fletcher–Goldfarb–Shanno quasi-Newton method called optim
is used for computing the KLR solution, and a quadratic program solver called ipop
contained in the kernlab package (Karatzoglou et al., 2004) used for computing
the KMM solution. The ksvm function contained in the kernlab package is used as
an OSVM implementation, and the lofactor function included in the dprep package
(Fernandez, 2005) is used as an LOF implementation.
Twelve datasets taken from the IDA Benchmark Repository (Rätsch et al.,
2001) are used for experiments. Note that they are originally binary classifica-
tion datasets – here, the positive samples are regarded as inliers and the negative
samples are treated as outliers. All the negative samples are removed from the
training set; that is, the training set contains only inlier samples. In contrast, a
146 10 Distribution Comparison
fraction ρ of randomly chosen negative samples are retained in the test set; that is,
the test set includes all inlier samples and some outliers.
When evaluating the performance of outlier-detection algorithms, it is important
to take into account both the detection rate (the amount of true outliers an outlier
detection algorithm can find) and the detection accuracy (the amount of true inliers
that an outlier detection algorithm misjudges as outliers). Because there is a trade-
off between the detection rate and detection accuracy, the area under the ROC
curve (AUC; Bradley, 1997) is adopted as the error metric.
The AUC values of the density-ratio–based methods (KMM, KLR, KLIEP,
and uLSIF) and other methods (KDE, OSVM, and LOF) are compared. All the
tuning parameters included in KLR, KLIEP, uLSIF, and KDE are chosen based
on cross-validation (CV) from a wide range of values. CV is not available to
KMM, OSVM, and LOF; the Gaussian kernel width in KMM and OSVM is set to
the median distance between samples, which was shown to be a useful heuristic
(Schölkopf and Smola, 2002). For KMM, the other tuning parameters are fixed to
√ √
B = 1000 and @ = ( nte − 1)/ nte following Huang et al. (2007). For OSVM,
the tuning parameter is set to ν = 0.1. The number of basis functions in KLIEP and
uLSIF is fixed to b = 100 (note that b can also be optimized by CV if necessary).
For LOF, three candidate values 5, 30, and 50 are tested as the number T of nearest
neighbors.
The mean AUC values over 20 trials as well as the computation time are sum-
marized in Table 10.1, where the computation time is normalized so that uLSIF
is one. Because the types of outliers may be diverse depending on the datasets,
no single method may consistently outperform the others for all the datasets. To
evaluate the overall performance, the average AUC values over all datasets are
also described at the bottom of Table 10.1.
The results show that KLIEP is the most accurate on the whole, and uLSIF
follows with a small margin. Because KLIEP can provide a more sensitive outlier
score than uLSIF (see Section 7.4.2), this would be a reasonable result. On the
other hand, uLSIF is computationally much more efficient than KLIEP. KLR works
reasonably well overall, but it performs poorly for some datasets such as the splice,
twonorm, and waveform datasets, and the average AUC performance is not as good
as uLSIF and KLIEP.
KMM and OSVM are not comparable to uLSIF in terms of AUC, and they are
computationally less efficient. Note that we also tested KMM and OSVM with
several different Gaussian widths and experimentally found that the heuristic of
using the median sample distance as the Gaussian kernel width works reasonably
well in this experiment. Thus, the AUC values of KMM and OSVM are close to
their optimal values.
LOF with large T is shown to work well, although it is not clear whether the
heuristic of simply using large T is always appropriate. In fact, the average AUC
values of LOF are slightly higher for T = 30 than T = 50, and there is no systematic
way to choose the optimal value for T . LOF is very slow because nearest neighbor
search is computationally expensive.
10.1 Inlier-Based Outlier Detection 147
Table 10.1. Mean AUC values over 20 trials for the benchmark datasets
Dataset uLSIF KLIEP KLR KMM OSVM LOF KDE
Name ρ (CV) (CV) (CV) (med) (med) T = 5 T = 30 T = 50 (CV)
0.01 0.851 0.815 0.447 0.578 0.360 0.838 0.915 0.919 0.934
banana 0.02 0.858 0.824 0.428 0.644 0.412 0.813 0.918 0.920 0.927
0.05 0.869 0.851 0.435 0.761 0.467 0.786 0.907 0.909 0.923
0.01 0.463 0.480 0.627 0.576 0.508 0.546 0.488 0.463 0.400
b-cancer 0.02 0.463 0.480 0.627 0.576 0.506 0.521 0.445 0.428 0.400
0.05 0.463 0.480 0.627 0.576 0.498 0.549 0.480 0.452 0.400
0.01 0.558 0.615 0.599 0.574 0.563 0.513 0.403 0.390 0.425
diabetes 0.02 0.558 0.615 0.599 0.574 0.563 0.526 0.453 0.434 0.425
0.05 0.532 0.590 0.636 0.547 0.545 0.536 0.461 0.447 0.435
0.01 0.416 0.485 0.438 0.494 0.522 0.480 0.441 0.385 0.378
f-solar 0.02 0.426 0.456 0.432 0.480 0.550 0.442 0.406 0.343 0.374
0.05 0.442 0.479 0.432 0.532 0.576 0.455 0.417 0.370 0.346
0.01 0.574 0.572 0.556 0.529 0.535 0.526 0.559 0.552 0.561
german 0.02 0.574 0.572 0.556 0.529 0.535 0.553 0.549 0.544 0.561
0.05 0.564 0.555 0.540 0.532 0.530 0.548 0.571 0.555 0.547
0.01 0.659 0.647 0.833 0.623 0.681 0.407 0.659 0.739 0.638
heart 0.02 0.659 0.647 0.833 0.623 0.678 0.428 0.668 0.746 0.638
0.05 0.659 0.647 0.833 0.623 0.681 0.440 0.666 0.749 0.638
0.01 0.812 0.828 0.600 0.813 0.540 0.909 0.930 0.896 0.916
satimage 0.02 0.829 0.847 0.632 0.861 0.548 0.785 0.919 0.880 0.898
0.05 0.841 0.858 0.715 0.893 0.536 0.712 0.895 0.868 0.892
0.01 0.713 0.748 0.368 0.541 0.737 0.765 0.778 0.768 0.845
splice 0.02 0.754 0.765 0.343 0.588 0.744 0.761 0.793 0.783 0.848
0.05 0.734 0.764 0.377 0.643 0.723 0.764 0.785 0.777 0.849
0.01 0.534 0.720 0.745 0.681 0.504 0.259 0.111 0.071 0.256
thyroid 0.02 0.534 0.720 0.745 0.681 0.505 0.259 0.111 0.071 0.256
0.05 0.534 0.720 0.745 0.681 0.485 0.259 0.111 0.071 0.256
0.01 0.525 0.534 0.602 0.502 0.456 0.520 0.525 0.525 0.461
titanic 0.02 0.496 0.498 0.659 0.513 0.526 0.492 0.503 0.503 0.472
0.05 0.526 0.521 0.644 0.538 0.505 0.499 0.512 0.512 0.433
0.01 0.905 0.902 0.161 0.439 0.846 0.812 0.889 0.897 0.875
twonorm 0.02 0.896 0.889 0.197 0.572 0.821 0.803 0.892 0.901 0.858
0.05 0.905 0.903 0.396 0.754 0.781 0.765 0.858 0.874 0.807
0.01 0.890 0.881 0.243 0.477 0.861 0.724 0.887 0.889 0.861
waveform 0.02 0.901 0.890 0.181 0.602 0.817 0.690 0.887 0.890 0.861
0.05 0.885 0.873 0.236 0.757 0.798 0.705 0.847 0.874 0.831
Average 0.661 0.685 0.530 0.608 0.596 0.594 0.629 0.622 0.623
KDE sometimes works reasonably well, but the performance fluctuates depend-
ing on the dataset. Therefore, its average AUC value is not comparable to uLSIF
and KLIEP.
Overall, the uLSIF-based and KLIEP-based methods are shown to be promising
in inlier-based outlier detection.
10.1.5 Remarks
In this section we discussed the problem of inlier-based outlier detection. Because
inlier information can be taken into account in this approach, it tends to outperform
unsupervised outlier-detection methods (if such inlier information is available).
Furthermore, thanks to the density-ratio formulation, model selection is possible
by cross-validation over the density-ratio approximation error. This is a significant
advantage over purely unsupervised approaches. An inlier-based outlier-detection
method based on the hinge-loss was also studied (see Smola et al., 2009).
The goal of change detection (a.k.a. event detection) is to identify time points
at which properties of time series data change (Basseville and Nikiforov, 1993;
Brodsky and Darkhovsky, 1993; Guralnik and Srivastava, 1999; Gustafsson,
2000; Yamanishi and Takeuchi, 2002; Ide and Kashima, 2004; Kifer et al., 2004).
Change detection covers a broad range of real-world problems such as fraud
detection in cellular systems (Murad and Pinkas, 1999; Bolton and Hand, 2002),
intrusion detection in computer networks (Yamanishi et al., 2004), irregular-
motion detection in vision systems (Ke et al., 2007), signal segmentation in data
streams (Basseville and Nikiforov, 1993), and fault detection in engineering sys-
tems (Fujimaki et al., 2005). If vectorial samples are extracted from time series
data in a sliding-window manner, one can apply density-ratio methods to change
detection (Kawahara and Sugiyama, 2011).
10.2.1 Introduction
The two-sample test is useful in various practically important learning scenarios:
• When learning is performed in a non-stationary environment, for exam-
ple, in brain–computer interfaces (Sugiyama et al., 2007) and robot control
(Hachiya et al., 2009), testing the homogeneity of the data-generating distri-
butions allows one to determine whether some adaptation scheme should be
used or not. When the distributions are not significantly different, one can
avoid using data-intensive non-stationarity adaptation techniques (such as
covariate shift adaptation explained in Section 9.1). This can significantly
contribute to stabilizing the performance.
• When multiple sets of data samples are available for learning, for exam-
ple, biological experimental results obtained from different laboratories
(Borgwardt et al., 2006), the homogeneity test allows one to make a deci-
sion as to whether all the datasets are analyzed jointly as a single dataset or
they should be treated separately.
• Similarly, one can use the homogeneity test for deciding whether multi-task
learning methods (Caruana et al., 1997, see also Section 9.2) are employed.
The rationale behind multi-task learning is that when several related learning
tasks are provided, solving them simultaneously can give better solutions than
solving them individually. However, when the tasks are not similar to each
other, using multi-task learning techniques can degrade the performance.
Thus, it is important to avoid using multi-task learning methods when the
tasks are not similar to each other. This may be achieved by testing the
homogeneity of the datasets.
The t-test (Student, 1908) is a classical method for testing homogeneity that
compares the means of two Gaussian distributions with a common variance; its
multi-variate extension also exists (Hotelling, 1951). Although the t-test is a fun-
damental method for comparing the means, its range of application is limited to
Gaussian distributions with a common variance, which may not be fulfilled in
practical applications.
The Kolmogorov–Smirnov test and the Wald–Wolfowitz runs test are classi-
cal non-parametric methods for the two-sample problem; their multi-dimensional
variants have also been developed (Bickel, 1969; Friedman and Rafsky, 1979).
Since then, different types of non-parametric test methods have been studied (e.g.,
Anderson et al., 1994; Li, 1996).
Recently, a non-parametric extension of the t-test called maximum mean dis-
crepancy (MMD) was proposed (Borgwardt et al., 2006; Gretton et al., 2007). The
MMD compares the means of two distributions in a universal reproducing kernel
150 10 Distribution Comparison
Hilbert space (universal RKHS; Steinwart, 2001) – the Gaussian kernel is a typical
example that induces a universal RKHS. The MMD does not require a restrictive
parametric assumption, and hence it could be a flexible alternative to the t-test. The
MMD was shown experimentally to outperform alternative homogeneity tests such
as the generalized Kolmogorov–Smirnov test (Friedman and Rafsky, 1979), the
generalized Wald–Wolfowitz test (Friedman and Rafsky, 1979), the Hall–Tajvidi
test (Hall and Tajvidi, 2002), and the Biau–Györfi test (Biau and Györfi, 2005).
The performance of the MMD depends on the choice of universal RKHSs (e.g.,
the Gaussian bandwidth in the case of Gaussian RKHSs). Thus, the universal
RKHS should be chosen carefully for obtaining good performance. The Gaussian
RKHS with bandwidth set to the median distance between samples has been a pop-
ular heuristic in practice (Borgwardt et al., 2006; Gretton et al., 2007). Recently, a
novel idea of using the universal RKHS (or the Gaussian widths) yielding the max-
imum MMD value was introduced and shown to work well (Sriperumbudur et al.,
2009).
Another approach to the two-sample problem is to evaluate a divergence
between two distributions. The divergence-based approach is advantageous in that
cross-validation over the divergence functional is available for optimizing tuning
parameters in a data-dependent manner. A typical choice of the divergence func-
tional would be the f -divergences (Ali and Silvey, 1966; Csiszár, 1967), which
include the Kullback–Leibler (KL) divergence (Kullback and Leibler, 1951) and
the Pearson (PE) divergence (Pearson, 1900) as special cases.
Various methods for estimating the divergence functional have been studied
so far (e.g., Darbellay and Vajda, 1999; Wang et al., 2005; Silva and Narayanan,
2007; Pérez-Cruz, 2008). Among them, approaches based on density-ratio esti-
mation have been shown to be promising both theoretically and experimentally
(Sugiyama et al., 2008; Gretton et al., 2009; Kanamori et al., 2009; Nguyen et al.,
2010).
A parametric density-ratio estimator based on logistic regression (Qin,
1998; Cheng and Chu, 2004) has been applied to the homogeneity test
(Keziou and Leoni-Aubin, 2005). Although the density-ratio estimator based on
logistic regression was proved to achieve the smallest asymptotic variance among
a class of semi-parametric estimators (Qin, 1998, see also Section 13.3), this the-
oretical guarantee is valid only when the parametric model is correctly specified
(i.e., the target density-ratio is included in the parametric model at hand). However,
when this unrealistic assumption is violated, a divergence-based density-ratio esti-
mator (Sugiyama et al., 2008; Nguyen et al., 2010) was shown to perform better
(Kanamori et al., 2010).
Among various divergence-based density-ratio estimators, unconstrained least-
squares importance fitting (uLSIF) was demonstrated to be accurate and compu-
tationally efficient (Kanamori et al., 2009; see also Section 6.2.2). Furthermore,
uLSIF was proved to possess the optimal non-parametric convergence rate
10.2 Two-Sample Test 151
(Kanamori et al., 2011b; see also Section 14.3) and optimal numerical stabil-
ity (Kanamori et al., 2011c; see also Chapter 16). In this section, we describe a
method for testing the homogeneity based on uLSIF.
Similarly to the MMD, the uLSIF-based homogeneity test processes data sam-
ples only through kernel functions. Thus, the uLSIF-based method can be used
for testing the homogeneity of non-vectorial structured objects such as strings,
trees, and graphs by employing kernel functions defined for such structured
data (Lodhi et al., 2002; Duffy and Collins, 2002; Kashima and Koyanagi, 2002;
Kondor and Lafferty, 2002; Kashima et al., 2003; Gärtner et al., 2003; Gärtner,
2003). This is an advantage over traditional two-sample tests.
η
Value of test statistic η
computed from given samples
r(x) =
θ ψ (x),
=1
n n
, X ) := 1
1
1
PE(X
r(x i ) −
r(x j ) +
2n i=1 n j =1 2
1 1
= h θ −h θ + ,
2 2
'
where h = n1 nj =1 ψ (x j ).
, X ) can take a negative value, although the true PE(P ∗ , P ∗ )
Note that PE(X
, X ) can be
is non-negative by definition. Thus, the estimation accuracy of PE(X
improved by rounding up a negative estimate to zero. However, we do not employ
this rounding-up strategy here because we are interested in the relative ranking of
the divergence estimates, as explained in Section 10.2.4.
Permutation Test
The two-sample test described here is based on the permutation test
(Efron and Tibshirani, 1993).
We first run the uLSIF-based PE divergence estimation procedure using the
, X ). Next,
original datasets X and X and obtain a PE divergence estimate PE(X
we randomly permute the |X ∪X | samples, and assign the first |X | samples to a set
X and the remaining |X | samples to another set X
(see Figure 10.2). Then we run
the uLSIF-based PE divergence estimation procedure again using the randomly
shuffled datasets X and X and obtain a PE divergence estimate PE(
X , X
). Note
and X
that X can be regarded as being drawn from the same distribution. Thus,
X X′
~ ~
X X′
X
PE( , X
) would take a value close to zero. This random shuffling procedure is
repeated many times, and the distribution of PE( X , X
) under the null hypothesis
(i.e., the two distributions are the same) is constructed. Finally, the p-value is
approximated by evaluating the relative ranking of PE(X , X ) in the distribution
of PE(X , X ).
Numerical Examples
Let the number of samples be n = n = 500, and
i.i.d.
X = {x i }ni=1 ∼ P ∗ = N (0, 1),
i.i.d.
X = {x j }nj =1 ∼ P ∗ = N (µ, σ 2 ),
where N(µ, σ 2 ) denotes the normal distribution with mean µ and variance σ 2 . We
consider the following four setups:
(a) (µ, σ ) = (0, 1.3): P ∗ has a larger standard deviation than P ∗ .
(b) (µ, σ ) = (0, 0.7): P ∗ has a smaller standard deviation than P ∗ .
(c) (µ, σ ) = (0.3, 1): P ∗ and P ∗ have different means.
(d) (µ, σ ) = (0, 1): P ∗ and P ∗ are the same.
Frequency
Probability
α
β PE ( X , X )
20 20
0 0
0 0.1 0.2 0.3 0.4 0 0.1 0.2 0.3 0.4
PE PE
(a) ( µ, σ) = (0, 1.3) (b) ( µ, σ) = (0, 0.7)
20 20
0 0
0 0.1 0.2 0.3 0.4 0 0.1 0.2 0.3 0.4
PE PE
(c) ( µ, σ) = (0.3, 1) (d) (µ, σ) = (0, 1)
X
Figure 10.4. Histograms of PE( , X ) (i.e., shuffled datasets) for the toy dataset. “×”
indicates the value of PE(X , X ) (i.e., the original datasets).
Figure 10.4 depicts histograms of PE( X , X ) (i.e., shuffled datasets), showing
that the profiles of the null distribution (i.e., the two distributions are the same)
are rather similar to each other for the four cases. The values of PE(X , X ) (i.e.,
the original datasets) are also plotted in Figure 10.4 using the “×” symbol on the
horizontal axis, showing that the p-values tend to be small when P ∗ # = P ∗ and the
p-value is large when P ∗ = P ∗ . This is desirable behavior as a test.
Figure 10.5 depicts the means and standard deviations of p-values over 100 runs
as functions of the sample size n (= n ), indicated by “plain.” The graphs show
that, when P ∗ # = P ∗ , the p-values tend to decrease as n increases. On the other
hand, when P ∗ = P ∗ , the p-values are almost unchanged and kept to relatively
large values.
Figure 10.6 depicts the rate of accepting the null hypothesis (i.e., P ∗ = P ∗ )
over 100 runs when the significance level is set to 0.05 (i.e., the rate of p-values
larger than 0.05). The graphs show that, when P ∗ #= P ∗ , the null hypothesis tends
to be more frequently rejected as n increases. On the other hand, when P ∗ = P ∗ ,
the null hypothesis is almost always accepted. Thus, the LSTT was shown to work
properly for these toy datasets.
Plain 0.35
0.5 Reciprocal
Adaptive 0.3
0.4
0.25
0.3 0.2
0.2 0.15
0.1
0.1
0.05
0 0
100 200 300 400 500 100 200 300 400 500
n n
(a) ( µ, σ ) = (0, 1.3) (b) ( µ, σ ) = (0, 0.7)
0.6 0.8
0.5
0.6
0.4
0.3 0.4
0.2
0.2
0.1
0 0
100 200 300 400 500 100 200 300 400 500
n n
(c) ( µ, σ ) = (0.3, 1) (d) ( µ, σ ) = (0, 1)
Figure 10.5. Means and standard deviations of p-values for the toy dataset.
1 1
Plain
Reciprocal 0.8
0.8 Adaptive
0.6 0.6
0.4 0.4
0.2 0.2
0 0
100 200 300 400 500 100 200 300 400 500
n n
(a) (µ,σ) = (0, 1.3) (b) ( µ,σ) = (0, 0.7)
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
100 200 300 400 500 100 200 300 400 500
n n
(c) ( µ,σ ) = (0.3, 1) (d) (µ,σ) = (0, 1)
Figure 10.6. The rate of accepting the null hypothesis (i.e., P ∗ = P ∗ ) for the toy dataset
under the significance level 0.05.
10.2 Two-Sample Test 157
is also a density-ratio, assuming that p ∗ (x) > 0 for all x. This means that we can
use uLSIF in two ways, estimating either the original density-ratio r ∗ (x) or its
reciprocal 1/r ∗ (x).
To illustrate this difference, we perform the same experiments as previously, but
swap X and X . The obtained p-values and the acceptance rate are also included in
Figures 10.5 and 10.6 as “reciprocal.” In the experiments, we prefer to have smaller
p-values when P ∗ # = P ∗ and larger p-values when P ∗ = P ∗ . The graphs show
that, when (µ, σ ) = (0, 1.3), estimating the inverted density-ratio gives slightly
smaller p-values and a significantly lower acceptance rate. On the other hand, when
(µ, σ ) = (0, 0.7), reciprocal estimation yields larger p-values and a significantly
higher acceptance rate. When (µ, σ ) = (0.3, 1) and (µ, σ ) = (0, 1), the ‘plain’ and
‘reciprocal’ methods result in similar p-values and thus similar acceptance rates.
These experimental results imply that, if the plain and reciprocal approaches are
adaptively chosen, the performance of the homogeneity test may be improved.
Figure 10.5 showed that, when P ∗ = P ∗ [i.e., (µ, σ ) = (0, 1)], the p-values
are large enough to reject the null hypothesis for both the plain and reciprocal
approaches. Thus, the type I error (the rejection rate of correct null hypotheses,
i.e., two distributions are judged to be different when they are actually the same)
would be sufficiently small for both approaches, as illustrated in Figure 10.6.
Based on this observation, a strategy of choosing a smaller p-value between
the plain and reciprocal approaches was proposed (Sugiyama et al., 2011c). This
allows one to reduce the type II error (the acceptance rate of incorrect null hypothe-
ses, i.e., two distributions are judged to be the same when they are actually
different), and thus the power of the test can be enhanced.
The experimental results of this adaptive method are also included in
Figures 10.5 and 10.6 as “adaptive.” The results show that p-values obtained by
the adaptive method are smaller than those obtained by the plain and reciprocal
approaches. This provides significant performance improvement when P ∗ # = P ∗ .
On the other hand, smaller p-values can be problematic when P ∗ = P ∗ because
the acceptance rate can be lowered. However, as the experimental results show,
the p-values are still large enough to accept the null hypothesis, and thus there is
no critical performance degradation in this illustrative example.
A pseudo code of the “adaptive” LSTT method is summarized in Figures 10.7
and 10.8. Although the permutation test is computationally intensive, it can be
easily parallelized using multi-processors/cores.
AMATLAB® implementation of LSTT is available from http://sugiyama-www.cs.
titech.ac.jp/˜sugi/software/LSTT/.
Input: Two sets of samples X = {x i }ni=1 and X = {x j }nj =1
Output: p-value p
, X );
p0 ←− PE(X , X );
p0 ←− PE(X
For t = 1, . . . , T
Randomly split X ∪ X into X of size |X | and X
of size |X |;
pt ←− PE( X , X
); X
pt ←− PE( , X
);
End
' '
p ←− T1 Tt=1 I (pt > p0 ); p ←− T1 Tt=1 I (pt > p0 );
←− min(p, p );
p
Figure 10.7. Pseudo code of LSTT. Pseudo code of PE(X , X ) is given in Figure 10.8.
I (c) denotes the indicator function, i.e., I (c) = 1 if the condition c is true; otherwise
I (c) = 0. When |X | = |X
| (i.e., n = n ), p ←− PE(
X , X
) may be replaced by p ←− pt
t t
because switching X and X does not essentially affect the estimation of the
PE divergence.
Input: Two sets of samples X = {x i }ni=1 and X = {x j }nj =1
, X )
Output: PE divergence estimate PE(X
MMD(H, P ∗ , P ∗ )
= sup f , K(·, x)H p ∗ (x)dx − f , K(·, x)H p ∗ (x)dx
f H ≤1
= sup f, K(·, x)p ∗ (x)dx − K(·, x)p∗ (x)dx
f H ≤1 H
= K(·, x)p (x)dx − K(·, x)p (x)dx
∗ ∗
,
H
where the Cauchy–Schwarz inequality (Bachman and Narici, 2000) was used in
the last equality. Furthermore, by using K(x, x ) = K(·, x), K(·, x )H , the squared
MMD can be expressed as
2
MMD (H, P , P ) = K(·, x)p (x)dx − K(·, x)p (x)dx
2 ∗ ∗ ∗ ∗
H
= K(x, x )p ∗ (x)p ∗ (x )dxdx + K(x, x )p ∗ (x)p ∗ (x )dxdx
−2 K(x, x )p ∗ (x)p ∗ (x)dxdx .
2 (H, X , X ) := 1
MMD K(x , x ) + K(x j , x j )
i i
n2 n2
i,i =1 j ,j =1
n n
2
− K(x i , x j ).
nn i=1 j =1
160 10 Distribution Comparison
By the same permutation test procedure as the one described in Section 10.2.4,
one can compute p-values for MMD2 (H, X , X ). Furthermore, an asymptotic
2 (H, X , X ) under P ∗ = P ∗ can be obtained explicitly
distribution of MMD
(Borgwardt et al., 2006; Gretton et al., 2007). This allows one to compute the p-
values without resorting to the computationally intensive permutation procedure,
which is an advantage of MMD over LSTT.
2 (H, X , X ) depends on the choice of the universal RKHS H. In the
MMD
original MMD papers (Borgwardt et al., 2006; Gretton et al., 2007), the Gaussian
RKHS with width set to the median distance between samples was used, which is a
popular heuristic in the kernel method community (Schölkopf and Smola, 2002).
Recently, an idea of using the universal RKHS yielding the maximum MMD value
has been introduced (Sriperumbudur et al., 2009). In the experiments, we use this
maximum MMD technique for choosing the universal RKHS, which was shown
to work better than the median heuristic.
1 1 1
0.8 0.8 0.8
0.6 0.6 0.6
0.4 0.4 0.4
0.2 0.2 0.2
0 0 0
0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1
η η η
(a) Banana (b) Breast cancer (c) Diabetes
1 1 1
0.8 0.8 0.8
0.6 LSTT(P=P’) 0.6 0.6
0.4 MMD(P=P’) 0.4 0.4
LSTT(P≠ P’)
0.2 MMD(P ≠P’) 0.2 0.2
0 0 0
0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1
η η η
(d) Flare solar (e) German (f) Heart
1 1 1
0 0 0
0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1
η η η
(g) Image (h) Ringnorm (i) Splice
1 1 1
0.8 0.8 0.8
0.6 0.6 0.6
0.4 0.4 0.4
0.2 0.2 0.2
0 0 0
0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1
η η η
(j) Thyroid (k) Twonorm (l) Waveform
Figure 10.9. The rate of accepting the null hypothesis (i.e., P ∗ = P ∗ ) for IDA datasets
under the significance level 0.05. η indicates the relative sample size we used in the
experiments.
for the thyroid dataset, and the two methods are comparable for the other datasets.
Overall, LSTT compares favorably with MMD in terms of the type II error (the
acceptance rate of incorrect null-hypotheses, i.e., two distributions are judged to
be the same when they are actually different).
10.2.7 Remarks
We explained a non-parametric method of testing homogeneity called the least-
squares two-sample test (LSTT). Through experiments, the LSTT was shown to
162 10 Distribution Comparison
that is, X and Y are statistically independent. Therefore, MI can be used for
detecting the statistical independence of random variables.
A variant of MI based on the Pearson (PE) divergence (Pearson, 1900) is
given by
∗ 2
1 p (x, y)
SMI(X ,Y) := p ∗ (x)p ∗ (y) − 1 dxdy, (11.2)
2 p ∗ (x)p ∗ (y)
which is called the squared-loss mutual information (SMI). Similarly to MI, SMI
is zero if and only if X and Y are statistically independent. Thus, SMI can also be
used for detecting the statistical independence of random variables.
In this chapter we introduce methods of mutual information estimation based
on density ratios and show their applications in machine learning. In Section 11.1,
density-ratio methods of mutual information estimation are described. Then we
show how the mutual information estimators may be utilized for solving sufficient
dimension reduction (Suzuki and Sugiyama, 2010) in Section 11.2 and indepen-
dent component analysis (Suzuki and Sugiyama, 2011) in Section 11.3. Note that
1 To be precise, the density functions p ∗ (x, y), p ∗ (x), and p ∗ (y) should have different names, e.g.,
∗ (x, y), p ∗ (x), and p ∗ (y). However, to simplify the notation, we use the same name p ∗ in this
pXY X Y
chapter.
163
164 11 Mutual Information Estimation
the mutual information estimators can also be used for solving various machine
learning tasks such as independence testing (Sugiyama and Suzuki, 2011), vari-
able selection (Suzuki et al., 2009b), clustering (Kimura and Sugiyama, 2011;
Sugiyama et al., 2011d), object matching (Yamada and Sugiyama, 2011a), and
causality learning (Yamada and Sugiyama, 2010). For details, please refer to each
reference.
11.1.1 Introduction
A naive approach to estimating mutual information (MI) is to use nonpara-
metric density estimation methods such as kernel density estimation (KDE;
Fraser and Swinney, 1986); that is, the densities p ∗ (x, y), p ∗ (x), and p ∗ (y)
included in Eq. (11.1) are estimated separately from the samples, and the estimated
densities are used for approximating MI. The bandwidth of the kernel functions
may be optimized based on cross-validation (Härdle et al., 2004), and hence there
is no open tuning parameter in this approach. However, density estimation is
known to be a hard problem, and division by estimated densities tends to magnify
the estimation error. Therefore, the KDE-based method may not be reliable in
practice.
Another approach uses histogram-based density estimators with data-dependent
partitions. In the context of estimating the KL divergence, histogram-based
∗ (x,y)
methods, which could be regarded as implicitly estimating the ratio p∗p(x)p ∗ (y) ,
have been studied thoroughly and their consistency has been established
(Darbellay and Vajda, 1999; Wang et al., 2005; Silva and Narayanan, 2007). How-
ever, the rate of convergence seems to be unexplored at present, and such
histogram-based methods seriously suffer from the curse of dimensionality. Thus,
these methods may not be reliable in high-dimensional problems.
Based on the fact that MI can be expressed in terms of the entropies, the nearest
neighbor distance has been used for approximating MI (Kraskov et al., 2004). Such
a nearest neighbor approach was shown to perform better than the naive KDE-
based approach (Khan et al., 2007), given that the number k of nearest neighbors
is chosen appropriately – a small (large) k yields an estimator with a small (large)
bias and a large (small) variance. However, appropriately determining the value of
k so that the bias-variance trade-off is optimally controlled is not straightforward
11.1 Density-Ratio Methods of Mutual Information Estimation 165
or
n
n
(X ,Y) := − 1 1
1
SMI r(x k , yk )2 +
r(x k , yk ) − , (11.5)
2n n 2
k,k =1 k =1
where we use the fact that SMI(X ,Y) can be equivalently expressed as
1 1
SMI(X ,Y) = p ∗ (x, y)r ∗ (x, y)dxdy −
2 2
1
=− p ∗ (x)p ∗ (y)r ∗ (x, y)2 dxdy
2
1
+ p∗ (x, y)r ∗ (x, y)dxdy − .
2
where {(u , v )}b =1 are Gaussian centers chosen randomly from {(x k , yk )}nk=1 . In
the classification scenario where y is categorical, the following Gaussian–delta
kernel model would be useful:
x − u 2
ψ (x, y) = exp − δ(y = v ),
2σ 2
where2
1 if the condition c is true,
δ(c) =
0 otherwise.
To these kernel basis functions we may also add a constant basis function
ψ0 (x, y) = 1.
A MATLAB® implementation of LSMI is available from http://sugiyama-
www.cs.titech.ac.jp/˜sugi/software/LSMI/.
zz := max{x, y}, where · denotes the Euclidean norm. Let Nk (i) be the
set of k-NN samples of z i = (x i , yi ) with respect to the norm · z , and let
@x (i) := max{x i − x j | (x j , yj ) ∈ Nk (i)},
cx (i) := |{z j | x i − x j ≤ @x (i)}|,
@y (i) := max{yi − yj | (x j , yj ) ∈ Nk (i)},
cy (i) := |{z j | yi − yj ≤ @y (i)}|.
Then the NN-based MI estimator is given by
n .
/
(k-NN) (X ,Y) := F(k) + F(n) − 1 − 1
MI F(cx (i)) + F(cy (i)) ,
k n i=1
where Hnormal is the entropy of the normal distribution with a covariance matrix
equal to the target distribution and κi,j ,k (1 ≤ i, j , k ≤ d) is the standardized third
cumulant of the target distribution. An estimate of MI can be obtained via the
MI–entropy identity as
(EDGE) (X ,Y) := H
MI (X ) + H
(Y) − H
(X ,Y).
where ·, ·F denotes the inner product in F. Thus, Cxy can be expressed as
Cxy := K(·, x) − K(·, x)p∗ (x)dx
⊗ L(·, y) − L(·, y)p∗ (y)dy p ∗ (x, y)dxdy,
where ⊗ denotes the tensor product, and we use the reproducing properties:
HSIC(Z) := K(x i , x i )L(yi , yi )
n2
i,i =1
n
1
+ K(x i , x i )L(yj , yj )
n4
i,i ,j ,j =1
n
2
− K(x i , x k )L(yj , yk )
n3 i,j ,k=1
1
= tr(K L),
n2
where K i,i = K(x i , x i ), Lj ,j = L(yi , yi ), = I n − n1 1n 1
n , I n denotes the n-
dimensional identity matrix, and 1n denotes the n-dimensional vector with all
ones. Note that corresponds to the “centering” matrix in the RKHS.
depends on the choice of the universal RKHSs F and G. In the original
HSIC
HSIC papers (Gretton et al., 2005, 2008), the Gaussian RKHS with width set to
the median distance between samples was used, which is a popular heuristic in
the kernel method community (Schölkopf and Smola, 2002). However, there is
no theoretical justification for this. On the other hand, the density-ratio methods
are equipped with cross-validation, and thus all the tuning parameters such as the
Gaussian width and the regularization parameter can be optimized in an objective
and systematic way.
where N (µ, σ 2 ) denotes the normal distribution with mean µ and variance σ 2 .
(b) Non-linear dependence 1: Y has a quadratic dependence on X as
6 10
4
5
2
y
y
0
0
−2
−4 −5
−1 0 1 −2 0 2
x x
(a) Linear dependence (b) Non-linear dependence 1
2 4
1 2
0 0
y
−1 −2
−2 −4
−0.5 0 0.5 0 0.1 0.2 0.3 0.4 0.5
x x
(c) Non-linear dependence 2 (d) Independence
0.25 0.4
Average MI approximation error
0.15
0.2
0.1
0.1
0.05
0 0
50 100 150 50 100 150
Number of samples Number of samples
(a) Linear dependence (b) Non-linear dependence 1
0.5 0.25
Average MI approximation error
0.3 0.15
0.2 0.1
0.1 0.05
0 0
50 100 150 50 100 150
Number of samples Number of samples
(c) Non-linear dependence 2 (d) Independence
Figure 11.2. MI approximation error measured by |MI − MI| averaged over 100 trials as a
function of the sample size n. The symbol “◦” on a line means that the corresponding
method is the best in terms of the average error or is judged to be comparable to the best
method by the t-test at the 1% significance level.
Figure 11.2 shows that (a), MLMI, KDE, KNN with k = 5, and EDGE perform
well on the dataset (b), MLMI tends to outperform the other estimators on the
dataset (c), MLMI and KNN with k = 5 show the best performance against the
other methods on the dataset, and (d) MLMI, EDGE, and KNN with k = 15 perform
well on the dataset.
KDE works moderately well on datasets (a)–(c) and performs poorly on dataset
(d). This instability may be ascribed to division by estimated densities, which tends
to magnify the estimation error. KNN seems to work well on all four datasets
if the value of k is chosen optimally. However, there is no systematic model
selection strategy for KNN (see Section 11.1.5), and hence KNN would be unre-
liable in practice. EDGE works well on datasets (a), (b), and (d), which posses
high normality.3 However, on dataset (c), where the normality of the target
3 Note that, although the Edgeworth approximation is exact when the target distributions are precisely
normal, the EDGE method still suffers from some estimation error because cumulants are estimated
from samples.
174 11 Mutual Information Estimation
distributions is low, the EDGE method performs poorly. In contrast, MLMI with
cross-validation performs reasonably well for all four datasets in a stable manner.
These experimental results show that MLMI nicely compensates for the
weaknesses of the other methods.
11.1.7 Remarks
In this section we described methods of mutual information approximation based
on density-ratio estimation. The density-ratio methods have several useful prop-
erties; for example, they are single-shot procedures, density estimation is not
involved, they are equipped with a cross-validation procedure for model selec-
tion, and the unique global solution can be computed efficiently. The numerical
experiments illustrate the usefulness of the density-ratio approaches.
11.2.1 Introduction
The purpose of dimension reduction in supervised learning is to find a low-
dimensional subspace of input features that has “sufficient” information for
predicting output values. Supervised dimension reduction methods can be catego-
rized broadly into two types – wrappers and filters (Guyon and Elisseeff, 2003).
The wrapper approach performs dimension reduction specifically for a particu-
lar predictor, while the filter approach is independent of the choice of successive
predictors.
If one wants to enhance the prediction accuracy, the wrapper approach is a
suitable choice because predictors’ characteristics can be taken into account in the
dimension reduction phase. On the other hand, if one wants to interpret dimension-
reduced features (e.g., in bioinformatics, computational chemistry, or brain-signal
analysis), the filter approach is more appropriate because the extracted features are
independent of the choice of successive predictors and therefore reliable in terms
of interpretability. Here we focus on the filter approach.
11.2 Sufficient Dimension Reduction 175
4 In principle, it is possible to choose the Gaussian width and the regularization parameter by cross-
validation (CV) over a successive predictor. However, this is not preferable for the following two
reasons. The first is a significant increase of the computational cost. When CV is used, the tuning
parameters in KDR (or HSIC) and hyperparameters in the target predictor (such as basis parameters
and the regularization parameter) should be optimized at the same time. This results in a deeply
nested CV procedure, and therefore this could be computationally very expensive. Another reason
is that features extracted based on CV over a successive predictor are no longer independent of
predictors. Thus, a merit of the filter approach (i.e., the obtained features are “reliable”) is lost.
176 11 Mutual Information Estimation
Let X (⊂ Rd ) be the domain of input feature x, and let Y be the domain of output
data5 y. Let m (∈ {1, . . . , d}) be the dimensionality of the sufficient subspace. To
search the sufficient subspace, we utilize the Grassmann manifold Gr dm (R), which
is the set of all m-dimensional subspaces in Rd defined as
Gr dm (R) := {W ∈ Rm×d | WW = I m }/ ∼, (11.6)
where denotes the transpose, I m is the m-dimensional identity matrix, and “/ ∼”
means that matrices sharing the same range are regarded as equivalent.
Let W∗ be a projection matrix corresponding to a member of the Grassmann
manifold Gr dm , and let z ∗ (∈ Rm ) be the orthogonal projection of input x given
by W∗ :
z ∗ = W∗ x.
Suppose that z ∗ satisfies
y ⊥⊥ x | z ∗ . (11.7)
∗
That is, given the projected feature z , the (remaining) feature x is condition-
ally independent of output y and thus can be discarded without sacrificing the
predictability of y.
Suppose that we are given n i.i.d. paired samples
D = {(x k , yk ) | x k ∈ X , yk ∈ Y}nk=1
drawn from a joint distribution with density p∗ (x, y). The goal of SDR is, from
data D, to find a projection matrix whose range agrees with that of W∗ . For a
projection matrix W (∈ Gr dm (R)) we write z k = Wx k . We assume that the reduced
dimensionality m is known throughout this section.
The rationale behind the use of SMI in the context of SDR relies on the following
lemma.
Lemma 11.1 (Suzuki and Sugiyama, 2010) Let p ∗ (x, y|z), p ∗ (x|z), and p ∗ (y|z)
be conditional densities. Then we have
SMI(X ,Y) − SMI(Z,Y)
2 ∗
1 p∗ (x, y|z) p (y, z)2 p ∗ (x)
= 1− ∗ dxdy ≥ 0.
2 p (x|z)p ∗ (y|z) p ∗ (z)2 p ∗ (y)
Lemma 11.1 implies
SMI(X ,Y) ≥ SMI(Z,Y),
and the equality holds if and only if
p ∗ (x, y|z) = p∗ (x|z)p ∗ (y|z),
which is equivalent to Eq. (11.7). Thus, Eq. (11.7) can be achieved by maximizing
SMI(Z,Y) with respect to W; then the “sufficient” subspace can be identified.
z z
H , = 2 φ (y )φ (y ) φ (z k )φ (z k ) ,
n k=1 k k
k =1
, 2
-
y exp − y−u2σ
2 (regression),
φ (y) =
δ(y = u ) (classification),
z z − v 2 1 if the condition c is true,
φ (z) = exp − 2
, δ(c) =
2σ 0 otherwise.
where “exp” for' a matrix denotes the matrix exponential; that is, for a square matrix
D, exp(D) = ∞ 1 k
k=0 k! D . Od is the d ×d matrix with all zeros. Note that the deriva-
tive ∂t Wt |t=0 coincides with the natural gradient (11.8); see Edelman et al. (1998)
for a detailed derivation of the geodesic. Thus, line search along the geodesic in the
180 11 Mutual Information Estimation
natural gradient direction is equivalent to finding the maximizer from {Wt | t ≥ 0}.
For choosing the step size of each gradient update, we may use some approximate
line search method such as Armijo’s rule (Patriksson, 1999) or backtracking line
search (Boyd and Vandenberghe, 2004).
The LSMI-based sufficient dimension reduction algorithm is called least-
squares dimension reduction (LSDR). The entire algorithm is summarized
in Figure 11.3. A MATLAB® implementation of LSDR is available from
http://sugiyama-www.cs.titech.ac.jp/˜sugi/software/LSDR/.
2 10
8 1
1
0 6
0
y
y
−1 4
2
−2 −1
0
−3 −2
−2
−2 −1 0 1 2 −2 0 2 −0.5 0 0.5
x1 x1 x1
(a) Linear dependence (b) Non-linear dependence 1 (c) Non-linear dependence 2
4
20 1
2
10 0.5
y
0
y
0 0 −2
2 2 −0.5 −4
0 0
−2
x2 −2 x1 0 0.2 0.4 0.6 0.8 1 −2 −1 0 1 2
x1
x1
(d) Artificial data 1 (e) Artificial data 2 (f) Artificial data 3
x (i) denotes the i-th element of the vector x, N (x; µ, ) denotes the multi-
variate normal density with mean µ and covariance matrix , 0d denotes the
d-dimensional vector with all zeros, and I d denotes the d-dimensional identity
matrix. The optimal projection is given by W∗ = (1 0 0 0 0).
(b) Non-linear dependence 1: d = 5 and m = 1. y has a quadratic dependence
on x as
x ∼ U (x; [− 12 , 12 ]5 )
N (y; 0, 14 ) if x (1) ≤ | 16 |,
y|x ∼
1
2
N (y; 1, 14 ) + 12 N (y; −1, 14 ) otherwise,
where U (x; S) denotes the uniform density on the set S. The optimal projection
is given by W∗ = (1 0 0 0 0).
(d) Artificial data 1: d = 4 and m = 2. y has a non-linear dependence on x as
x (1)
y= 0.5+(x (2) +1.5)2
+ (1 + x (2) )2 + 0.4@,
In KDR and HSIC, the Gaussian width is set to the median sample distance, follow-
ing the suggestions in the original papers (Fukumizu et al., 2009; Gretton et al.,
2005). We use the dimension reduction package dr included in R for SIR, SAVE,
and pHd. The principal directions estimated by SIR, SAVE, and pHd do not nec-
essarily form an orthogonal system; that is, the matrix F, each row of which
corresponds to each principal direction, is not necessarily a projection matrix.
To recover a projection matrix W, we perform a singular decomposition of F as
F = VSU and set W = U .
We evaluate the performance of each method by
W
W − W∗ W∗ Frobenius , (11.9)
11.2.5 Remarks
In this section we described a method of sufficient dimension reduction (SDR)
called least-squares dimension reduction (LSDR) that utilizes least-squares
mutual information (LSMI). The LSDR method is advantageous over other
approaches in several respects; for example, density estimation is not involved,
it is distribution-free, and model selection by cross-validation is available. The
numerical experiments show the usefulness of LSDR.
Although the LSMI solution is given analytically, sufficient subspace search
involves non-convex optimization and the natural gradient method is still com-
putationally demanding. Thus, improving the computation cost for a sufficient
subspace search is an important future work. A heuristic for LSMI maximization
is studied in Yamada et al. (2011c).
11.3.1 Introduction
Various approaches to evaluating the independence of random variables from sam-
ples have been explored so far. A naive approach is to estimate probability densities
based on parametric or non-parametric density estimation methods. However, find-
ing an appropriate parametric model is not easy without strong prior knowledge,
and non-parametric estimation is not accurate in high-dimensional problems. Thus,
this naive approach is not reliable in practice. Another approach is to approxi-
mate the negentropy (or negative entropy) based on the Gram–Charlier expansion
(Cardoso and Souloumiac, 1993; Comon, 1994; Amari et al., 1996) or the Edge-
worth expansion (Hulle, 2008). An advantage of this negentropy-based approach
is that the hard task of density estimation is not directly involved. However, these
expansion techniques are based on the assumption that the target density is close
to normal and violation of this assumption can cause large approximation errors.
These approaches are based on the probability densities of signals. Another
line of research that does not explicitly involve probability densities employs
184 11 Mutual Information Estimation
Hyperparameter
Distribution
selection
where A is a d × d invertible matrix called the mixing matrix. The goal of ICA is,
given samples of the mixed signals {yk }nk=1 , to obtain a demixing matrix W that
recovers the original source signal x. We denote the demixed signal by z:
z = Wy.
The ideal solution is W = A−1 , but we can only recover the source signals up to
the permutation and scaling of components of x because of the non-identifiability
of the ICA setup (Hyvärinen et al., 2001).
Here we use LSMI (see Section 11.1.4) for SMI approximation. An LSMI-based
SMI estimator,
SMI(Z (1) , . . . , Z (d) ), is given as
1 1
SMI(Z (1) , . . . , Z (d) ) =
h θ− ,
2 2
where
n d
1
(z (i)
− v (i) 2
)
+ λI b ) h,
θ = (H −1
h = exp − k
,
n k=1 i=1 2σ 2
d
n
1
(zk(i) − v (i) )2 + (zk(i) − v (i) )2
and H , = d exp − .
n i=1 k=1 2σ 2
b is the number of basis functions used in LSMI, and I b denotes the b-dimensional
identity matrix. Note that θ and h are b-dimensional vectors and H is a b × b
matrix. λ (≥ 0) is the regularization parameter in LSMI, σ is the kernel width, and
186 11 Mutual Information Estimation
n
∂
h 1
(i) z k − vi 2
=− 2 (zk − v (i) )(yk(i ) − u (i ) ) exp − ,
∂Wi,i nσ k=1 2σ 2
n
,
∂H 1
(zk(i ) − v (i ) )2 + (zk(i ) − v (i ) )2
= d−1 exp −
∂Wi,i n k=1
2σ 2
i # =i
1
, (i) (i) (i ) (i ) -
n
(i) (i) (i ) (i )
× − (z −v )(y −u )+(z −v )(y −u )
nσ 2 k=1 k k k k
(zk(i) − v (i) )2 + (zk(i) − v (i) )2
× exp − .
2σ 2
In ICA, the scale of components of z can be arbitrary. This implies that the
above gradient updating rule can lead to a solution with poor scaling, which is not
preferable from a numerical point of view. To avoid possible numerical instability,
we normalize W at each gradient iteration as
Wi,i
Wi,i ←− 6' . (11.13)
d 2
i =1 Wi,i
In practice, we may iteratively perform line searches along the gradient and opti-
mize the Gaussian width σ and the regularization parameter λ by cross-validation
(CV). A pseudo code of the PG-LICA algorithm is summarized in Figure 11.5.
11.3 Independent Component Analysis 187
Figure 11.5. The LICA algorithm with plain gradient descent (PG-LICA).
C yk − y y k − y and y := y .
n k=1 n k=1 k
Then it can be shown that a demixing matrix that eliminates the second-order
correlation is an orthogonal matrix (Hyvärinen et al., 2001). Thus, for whitened
data, the search space of W can be restricted to the orthogonal group O(d) without
loss of generality.
The tangent space of O(d) at W is equal to the space of all matrices U such that
W U is skew symmetric, that is, UW = −WU . The steepest direction on this
tangent space, which is called the natural gradient, is given as follows (Amari,
1998):
1 ∂ SMI
∂ SMI
∇ SMI(W) := −W W , (11.15)
2 ∂W ∂W
Figure 11.6. The LICA algorithm with natural gradient descent (NG-LICA).
We also use the demosig dataset available in the FastICA package6 for MATLAB® ,
and 10halo, Sergio7, Speech4, and c5signals datasets available in the ICALAB
signal processing benchmark datasets7 (Cichocki and Amari, 2003).
The Amari index (Amari et al., 1996) is used as the performance measure
(smaller is better):
d
1 |oi,i | |oi,i |
Amari index := +
2d(d − 1) maxi |oi,i | maxi |oi ,i |
i,i =1
1
− ,
d −1
6 http://www.cis.hut.fi/projects/ica/fastica.
7 http://www.bsp.brain.riken.jp/ICALAB/ICALABSignalProc/benchmarks/.
11.3 Independent Component Analysis 189
eter settings are used. The hyperparameters σ and λ in LICA are chosen by 5-fold
CV from the 10 values in [0.1, 1] at regular intervals and the 10 values in [0.001, 1]
at regular intervals in log scale, respectively.
We randomly generate the mixing matrix A and source signals for artificial
datasets and compute the Amari index between the true A and W −1 for W
estimated
by each method. As training samples, we use the first n samples for the Sergio7
and c5signals datasets, and the n samples between the 1001th and (1000+n)-th
intervals for the 10halo and Speech4 datasets. We test n = 200 and 500.
The performance of each method is summarized in Table 11.4, depicting the
mean and standard deviation of the Amari index over 50 trials. NG-LICA shows
overall good performance. KICA tends to work reasonably well for datasets (a),
(b), (c), and demosig, but it performs poorly for the ICALAB datasets; this seems
to be caused by an inappropriate choice of the Gaussian kernel width and local
8 http://www.di.ens.fr/˜fbach/kernel-ica/index.htm.
9 http://www.cis.hut.fi/projects/ica/fastica.
10 http://perso.telecom-paristech.fr/ cardoso/guidesepsou.html.
190 11 Mutual Information Estimation
optima. On the other hand, FICA and JADE tend to work reasonably well for the
ICALAB datasets but perform poorly for (a), (b), (c), and demosig; we conjecture
that the contrast functions in FICA and the fourth-order statistics in JADE do
not appropriately catch the non-Gaussianity of datasets (a), (b), (c), and demosig.
Overall, the LICA algorithm compares favorably with other methods.
11.3.5 Remarks
In this section we explained an ICA method based on squared-loss mutual infor-
mation. The method, called least-squares ICA (LICA), has several preferable
properties; for example, it is distribution-free and hyperparameter selection by
cross-validation is available.
Similarly to other ICA algorithms, the optimization problem involved in LICA
is non-convex. Thus, it is practically very important to develop good heuristics for
initialization and avoiding local optima in the gradient procedures, which is an open
research topic to be investigated. Moreover, although the SMI estimator is analytic,
the LICA algorithm is still computationally rather expensive because it requires
one to solve linear equations and perform cross-validations. The computational
issue needs to be addressed, for example, by vectorization and parallelization.
12
Conditional Probability Estimation
12.1.1 Introduction
Regression aims to estimate the conditional mean of output y given input x (see
Section 1.1.1). When the conditional density p ∗ (y|x) is unimodal and symmet-
ric, regression would be sufficient for analyzing the input–output dependency.
191
192 12 Conditional Probability Estimation
h := ψ(x k , yk ).
n k=1
Then the uLSIF solution is given analytically as follows (see Section 6.2.2):
+ λI b )−1
θ = (H h,
194 12 Conditional Probability Estimation
and CSQ is the constant defined by CSQ := 21 p∗ (y|x)p ∗ (x, y)dxdy. The KL error
for a conditional density estimator p (y|x) is defined as
p∗ (x, y)
KL0 := p∗ (x, y) log dxdy = KL + CKL ,
(y|x)p ∗ (x)
p
where
KL := − p ∗ (x, y) log p
(y|x)dxdy
and CKL is the constant defined by CKL := p∗ (x, y) log p ∗ (y|x)dxdy. The smaller
the value of SQ or KL, the better the performance of the conditional density
(y|x).
estimator p
For the above performance measures, CV is carried out as follows. First, the
samples Z := {z k |z k = (x k , yk )}nk=1 are divided into K disjoint subsets {Zk }Kk=1 of
approximately the same size. Let p k be the conditional density estimator obtained
using Z\Zk (i.e., the estimator obtained without Zk ). Then the target error values
are approximated using the hold-out samples Zk as
1
Z := 1
SQ ( x))2 dy −
pk (y| k (
p y|
x),
k
2|Zk | x∈Z |Zk | (x,y)∈Z
k k
Z
KL := − k (
log p y|
x),
k
|Zk | (x,y)∈Z
k
where |Zk | denotes the number of elements in the set Zk . This procedure is repeated
for k = 1, . . . , K and their averages are computed:
K
K
:= 1
SQ SQ := 1
Z and KL Z .
KL k
K k=1 k
K k=1
and KL
SQ can be shown to be almost unbiased estimators of the true costs
SQ and KL, respectively, where the “almost”-ness comes from the fact that the
number of samples is reduced in the CV procedure as a result of data splitting
(Luntz and Brailovsky, 1969; Schölkopf and Smola, 2002).
(y|x) =
p N (y; yk , σ 2 I dy ),
|Ix,@ | k∈I
x,@
where N(y; µ, ) denotes the Gaussian density with mean µ and covariance matrix
. The threshold @ and the bandwidth σ may be chosen based on cross-validation
(Härdle et al., 2004). @-KDE is simple and easy to use, but it may not be reliable in
high-dimensional problems. Slightly more sophisticated variants have been pro-
posed based on weighted kernel density estimation (Fan et al., 1996; Wolff et al.,
1999), but they may still share the same weaknesses.
(y|x) =
p π (x)N (y; µ (x), σ 2 (x)I dy ),
=1
Illustrative Examples
First we illustrate how LSCDE behaves using toy datasets.
Let dx = dy = 1. Inputs {xk }nk=1 are independently drawn from U (−1, 1), where
U (a, b) denotes the uniform distribution on (a, b). Outputs {yk }nk=1 are generated
by the following heteroscedastic noise model:
1
yk = sinc(2πxk ) +
exp(1 − xk ) · εk .
8
We test the following three different distributions for {εk }ni=1 :
i.i.d.
(a) Gaussian: εk ∼ N (0, 1)
i.i.d. 1
(b) Bimodal: εk ∼ 2
N (−1, 49 ) + 12 N (1, 94 )
i.i.d. 3
(c) Skewed: εk ∼ 4 N (0, 1) + 14 N ( 32 , 19 )
i.i.d.
where “ ∼ ” denotes “independent and identically distributed” and N (µ, σ 2 )
denotes the Gaussian distribution with mean µ and variance σ 2 . See
Figure 12.1(a)–(c) for the true conditional densities and training samples of size
n = 200. The estimated results are depicted in Figure 12.1(a)–(c), illustrating that
LSCDE captures well heteroscedasticity, bimodality, and asymmetricity.
198 12 Conditional Probability Estimation
3 3
Truth Truth
Estiamted 2 Estiamted
2
1 1
Output y
Output y
0 0
–1 –1
–2 –2
–3 –3
–1 –0.5 0 0.5 1 –1 –0.5 0 0.5 1
Input x Input x
(a) Artificial dataset containing (b) Artificial dataset containing
i. i. d .
heteroscedastic Gaussian noise: ε k ∼ heteroscedastic bimodal Gaussian
N(0,1) noise:
i. i. d . 1 4 1 4
ε k ∼ 2 N (−1, 9 ) + 2 N (1, 9 )
3
Truth
2 Estiamted
1
Output y
–1
–2
–3
–1 –0.5 0 0.5 1
Input x
(c) Artificial dataset containing
heteroscedastic bi modal Gaussian noise:
i. i. d . 3 1 3 4
ε k ∼ 4 N ( 0, 1) + 4 N ( 2 , 9 )
0.25 6
Relative Change in Spinal BMD
0.2 5
0.15
4
0.1
3
0.05
2
0
–0.05 1
–0.1 0
8 10 12 14 16 18 20 22 24 26 28 40 50 60 70 80 90 100
Age Waiting time [minutes]
(d) Relative spinal bone mineral density (e) The durations of eruptions of the Old
measurements on North American Faithful Geyser (Weisberg, 1985) having
adolescents (Hastie et al., 2001) having a a bimodal conditional distribution
heteroscedastic asymmetric conditional
distribution
Benchmark Datasets
Next we apply LSCDE and the methods reviewed in Section 12.1.3 to the bench-
mark datasets accompanied with the R package (R Development Core Team,
2009), and evaluate their experimental performance. See Table 12.1 for the list
of datasets.
In each dataset, 50% of the samples are chosen randomly for conditional den-
sity estimation and the rest are used for computing the estimation accuracy. The
accuracy of a conditional density estimator p (y|x) is measured by the negative
log-likelihood for test samples {
z k |
z k = ( yk )}ni=1 :
x k ,
n
1
NLL := − (
log p yk |
x k ). (12.4)
n i=1
Thus, the smaller the value of NLL, the better the performance of the conditional
density estimator p (y|x).
We compare LSCDE, @-KDE, MDN, and KQR. In addition, the ratio of ker-
nel density estimators (RKDE) is also tested, which estimates the density ratio
p∗ (x, y)/p ∗ (x) by first approximating the two densities p ∗ (x, y) and p ∗ (x) sep-
arately by kernel density estimation and then taking the ratio of the estimated
densities. For model selection, we use cross-validation based on the Kullback–
Leibler (KL) error (see Section 12.1.2), which is consistent with the above NLL.
In MDN, cross-validation over three tuning parameters (the number of Gaussian
components, the number of hidden units in the neural network, and the regular-
ization parameter; see Section 12.1.3) is unbearably slow, and hence the number
of Gaussian components is fixed to b = 3 and the other two tuning parameters are
chosen by cross-validation.
The experimental results are summarized in Table 12.1. @-KDE is computation-
ally very efficient, but it tends to perform rather poorly. MDN works well, but it is
computationally highly demanding. KQR performs well overall and is computa-
tionally slightly more efficient than LSCDE. However, its solution-path tracking
algorithm is numerically rather unstable and solutions are not properly obtained
for the engel and cpus datasets. RKDE does not perform well for all cases, imply-
ing that density-ratio estimation via density estimation is not reliable in practice.
Table 12.1. Experimental results on benchmark datasets (dy = 1). The averages and the standard
deviations of the NLL errors [see Eq. (12.4)] over 10 runs are described (smaller is better). The best
method in terms of the mean error and comparable methods according to the t-test at the 5%
significance level are specified by boldface. The mean computation time is normalized so that
LSCDE is one.
Dataset (n, dX ) LSCDE @-KDE MDN KQR RKDE
caution (50,2) 1.24 ± 0.29 1.25 ± 0.19 1.39 ± 0.18 1.73 ± 0.86 17.11 ± 0.25
ftcollinssnow (46,1) 1.48 ± 0.01 1.53 ± 0.05 1.48 ± 0.03 2.11 ± 0.44 46.06 ± 0.78
highway (19,11) 1.71 ± 0.41 2.24 ± 0.64 7.41 ± 1.22 5.69 ± 1.69 15.30 ± 0.76
heights (687,1) 1.29 ± 0.00 1.33 ± 0.01 1.30 ± 0.01 1.29 ± 0.00 54.79 ± 0.10
sniffer (62,4) 0.69 ± 0.16 0.96 ± 0.15 0.72 ± 0.09 0.68 ± 0.21 26.80 ± 0.58
snowgeese (22,2) 0.95 ± 0.10 1.35 ± 0.17 2.49 ± 1.02 2.96 ± 1.13 28.43 ± 1.02
ufc (117,4) 1.03 ± 0.01 1.40 ± 0.02 1.02 ± 0.06 1.02 ± 0.06 11.10 ± 0.49
birthwt (94,7) 1.43 ± 0.01 1.48 ± 0.01 1.46 ± 0.01 1.58 ± 0.05 15.95 ± 0.53
crabs (100,6) -0.07 ± 0.11 0.99 ± 0.09 -0.70 ± 0.35 -1.03 ± 0.16 12.60 ± 0.45
GAGurine (157,1) 0.45 ± 0.04 0.92 ± 0.05 0.57 ± 0.15 0.40 ± 0.08 53.43 ± 0.27
geyser (149,1) 1.03 ± 0.00 1.11 ± 0.02 1.23 ± 0.05 1.10 ± 0.02 53.49 ± 0.38
gilgais (182,8) 0.73 ± 0.05 1.35 ± 0.03 0.10 ± 0.04 0.45 ± 0.15 10.44 ± 0.50
topo (26,2) 0.93 ± 0.02 1.18 ± 0.09 2.11 ± 0.46 2.88 ± 0.85 10.80 ± 0.35
BostonHousing (253,13) 0.82 ± 0.05 1.03 ± 0.05 0.68 ± 0.06 0.48 ± 0.10 17.81 ± 0.25
CobarOre (19,2) 1.58 ± 0.06 1.65 ± 0.09 1.63 ± 0.08 6.33 ± 1.77 11.42 ± 0.51
engel (117,1) 0.69 ± 0.04 1.27 ± 0.05 0.71 ± 0.16 N.A. 52.83 ± 0.16
mcycle (66,1) 0.83 ± 0.03 1.25 ± 0.23 1.12 ± 0.10 0.72 ± 0.06 48.35 ± 0.79
BigMac2003 (34,9) 1.32 ± 0.11 1.29 ± 0.14 2.64 ± 0.84 1.35 ± 0.26 13.34 ± 0.52
UN3 (62,6) 1.42 ± 0.12 1.78 ± 0.14 1.32 ± 0.08 1.22 ± 0.13 11.43 ± 0.58
cpus (104,7) 1.04 ± 0.07 1.01 ± 0.10 -2.14 ± 0.13 N.A. 15.16 ± 0.72
θ
θ
hL hR
τ
υL υR
12.1.5 Remarks
We described a density-ratio method for conditional density estimation called
LSCDE. Experiments on benchmark and robot-transition datasets demonstrated
the usefulness of LSCDE.
In LSCDE, a direct density-ratio estimation method based on the squared
distance called unconstrained least-squares importance fitting (uLSIF; see
Section 6.2.2) was applied to conditional density estimation. Similarly, applying
the direct density-ratio estimation method based on the KL importance estimation
procedure (KLIEP; see Chapter 5), we can obtain a log-loss variant of LSCDE. A
valiant of the KLIEP method described in Section 5.2.2 uses a log-linear model
(a.k.a. a maximum entropy model; Jaynes, 1957) for density-ratio estimation:
exp(ψ(x, y) θ )
r(x, y) := .
exp(ψ(x, y ) θ )dy
Applying this log-linear KLIEP method to conditional density estimation is actu-
ally equivalent to maximum likelihood estimation of conditional densities for
log-linear models1 :
n
12.2.1 Introduction
The support vector machine (SVM; Cortes and Vapnik, 1995; Vapnik, 1998; see
also Section 4.4) is a popular method for classification. Various computationally
efficient algorithms for training SVMs with massive datasets have been developed
so far and (e.g., Platt, 1999; Joachims, 1999; Chang and Lin, 2001; Collobert and
Bengio., 2001; Suykens et al., 2002; Rifkin et al., 2003; Tsang et al., 2005; Fung
and Mangasarian, 2005; Fan et al., 2005; Tang and Zhang, 2006; Joachims, 2006;
Teo et al., 2007; Franc and Sonnenburg, 2009; and many other softwares available
online). However, SVMs cannot provide the confidence of class prediction because
they only learn the decision boundaries between different classes. To cope with this
problem, several post-processing methods have been developed for approximately
computing the class-posterior probability (Platt, 2000; Wu et al., 2004).
On the other hand, logistic regression (LR; see Section 4.2) is a classification
algorithm that can naturally give the confidence of class prediction because it learns
the class-posterior probabilities (Hastie et al., 2001). Recently, various efficient
algorithms for training LR models specialized in sparse data have been developed
(Koh et al., 2007; Fan et al., 2008).
Applying the kernel trick to LR as is done in SVMs, one can easily obtain
a non-linear classifier with probabilistic outputs, called a kernel logistic regres-
sion (KLR). Because the kernel matrix is often dense (e.g., Gaussian kernels),
the state-of-the-art LR algorithms for sparse data are not applicable to KLR.
Thus, to train KLR classifiers, standard non-linear optimization techniques such as
Newton’s method (which results in iteratively reweighted least squares) and quasi-
Newton methods such as the Broyden–Fletcher–Goldfarb–Shanno method seem to
be commonly used in practice (Hastie et al., 2001; Minka, 2007). Although the per-
formances of these general-purpose non-linear optimization techniques have been
improved together with the evolution of computer environments in the last decade,
computing the KLR solution is still challenging when the number of training sam-
ples is large. In this section we give an alternative probabilistic classification
method that can be trained very efficiently.
204 12 Conditional Probability Estimation
n c n
:= 1 1
Then the uLSIF solution is given analytically as follows (see Section 6.2.2):
+ λI b )−1
θ = (H h, (12.5)
x, y)
max(0, ψ( θ)
(y|x =
p x) = 'c .
)
y =1 max(0, ψ(
x, y θ)
n
(y )
p(y|x; θ ) = θ K (x, x , y, y ), (12.6)
y =1 =1
n
(y )
p(y|x; θ ) = θ K(x, x )δy,y ,
y =1 =1
where K(x, x ) is a kernel function for x and δy,y is the Kronecker delta; that is,
δy,y = 1 if y = y and δy,y = 0 otherwise. This model choice actually allows us
to speed up the computation of LSPC significantly because all of the calculations
can be carried out separately in a class-wise manner. Indeed, this above model for
class y is expressed as follows [see Figure 12.3(a)]:
n
(y)
p(y|x; θ ) = θ K(x, x ). (12.7)
=1
Output
Input Input
(a) Model (12.7) (b) Model (12.9)
Figure 12.3. Gaussian kernel models for approximating class-posterior probabilities. (a)
Locating Gaussian kernels at all samples. (b) Heuristic of reducing the number of basis
functions – locate Gaussian kernels only at the samples of the target class.
12.2 Probabilistic Classification 207
(1)
H
H′
H′ H (2)
H′
H (3)
(a) Model (12.7) (b) Model (12.9)
Figure 12.4. Structure of matrix H for model (12.7) and model (12.9). The number of
classes is c = 3. Suppose training samples {(x k , yk )}nk=1 are sorted according to label y.
Colored blocks are non-zero and others are zeros. For model (12.7) consisting of c sets of
n basis functions, the matrix H becomes block-diagonal (with common block matrix H ),
and thus training can be carried out separately for each block. For model (12.9) consisting
of c sets of ny basis functions, the size of the target block is further reduced.
y (y) n
where ny is the number of training samples in class y and {x k }k=1 is the training
input samples in class y.
The rationale behind this model simplification is as follows. By definition, the
class-posterior probability p∗ (y|x) takes large values in the regions where samples
in class y are dense; conversely, p∗ (y|x) takes smaller values (i.e., close to zero)
in the regions where samples in class y are sparse. When a non-negative function
is approximated by a Gaussian kernel model, many kernels may be needed in the
208 12 Conditional Probability Estimation
region where the output of the target function is large; on the other hand, only a
small number of kernels would be enough in the region where the output of the
target function is close to zero. Following this heuristic, many kernels are allocated
in the region where p ∗ (y|x) takes large values, which can be achieved by Eq. (12.9).
This model simplification allows us to further reduce the computational cost
because the size of the target blocks in matrix H is further reduced, as illus-
trated in Figure 12.4(b). To learn the ny -dimensional parameter vector θ (y) =
(y) (y)
(θ1 , . . . , θny ) for each class y, we only need to solve the following system of
ny linear equations:
(y) (y)
(H + λI ny )θ (y) =
h , (12.10)
(y) (y)
where H is the ny × ny matrix and
h is the ny -dimensional vector defined as
n
(y) := 1
H
(y) (y)
K(x k , x )K(x k , x ), (12.11)
,
n k=1
ny
(y) 1
(y) (y)
h := K(x k , x ).
n k=1
(y)
Let
θ be the solution of Eq. (12.10). Then the final solution is given by
, ' -
ny (y) (y)
max 0, =1
θ K(x, x )
(y|x) = '
p , 'n -.
c y (y ) (y )
y =1 max 0, =1 θ K(x, x )
For the simplified model (12.9), the computational complexity for obtaining
the solution is O(cn2y n); when ny = n/c for all y (i.e., balanced classification),
this is equal to O(c−1 n3 ). Thus, this approach is computationally highly efficient
for multi-class problems.
A pseudo code of the simplest LSPC implementation for Gaussian kernels is
summarized in Figure 12.5. A MATLAB® implementation of LSPC is available
from http://sugiyama-www.cs.titech.ac.jp/˜sugi/software/LSPC/.
(y)
(y) + λI ny )θ (y) =
Solve linear equation (H
(y)
h and obtain θ ;
end
'ny
max(0, (y) (y)
=1 θ K(x,x ))
(y|x) ←−
p 'c 'ny (y ) (y )
max(0,
y =1 =1 θ K(x,x ))
Figure 12.5. Pseudo code of LSPC for simplified model (12.9) with Gaussian kernel
(12.8).
computing the step direction and a bracketing line search for a point satisfying
the strong Wolfe conditions to compute the step size.
When data are fed to learning algorithms, the input samples are normalized
in element-wise manner so that each element has mean zero and unit variance.
The Gaussian width σ and the regularization parameter λ for all the methods are
chosen based on 2-fold cross-validation from
1
σ ∈ { 10 m, 15 m, 12 m, 23 m, m, 32 m, 2m, 5m, 10m},
λ ∈ {10−2 , 10−1.5 , 10−1 , 10−0.5 , 100 },
where m := median({x k − x k }nk,k =1 ).
We evaluate the classification accuracy and computation time of each method
using the following multi-class classification datasets taken from the LIBSVM web
page (Chang and Lin, 2001):
satimage: Input dimensionality is 36 and the number of classes is 6.
letter: Input dimensionality is 16 and the number of classes is 26.
We investigate the classification accuracy and computation time of LSPC,
LSPC(full), and KLR. For given n and c, we randomly choose ny = *n/c+ training
samples from each class y, where *t+ is the largest integer not greater than t. In the
first set of experiments, we fix the number of classes c to the original number shown
above and change the number of training samples to n = 100, 200, 500, 1000, 2000.
In the second set of experiments, we fix the number of training samples to n = 1000
and change the number of classes c – only samples in the first c classes in the dataset
are used. The classification accuracy is evaluated using 100 test samples chosen
randomly from each class. The computation time is measured by the CPU compu-
tation time required for training each classifier when the Gaussian width and the
regularization parameter chosen by cross-validation are used.
210 12 Conditional Probability Estimation
The experimental results are summarized in Figures 12.6 and 12.7. The graphs
on the left in Figure 12.6 show the misclassification errors. When n is increased, the
misclassification error for all the methods tends to decrease and LSPC, LSPC(full),
and KLR perform similarly well. The graphs on the right in Figure 12.6 show the
computation times. When n is increased, the computation time tends to grow for
all the methods, but LSPC is faster than KLR by two orders of magnitude. The
graphs on the left in Figure 12.7 show that when c is increased, the misclassification
error tends to increase for all the methods, and LSPC, LSPC(full), and KLR behave
similarly well. The graphs on the right in Figure 12.7 show that when c is increased,
the computation time of KLR tends to grow, whereas that of LSPC is kept constant
or even tends to decrease slightly. This happens because the number of samples
in each class decreases when c is increased, and the computation time of LSPC
is governed by the number of samples in each class, not by the total number of
samples (see Section 12.2.2).
Overall, the computation of LSPC was shown to be faster than that of KLR by
two orders of magnitude, whereas LSPC and KLR were shown to be comparable to
each other in terms of classification accuracy. LSPC and LSPC(full) were shown
to possess similar classification performances, and thus LSPC with the simplified
model (12.9) would be more preferable in practice.
Figure 12.6. Misclassification rates (in percent, left) and computation times (in second,
right) as functions of the number of training samples n. The two rows correspond to the
satimage, and letter datasets, respectively.
12.2 Probabilistic Classification 211
0.2
102
0.15
0.1
100
0.05
0 10−2
2 5 10 15 20 26 2 5 10 15 20 26
c c
Figure 12.7. Misclassification rate (in percent, left) and computation time (in second,
right) as functions of the number of classes c. From top to bottom, the graphs correspond
to the ‘mnist’, ‘usps’, ‘satimage’, and ‘letter’ datasets.
12.2.4 Remarks
Recently, various efficient algorithms for computing the solution of logistic regres-
sions have been developed for high-dimensional sparse data (Koh et al., 2007;
Fan et al., 2008). However, for dense data, using standard non-linear optimiza-
tion techniques such as Newton’s method or quasi-Newton methods seem to be
a common choice (Hastie et al., 2001; Minka, 2007). The performance of these
general-purpose non-linear optimizers has been improved in the last decade, but
computing the solution of logistic regressions for a large number of dense training
samples is still a challenging problem.
In this section we described a probabilistic classification algorithm called a
least-squares probabilistic classifier (LSPC). LSPC employs a linear combina-
tion of Gaussian kernels centered at training points for modeling the class-posterior
probability, and the parameters are learned by least-squares class-posterior fitting.
The notable advantages of LSPC are that its solution can be computed analytically
just by solving a system of linear equations and training can be carried out sepa-
rately in a class-wise manner. LSPC was shown experimentally to be faster than
kernel logistic regression (KLR) in computation time by two orders of magnitude,
with comparable accuracy.
212 12 Conditional Probability Estimation
(y) =
H γ ψ ψ
,
=1
ny
where {ψ } =1 (y) associated with the eigenvalues {γ }ny .
are the eigenvectors of H =1
(y)
Then, the solution
θ can be expressed as
ny
(y)
h ψ
(y) + λI ny )−1
θ = (H
(y)
h = ψ .
=1
γ + λ
(y)
Because ( h ψ )ψ is common to all λ, the solution
θ for all λ can be computed
efficiently by eigendecomposing the matrix H (y) once in advance. Although the
eigendecomposition of H (y) may be computationally slightly more demanding
than solving a system of linear equations of the same size, this approach would be
useful, for example, when computing the solutions for various values of λ in the
cross-validation procedure.
When ny is large, we may further reduce the computational cost and memory
space by using only a subset of kernels. This would be a useful heuristic when
a large number of samples are used for training. Another option for reducing the
computation time when the number of samples is very large is the stochastic gra-
dient descent method (Amari, 1967). That is, starting from some initial parameter
value, gradient descent is carried out only for a randomly chosen single sample
in each iteration. Because our optimization problem is convex, convergence to
the global solution is guaranteed (in the probabilistic sense) by stochastic gradient
descent.
Part IV
∗ ∗
distribution Pde with density pde (x):
∗
We assume that pde (x) is strictly positive. The goal is to estimate the density ratio
∗ ∗ ∗
r (x) = pnu (x)/pde (x) based on the observed samples. In this chapter we focus
on using a parametric model (i.e., a finite-dimensional model) for density-ratio
estimation.
215
216 13 Parametric Convergence Analysis
13.1.1 Preliminaries
We consider a linear parametric model with b basis functions {ϕ | = 1, . . . , b}.
Letting ϕ(x) := (ϕ1 (x), . . . , ϕb (x)) , we can express our parametric model R as
9 :
R := α ϕ | α ≥ 0 .
Let
α n be the coefficient corresponding to α
rn =
rn , that is, ∗
n ϕ, and let α be the
coefficient of the true density-ratio function:
∗
pde
r ∗ = ϕα∗ = ∗
.
pnu
∗ ∗ ∗ ∗ as
For any function r, let us define Pnu , Pde , Pnu , and P de
∗ ∗ ∗ ∗
Pnu r := r(x)pnu (x)dx, Pde r := r(x)pde (x)dx,
n n
nu
∗ 1
nu ∗ 1
de
P r := r(x i ), Pde r := r(x j ).
n i=1 n j =1
∗
We define the (generalized) Hellinger distance with respect to pde as
,; ; -2 1/2
hPde∗ (r, r ) := ∗
r(x) − r (x) pde (x)dx ,
0 < η0 ≤ r ∗ ≤ η1
∗
on the support of pnu .
∗ ∗
2. ∃@ , ξ > 0 such that
∗
ϕl (x)pde (x)dx ≥ @ ∗ , ϕl ∞ ≤ ξ ∗ , (∀ϕl ∈ F).
∗
3. ϕ(x)ϕ(x) pde (x)dx ! O (positive definite).
4. The model contains the true density-ratio function: r ∗ ∈ R.
13.1 Density-Ratio Fitting under Kullback–Leibler Divergence 217
∗
Let ψ(α) := log(α ϕ(x)). Note that if Pde (ϕϕ ) ! O is satisfied, then we
obtain the following inequality for all β # = 0:
∗ ϕ ϕϕ
β ∇ 2 Pnu
∗
ψ(α ∗ )β = β ∇Pnu β = −β ∗
P nu β
α ϕ α=α∗ (ϕ α ∗ )2
ϕϕ
= −β Pde
∗
β ≤ −β Pde
∗
(ϕϕ )β/η1 < 0.
(ϕ α ∗ )
Thus, −β ∇ 2 Pnu
∗
ψ(α ∗ )β is also positive definite. Let
G ∗ := −∇ 2 Pnu
∗
ψ(α ∗ ) = Pnu
∗
∇ψ(α ∗ )∇ψ(α ∗ ) (! O).
B(α, @) := {α | α − α ≤ @}.
Now S and Sn are convex polytopes, so that the approximating cones at α ∗ are
also convex polytopes and
That is,
j b−1
C= λ i µi + β i µi | λi ≥ 0, β i ∈ R .
i=1 i=j +1
√
Then we obtain the asymptotic law of α n /cn − α ∗ ).
n(
µ ∗ ∗
i G α = 0 (i = 1, . . . , b − 1).
13.2.1 Preliminaries
∗
Let us model the density ratio r ∗ (x) = pnu
∗
(x)/pde (x) by the following linear
model:
b
r(x; θ ) = θ ϕ (x),
=1
and let
h be the b-dimensional vector with the -th element
nnu
1
h = ϕ (x nu
i ).
nnu i=1
Similarly, the elements of the b × b matrix H and the b-dimensional vector h are
defined as
∗ ∗
H , = ϕ (x)ϕ (x)pde (x)dx and h = ϕ (x)pnu (x)dx,
and
which are obtained as the infinite sample limits of H h, respectively.
220 13 Parametric Convergence Analysis
Suppose that the linear model r(x; θ ) includes the true density ratio r ∗ (x). Then,
if the sample sizes nde and nnu tend to infinity, both the constrained LSIF (cLSIF;
Section 6.2.1) and the unconstrained LSIF (uLSIF; Section 6.2.2) estimators will
converge to r ∗ (x). In the following we present the asymptotic properties of these
estimators.
Let us consider the squared distance between two ratios, r ∗ (x) and r(x; θ ):
1
J0 (θ) = (r(x; θ ) − r ∗ (x))2 pde
∗
(x)dx
2
1 2 ∗ ∗ 1
= r(x; θ ) pde (x)dx − r(x; θ )pnu (x)dx + r ∗ (x)2 pde
∗
(x)dx,
2 2
where the last term is a constant and therefore can be safely ignored. Let us denote
the first two terms by J :
1 2 ∗ ∗
J (θ ) = r(x; θ) pde (x)dx − r(x; θ )pnu (x)dx
2
b b
1
∗
∗
= θ θ ϕ (x)ϕ (x)pde (x)dx − θ ϕ (x)pnu (x)dx
2 =1
, =1
1
= θ H θ − h θ .
2
The accuracy of the estimated ratio r(x;
θ ) is measured by J (
θ ).
For the active set A = {j1 , . . . , j|A| } with j1 < · · · < j|A| , let E be the |A| × b
indicator matrix with the (i, ji )-th element
1 j = ji ∈ A,
Ei,j =
0 otherwise,
and A = H −1 − H −1 E (EH −1 E )−1 EH −1 . For the functions r(x) and r (x), let
C r,r be the b ×b covariance matrix with the ( , )-th element being the covariance
∗
between r(x)ϕ (x) and r (x)ϕ (x) under the probability pde (x), and the functions
∗
r (x) and v(x) denote
b
b
In the following we also use C r ∗ ,r ∗ and C r ∗ ,v defined in the same way as was done
previously. Let
f (n) = ω(g(n))
denote that f (n) asymptotically dominates g(n); more precisely, for all C > 0,
there exists n0 such that
(a) The optimal solution of the problem (13.3) satisfies the strict complementarity
condition (Bertsekas et al., 2003).
(b) nnu and nde satisfy
nnu = ω(n2de ). (13.4)
This theorem elucidates the learning curve (Amari et al., 1992) of cLSIF up to
the order of 1/nde .
B = { | θ ◦ < 0, = 1, . . . , b}.
Let D be the b-dimensional diagonal matrix with the -th diagonal element
0 ∈ B,
D , =
1 otherwise.
Let
b
b
∗
r (x) = θ ∗ ϕ (x) and v(x) = [B−1 ∗
λ D(H θ − h)] ϕ (x),
=1 =1
(a) For the optimal solution of the problem (13.6), the condition θ ◦ # = 0 for =
1, . . . , b holds.
(b) nde and nnu satisfy Eq. (13.4).
E[J (
θ )] = J (θ ∗ )
1 −1 −1 −1 1
+ tr(Bλ DH DBλ C r ,r + 2Bλ C r ,v ) + o
∗ ∗ ∗ .
2nde nde
Theorem 13.7 elucidates the learning curve of uLSIF up to the order of n−1
de .
13.3 Optimality of Logistic Regression 223
η(x de
i ; θ )r(x de
i ; θ ) = η(x nu
i ; θ ). (13.7)
nde i=1 nnu j =1
and hence θ = θ ∗ is a solution of Eq. (13.8). Under a mild assumption, the estimator
θ η converge to θ ∗ ; that is, the estimator has the statistical consistency. Qin (1998)
proved that the moment function defined by
1
η∗ (x; θ ) = ∇ log r(x; θ ) (13.9)
1 + nnu /nde · r(x; θ)
is optimal, where ∇ log r(x; θ ) denotes the b-dimensional gradient vector of
log r(x; θ ). More precisely, the variance–covariance matrix V( θ η∗ ) of the estima-
tor
θ η∗ is asymptotically smaller than or equal to the variance–covariance matrix
V(θ η ) of the other estimator
θ η in the sense of the positive semi-definiteness of
the matrix. This fact is summarized as follows.
Theorem 13.8 (Theorem 3 in Qin, 1998) Suppose that the limit of nnu /nde con-
verges to a positive constant. For any vector-valued function η(x, θ ) with finite
∗ ∗
variance under pde and pnu , the difference of the asymptotic variance covariance
matrix,
that the estimator with η∗ is optimal in the sense of the variance–covariance matrix
of the estimator.
In the following we show that the optimal moment-matching estimator is
derived from the maximum likelihood estimator of the logistic regression models.
∗
Let us assign a selector variable y = nu to samples drawn from pnu (x) and y = de
∗
to samples drawn from pde (x); that is, the two densities are written as
∗
pnu (x) = q ∗ (x|y = nu) and pde
∗
(x) = q ∗ (x|y = de).
(x de nu
i , de), i = 1, . . . , nde and (x j , nu), j = 1, . . . , nnu ,
is observed, the maximum likelihood estimator based on the model (13.10) is the
maximizer of the log-likelihood function,
nde nnu
1
nnu /nde · r(x nu
j ;θ)
L(θ ) = log + log ,
i=1
1 + nnu /nde · r(x de
i ;θ) j =1
1 + nnu /nde · r(x nu
j ;θ)
where
r(x, θ) = exp{θ0 + θ
1 φ(x)}.
i ;θ) de ρ
de
∇ log r(x i ; θ ) = nu ∇ log r(x nu
j ; θ ).
i=1
1 + ρr(x i ; θ ) j =1
1 + ρr(x j ; θ )
13.4 Accuracy Comparison 225
∗ ∗
In our theoretical analysis, we use the expectation of UKL(pnu
r · pde ) over
nu n de n
{x i }i=1 and {x j }j =1 as the measure of accuracy of a density-ratio estimator r(x):
0 ∗ ∗
1
J (
r) := E UKL(pnu
r · pde ) , (13.13)
n de n
where E denotes the expectation over {x nu
i }i=1 and {x j }j =1 .
θ nu := argmax nu
log pnu (x ; θ nu ) ,
i
θ nu ∈nu i=1
n
θ de := argmax log pde (x de
j ; θ de ) .
θ de ∈de j =1
Note that the maximum likelihood estimators θ nu and θ de minimize the empirical
∗ ∗
Kullback–Leibler divergences from the true densities pnu (x) and pde (x) to their
models pnu (x; θ nu ) and pde (x; θ de ), respectively:
n
1
p ∗
(x nu
)
θ nu = argmin log nu i
,
θ nu ∈nu n i=1 pnu (x nu
i ; θ nu )
n ∗
1 pde (x de
j )
θ de = argmin log .
θ de ∈de n j =1 pnu (x de
j ; θ de )
13.4 Accuracy Comparison 227
1 'n
where the estimator is normalized so that n
rA (x de
j =1 j ) = 1.
Logistic Regression
∗
Let us assign a selector variable y = nu to samples drawn from pnu (x) and y = de
∗ ∗
to samples drawn from pde (x); that is, the two densities are written as pnu (x) =
∗ ∗ ∗
q (x|y = nu) and pde (x) = q (x|y = de). Since
q ∗ (y = nu|x)q ∗ (x)
q ∗ (x|y = nu) = ,
q ∗ (y = nu)
q ∗ (y = de|x)q ∗ (x)
q ∗ (x|y = de) = ,
q ∗ (y = de)
the density ratio can be expressed in terms of y as
q ∗ (y = nu|x) q ∗ (y = de) q ∗ (y = nu|x)
r ∗ (x) = = ,
q ∗ (y = nu) q ∗ (y = de|x) q ∗ (y = de|x)
where we use q ∗ (y = nu) = q ∗ (y = de) = 1/2 based on the assumption that nnu = n.
The conditional probability q ∗ (y|x) could be approximated by discriminating
nu n n
{x i }i=1 from {x de
j }j =1 using a logistic regression classifier; that is, for a non-
negative parametric function r(x; θ), the conditional probabilities q ∗ (y = nu|x)
and q ∗ (y = de|x) are modeled by
r(x; θ ) 1
q(y = nu|x; θ ) = and q(y = de|x; θ ) = .
1 + r(x; θ ) 1 + r(x; θ )
Then the maximum likelihood estimator
θ B is computed from {x nu n
i }i=1 and
de n
{x j }j =1 as
n
n
r(x nu
i ;θ) 1
θ B := argmax log + log . (13.14)
θ∈ i=1
1 + r(x nu
i ;θ) j =1
1 + r(x de
j ;θ)
Note that the maximum likelihood estimator θ B minimizes the empirical Kullback–
Leibler divergences from the true density q ∗ (x, y) to its estimator q(y|x; θ)q ∗ (x):
n
1 q ∗ (x nu
i , y = nu)
θ B = argmin log
θ∈ 2n i=1 q(y = nu|x nu ∗ nu
i ; θ )q (x i )
n
1
q ∗ (x de
j , y = de)
+ log .
2n j =1 q(y = de|x de ∗ de
j ; θ )q (x j )
228 13 Parametric Convergence Analysis
n n
θ C := argmax log r(x nu r(x de
i ;θ) − j ;θ) . (13.15)
θ∈ i=1 j =1
Note that
θ C minimizes the empirical unnormalized Kullback–Leibler divergence
∗ ∗
from the true density pnu (x) to its estimator
r(x)pde (x):
n n
1 pnu∗
(x nu
i ) 1
de
θ C = argmin log − 1 +
r(x j ) .
θ∈ n i=1 r(x nu
∗ nu
i )pde (x i ) n j =1
Method (A): For the exponential model (13.16), the maximum likelihood
estimators
θ nu and
θ de are given by
n
θ nu = argmax nu
θ ξ (x ) − nϕ(θ) , i
θ nu ∈ i=1
n
θ de = argmax θ ξ (x de
j ) − nϕ(θ) ,
θ de ∈ j =1
where
θ A :=
θ nu −
θ de . One may use other estimators such as
7 8
rA (x) = exp
θ A ξ (x) − ϕ(
θ nu ) + ϕ(
θ de )
Method (B): For the exponential model (13.17), the optimization problem (13.14)
is expressed as
(
θ B,
θB,0 )
n
n
r(x nu
i ; θ , θ0 ) 1
= argmax log + log
(θ,θ0 )∈×R i=1 1 + r(x nu
i ; θ , θ0 ) j =1
1 + r(x de
j ; θ , θ0 )
n 9 :
exp θ0 + θ ξ (x nu i )
= argmax log 9
:
(θ,θ0 )∈×R i=1 1 + exp θ0 + θ ξ (x nu i )
n
1
+ log 7 8 .
j =1 1 + exp θ0 + θ ξ (x de j )
Method (C): For the exponential model (13.17), the optimization problem
(13.15) is expressed as
n n
1 1
de
(θ C,
θC,0 ) = argmax log r(x nu
i ; θ , θ 0 ) − r(x i ; θ , θ 0 )
(θ,θ0 )∈×R n i=1 n j =1
n
1
= argmax (θ0 + θ ξ (x nu
i ))
(θ,θ0 )∈×R n i=1
n
1
9 de
:
− exp θ0 + θ ξ (x j ) .
n j =1
Based on these lemmas, we compare the accuracy of the three methods. For the
accuracy of (A) and (B), we have the following theorem.
232 13 Parametric Convergence Analysis
Theorem 13.12 J (
rA ) ≤ J (
rB ) holds asymptotically.
Thus method (A) is more accurate than method (B) in terms of the expected
unnormalized Kullback–Leibler divergence (13.13). Theorem 13.12 may be
regarded as an extension of the result for binary classification (Efron, 1975): esti-
mating data-generating Gaussian densities by maximum likelihood estimation has
higher statistical efficiency than logistic regression in terms of the classification
error rate.
Next, we compare the accuracy of (B) and (C).
rB ) ≤ J (
Theorem 13.13 J ( rC ) holds asymptotically.
Thus method (B) is more accurate than method (C) in terms of the expected
unnormalized Kullback–Leibler divergence (13.13). This inequality is a direct
consequence of Qin (1998) (see Section 13.3), where it was shown that method
(B) has the smallest asymptotic variance in a class of semi-parametric estimators.
It is easy to see that method (C) is included in the class.
Finally, we compare the accuracy of (A) and (C). From Theorems 13.12
and 13.13, we immediately have the following corollary.
rA ) ≤ J (
Corollary 13.14 J ( rC ) holds asymptotically.
It was advocated that one should avoid solving more difficult intermedi-
ate problems when solving a target problem (Vapnik, 1998). This statement is
sometimes referred to as Vapnik’s principle, and the support vector machine
(Cortes and Vapnik, 1995) would be a successful example of this principle; instead
of estimating a data-generation model, it directly models the decision boundary,
which is sufficient for pattern recognition.
If we follow Vapnik’s principle, directly estimating the density ratio r ∗ (x) would
∗ ∗
be more promising than estimating the two densities pnu (x) and pde (x), because
∗ ∗ ∗
knowing pnu (x) and pde (x) implies knowing r (x), but not vice versa; indeed,
∗
r ∗ (x) cannot be uniquely decomposed into pnu ∗
(x) and pde (x). Thus, at a glance,
Corollary 13.14 is counterintuitive. However, Corollary 13.14 would be reasonable
because method (C) does not make use of the knowledge that each density is
exponential, but only the knowledge that their ratio is exponential. Thus method
(A) can utilize the a priori model information more effectively. Thanks to the
additional knowledge that both the densities belong to the exponential model, the
intermediate problems (i.e., density estimation) were actually made easier in terms
of Vapnik’s principle.
First we study the convergence of method (A). Let p nu (x) and p de (x) be the
∗ ∗
projections of the true densities pnu (x) and pde (x) onto the model p(x; θ ) in terms
of the Kullback–Leibler divergence (13.12):
where
∗
∗ pnu (x)
θ nu := argmin pnu (x) log dx,
θ∈ p(x; θ )
∗
∗ pde (x)
θ de := argmin pde (x) log dx.
θ∈ p(x; θ )
∗
This means that p nu (x) and pde (x) are the optimal approximations to pnu (x) and
∗
pde (x) in the model p(x; θ) in terms of the Kullback–Leibler divergence. Let
pnu (x)
r A (x) := .
pde (x)
Because the ratio of two exponential densities also belongs to the exponential
model, there exists θ A ∈ such that r A (x) = r(x; θ A , θ̄A,0 ). Then we have the
following lemma.
Lemma 13.15
rA converges in probability to r A as n → ∞.
Next we investigate the convergence of method (B). Let q ∗ (x, y) be the joint
probability defined as
∗ ∗
pnu (x) + pde (x)
q ∗ (x, y) = q ∗ (y|x) × ,
2
where q ∗ (y|x) is the conditional probability of y such that
r ∗ (x) 1
q ∗ (y = nu|x) = ∗
and q ∗ (y = de|x) = .
1 + r (x) 1 + r ∗ (x)
The model (13.19) is used to estimate q ∗ (x, y), and let q(x, y) be the projection
of the true density q ∗ (x, y) onto the model (13.19) in terms of the Kullback–Leibler
divergence (13.12):
where
q ∗ (y|x)
(θ B , θ̄B,0 ) := argmin q ∗ (x, y) log dx.
(θ,θ0 )∈×R q(y|x; θ, θ0 )
y∈{nu,de}
This means that q(x, y) is the optimal approximation to q ∗ (x, y) in the model
∗ ∗
pnu (x) + pde (x)
q(y|x; θ , θ0 )
2
234 13 Parametric Convergence Analysis
where
∗ r ∗ (x)
(θ C , θ̄C,0 ) := argmin pnu (x) log dx
(θ,θ0 )∈×R r(x; θ, θ0 )
∗
−1+ pde (x)r(x; θ, θ0 )dx .
This means that r C (x) is the optimal approximation to r ∗ (x) in the model r(x; θ)
in terms of the unnormalized Kullback–Leibler divergence. Then we have the
following lemma.
Lemma 13.17
rC converges in probability to r C as n → ∞.
Based on these lemmas, we investigate the relation among the three methods.
Lemma 13.17 implies that method (C) is consistent with the optimal approxima-
tion r C . However, as we will show in the following, methods (A) and (B) are
not consistent with the optimal approximation r C in general. Let us measure the
deviation of a density-ratio function r from r by
% &2
D(r , r) := pde∗
(x) r (x) − r(x) dx.
∗ ∗
When the model is misspecified, pde (x) r A (x) and pde (x) r B (x) are not proba-
bility densities in general. Then Theorem 13.18 implies that methods (A) and (B)
are not consistent with the optimal approximation r C .
Because model misspecification is a usual situation in practice, method (C) is
the most promising approach in density-ratio estimation.
Finally, for the consistency of method (A), we also have the following additional
result.
13.5 Remarks 235
∗
Corollary 13.19 If pde (x) belongs to the exponential model (13.16), that is, there
∗ ∗
exists θ de ∈ such that pde (x) = p(x; θ de )n, then r A = r C holds even when pnu (x)
does not belong to the exponential model (13.16).
∗
This corollary means that, as long as pde (x) is correctly specified, method (A)
is still consistent.
13.5 Remarks
In this chapter we analyzed the asymptotic properties of density-ratio estimators
under the parametric setup.
We first elucidated the consistency and asymptotic normality of KLIEP
(Chapter 5) in Section 13.1, and the asymptotic learning curve of cLSIF
(Section 6.2.1) and uLSIF (Section 6.2.2) in Section 13.2. All of the methods
√
were shown to achieve the n-consistency, which is the optimal parametric
convergence rate.
In Section 13.3 we considered the moment-matching approach to density-ratio
estimation and introduced Qin’s (1998) result: The logistic regression method
(Chapter 4) achieves the minimum asymptotic variance under the correctly spec-
ified parametric setup. Thus, as long as the parametric model at hand is correctly
specified, use of the logistic regression method is recommended.
Finally, in Section 13.4, we theoretically compared the performance of three
density-ratio estimation methods:
(A) The density estimation method (Chapter 2)
(B) The logistic regression method (Chapter 4)
(C) The Kullback–Leibler divergence method (Chapter 5)
In Section 13.4.4, we first showed that when the numerator and denominator
densities are known to be members of the exponential family, (A) is better than (B)
and (B) is better than (C) in terms of the expected unnormalized Kullback–Leibler
divergence. This implies that when correctly specified parametric density models
are available for both the numerator and denominator densities, separate density
estimation is more promising than direct density-ratio estimation. This is because
direct density-ratio estimation cannot utilize the knowledge of each density, only
the knowledge of their ratio. However, once the model assumption is violated, (C)
is better than (A) and (B), as shown in Section 13.4.5. Thus, in practical situations
where no exact model is available, (C) would be the most promising approach to
density-ratio estimation.
In the next chapter we analyze statistical properties of density-ratio esti-
mation under the non-parametric setup, which requires considerably different
mathematical tools.
14
Non-Parametric Convergence Analysis
∗ ∗ ∗ ∗
Let pnu (x) and pde (x) be the probability densities for the distributions Pnu and Pde ,
∗ ∗ ∗
respectively. The goal is to estimate the density ratio r (x) = pnu (x)/pde (x) based
∗
on the observed samples, where pde (x) is assumed to be strictly positive. In this
chapter we focus on using a non-parametric model (i.e., an infinite-dimensional
model) for density-ratio estimation.
F / f 0→ f (x i ),
n i=1
236
14.1 Mathematical Preliminaries 237
where E denotes the expectation. To evaluate the uniform upper bound, quanti-
ties that represent the “complexity” of F are required. The covering numbers and
bracketing numbers are commonly used complexity measures that will be intro-
duced in Section 14.1.3. A key device to give the tail probability of the uniform
upper bound is Talagrand’s concentration inequality, which evaluates how the
uniform upper bound of the empirical process concentrates around its expectation.
Talagrand’s concentration inequality will be explained in Section 14.1.5.
14.1.1 Outline
First we give an outline for deriving the convergence rates of non-parametrically
estimated density ratios. Because the aim here is to give an intuitive explanation,
mathematical preciseness is sacrificed to some extent.
Let F be a model that is a class of measurable functions. Let x 1 , . . . , x n be
i.i.d. samples, and let P ∗ be an underlying probability measure generating x i . We
denote an empirical risk of a function f for a loss function by
n
∗ (f ) := 1
P (f (x k )).
n k=1
The true risk of f with respect to the true probability measure P ∗ is written as
P ∗ (f ) := (f (x))p∗ (x)dx.
This allows us to obtain a tighter bound compared with dealing with f and f ∗
independently. In the following we show an illustrative usage of the localization
technique.
First we suppose that there exists a (pseudo) distance d(f, f ∗ ) such that
where f− f ∗ 2,P is the L2 -distance with respect to the probability measure P ,
and we ignore the difference in constant factors. This assumption can be achieved,
for example, if there exists a constant δ > 0 such that
δ
(z) − (z0 ) ≥ (z − z0 ) (z0 ) + (z − z0 )2 .
2
For F := { (f ) | f ∈ F}, its complexity is usually measured by the covering
number or the bracketing number. Roughly speaking, the covering number and
the bracketing number of F are large/small if the model F is complex/simple.
To impose that the model F is not too complicated, we assume that there exists
0 < γ < 2 such that one of the following two conditions is satisfied:
% &
log N[] (@, F , L2 (P ∗ )) = O @ −γ ,
% &
sup log N (@, F , L2 (P )) = O @ −γ ,
P
where N[] (@, F , L2 (P )) denotes the bracketing number of F (see Definition 14.4)
and N(@, F , L2 (P )) denotes the covering number of F with respect to the norm
L2 (P ) (see Definition 14.3). In the last equation, the supremum is taken over all
discrete probability measures (see Lemma 14.5).
Under the previously mentioned conditions about the covering number or the
bracketing number, one can show that
1− γ2
∗ )( (f ) − (f ∗ )) ≤ Op √δ
sup (P ∗ − P . (14.3)
f ∈F :d(f ,f ∗ )≤δ n
Condition (14.1) is used to prove this bound. Roughly speaking, the right-hand
side comes from the so-called Dudley integral (see Lemmas 14.5 and 14.6):
δ; δA
1 1
√ log N[] (@, F , L2 (P ∗ ))d@ or √ sup log N (@, F , L2 (P ))d@.
n 0 n 0 P
This bound is called the fast learning rate (Koltchinskii, 2006; Bartlett et al., 2005).
Hoeffding’s inequality is simpler than Bernstein’s, and gives the tail bound of
'n
i=1 Yi of independent variables with range [0, 1] but unknown variance.
where “sup” is taken over all discrete probability measures. Then there exists a
constant CK,γ such that
2
∗ − P ∗ F ] ≤ CK,γ max{n−1/2 δ 1−γ /2 , n
E[P
− 2+γ
}.
By Lemma 3.4.2 of van der Vaart and Wellner (1996), we also have the
following lemma.
Lemma 14.6 Let F be a class of functions such that f ∞ ≤ 1 and supf ∈F Ef 2 ≤
δ 2 . Then there exists a constant C such that
∗ ∗ 2
E[P∗ − P ∗ F ] ≤ C J (F, δ,√L2 (P )) + J (F, δ, L2 (P )) ,
n δ2n
where δ;
J (F, δ, L2 (P ∗ )) := 1 + log N[] (F, @, L2 (P ∗ ))d@.
0
It should be noted that the inequality shown in Lemma 14.5 is a uniform ver-
sion of Hoeffding’s inequality, and that of Lemma 14.6 is a uniform version of
Bernstein’s inequality.
14.2.1 Preliminaries
For simplicity, we assume that the numbers of the numerator and denominator
samples are the same, that is, n = nnu = nde . We note that this assumption is just
for notational simplicity; in the absence of this assumption, the convergence rate
is determined solely by the sample size with the slower rate.
∗ ∗
For the two probability distributions pnu (x) and pde (x), we express the
expectation of a function r as
∗ ∗ ∗ ∗
Pnu r := r(x)pnu (x)dx and Pde r := r(x)pde (x)dx.
∗
In a similar fashion we define the empirical distributions of pnu ∗
and pde nu
by P ∗
∗
and Pde , that is,
n n
nu
∗ 1
nu de
∗ 1
de
P r := r(x i ) and P r := r(x j ).
n i=1 n j =1
RM
n := {r ∈ Rn | r∞ ≤ M} ⊂ RM .
Under the notations described above, the solution rn of (generalized) KLIEP is
given as
nu
rn := argmax P ∗
log (r) .
n
r∈R
For simplicity, we assume that the optimal solution can be determined uniquely.
∗
We define the (generalized) Hellinger distance with respect to pde as
,; ; -2 1/2
hPde∗ (r, r ) := ∗
r(x) − r (x) pde (x)dx ,
3. Let N(@, Q, · ) and N[] (@, Q, · ) be the @-covering number and the @-
bracketing number of Q with norm · , respectively (see Definitions 14.3
and 14.4). For some constants 0 < γ < 2 and K,
γ
M M
sup log N (@, R , L2 (P )) ≤ K , (14.6)
P @
where the supremum is taken over all finitely discrete probability measures
P , or γ
M
log N[] (@, RM , L2 (pde
∗
)) ≤ K . (14.7)
@
The lower bound of r ∗ that appears in the first assumption will be used to
ensure the existence of a Lipschitz continuous function that bounds the Hellinger
distance from the true density ratio. The bound of r ∗ is needed only on the support
244 14 Non-Parametric Convergence Analysis
∗ ∗
of pnu and pde . The third assumption controls the complexity of the model. By
this complexity assumption, we can bound the tail probability of the difference
between the empirical risk and the true risk uniformly over the function class RM .
Then we have the following convergence bound for KLIEP.
The technical advantage of using the Hellinger distance instead of the Kullback–
Leibler divergence is that the Hellinger distance is bounded from above by a
Lipschitz continuous function. On the other hand, the Kullback–Leibler divergence
is not Lipschitz continuous because log(x) diverges to −∞ as x → 0. This allows
us to utilize the uniform convergence results of the empirical processes.
Theorem 14.9 In addition to Assumption 14.7, if there exists rn∗ ∈ Rn such that,
∗ ∗ 2 ∗ ∗
for some constant c0 , r (x)/rn (x) ≤ c0 for all x on the support of pnu and pde ,
1
− 2+γ
then hPde∗ (r ∗ ,
rn ) = Op (n + hPde∗ (r ∗ , rn∗ )).
where P is a positive finite measure whose support is contained in [0, 1]. For a
measure P , we define rP (x) := K1 (x, x )dP (x ). According to Lemma 3.1 of
Ghosal and van der Vaart (2001), for every 0 < @n < 1/2 there exits a discrete
positive finite measure P on [0, 1] such that
Let us divide [0, 1] into bins with width @n . Then the number of sample points xinu
that fall in a bin is a binomial random variable. If exp(−η2 n@n /4)/@n → 0, then
14.2 Non-Parametric Convergence Analysis of KLIEP 245
converges to 1, where supp(P ) denotes the support of P . This holds because the
∗
density pnu (x) is bounded from below across the support.
It holds that
|1 − Pde
∗ ∗ de
r̃n | = |1 − P ∗ ∗
(r̃n − rP + rP − r ∗ + r ∗ )|
√
≤ O(@n ) + |1 − P de
∗ ∗
r | = Op (1/ n),
we have rn∗ − r̃n∗ ∞ = rn∗ ∞ |1− P ∗ r̃n∗ | = Op (1/√n). From the above discussion,
√ de √
we obtain rn∗ −r ∗ ∞ = Op (1/ n), which indicates that hPde∗ (rn∗ , r ∗ ) = Op (1/ n)
and that r ∗ /rn∗ ≤ c02 is satisfied with high probability.
For the bias term in Theorem 14.8, set @n = C log(n)/n for sufficiently large
C > 0 and replace r ∗ with cn r ∗ . Then we obtain γn = Op (log(n)/n).
As for the complexity of the model, a similar argument to Theorem 3.1 in
Ghosal and van der Vaart (2001) gives
M 2
log N (@, RM , · ∞ ) ≤ K log
@
for 0 < @ < M/2. This gives both conditions (14.6) and (14.7) of the third assump-
tion in Assumption 14.7 for arbitrary small γ > 0 (but the constant K depends
on γ ). Thus, the convergence rate is evaluated as hPde∗ (r ∗ ,
rn ) = Op (n−1/(2+γ ) ) for
arbitrary small γ > 0.
1 Here we refer to the Chernoff bound as follows: Let {X }n
i i=1 be independent random variables
taking values on 0 or 1. Then, for any δ > 0,
n
n n
Pr Xi < (1 − δ)
E[Xi ] < exp −δ 2 E[Xi ]/2 .
i=1 i=1 i=1
246 14 Non-Parametric Convergence Analysis
Then the following theorem and corollary establish the convergence rate of
rn .
Theorem 14.12 If there exist constants c0 and c1 such that rn∗ satisfies
r ∗ (x)
Pr c0 ≤ ∗ ≤ c1 , ∀x → 1 as n → ∞,
rn (x)
1
− 2+γ
rn , r ∗ ) = Op (n
then hPde∗ ( + hPde∗ (rn∗ , r ∗ )).
Corollary 14.13 If there exists N such that ∀n ≥ N , r ∗ ∈ Rn , then hPde∗ (
rn , r ∗ ) =
1
− 2+γ
Op (n ).
rn , rn∗ )
The main mathematical device used in Section 14.2.2 was to bound hPde∗ (
from above by the difference between the empirical mean and the expectation of
14.3 Convergence Analysis of KuLSIF 247
log(2rn∗ /(
rn + rn∗ )). On the other hand, in the proof of the above results, 2
rn /(
rn +
∗ ∗ ∗
rn ) was used instead of log(2rn /( rn + rn )), which enabled us to replace the lower
bound of r ∗ with the bounds of the ratio r ∗ /rn∗ . Then the convexity of Rn and the
bracketing number condition with respect to the Hellinger distance were utilized
to establish the above proof (see Section 7 of van de Geer, 2000). The complete
proof can be found in Suzuki et al. (2011).
14.3.1 Preliminaries
For simplicity, we assume that the numbers of numerator and denominator samples
are the same, that is, n = nnu = nde . We note that this assumption is just for
notational simplicity; in the absence of this assumption, the convergence rate is
determined solely by the sample size with a slower rate.
Given a probability distribution P and a random variable h(X), we denote
the expectation of h(X) under P by hdP or h(x)P (dx). Let · ∞ be
the infinity norm, and let · P be the L2 -norm under the probability P ,
that is, h2P = |h|2 dP . For a reproducing kernel Hilbert space (RKHS) R
(Schölkopf and Smola, 2002), the inner product and the norm on R are denoted
as ·, ·R and · R , respectively.
Let R be an RKHS endowed with the kernel K(x, x ), and let the estimated
density ratiorn be defined as the minimizer of the following minimization problem:
n
n
1 1 λ
g := argmin
r(x de 2
j ) − r(x nu 2
i ) + rR . (14.9)
r∈R 2n j =1 n i=1 2
(Rosenblatt, 1956), and the same fact may hold in non-parametric density-ratio
estimation.
By Mercer’s theorem (Mercer, 1909), the kernel K(x, x ) has the following
∗
spectrum decomposition with respect to pde :
∞
K(x, x ) = ek (x)µk ek (x ),
k=1
where {ek }∞ ∗ 2
k=1 is an orthogonal system in L2 (pde ); that is, Epde [ek ] = 1 and
∗
Assumption 14.15
1. supx∈Rd K(x, x) ≤ 1.
∗ ∗ ∗ ∗
2. The true density ratio pnu /pde is contained in R, that is, pnu /pde = r ∗ ∈ R.
3. There exists a constant 0 < γ < 1 such that the spectrum µk of the kernel
2
decays as µk ≤ ck − γ for some positive constant c.
for all f ∈ R. The third condition is important for controlling the complexity of
the model R. The main message of this condition is that the constant γ represents
the “complexity” of the model R. It is easy to see that the spectrum decays rapidly
if γ is small. Then the situation gets close to a finite-dimensional case, and thus
the model becomes simple. In fact, if the kernel is linear, that is, K(x, x ) = x x ,
then µk = 0 for all k > d (this corresponds to the situation where γ is arbitrarily
small). We note that there is a clear relation between the spectrum condition and
2
the covering number condition (Steinwart et al., 2009); that is, µk ∼ ck − γ if and
∗
only if N(@, BR , L2 (pde )) ∼ @ −γ , where BR is the unit ball in R. However, dealing
directly with the spectrum condition gives a tighter bound because the spectrum
condition allows us to avoid using the uniform bounds.
Then we have the following theorem.
, -2/(2+γ )
Theorem 14.16 Under Assumption 14.15, if we set λn = logn n , then we
have
1
∗ log n 2+γ
rn − r L2 (pde
∗ ) = Op ,
n
The above convergence rate agrees with the mini-max optimal rate up to the
log(n) factor. It is easy to see that, if γ is small (i.e., the model R is simple), the
convergence rate becomes faster.
250 14 Non-Parametric Convergence Analysis
Theorem 14.17 Assume that the constant function 1 is contained in R, that is,
1 ∈ R. Then, under Assumption 14.15, if we set λn = (log n/n)2/(2+γ ) , we have
, 2 1
-
− PE| = Op (log n/n) 2+γ + C (log n/n) 2+γ ,
|PE (14.10)
14.4 Remarks
In this chapter we investigated the non-parametric convergence rate of KLIEP and
uLSIF. In the case of a parametric estimation, the convergence rate was shown
to be n−1/2 (see Chapter 13). On the other hand, the convergence rate in a non-
parametric estimation was shown to be slightly slower than n−1/2 (depending on the
complexity of the function space used for estimation). Thus, parametric estimations
achieve faster convergence rates than non-parametric estimations. However, non-
parametric methods can handle infinite-dimensional models, which would be more
flexible and powerful than the parametric approach.
14.4 Remarks 251
The goal of a two-sample test is to, given two sets of samples, test whether the
probability distributions behind the samples are equivalent. In Section 10.2 we
described a practical two-sample method based on non-parametric density-ratio
estimation. In this chapter we study two-sample tests for parametric density-ratio
models.
After an introduction in Section 15.1, basic materials for parametric density-
ratio estimation and divergence estimation are summarized in Sections 15.2 and
15.3, respectively. Then we derive the optimal divergence estimator in the sense of
the asymptotic variance in Section 15.4 and give a two-sample test statistic based
on the optimal divergence estimator in Section 15.5. Finally, numerical examples
are shown in Section 15.6, and the chapter is concluded in Section 15.7.
15.1 Introduction
We study a two-sample homogeneity test under semi-parametric density-ratio mod-
els, where an estimator of density ratios is exploited to obtain a test statistic. For
∗ ∗
two probability densities pnu (x) and pde (x) over a probability space X , the density
∗
ratio r (x) is defined as the ratio of these densities, that is,
∗
pnu (x)
r ∗ (x) := ∗ .
pde (x)
Qin (1998) studied the inference problem of density ratios under retrospective
sampling plans and proved that in the sense of Godambe (1960), the estimating
function obtained from the prospective likelihood is optimal in a class of unbiased
estimating functions for semi-parametric density-ratio models. In a similar fash-
ion, a semi-parametric density-ratio estimators based on logistic regression were
studied in Cheng and Chu (2004).
Density-ratio estimation is closely related to the estimation of divergences. A
divergence is a discrepancy measure between pairs of multivariate probability
densities, and the Ali–Silvey–Csiszár (ASC) divergence (a.k.a. the f -divergence;
252
15.2 Estimation of Density Ratios 253
see Ali and Silvey, 1966; Csiszár, 1967) is a class of divergences based on the ratio
of two probability densities. For a strictly convex function f such that f (1) = 0,
∗ ∗
the ASC divergence from pnu (x) to pde (x) is defined as
∗
∗ ∗ ∗ pnu (x)
ASCf (pnu pde ) := pde (x)f ∗ dx. (15.1)
pde (x)
15.2.1 Formulation
Suppose that two sets of samples are independently generated from each
probability:
i.i.d. i.i.d.
x nu nu ∗ de de ∗
1 , . . . , x nnu ∼ pnu and x 1 , . . . , x nde ∼ pde .
Let us denote a model for the density ratio by r(x; θ ), where θ ∈ ⊂ Rd is the
parameter. We assume that the model is correctly specified; that is, there exists
θ ∗ ∈ such that the true density ratio is represented as
∗
pnu (x)
r ∗ (x) = ∗ = r(x; θ ∗ ).
pde (x)
The model for the density ratio r ∗ (x) is regarded as a semi-parametric model for
probability densities. That is, even if r(x; θ ∗ ) = pnu
∗ ∗
(x)/pde (x) is specified, there
∗ ∗
are still infinite degrees of freedom for the probability densities pnu and pde .
Qφ (θ) := r(x de
j ; θ )φ(x de
j ; θ ) − φ(x nu
i ; θ ).
nde j =1 nnu i=1
Because pnu∗
(x) = r(x; θ ∗ )pde
∗
(x), the expectation of Qφ (θ ) over observed samples
∗
vanishes at θ = θ . In addition, the estimation function Qφ (θ ) converges to its
expectation in the large sample limit. Thus, the estimator θ defined as a solution
of the estimating equation,
Qφ (
θ ) = 0,
has the statistical consistency under some mild assumption. See Qin (1998) for
details. In the following we give a sufficient condition for the consistency and the
asymptotic normality of θ.
The previously mentioned moment-matching framework includes various
density-ratio estimators (Keziou, 2003b; Keziou and Leoni-Aubin, 2005, 2008;
Nguyen et al., 2010; Sugiyama et al., 2008; Kanamori et al., 2009); that is, these
estimators with a finite-dimensional model r(x; θ) can all be represented as a
moment-matching estimator. However, these methods were originally intended
to be used with non-parametric kernel models. On the other hand, kernel density
estimators can also be exploited as another approach to density-ratio estimation
(Ćwik and Mielniczuk, 1989; Jacoba and Oliveirab, 1997; Bensaid and Fabre,
2007).
15.2 Estimation of Density Ratios 255
Let ρ and m be
nnu 1 nnu nde
ρ= and m = = ,
nde 1/nnu + 1/nde nnu + nde
and let U φ be the d-by-d matrix defined by
1 Similar assumptions for one-sample problems have been studied in Broniatowski and Keziou (2009).
256 15 Parametric Two-Sample Test
nde
1
p
sup φ(x j ; θ )r(x j ; θ ) − Ede [φ(x; θ )r(x; θ )]
de de
−→ 0.
θ∈ nde j =1
holds.
Note that item 15.1 in Assumption 15.1 and the triangle inequality lead to the
uniform convergence of Qφ :
p
sup Qφ (θ ) − E[Qφ (θ )] −→ 0.
θ∈
Similarly, the following assumptions are required for establishing the asymp-
totic normality of density-ratio estimation based on moment matching (see
Section 5 of van der Vaart, 1998, for details):
Qin (1998) showed that the prospective likelihood minimizes the asymptotic
variance in the class of moment-matching estimators. More precisely, for the
density-ratio model
r(x; θ) = exp{α + φ(x; β)}, θ = (α, β ) ∈ R × Rd−1 ,
the vector-valued function φ opt defined by
1
φ opt (x; θ ) = ∇ log r(x; θ ) (15.3)
1 + ρr(x; θ )
minimizes the asymptotic variance (15.2).
where the supremum is taken over all measurable functions and the supremum
is attained at w(x) = f (r ∗ (x)). Based on Eq. (15.5), one can consider the ASC
f by replacing the expectations with their empirical
divergence estimator ASC
averages:
5 nnu nde B
1
nu 1
∗ de
ASCf := sup f (r(x i ; θ )) − f (f (r(x j ; θ ))) . (15.6)
θ∈ nnu i=1 nde j =1
258 15 Parametric Two-Sample Test
∇(f (r(x nu
i ; θ ))) − r(x de de
i ; θ )∇(f (r(x j ; θ ))) = 0,
nnu i=1 nde j =1
Its empirical version provides the following estimate of the ASC divergence:
nde nnu
1
f = 1
ASC fde (r(x de
j ;
θ )) + fnu (r(x nu
i ; θ )), (15.7)
nde j =1 nnu i=1
15.4.1 Preliminaries
For the model r(x; θ ) and the function f (r), we assume the following conditions.
Assumption 15.3
1. The model r(x; θ) includes the constant function 1.
2. For any θ ∈ , 1 ∈ L[∇ log r(x; θ )] holds.
3. f is third-order differentiable and strictly convex, and it satisfies f (1) =
f (1) = 0.
Standard models of density ratios may satisfy items 1 and 2 of Assumption 15.3.
Furthermore, we assume the following conditions to justify the asymptotic
f defined in Eq. (15.7).
expansion of the estimator ASC
Assumption 15.4 (Asymptotic expansion of ASC f)
√
θ, m(
1. For the estimator θ − θ ∗ ) converges in distribution to a centered
multivariate normal distribution.
2. For the decomposition
suppose that
are finite. In the vicinity of θ ∗ , the second derivatives of fnu (r(x; θ )) and
∗
fde (r(x; θ)) with respect to θ are dominated by a pnu -integrable function
∗
and a pde -integrable function, respectively.
3. The expectation
exists.
Under Assumption 15.4, the delta method is available (see Section 3 of
van der Vaart, 1998, for details).
with φ(x; θ ) and the decomposition f (r) = fde (r) + rfnu (r), and the other is
the estimator ASCf defined by the density-ratio estimator with φ(x; θ) and the
decomposition f (r) = fde (r) + rfnu (r).
To compare the variances of these estimators, we consider the formula
f − ASCf ]
0 ≤ V[ASC
f ] − V[ASCf ] − 2 Cov[ ASC
= V[ASC f − ASCf , ASCf ].
f . Then we have
Suppose that the third term vanishes for any ASC
f]
V[ASCf ] ≤ V[ASC
f . This implies that the estimator ASCf is asymptotically optimal in
for any ASC
terms of the variance.
f − ASCf , ASCf ]. Let
In the following we compute the covariance Cov[ ASC
the vectors c(θ ), c(θ) ∈ Rd be
c(θ ) = Enu [{f (r(x; θ )) − fnu (r(x; θ ))}∇ log r(x; θ )],
c(θ ) = Enu [{f (r(x; θ )) − fnu (r(x; θ ))}∇ log r(x; θ )].
Then, under Assumptions 15.3 and 15.4, the following equality holds:
f − ASCf , ASCf ]
m(1 + ρ −1 ) · Cov[ ASC
.9 :
= Enu fnu (r ∗ ) − fnu (r ∗ ) + c U −1
φ
φ − c U −1
φ φ
/
× {f (r ∗ ) − (r ∗ + ρ −1 )(fnu (r ∗ ) + c U −1
φ
φ)} + o(1), (15.8)
where r ∗ denotes the true density ratio r ∗ (x) = r(x; θ ∗ ), and the functions in
Eq. (15.8) are evaluated at θ = θ ∗ .
f−
The next theorem shows a sufficient condition that the covariance Cov[ ASC
ASCf , ASCf ] vanishes.
Theorem 15.5 Suppose Assumptions 15.3 and 15.4 hold for the decomposition of
f , and suppose that φ(x; θ ), fnu (r(x; θ)), and fde (r(x; θ)) satisfy
for all θ ∈ . Then the estimator ASCf with φ(x; θ ) and the decomposition f (r) =
fde (r) + rfnu (r) satisfies
lim mV[ASCf ] ≤ lim mV[A SCf ].
m→∞ m→∞
This means that ASCf uniformly attains the minimum asymptotic variance in terms
of the ASC divergence estimation.
15.4 Optimal Estimator of ASC Divergence 261
0 1
= Enu {fnu (r(x; θ ∗ )) − fnu (r(x; θ ∗ ))}∇ log r(x; θ ∗ )
+ c U −1
φ
U φ − c U −1 φ Uφ
0 1
= Enu {fnu (r(x; θ ∗ )) − fnu (r(x; θ ∗ ))}∇ log r(x; θ ) + c − c
= 0.
Hence, when Eq. (15.9) holds, we have
f − ASCf , ASCf ] = o(1)
m(1 + ρ −1 ) · Cov[ ASC
f.
for any ASC
holds for all θ ∈ . Then the function φ = φ opt defined in Eq. (15.3) and the
decomposition f (r) = fde (r) + rfnu (r) satisfy the condition (15.9).
Proof We see that the condition (15.10) and the equality
(r(x; θ) + ρ −1 )φ opt (x; θ ) = ρ −1 ∇ log r(x; θ )
ensure the condition (15.9).
Based on Corollary 15.6, we see that the estimator defined from
f (r) ρf (r)
fde (r) = , fnu (r) = , and φ(x; θ ) = φ opt (x; θ) (15.11)
1 + ρr 1 + ρr
leads to an optimal estimator of the ASC divergence. In the optimal estimator, the
function f is decomposed according to the ratio of the logistic model, 1/(1 + ρr)
and ρr/(1 + ρr).
In the following we show another sufficient condition.
Corollary 15.7 Under Assumptions 15.3 and 15.4, suppose that, for the model
r(x; θ) and φ(x; θ ),
f (r(x; θ )) − (r(x; θ ) + ρ −1 )f (r(x; θ)) ∈ L[∇ log r(x; θ )]
262 15 Parametric Two-Sample Test
and
f (r(x; θ)) − fnu (r(x; θ)) ∈ L[φ(x; θ )]
hold for all θ ∈ . Then the decomposition f (r) = fde (r) + rfnu (r) and the
vector-valued function φ(x; θ ) satisfy Eq. (15.9).
Proof When
f (r(x; θ )) − fnu (r(x; θ )) ∈ L[φ(x; θ )],
there exists a vector b ∈ Rd such that
f (r(x; θ )) − fnu (r(x; θ)) = b φ(x; θ ).
Recall that
c(θ ) = Enu [{f (r(x; θ )) − fnu (r(x; θ))}∇ log r(x; θ )]
and
U φ (θ ) = Enu [φ∇ log r(x; θ ) ].
Then
c U −1
φ
= b
holds. Hence, we have
c U −1
φ
φ(x; θ ) = b φ(x; θ ) = f (r(x; θ )) − fnu (r(x; θ)),
and we can confirm that Eq. (15.9) holds under the assumption.
We consider the decomposition derived from the conjugate representation
f (r) = −f ∗ (f (r)) + rf (r).
That is,
fde (r) = −f ∗ (f (r)) and fnu (r) = f (r),
where f ∗ is the conjugate function of f . For the conjugate representation, the
second condition in Corollary 15.7 is always satisfied, because f (r) − fnu (r) = 0
holds. Then the decomposition based on the conjugate representation leads to an
optimal estimator when the model r(x; θ ) and the function f satisfy
f (r(x; θ )) − (r(x; θ) + ρ −1 )f (r(x; θ )). ∈ L[∇ log r(x; θ )]. (15.12)
Later we will show some more specific examples.
As shown previously, the conjugate representation leads to an optimal estima-
tor, if Eq. (15.12) holds. However, there exists a pair of the function f and the
model r(x; θ) that does not satisfy Eq. (15.12), as shown in Example 15.8 later. In
this case, the optimality of the estimator based on the conjugate representation is
not guaranteed. On the other hand, the decomposition (15.11) always leads to an
optimal estimator without specific conditions on f (r) and r(x; θ ), as long as the
asymptotic expansion is valid.
15.4 Optimal Estimator of ASC Divergence 263
with φ(x) = (φ1 (x), . . . , φd (x)) and φ1 (x) = 1. Then L[∇ log r(x; θ )] is spanned
by 1, φ2 (x), . . . , φd (x). The ASC divergence with
f (r) = − log r + r − 1
Then we can confirm that Eq. (15.10) is satisfied. Hence, the function φ = φ opt
and the above decomposition lead to an optimal estimator of the KL divergence.
We see that there is redundancy for the decomposition of f . Indeed, for any
constants c0 , c1 ∈ R, the function c0 + c1 log r(x; θ ) is included in L[∇ log r(x; θ )].
Hence the decomposition
r + c1 log r + c0
fnu (r) = and fde (r) = r − log r − 1 − rfnu (r)
r + ρ −1
with φ = φ opt also leads to an optimal estimator. The decomposition in Eq. (15.11)
is realized by setting c0 = −1 and c1 = −1.
Next we consider the conjugate expression of the KL divergence. For f (r) =
− log r + r − 1 and r(x; θ ) = exp{θ φ(x)}, we have
In general, the function exp{−θ φ(x)} is not represented by the linear combination
of φ1 (x), . . . , φd (x), and thus the condition in Corollary 15.7 does not hold. This
means that the conjugate expression of the KL divergence is not optimal in general.
Let us compare numerically the optimal estimator using Eq. (15.11) with the
estimator defined based on the conjugate representation of the KL divergence for
the model
r(x; θ ) = exp{α + βx}, θ = (α, β) ∈ R2 .
We estimate the KL divergence from N (0, 1) to N (µ, 1) for µ = 0.1, 0.5, 0.9,
1.3, and 1.7. The sample size is set to nnu = nde = 50, and the averaged values
f − ASCf )2 over 1000 runs are computed. Table 15.1
of the square error m(ASC
summarizes the numerical results, showing that the optimal estimator outperforms
the estimator using the conjugate representation.
We now give more examples.
264 15 Parametric Two-Sample Test
where y is the binary random variable taking nu or de; the joint probability of x
and y is defined as
∗ ρ ∗ 1
p(x, nu) = pnu (x) and p(x, de) = pde (x) .
1+ρ 1+ρ
∗ ∗
The equality pnu = pde implies that the conditional probability p(x|y) is inde-
∗ ∗
pendent of y. Thus, mutual information becomes zero if and only if pnu = pde
holds. For any moment-matching estimator, we can confirm that the following
decomposition satisfies the condition in Corollary 15.7:
1 1+ρ ρ r(1 + ρ)
fde (r) = log and fnu (r) = log . (15.14)
1+ρ 1 + ρr 1+ρ 1 + ρr
Note that this decomposition with the model r(x; θ) = exp{θ φ(x)} also sat-
isfies the condition in Corollary 15.6. As pointed out in Keziou and Leoni-Aubin
(2005, 2008), the decomposition above is derived from the conjugate expression
of Eq. (15.13). Thus, in this example, we are presenting that the above estimator
is also optimal in mutual information estimation.
15.5 Two-Sample Test Based on ASC Divergence Estimation 265
and thus L[∇ log r(x; θ )] includes the function of the form
c0 + c1 /r(x; θ ) for c0 , c1 ∈ R.
test, the null hypothesis may be rejected if ASC f > t, where the threshold t is
determined from the significance level.
f under the null
In this section we first derive an asymptotic distribution of ASC
hypothesis H0 in Eq. (15.15), that is,
∗
pnu (x)
∗ = r(x; θ ∗ ) = 1.
pde (x)
Then we give a test statistic based on the asymptotic distribution and analyze its
power (the acceptance rate of correct null-hypotheses, i.e., two distributions are
judged to be the same when they actually are the same).
with φ(x, β) ∈ R, Fokianos et al. (2001) pointed out that the asymptotic distribu-
tion of the empirical likelihood estimator α,
θ = ( β ) ∈ R × Rd−1 under the null
∗ ∗
hypothesis pnu = pde is given by
√ d
m(
β − β ∗ ) −→ Nd−1 ( 0, Vnu [∇β φ]−1 ),
15.5 Two-Sample Test Based on ASC Divergence Estimation 267
It is shown that, under the null hypothesis r ∗ (x) = 1, the statistic R con-
verges in distribution to the chi-square distribution with d − 1 degrees of
freedom (Keziou and Leoni-Aubin, 2008).
Note that the empirical likelihood-ratio test is closely related to mutual
information. Indeed, the mutual information estimator ASC f derived from
Eqs. (15.6) and (15.13) is related to R defined in Eq. (15.19) as follows
(Keziou and Leoni-Aubin, 2008):
f.
R = 2(nnu + nde )ASC
f attains the minimum asymptotic variance in
Example 15.9 guarantees that ASC
mutual information estimation.
where Pr(·) denotes the probability of an event and Y is the random variable
following the non-central chi-square distribution with d −1 degrees of freedom and
non-centrality parameter h M (θ ∗ )h. Moreover, the asymptotic power function of
the empirical likelihood-score test (15.18) is the same.
Theorem 15.13 implies that, under the local alternative (15.20), the power
f -based test does not depend on the choice of the ASC diver-
function of the ASC
gence, and that the empirical likelihood-score test has the same power as the
f -based test.
ASC
Next we consider the power function under model misspecification.
Theorem 15.14 (Kanamori et al., 2011a) We assume that the density ratio
∗ (m) ∗ ∗ (m)
pnu /pde is not realized by the model r(x; θ), and that pnu is represented as
∗ (m) ∗ sm (x) + εm
pnu (x) = pde (x) r(x; θ m ) + √ ,
m
where sm (x) satisfies Ede [sm (x)] = 0 and limm→∞ εm = ε. Suppose that all of
∗ (m)
the items in Assumption 15.12 hold except the definition of pnu (x). We further
assume some additional regularity conditions (see Kanamori et al., 2011a, for
details). Then, under the setup of the local alternatives, the power function of
the A SCf -based test is larger than or equal to that of the empirical likelihood-
score test.
15.6 Numerical Studies 269
Theorems 15.13 and 15.14 indicate that the ASC f -based test is more powerful
than the empirical likelihood-score test regardless of whether the model r(x; θ ) is
correct or slightly misspecified.
See Section 11.4.2 of Lehmann (1986) for a more detailed explanation on the
asymptotic theory under local alternatives.
15.6.1 Setup
We examine two ASC divergences for a two-sample homogeneity test. One is the
KL divergence defined by f (r) = r − 1 − log(r) as shown in Example 15.8, and
the test statistic is derived from the optimal choice (15.11). This is referred to as
the KL-based test. The other is mutual information defined by Eq. (15.13), and
the estimator ASC f is derived from the optimal decomposition (15.11) and the
moment-matching estimator φ = φ opt . This is referred to as the MI-based test.
In addition, the empirical likelihood-ratio test (15.19) is also attempted. As
shown in Keziou and Leoni-Aubin (2005, 2008), the statistic of the empirical
likelihood-ratio test is equivalent to the estimator of mutual information using the
conjugate representation (15.14) and the moment-matching estimator φ = φ opt .
Thus, the MI-based test and the empirical likelihood-ratio test share the same
moment-matching estimator, but the ways in which the function f is decomposed
are different.
We further compare these methods with the empirical likelihood-score test
(15.18) proposed by Fokianos et al. (2001), and the Hotelling T 2 -test. The null
∗ ∗ ∗ ∗
hypothesis of the test is H0 : pnu = pde , and the alternative is H1 : pnu # = pde . The
type I error (the rejection rate of the correct null hypotheses, i.e., two distributions
are judged to be different when they are actually the same) and the power function
(the acceptance rate of the correct null hypotheses, i.e., two distributions are judged
to be the same when they actually are the same) of these tests are numerically
computed.
∗ ∗
(b) The distributions of pnu and pde are given as the ten-dimensional normal
distribution N10 (0, I 10 ), where I 10 denotes the 10-dimensional identity matrix.
(c) Each element of the 10-dimensional vector x ∈ R10 is i.i.d. from the
t-distribution with 10 degrees of freedom.
with the (2k + 1)-dimensional parameter θ = (α, β1 , . . . , β2k ) . The sample size is
set to nnu = nde and is varied from 10 to 100 for one-dimensional random variables
and from 100 to 1000 for 10-dimensional random variables.
The significance level of the test is set to 0.05, and the type I error is averaged
over 1000 runs. For each of the above three cases, the averaged type I errors of
the KL-based test, the MI-based test, the empirical likelihood-ratio test, and the
empirical likelihood-score test are shown in Table 15.2.
For case (a), the type I error of the empirical likelihood-score test is larger than
the significance level even with a large sample size. On the other hand, the type I
errors of the KL-based test, the MI-based test, and the empirical likelihood-ratio
test are close to the significance level for a large sample size. For case (b), the type
I error of all methods converges to the significance level with a modest sample
size. For case (c), the type I error of the empirical likelihood-score test is larger
than the significance level, even with a large sample size. On the other hand, the
type I error of the other tests is close to the significance level with a moderate
sample size.
Table 15.2. Averaged type I errors over 1000 runs are shown as
functions of the number of samples. The significance level is set to
0.05. In the table, “KL,” “MI,” “Ratio,” and “Score” denote the
KL-based test, the MI-based test, the empirical likelihood-ratio test,
and the empirical likelihood-score test, respectively.
∗ de de
(B) pnu (x) is defined in the same way as (A), and the sample x de = (x(1) , . . . , x(10) )
∗
corresponding to pde is generated as
de
x( ) = σ × x( ) , = 1, . . . , 10, (15.23)
∗
where x = (x(1) , . . . , x(10) ) is drawn from pnu . That is, the scale parameter σ > 0
10 ∗ ∗
is multiplied to each element of x ∈ R . Hence, the null hypothesis pnu = pde
corresponds to σ = 1.
In both cases the sample size is set to nnu = nde = 500 and the density-ratio
∗ ∗
model (15.21) with k = 10 is used. When pnu and pde are the normal distributions,
the density-ratio model (15.21) includes the true density ratio. However, when they
are the t-distributions, the true ratio r ∗ (x) resides outside of the model (15.21). In
all simulations, the significance level is set to 0.05, and the power functions are
averaged over 1000 runs.
272 15 Parametric Two-Sample Test
Table 15.3 shows the numerical results for setup (A). The mean parameter µ
∗ ∗
in Eq. (15.22) is varied from −0.1 to 0.1. When both pnu and pde are the normal
distributions, the powers of the KL-based test, the MI-based test, the empirical
likelihood-ratio test, and the empirical likelihood-score test almost coincide with
each other. On the other hand, the power of the Hotelling T 2 -test is slightly larger
than the others. This is natural because the Hotelling T 2 -test was designed to work
well under a normal distribution. For the t-distribution with 5 degrees of freedom,
the power of the empirical likelihood-score test around µ = 0 is much larger than
the significance level, 0.05. This means that the empirical likelihood-score test
is not conservative and will lead to a high false-positive rate. The powers of the
MI-based test and the empirical likelihood-ratio test are close to the significance
level around µ = 0 and are comparable to the Hotelling T 2 -test.
Table 15.4 shows the averaged power functions for setup (B), where the scale
parameter σ in Eq. (15.23) is varied from 0.9 to 1.1. The results show that the
Hotelling T 2 -test completely fails because it relies on the difference of the means,
but the means are the same in setup (B). The results also show that the power
function of the empirical likelihood-score test takes the minimum value at σ less
than 1. Such a biased result is caused by the fact that the estimated variance, V nu ,
based on the empirical likelihood-score estimator, tends to take slightly smaller
values than the true variance. The powers of the MI-based test and the empirical
likelihood-ratio test are close to the significance level around σ = 1, whereas that
of the KL-based test is slightly larger than the significance level around σ = 1.
Overall, these numerical results show that when the model r(x; θ ) is speci-
fied correctly, the powers of the KL-based test, the MI-based test, the empirical
likelihood-ratio test, and the empirical likelihood-score test are highly comparable
to each other. This tendency agrees well with Theorem 15.13.
On the other hand, the empirical likelihood-score test has a large type I error
and its power is biased when the model is misspecified. The MI-based test and the
empirical likelihood-ratio test have comparable powers to the other methods, and
their type I error is well controlled.
15.7 Remarks
In this chapter we discussed a two-sample homogeneity test under the semi-
parametric density-ratio models. We first showed that the moment-matching
estimator introduced in Qin (1998) provides an optimal estimator of the ASC
divergence with appropriate decomposition of the function f . We then gave a test
statistic for a two-sample homogeneity test using the optimal ASC divergence esti-
f -based test does not depend
mator. We showed that the power function of the ASC
on the choice of the ASC divergence up to the first order under the local alternative
setup. Furthermore, the ASC f -based test and the empirical likelihood-score test
(Fokianos et al., 2001) were shown to have asymptotically the same power. For
misspecified density-ratio models, we showed that the ASC f -based test usually
has greater power than the empirical likelihood-score test.
In numerical studies, the MI-based test and the empirical likelihood-ratio
test gave the most reliable results. It is also notable that their powers were
comparable to that of the Hotelling T 2 -test even under the normal case. We
experimentally observed that the null distributions of the MI-based test and the
empirical likelihood-ratio test are approximated by the asymptotic distribution
more accurately than that of the KL-based test, although the first-order asymptotic
theory provided in Section 15.5 does not explain this empirical fact. Higher order
asymptotic theory may be needed to better understand this tendency.
Although we focused on estimators of the form (15.7) in this chapter, a variety
of estimators is available for divergence estimation. Along this line of research,
remaining future works are to study the optimal estimator among all estimators
of the ASC divergence and to specify how large the class of estimators (15.7) is
among all estimators.
16
Non-Parametric Numerical Stability Analysis
16.1 Preliminaries
In this section we describe the density-ratio estimators that will be analyzed in this
chapter.
First, let us briefly review the problem formulation and notation. Consider
∗ ∗
two probability densities pnu (x) and pde (x) on a probability space X . We assume
∗
pde (x) > 0 for all x ∈ X . Suppose that we are given two sets of i.i.d. samples,
i.i.d. i.i.d.
x nu nu ∗ de de ∗
1 , . . . , x nnu ∼ pnu and x 1 , . . . , x nde ∼ pde . (16.1)
∗ (x)
pnu
Our goal is to estimate the density ratio r ∗ (x) = ∗ (x)
pde
based on the observed
samples (16.1).
275
276 16 Non-Parametric Numerical Stability Analysis
1
1
min rj rj K(x de de
j , xj ) − rj K(x de nu
j , xi )
r1 ,...,rnde 2nde nnu i=1 j =1
j ,j =1
nde
1
s.t. rj − 1 ≤ @ and 0 ≤ r1 , . . . , rnde ≤ B. (16.2)
n
de j =1
∗ ∗
Let f : R → R be a convex function. Then the ASC divergence from pnu to pde
is defined as follows (Ali and Silvey, 1966; Csiszár, 1967):
∗
∗ ∗ pnu (x) ∗
ASCf (pnu pde ) := f p (x)dx. (16.3)
∗
pde (x) de
Let the conjugate dual function g of f be
See Section 12 of Rockafellar (1970) for details. Substituting Eq. (16.4) into
Eq. (16.3), we obtain the following expression of the ASC divergence:
∗ ∗ ∗ ∗
ASCf (pnu pde ) = − inf g(r(x))pde (x)dx − r(x)pnu (x)dx , (16.5)
r
where the infimum is taken over all measurable functions r : X → R. The infimum
is attained at the function r such that
∗
pnu (x)
∗ = g (r(x)), (16.6)
pde (x)
where g is the derivative of g. Approximating Eq. (16.5) with the empirical distri-
butions, we obtain the empirical loss function. This estimator is referred to as the
M-estimator of the density ratio. An M-estimator based on the Kullback–Leibler
divergence is derived from g(z) = −1 − log(−z).
When an RKHS R is employed as a statistical model, the M-estimator is
given by
nde nnu
1
1
λ
inf g(r(x de
j )) − r(x nu 2
i ) + rR , (16.7)
r∈R nde j =1 nnu i=1 2
where the regularization term λ2 r2R with the regularization parameter λ is intro-
duced to avoid overfitting. Using the solution r of the above optimization problem,
a density-ratio estimator is given by g (
r(x)) [see Eq. (16.6)].
The KuLSIF optimization problem is given as Eq. (16.7) with g(u) = u2 /2:
nde nnu
1
1
λ
min r(x de 2
j ) − r(x nu 2
i ) + rR . (16.8)
r∈R 2nde j =1 nnu i=1 2
r(x) = αi K(x, x de
j )+ βi K(x, x nu
i ),
j =1 i=1
inf 1 g(r(x de
j )) −
1
r(x nu
λ 2
i ) + rR , (16.11)
α1 ,...,αnde ∈R nde j =1 nnu i=1 2
'nde de 1 'nnu nu
where r(x) = j =1 αj K(x, x j ) + nnu λ i=1 K(x, x i ).
where the term independent of the parameter α is dropped. This is the optimization
criterion for KuLSIF.
For a positive definite matrix A, the solution of a linear equation Ax = b is given
as the minimizer of 12 x Ax − b x. Applying this fact to Eq. (16.9), we can obtain
the solution α for KuLSIF by solving the following optimization problem:
1 1 1
min α K de,de + λI nde α + 1 K α . (16.13)
α∈Rnde 2 nde nnu nde λ nnu de,nu
In the following, the estimator obtained by solving the optimization problem
(16.13) is referred to as reduced-KuLSIF (R-KuLSIF). Although KuLSIF and
R-KuLSIF share the same optimal solution, their loss functions are different
[cf. Eq.(16.12)]. As shown in the following, this difference yields significant
improvement of the numerical stability and computational efficiency.
Using the reproducing property of the kernel function K, we can express the above
equality in terms of φ(r ∗ ) as:
r ∗ (x)v(x)pde
∗
(x)dx − v(x)pnu ∗
(x)dx
= r ∗ (x)K(·, x), vR pde
∗ ∗
(x)dx − K(·, x), vR pnu (x)dx
∗ ∗ ∗
= K(·, x)r (x)pde (x)dx − K(·, x)pnu (x)dx, v
R
2 ∗
3
= φ(r ), v R = 0, (16.14)
where ·, ·R denotes the inner product in the RKHS R. Because Eq. (16.14) holds
for arbitrary v ∈ R, we have φ(r ∗ ) = 0.
The above expression implies that φ(r) is the Gâteaux derivative (see Section
4.2 of Zeidler, 1986) of LKuLSIF at r ∈ R; that is,
d
LKuLSIF (r + δ · v)δ=0 = φ(r), vR
dδ
holds for all v ∈ R. See Section 16.2.2 for the definition of the Gâteaux derivative.
Let DLKuLSIF (= φ) be the Gâteaux derivative of LKuLSIF over the RKHS R.
Then the equality LKMM (r) = 12 DLKuLSIF (r)2R holds. Note that a similar relation
also holds for the M-estimator based on the Kullback–Leibler divergence with
log-linear models (Tsuboi et al., 2009).
Now we illustrate the relation between KuLSIF and KMM by showing an
analogous optimization example in the Euclidean space. Let h : Rd → R be a
differentiable function, and consider the optimization problem minx∈Rd h(x). At
the optimal solution x ∗ , the extremal condition ∇h(x ∗ ) = 0 should hold, where
∇h is the gradient of h. Thus, instead of minimizing h, minimizing ∇h(x)2 also
provides the minimizer of h. This actually corresponds to the relation between
16.2 Relation between KuLSIF and KMM 281
In other words, to find the solution of the equation φ(r) = 0, KMM tries to minimize
the norm of φ(r). The “dual” expression of φ(r) = 0 is given as
∀
φ(r), vR = 0, v ∈ R. (16.15)
2. Let Lg-KMM : R → R be
1
Lg-KMM (r) = DLg (r)2R ,
2
∗
and suppose that r ∗ (x) = pnu
∗
(x)/pde (x) is represented by r ∗ = g (rg ) for
some rg ∈ R. Then, rg is the minimizer of both Lg and Lg-KMM .
The quadratic function g(z) = z2 /2 and a bounded kernel satisfy the assump-
tion of Theorem 16.4. Indeed, for any bounded kernel, the inequality |r(x)| ≤
282 16 Non-Parametric Numerical Stability Analysis
√
rR supx∈X K(x, x) < ∞ holds, and thus r 2 /2 and r are integrable with respect
∗
to the probability pde .
On the other hand, for the function g(z) = −1 − log(−z) [and g (z) = −1/z],
which leads to the M-estimator using the Kullback–Leibler divergence, the
Gâteaux derivative of Lg is given as
1 ∗ ∗
DLg (r) = − K(·, x) pde (x)dx − K(·, x)pnu (x)dx.
r(x)
In this case, the functional DLg (r) is not necessarily defined for all r ∈ R; for
example, DLg (r) does not exist for r = 0.
Nevertheless, the second statement in Theorem 16.4 is still valid for the
Kullback–Leibler divergence. Suppose that there exists an rg ∈ R such that
r ∗ = −1/rg . Then there exists a√positive constant c > 0 such that r ∗ ≥ c > 0,
because |rg (x)| ≤ rg R supx∈X K(x, x) < ∞ holds. The condition that r ∗ is
bounded below by a positive constant is assumed when the Kullback–Leibler
divergence is used in the estimator (Nguyen et al., 2010; Sugiyama et al., 2008).
On the other hand, for the quadratic function g(z) = z2 /2, such an assumption is
not required.
the relative error of the solution is given as follows (Section 2.2 of Demmel, 1997):
δx κ(A) δA δb
≤ + .
x 1 − κ(A)δA/A A b
Hence, smaller condition numbers are preferable in numerical computations.
is also likely to be large, and thus the numerical computation of S −1k ∇h(x k )
will not be reliable. This implies that the round-off error caused by nearly sin-
gular Hessian matrices significantly affects the accuracy of the quasi-Newton
methods. Consequently, S −1 k ∇h(x k ) is not guaranteed to be a proper descent
direction of the objective function h.
In optimization problems with large condition numbers, the numerical compu-
tation tends to be unreliable. To avoid numerical instability, the Hessian matrix is
often modified so that S k has a moderate condition number. For example, in the
optimization toolbox in MATLAB® , gradient descent methods are implemented
by the function fminunc. The default method in fminunc is the BFGS method
with an update through the Cholesky factorization of S k (not S −1 k ). Even if the
positive definiteness of S k is violated by the round-off error, the Cholesky fac-
torization immediately detects the negativity of the eigenvalues and the positive
definiteness of S k is recovered by adding a correction term. When the modified
Cholesky factorization is used, the condition number of S k is guaranteed to be
bounded above by some constant. See Moré and Sorensen (1984) for details.
Preliminaries
Let us consider the optimization problems of KuLSIF and KMM on an RKHS R
endowed with a kernel function K over a set X . Given the samples (16.1), the
optimization problems of KuLSIF and KMM are defined as
nde nnu
1
1
λ
(KuLSIF) min (r(x de 2
j )) − r(x nu 2
i ) + rR ,
r∈R 2nde j =1 nnu i=1 2
1
2
(KMM) min φ(r) + λr R ,
r∈R 2
'nde 1 'nnu
where φ(r) =1
nde
de de
j =1 K(·, x j )r(x i )− nnu
nu
i=1 K(·, x i ). Here, φ(r)+λr is the
Gâteaux derivative of the loss function for KuLSIF including the regularization
term. In the original KMM method, the density-ratio values at denominator samples
x de de
1 , . . . , x nde are optimized (Huang et al., 2007). Here we consider its inductive
∗
variant; that is, the entire density-ratio function r ∗ = pnu ∗
/pde on X is estimated
using the loss function of KMM.
According to Theorem 16.3, the optimal solution of equation (KuLSIF) is given
as a form of
nde nnu
1
de
r= αj K(·, x j ) + K(·, x nu
i ).
j =1
n nu λ i=1
Note that the optimal solution of equation (KMM) is also given by the same form.
Thus, the variables to be optimized in (KuLSIF) and (KMM) are α1 , . . . , αnde .
16.3 Condition Number Analysis 285
H KuLSIF is derived from Eq. (16.12), and H KMM is given by a direct computation
based on (KMM). Then we obtain
% 1 &
κ(H KuLSIF ) =κ(K de,de )κ K de,de + λI nde ,
nde
% 1 &2
κ(H KMM ) =κ(K de,de )κ K de,de + λI nde .
nde
Because the condition number is larger than or equal to one, the inequality
always holds. This implies that the convergence rate of KuLSIF will be faster than
that of KMM, when an iterative optimization algorithm is used.
In the R-KuLSIF (16.13), the Hessian matrix of the objective function is
given by
1
H R-KuLSIF = K de,de + λI nde ,
nde
and thus the condition number of H R-KuLSIF satisfies
Therefore, the R-KuLSIF has an advantage in the efficiency and the robustness of
numerical computation. This theoretical analysis will be illustrated numerically in
Section 16.5.
286 16 Non-Parametric Numerical Stability Analysis
Mini-Max Evaluation
We assume that a universal RKHS R endowed with a kernel function K on a
compact set X is used to estimate r ∗ . The M-estimator based on theASC divergence
is obtained by solving the problem (16.11). As shown in Eq. (16.17), the Hessian
matrix of the loss function at the optimal solution r is given by
1
K de,de Dg,r K de,de + λK de,de .
n
The condition number of the Hessian matrix is denoted by
1
κ0 (Dg,r ) := κ K de,de Dg,r K de,de + λK de,de .
n
16.4 Optimality of KuLSIF 287
In KuLSIF, the equality g = 1 holds, and thus the condition number is given by
κ0 (I nde ). In the following we analyze the relation between κ0 (I nde ) and κ0 (Dg,r ).
Theorem 16.5 (Mini-max evaluation) Suppose that R is a universal RKHS and
K de,de is non-singular, and let c be a positive constant. Then
holds, where the infimum is taken over all convex second-order continuously
differentiable functions such that g ((g )−1 (1)) = c.
∗
When r ∗ = 1 (i.e., pnu ∗
= pde ), the Hessian matrices are approximately the
same for any function g such that g ((g )−1 (1)) = c. More precisely, when λ = 0,
nnu = nde , and x nu de
i = x i (i = 1, . . . , nnu ) hold in Eq. (16.11), the estimator satisfies
de
g (r(x j )) = 1 and g (
r(x de
j )) = c for all j = 1, . . . , nde . Then the Hessian matrix
c 2
is equal to nnu K de,de . The constraint on g in Theorem 16.5 [i.e., g ((g )−1 (1)) = c]
works as a kind of calibration at the density ratio r ∗ = 1.
Under such a calibration, Theorem 16.5 shows that the quadratic function g(z) =
cz2 /2 is optimal in the mini-max sense. This feature is brought about by the fact that
the condition number of KuLSIF does not depend on the optimal solution. Because
both sides of Eq. (16.18) depend on the samples x de de
1 , . . . , x nde , KuLSIF achieves
the mini-max solution in terms of the condition number for each observation.
Probabilistic Evaluation
Next we study the probabilistic evaluation of the condition number. As shown in
Eq. (16.17), the Hessian matrix at the estimated function
g is given as
1
H g-div = K de,de Dg,r K de,de + λK de,de ,
n
r(x de
where the diagonal elements of Dg,r are given by g (
1 )), . . . , g (r(x de
nde )), and
r is the minimizer of Eq. (16.11). Let us define the random variable Tnde as
Tnde = max g (
r(x de
j )) ≥ 0. (16.19)
j =1,...,nde
Let Fnde be the distribution function of Tnde . More precisely, Tnde and Fnde depend
not only on nde but also on nnu through g . However, here we consider the case
where nnu is fixed to a finite number, or nnu may be a function of nde . Then, Tnde
and Fnde depend only on nde .
In the following we first compute the distribution of the condition number
κ(H g-div ). Then we investigate the relation between the function g and the distri-
bution of the condition number κ(H g-div ). We need to study the eigenvalues and
condition numbers of random matrices. For the Wishart distribution, the prob-
ability distribution of condition numbers has been investigated, for example, by
Edelman (1988) and Edelman and Sutton (2005). Recently, the condition numbers
of matrices perturbed by additive Gaussian noise have been investigated by the
288 16 Non-Parametric Numerical Stability Analysis
method called smoothed analysis (Sankar et al., 2006; Spielman and Teng, 2004;
Tao and Vu, 2007). However, the randomness involved in the matrix H g-div is
different from that in existing smoothed analysis studies.
Theorem 16.6 (Probabilistic evaluation) Let R be an RKHS endowed with a
kernel function K : X × X → R satisfying the following boundedness condition:
0 < inf
K(x, x ), sup K(x, x ) < ∞.
x,x ∈X x,x ∈X
Assume that the Gram matrix K de,de is almost surely positive definite in terms of
∗
the probability measure Pde . Suppose that there exist sequences snde and tnde such
that
where the probability Pr(·) is defined for the distribution of all samples
x nu nu de de
1 , . . . , x nnu , x 1 , . . . , x nde .
Remark. The Gaussian kernel on a compact set X meets the condition of The-
∗
orem 16.6 under a mild assumption on the probability pde . Suppose that X is
included in the ball {x ∈ R | x ≤ R}. Then, for K(x, x ) = exp{−γ x − x 2 }
d
2 ∗
with x, x ∈ X and γ > 0, we have e−4γ R ≤ K(x, x ) ≤ 1. If the distribution Pde of
de de
samples x 1 , . . . , x nde is absolutely continuous with respect to the Lebesgue mea-
sure, the Gram matrix of the Gaussian kernel is almost surely positive definite
because K de,de is positive definite if x de de
j # = x j for j # = j .
When g is the quadratic function g(z) = z2 /2, the distribution function Fnde
is given by Fnde (t) = 1[t ≥ 1], where 1[ · ] is the indicator function. Hence, the
sequence snde defined in Theorem 16.6 does not exist. Nevertheless, by choosing
tnde = 1, the upper bound of κ(H g-div ) with g(z) = z2 /2 is asymptotically given as
κ(K de,de )(1 + λ−1
nde ).
On the other hand, for the M-estimator with the Kullback–Leibler divergence
(Nguyen et al., 2010), the function g is defined by g(z) = −1 − log(−z), z <
0, and thus g (z) = 1/z2 holds. Then we have Tnde = maxj =1,...,nde ( r(x de −2
j )) .
de 2
However, we note that ( r(x j )) can take a very small value for the Kullback–
Leibler divergence, which yields that the order of Tnde is larger than a constant and
thus tnde diverges to infinity. This simple analysis indicates that KuLSIF is more
preferable than the M-estimator with the Kullback–Leibler divergence in the sense
of numerical stability and computational efficiency.
16.4 Optimality of KuLSIF 289
expect that r will converge to rg∗ when the sample size tends to infinity. Hence, the
approximation that g ( r(x de
j )), j = 1, . . . , nde are independent of each other will be
acceptable in the large sample limit. Under such an approximation, the distribution
function Fnde (t) is given by (F̄ (t))nde , where F̄ is the distribution function of each
r(x de
g ( j )). Based on this intuition, we have the following proposition.
Proposition 16.7 (Approximated bound) Suppose the kernel function K and the
regularization parameter λ = λnde satisfy the same condition as Theorem 16.6.
For the distribution function Fnde of the random variable Tnde , we assume that
there exist distribution functions F̄0 and F̄1 such that the following conditions are
satisfied:
1. For large nde , the inequality (F̄0 (t))nde ≤ Fnde (t) ≤ (F̄1 (t))nde holds for all
t > 0.
2. Let G0 := 1 − F̄0 and G1 := 1 − F̄1 . Then both G0 (t) and G1 (t) have
the inverse functions G−1 −1
0 (t) and G1 (t) for small t > 0. Furthermore,
−1
limt→+0 G1 (t) = ∞ holds.
Then, for any small η > 0 and any small ν > 0, we have
−1+η 1−ν
lim Pr {G−1 1 (nde )} ≤ κ(H g-div )
nde →∞
, -
−1 −1−η
≤ κ(K de,de ) 1 + U λ−1 G
nde 0 (n de ) = 1,
holds with high probability. In KuLSIF, the function g is given as g(z) = z2 /2.
Then the corresponding distribution function of each diagonal element in Dg,g is
given by FKuLSIF (t) = 1[t ≥ 1], and thus GKuLSIF (t) = 1 − FKuLSIF (t) = 1[t < 1].
In all M-estimators except KuLSIF, the diagonal elements of Dg,g can take various
positive values. We can regard the diagonal elements of Dg,g as a typical realization
of random variables with the distribution function F (t). When the distribution
function F is close to FKuLSIF , the function G = 1 − F is also close to GKuLSIF .
Then, as illustrated in Figure 16.1, the function G−1 will take small values. As a
result, we can expect that the condition number of KuLSIF will be smaller than that
290 16 Non-Parametric Numerical Stability Analysis
G(t )
0.4
0.3 G2
G1
0.2
0.1 G KuLSIF
G1−1 G2−1
0.0
0.5 1.0 1.5 2.0 2.5 3.0
t
Figure 16.1. If the function G1 (t) is closer to GKuLSIF (t) (= 0) than G2 (t) for large t, then
G−1 −1
1 (z) takes a smaller value than G2 (z) for small z.
Suppose that Fnde (t) is approximated by (F̄γ (t))nde . The distribution function
FKuLSIF (t) = 1[t ≥ 1] defined in Remark 16.4 is represented as 1[t ≥ 1] =
limγ →∞ F̄γ (t) except at t = 1. Then, Gγ (t) = 1 − F̄γ (t) is given by
1 (0 ≤ t < 1),
Gγ (t) = −γ
t (1 ≤ t).
The function Gγ (t) = 1 − F̄γ (t) is given by Gγ (t) = 1+eγ1(t−1) for t ≥ 0. For small
1 1−z
z, the inverse function G−1 −1
γ (z) is given as Gγ (z) = 1 + γ log z . Hence, for small
η and small ν, the inequality (16.21) is reduced to
1−η nde 1−ν
log ≤ κ(H g-div )
γ 2
U % 1+η &
≤ κ(K de,de ) 1 + 1+ log nde .
λnnu ,nde γ
The upper and lower bounds in this inequality are monotone decreasing with
respect to γ .
Finally, we review briefly the idea of smoothed analysis (Spielman and Teng,
2004) and discuss its relation to the above analysis. Let us consider the expected
computation cost EP [c(X)], where c(X) is the cost of an algorithm for the input
X, and EP [ · ] denotes the expectation with respect to a probability P over the
input space. Let P be a set of probabilities on the input space. In a smoothed
analysis, the performance of an algorithm is measured by maxP ∈P EP [c(X)],
where the set of Gaussian distributions is a popular choice as P. On the other
hand, in our theoretical analysis, we considered the probabilistic order of condition
numbers as a measure of the computation cost. Thus, roughly speaking, the loss
function achieving ming maxpnu ∗ ,p ∗ Op (κ(H g-div )) will be the optimal choice in
de
∗ ∗
our analysis, where the sample distributions pnu and pde vary in an appropriate set
of distributions. This means that our concern is not only to compute the worst-case
computation cost but also to find the optimal loss function or tuning parameters in
the algorithm.
Remark. We summarize the theoretical results on condition numbers. Let H g-div
be the Hessian matrix (16.17) of the M-estimator. Then the following inequalities
hold:
Recall that K de,de is the Hessian matrix of the original fixed-design KMM method
(which estimates the density-ratio values only at denominator samples), and H KMM
is its inductive variant that estimates the entire density-ratio function using the loss
function of KMM.
Based on a probabilistic evaluation, the inequality
also holds with high probability, although the probabilistic order of Tnde in
Eq. (16.19) is left unexplored.
Overall, R-KuLSIF was shown to be advantageous in numerical computations.
292 16 Non-Parametric Numerical Stability Analysis
median distance as the kernel width is a popular heuristics (Schölkopf and Smola,
2002).
The sample size is increased under nnu = nde in the first set of experiments,
whereas nnu is fixed to 50 and nde is varied from 20 to 500 in the second set of
experiments. The regularization parameter λ is set to λnnu ,nde = min(nnu , nde )−0.9 ,
which meets the assumption in Theorem 14.14.
Table 16.1 shows the average condition numbers over 1000 runs. In each setup,
samples x de de
1 , . . . , x nde are randomly generated and the condition number is com-
puted. The table shows that the condition number of R-KuLSIF is much smaller
than the condition numbers for the other methods for all cases. Thus it is expected
that the convergence speed of R-KuLSIF in optimization is faster than in the other
methods and that R-KuLSIF is robust against numerical degeneracy. It is also note-
worthy that κ(H R-KuLSIF ) is smaller than κ(K de,de ) – this is because the identity
matrix in H R-KuLSIF prevents the smallest eigenvalue from becoming extremely
small.
Next we investigate the number of iterations and computation times required
for obtaining the solutions of R-KuLSIF, KuLSIF, the inductive KMM (simply
referred to as KMM from here on), and the M-estimator with the Kullback–
Leibler divergence (KL). We also include in the comparison the computation
time required for solving the linear equation of R-KuLSIF, which is denoted as R-
∗ ∗
KuLSIF(direct). The probability densities pnu and pde are essentially the same as in
∗
the previous experiment, but the mean vector of pnu is set to 0.5 × 110 . The number
of samples from each probability distribution is set to nnu = nde = 100, . . . , 6000,
and the regularization parameter is defined by λ = min(nnu , nde )−0.9 . The ker-
nel parameter σ is set to the median of x de de
j − x j . To solve the optimization
problems in the M-estimators and KMM, we use two optimization methods:
the BFGS quasi-Newton method implemented in the optim function in R
(R Development Core Team, 2009) and the steepest descent method. Furthermore,
for R-KuLSIF(direct), we use the solve function in R.
Tables 16.2 and 16.3 show the average number of iterations and the average
computation times for solving the optimization problems over 50 runs. For the
steepest descent method, the maximum number of iterations was limited to 4000,
and the KL method reached the limit. The numerical results indicate that the number
of iterations in optimization is highly correlated with the condition numbers of the
Hessian matrices in Table 16.1.
Although the practical computational time would depend on various issues such
as stopping rules, our theoretical results in Section 16.4 were shown to be in good
agreement with the empirical results obtained from artificial datasets. We also
observed that numerical optimization methods such as the quasi-Newton method
are competitive with numerical algorithms for solving linear equations using the
LU decomposition or the Cholesky decomposition, especially when the sample
size nde is large (note that the number of parameters is nde in the kernel-based
methods). This implies that our theoretical result will be useful in large sample
cases, which are common situations in practical applications.
294 16 Non-Parametric Numerical Stability Analysis
H KL H KL
nnu , nde K de,de H R-KuLSIF H KuLSIF H KMM
µ = 0.2 µ = 0.5
H KL H KL
nnu , nde K de,de H R-KuLSIF H KuLSIF H KMM
µ = 0.2 µ = 0.5
H KL H KL
nde K de,de H R-KuLSIF H KuLSIF H KMM
µ = 0.2 µ = 0.5
Table 16.2. The average computation times and the average numbers of
iterations in the BFGS method over 50 runs.
nde = nnu = 100 nde = nnu = 300
use the IDA binary classification datasets (Rätsch et al., 2001), which consist of
positive/negative and training/test samples. We allocate all positive training sam-
ples for the model dataset and assign all positive test samples and 5% of negative
test samples to the evaluation dataset. Thus, we regard the positive samples as
inliers and the negative samples as outliers. The density ratio r ∗ (x) is defined as
the probability density of the model dataset over that of the evaluation dataset.
Then the true density ratio is approximately equal to one in inlier regions and
takes small values around outliers.
Table 16.4 shows the average computation times and the average numbers of
iterations over 20 runs for the image and splice datasets and over 50 runs for the
296 16 Non-Parametric Numerical Stability Analysis
Table 16.4. The average computation times and the average numbers of
iterations for the IDA benchmark datasets. The BFGS quasi-Newton
method in the optim function of the R environment is used to obtain
numerical solutions. For each dataset, the numbers in the upper row
denote the the computation time (sec.), and the numbers in the lower row
denote the numbers of iterations of the quasi-Newton update.
R-KuLSIF
Data # samples R-KuLSIF KuLSIF KMM KL
(direct)
other datasets. In the same way as the simulations in Section 16.5.1, we compare
R-KuLSIF, KuLSIF, the inductive variant of KMM (KMM), and the M-estimator
with the Kullback–Leibler divergence (KL). In addition, the computation time
for solving the linear equation of R-KuLSIF is also shown as “direct.” For the
optimization, we use the BFGS method implemented in the optim function
in R (R Development Core Team, 2009), and for R-KuLSIF(direct) we use the
solve function in R. The kernel parameter σ is determined based on the median
of x de de
j − x j , which is computed by the function sigest in the kernlab library
(Karatzoglou et al., 2004). The number of samples is shown in the second column,
and the regularization parameter is defined by λ = min(nnu , nde )−0.9 .
The numerical results show that the number of iterations agrees well with the
theoretical analysis when the sample size is balanced, that is, nnu and nde are
comparable. On the other hand, for the titanic, waveform, banana, ringnorm, and
twonorm datasets, the number of iterations of each method is almost the same
except KMM. In these datasets, nnu is much smaller than nde , and thus the second
term, λK de,de , in the Hessian matrix (16.22) for the M-estimator will govern the
convergence property, because the order of λ is larger than O(n−1de ). This tendency
can be explained by Eq. (16.20); that is, a large λ provides a small upper bound
of κ(H ).
Next, we more systematically investigate the number of iterations when nnu
and nde are comparable. We use the titanic, waveform, banana, ringnorm, and
twonorm datasets. In the first set of experiments, the evaluation dataset consists
of all positive test samples, and the model dataset is defined by all negative test
samples. Therefore, the true density ratio may be far from the constant function
r ∗ (x) = 1. The upper half of Table 16.5 summarizes the results, showing that R-
KuLSIF keeps the computation costs low for all cases. This again agrees well with
our theoretical analysis. In the second set of experiments, both model samples
and evaluation samples are taken randomly from all test samples. Thus, the target
density ratio is the constant function r ∗ (x) = 1. The lower half of Table 16.5
summarizes the results, showing that the number of iterations for the KL method
is significantly smaller than that shown in the upper half of Table 16.5. This is
because the condition number of the Hessian matrix (16.22) is likely to be small
when the true density ratio r ∗ is close to the constant function. Nevertheless,
R-KuLSIF is still a preferable approach even when the target density ratio is
constant. Furthermore, it is noteworthy that the computation time of R-KuLSIF
is comparable to a direct method such as the Cholesky decomposition when the
sample size is more than about 3000.
16.6 Remarks
In this chapter we investigated the numerical stability and computational effi-
ciency of kernel-based density-ratio estimation via conditional number analysis.
In Section 16.3 we showed that, although KuLSIF and KMM share the same
298 16 Non-Parametric Numerical Stability Analysis
Table 16.5. The average computation times and the average numbers of
iterations for balanced samples, i.e., nnu and nde are comparable. We use
the titanic, waveform, banana, ringnorm, and twonorm datasets in the IDA
benchmark repository. In the upper table, the evaluation dataset consists of
all positive test samples and the model dataset is defined by all negative
test samples; i.e., the two datasets follow highly different distributions and
thus the true density-ratio function is far from constant. In the lower table,
the evaluation dataset and the model dataset are both randomly generated
from all test samples; i.e., the two datasets follow the same distribution and
thus the true density-ratio function is constant. The BFGS quasi-Newton
method in the optim function of the R environment is used to obtain
numerical solutions. For each dataset, the numbers in the upper row denote
the computation time (sec.), and the numbers in the lower row denote the
number of iterations of the quasi-Newton update.
R-KuLSIF
Data # samples R-KuLSIF KuLSIF KMM KL
(direct)
R-KuLSIF
Data # samples R-KuLSIF KuLSIF KMM KL
(direct)
Conclusions
17
Conclusions and Future Directions
303
304 17 Conclusions and Future Directions
For example, considering more elaborate quantities such as the relative density
ratio (Yamada et al., 2011b),
∗
pnu (x)
rα∗ (x) := ∗ ∗ for 0 ≤ α ≤ 1,
αpnu (x) + (1 − α)pde (x)
x Input variable
X Domain of input x
d Dimensionality of input x
E Expectation
∗
pnu (x) Probability density function in the numerator of the ratio
∗
pde (x) Probability density function in the denominator of the ratio
∗
pnu (x) Model of pnu (x)
∗
pde (x) Model of pde (x)
nu (x) ∗
p Estimator of pnu (x)
∗
de (x)
p Estimator of pde (x)
∗
r ∗ (x) Density ratio pnu ∗
(x)/pde (x)
∗
r(x) Estimator of r (x)
r(x) Model of r ∗ (x)
i.i.d. Independent and identically distributed
nnu
{x nu
i }i=1
∗
Set of nnu i.i.d. samples following pnu (x)
de nde ∗
{x j }j =1 Set of nde i.i.d. samples following pde (x)
φ(x), ψ(x) Basis functions
nu , nu Design matrices for numerator samples
de , de Design matrices for denominator samples
K(x, x ) Kernel function
λ Regularization parameter
Transpose
0n n-dimensional vector with all zeros
1n n-dimensional vector with all ones
0n×n n × n matrix with all zeros
1n×n n × n matrix with all ones
In n-dimensional identity matrix
307
308 Symbols and Abbreviations
Agakov, F., and Barber, D. 2006. Kernelized Infomax Clustering. Pages 17–24 of: Weiss, Y.,
Schölkopf, B., and Platt, J. (eds), Advances in Neural Information Processing Systems 18.
Cambridge, MA: MIT Press.
Aggarwal, C. C., and Yu, P. S. (eds). 2008. Privacy-Preserving Data Mining: Models and
Algorithms. New York: Springer.
Akaike, H. 1970. Statistical Predictor Identification. Annals of the Institute of Statistical
Mathematics, 22, 203–217.
Akaike, H. 1974. A New Look at the Statistical Model Identification. IEEE Transactions on
Automatic Control, AC-19(6), 716–723.
Akaike, H. 1980. Likelihood and the Bayes Procedure. Pages 141–166 of: Bernardo, J. M.,
DeGroot, M. H., Lindley, D. V., and Smith, A. F. M. (eds), Bayesian Statistics. Valencia,
Spain: Valencia University Press.
Akiyama, T., Hachiya, H., and Sugiyama, M. 2010. Efficient Exploration throughActive Learning
for Value Function Approximation in Reinforcement Learning. Neural Networks, 23(5), 639–
648.
Ali, S. M., and Silvey, S. D. 1966. A General Class of Coefficients of Divergence of One
Distribution from Another. Journal of the Royal Statistical Society, Series B, 28(1), 131–142.
Amari, S. 1967. Theory of Adaptive Pattern Classifiers. IEEE Transactions on Electronic
Computers, EC-16(3), 299–307.
Amari, S. 1998. Natural Gradient Works Efficiently in Learning. Neural Computation, 10(2),
251–276.
Amari, S. 2000. Estimating Functions of Independent Component analysis for Temporally
Correlated Signals. Neural Computation, 12(9), 2083–2107.
Amari, S., and Nagaoka, H. 2000. Methods of Information Geometry. Providence, RI: Oxford
University Press.
Amari, S., Fujita, N., and Shinomoto, S. 1992. Four Types of Learning Curves. Neural
Computation, 4(4), 605–618.
Amari, S., Cichocki, A., and Yang, H. H. 1996. A New Learning Algorithm for Blind Signal
Separation. Pages 757–763 of: Touretzky, D. S., Mozer, M. C., and Hasselmo, M. E. (eds),
Advances in Neural Information Processing Systems 8. Cambridge, MA: MIT Press.
Anderson, N., Hall, P., and Titterington, D. 1994. Two-Sample Test Statistics for Measuring
Discrepancies between Two Multivariate Probability Density Functions Using Kernel-based
Density Estimates. Journal of Multivariate Analysis, 50, 41–54.
Ando, R. K., and Zhang, T. 2005. A Framework for Learning Predictive Structures from Multiple
Tasks and Unlabeled Data. Journal of Machine Learning Research, 6, 1817–1853.
Antoniak, C. 1974. Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric
Problems. The Annals of Statistics, 2(6), 1152–1174.
309
310 References
Blei, D. M., and Jordan, M. I. 2006. Variational Inference for Dirichlet Process Mixtures. Bayesian
Analysis, 1(1), 121–144.
Bolton, R. J., and Hand, D. J. 2002. Statistical Fraud Detection: A Review. Statistical Science,
17(3), 235–255.
Bonilla, E., Chai, K. M., and Williams, C. 2008. Multi-Task Gaussian Process Prediction. Pages
153–160 of: Platt, J. C., Koller, D., Singer, Y., and Roweis, S. (eds), Advances in Neural
Information Processing Systems 20. Cambridge, MA: MIT Press.
Borgwardt, K. M., Gretton, A., Rasch, M. J., Kriegel, H.-P., Schölkopf, B., and Smola,
A. J. 2006. Integrating Structured Biological Data by Kernel Maximum Mean Discrepancy.
Bioinformatics, 22(14), e49–e57.
Bousquet, O. 2002. A Bennett Concentration Inequality and its Application to Suprema of
Empirical Process. Note aux Compte Rendus de l’Académie des Sciences de Paris, 334,
495–500.
Boyd, S., and Vandenberghe, L. 2004. Convex Optimization. Cambridge, UK: Cambridge
University Press.
Bradley, A. P. 1997. The Use of the Area under the ROC Curve in the Evaluation of Machine
Learning Algorithms. Pattern Recognition, 30(7), 1145–1159.
Bregman, L. M. 1967. The Relaxation Method of Finding the Common Point of Convex Sets and
Its Application to the Solution of Problems in Convex Programming. USSR Computational
Mathematics and Mathematical Physics, 7, 200–217.
Breunig, M. M., Kriegel, H.-P., Ng, R. T., and Sander, J. 2000. LOF: Identifying Density-Based
Local Outliers. Pages 93–104 of: Chen, W., Naughton, J. F., and Bernstein, P. A. (eds),
Proceedings of the ACM SIGMOD International Conference on Management of Data.
Brodsky, B., and Darkhovsky, B. 1993. Nonparametric Methods in Change-Point Problems.
Dordrecht, the Netherlands: Kluwer Academic Publishers.
Broniatowski, M., and Keziou, A. 2009. Parametric Estimation and Tests through Divergences
and the Duality Technique. Journal of Multivariate Analysis, 100, 16–26.
Buhmann, J. M. 1995. Data Clustering and Learning. Pages 278–281 of: Arbib, M. A. (ed), The
Handbook of Brain Theory and Neural Networks. Cambridge, MA: MIT Press.
Bura, E., and Cook, R. D. 2001. Extending Sliced Inverse Regression. Journal of the American
Statistical Association, 96(455), 996–1003.
Caponnetto, A., and de Vito, E. 2007. Optimal Rates for Regularized Least-Squares Algorithm.
Foundations of Computational Mathematics, 7(3), 331–368.
Cardoso, J.-F. 1999. High-Order Contrasts for Independent Component Analysis. Neural
Computation, 11(1), 157–192.
Cardoso, J.-F., and Souloumiac, A. 1993. Blind Beamforming for Non-Gaussian Signals. Radar
and Signal Processing, IEE Proceedings-F, 140(6), 362–370.
Caruana, R., Pratt, L., and Thrun, S. 1997. Multitask Learning. Machine Learning, 28,
41–75.
Cesa-Bianchi, N., and Lugosi, G. 2006. Prediction, Learning, and Games. Cambridge, UK:
Cambridge University Press.
Chan, J., Bailey, J., and Leckie, C. 2008. Discovering Correlated Spatio-Temporal Changes in
Evolving Graphs. Knowledge and Information Systems, 16(1), 53–96.
Chang, C. C., and Lin, C. J. 2001. LIBSVM: A Library for Support Vector
Machines. Tech. rept. Department of Computer Science, National Taiwan University.
http://www.csie.ntu.edu.tw/˜cjlin/libsvm/.
Chapelle, O., Schölkopf, B., and Zien, A. (eds). 2006. Semi-Supervised Learning. Cambridge,
MA: MIT Press.
Chawla, N. V., Japkowicz, N., and Kotcz, A. 2004. Editorial: Special Issue on Learning from
Imbalanced Data Sets. ACM SIGKDD Explorations Newsletter, 6(1), 1–6.
Chen, S.-M., Hsu, Y.-S., and Liaw, J.-T. 2009. On Kernel Estimators of Density Ratio. Statistics,
43(5), 463–479.
Chen, S. S., Donoho, D. L., and Saunders, M. A. 1998. Atomic Decomposition by Basis Pursuit.
SIAM Journal on Scientific Computing, 20(1), 33–61.
312 References
Cheng, K. F., and Chu, C. K. 2004. Semiparametric Density Estimation under a Two-sample
Density Ratio Model. Bernoulli, 10(4), 583–604.
Chiaromonte, F., and Cook, R. D. 2002. Sufficient Dimension Reduction and Graphics in
Regression. Annals of the Institute of Statistical Mathematics, 54(4), 768–795.
Cichocki, A., and Amari, S. 2003. Adaptive Blind Signal and Image Processing: Learning
Algorithms and Applications. New York: Wiley.
Cohn, D. A., Ghahramani, Z., and Jordan, M. I. 1996. Active Learning with Statistical Models.
Journal of Artificial Intelligence Research, 4, 129–145.
Collobert, R., and Bengio., S. 2001. SVMTorch: Support Vector Machines for Large-Scale
Regression Problems. Journal of Machine Learning Research, 1, 143–160.
Comon, P. 1994. Independent Component Analysis, A New Concept? Signal Processing, 36(3),
287–314.
Cook, R. D. 1998a. Principal Hessian Directions Revisited. Journal of the American Statistical
Association, 93(441), 84–100.
Cook, R. D. 1998b. Regression Graphics: Ideas for Studying Regressions through Graphics.
New York: Wiley.
Cook, R. D. 2000. SAVE: A Method for Dimension Reduction and Graphics in Regression.
Communications in Statistics – Theory and Methods, 29(9), 2109–2121.
Cook, R. D., and Forzani, L. 2009. Likelihood-Based Sufficient Dimension Reduction. Journal
of the American Statistical Association, 104(485), 197–208.
Cook, R. D., and Ni, L. 2005. Sufficient Dimension Reduction via Inverse Regression. Journal
of the American Statistical Association, 100(470), 410–428.
Cortes, C., and Vapnik, V. 1995. Support-Vector Networks. Machine Learning, 20, 273–297.
Cover, T. M., and Thomas, J. A. 2006. Elements of Information Theory. 2nd edn. Hoboken, NJ:
Wiley.
Cramér, H. 1946. Mathematical Methods of Statistics. Princeton, NJ: Princeton University Press.
Craven, P., and Wahba, G. 1979. Smoothing Noisy Data with Spline Functions: Estimating the
Correct Degree of Smoothing by the Method of Generalized Cross-Validation. Numerische
Mathematik, 31, 377–403.
Csiszár, I. 1967. Information-Type Measures of Difference of Probability Distributions and
Indirect Observation. Studia Scientiarum Mathematicarum Hungarica, 2, 229–318.
Ćwik, J., and Mielniczuk, J. 1989. Estimating Density Ratio with Application to Discriminant
Analysis. Communications in Statistics: Theory and Methods, 18(8), 3057–3069.
Darbellay, G. A., and Vajda, I. 1999. Estimation of the Information by an Adaptive Partitioning
of the Observation Space. IEEE Transactions on Information Theory, 45(4), 1315–1321.
Davis, J., Kulis, B., Jain, P., Sra, S., and Dhillon, I. 2007. Information-Theoretic Metric Learn-
ing. Pages 209–216 of: Ghahramani, Z. (ed), Proceedings of the 24th Annual International
Conference on Machine Learning (ICML2007).
Demmel, J. W. 1997. Applied Numerical Linear Algebra. Philadelphia, PA: Society for Industrial
and Applied Mathematics.
Dempster, A. P., Laird, N. M., and Rubin, D. B. 1977. Maximum Likelihood from Incomplete
Data via the EM Algorithm. Journal of the Royal Statistical Society, series B, 39(1), 1–38.
Dhillon, I. S., Guan, Y., and Kulis, B. 2004. Kernel K-Means, Spectral Clustering and Normalized
Cuts. Pages 551–556 of: Proceedings of the Tenth ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining. New York: ACM Press.
Donoho, D. L., and Grimes, C. E. 2003. Hessian Eigenmaps: Locally Linear Embedding
Techniques for High-Dimensional Data. Pages 5591–5596 of: Proceedings of the National
Academy of Arts and Sciences.
Duda, R. O., Hart, P. E., and Stork, D. G. 2001. Pattern Classification. 2nd edn. New York:
Wiley.
Duffy, N., and Collins, M. 2002. Convolution Kernels for Natural Language. Pages 625–632
of: Dietterich, T. G., Becker, S., and Ghahramani, Z. (eds), Advances in Neural Information
Processing Systems 14. Cambridge, MA: MIT Press.
References 313
Durand, J., and Sabatier, R. 1997. Additive Splines for Partial Least Squares Regression. Journal
of the American Statistical Association, 92(440), 1546–1554.
Edelman, A. 1988. Eigenvalues and Condition Numbers of Random Matrices. SIAM Journal on
Matrix Analysis and Applications, 9(4), 543–560.
Edelman, A., and Sutton, B. D. 2005. Tails of Condition Number Distributions. SIAM Journal
on Matrix Analysis and Applications, 27(2), 547–560.
Edelman, A., Arias, T. A., and Smith, S. T. 1998. The Geometry of Algorithms with Orthogonality
Constraints. SIAM Journal on Matrix Analysis and Applications, 20(2), 303–353.
Efron, B. 1975. The Efficiency of Logistic Regression Compared to Normal Discriminant
Analysis. Journal of the American Statistical Association, 70(352), 892–898.
Efron, B., and Tibshirani, R. J. 1993. An Introduction to the Bootstrap. New York: Chapman &
Hall/CRC.
Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. 2004. Least Angle Regression. The Annals
of Statistics, 32(2), 407–499.
Elkan, C. 2011. Privacy-Preserving Data Mining via Importance Weighting. In C. Dimitrakakis,
A. Gkoulalas-Divanis, A. Mitrokotsa, V. S. Verykios, and Y. Saygin (Eds.): Privacy and
Security Issues in Data Mining and Machine Learning, 15–21, Berlin: Springer.
Evgeniou, T., and Pontil, M. 2004. Regularized Multi-Task Learning. Pages 109–117 of: Pro-
ceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining (KDD2004).
Faivishevsky, L., and Goldberger, J. 2009. ICA based on a Smooth Estimation of the Differential
Entropy. Pages 433–440 of: Koller, D., Schuurmans, D., Bengio, Y., and Bottou, L. (eds),
Advances in Neural Information Processing Systems 21. Cambridge, MA: MIT Press.
Faivishevsky, L., and Goldberger, J. 2010 (Jun. 21–25). A Nonparametric Information Theoretic
Clustering Algorithm. Pages 351–358 of: Joachims, A. T., and Fürnkranz, J. (eds), Proceedings
of 27th International Conference on Machine Learning (ICML2010).
Fan, H., Zaïane, O. R., Foss, A., and Wu, J. 2009. Resolution-Based Outlier Factor: Dtecting the
Top-n Most Outlying Data Points in Engineering Data. Knowledge and Information Systems,
19(1), 31–51.
Fan, J., Yao, Q., and Tong, H. 1996. Estimation of Conditional Densities and Sensitivity Measures
in Nonlinear Dynamical Systems. Biometrika, 83(1), 189–206.
Fan, R.-E., Chen, P.-H., and Lin, C.-J. 2005. Working Set Selection Using Second Order
Information for Training SVM. Journal of Machine Learning Research, 6, 1889–1918.
Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., and Lin, C.-J. 2008. LIBLINEAR: A Library
for Large Linear Classification. Journal of Machine Learning Research, 9, 1871–1874.
Fedorov, V. V. 1972. Theory of Optimal Experiments. New York: Academic Press.
Fernandez, E. A. 2005. The dprep Package. Tech. rept. University of Puerto Rico.
Feuerverger, A. 1993. A Consistent Test for Bivariate Dependence. International Statistical
Review, 61(3), 419–433.
Fisher, R. A. 1936. The Use of Multiple Measurements in Taxonomic Problems. Annals of
Eugenics, 7(2), 179–188.
Fishman, G. S. 1996. Monte Carlo: Concepts, Algorithms, and Applications. Berlin, Germany:
Springer-Verlag.
Fokianos, K., Kedem, B., Qin, J., and Short, D. A. 2001. A Semiparametric Approach to the
One-Way Layout. Technometrics, 43, 56–64.
Franc, V., and Sonnenburg, S. 2009. Optimized Cutting Plane Algorithm for Large-Scale Risk
Minimization. Journal of Machine Learning Research, 10, 2157–2192.
Fraser, A. M., and Swinney, H. L. 1986. Independent Coordinates for Strange Attractors from
Mutual Information. Physical Review A, 33(2), 1134–1140.
Friedman, J., and Rafsky, L. 1979. Multivariate Generalizations of the Wald-Wolfowitz and
Smirnov Two-Sample Tests. The Annals of Statistics, 7(4), 697–717.
Friedman, J. H. 1987. Exploratory Projection Pursuit. Journal of the American Statistical
Association, 82(397), 249–266.
314 References
Friedman, J. H., and Tukey, J. W. 1974. A Projection Pursuit Algorithm for Exploratory Data
Analysis. IEEE Transactions on Computers, C-23(9), 881–890.
Fujimaki, R., Yairi, T., and Machida, K. 2005. An Approachh to Spacecraft Anomaly Detec-
tion Problem Using Kernel Feature Space. Pages 401–410 of: Proceedings of the 11th ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD2005).
Fujisawa, H., and Eguchi, S. 2008. Robust Parameter Estimation with a Small Bias against Heavy
Contamination. Journal of Multivariate Analysis, 99(9), 2053–2081.
Fukumizu, K. 2000. Statistical Active Learning in Multilayer Perceptrons. IEEE Transactions
on Neural Networks, 11(1), 17–26.
Fukumizu, K., Bach, F. R., and Jordan, M. I. 2004. Dimensionality Reduction for Supervised
Learning with Reproducing Kernel Hilbert Spaces. Journal of Machine Learning Research,
5(Jan), 73–99.
Fukumizu, K., Bach, F. R., and Jordan, M. I. 2009. Kernel Dimension Reduction in Regression.
The Annals of Statistics, 37(4), 1871–1905.
Fukunaga, K. 1990. Introduction to Statistical Pattern Recognition. 2nd edn. Boston, MA:
Academic Press, Inc.
Fung, G. M., and Mangasarian, O. L. 2005. Multicategory Proximal Support Vector Machine
Classifiers. Machine Learning, 59(1–2), 77–97.
Gao, J., Cheng, H., and Tan, P.-N. 2006a. A Novel Framework for Incorporating Labeled Exam-
ples into Anomaly Detection. Pages 593–597 of: Proceedings of the 2006 SIAM International
Conference on Data Mining.
Gao, J., Cheng, H., and Tan, P.-N. 2006b. Semi-Supervised Outlier Detection. Pages 635–636
of: Proceedings of the 2006 ACM symposium on Applied Computing.
Gärtner, T. 2003. A Survey of Kernels for Structured Data. SIGKDD Explorations, 5(1), S268–
S275.
Gärtner, T., Flach, P., and Wrobel, S. 2003. On Graph Kernels: Hardness Results and Efficient
Alternatives. Pages 129–143 of: Schölkopf, B., and Warmuth, M. (eds), Proceedings of the
Sixteenth Annual Conference on Computational Learning Theory.
Ghosal, S., and van der Vaart, A. W. 2001. Entropies and Rates of Convergence for Maximum
Likelihood and Bayes Estimation for Mixtures of Normal Densities. Annals of Statistics, 29,
1233–1263.
Globerson, A., and Roweis, S. 2006. Metric Learning by Collapsing Classes. Pages 451–458
of: Weiss, Y., Schölkopf, B., and Platt, J. (eds), Advances in Neural Information Processing
Systems 18. Cambridge, MA: MIT Press.
Godambe, V. P. 1960. An Optimum Property of Regular Maximum Likelihood Estimation. Annals
of Mathematical Statistics, 31, 1208–1211.
Goldberger, J., Roweis, S., Hinton, G., and Salakhutdinov, R. 2005. Neighbourhood Components
Analysis. Pages 513–520 of: Saul, L. K., Weiss, Y., and Bottou, L. (eds), Advances in Neural
Information Processing Systems 17. Cambridge, MA: MIT Press.
Golub, G. H., and Loan, C. F. Van. 1996. Matrix Computations. Baltimore, MD: Johns Hopkins
University Press.
Gomes, R., Krause, A., and Perona, P. 2010. Discriminative Clustering by Regularized Informa-
tion Maximization. Pages 766–774 of: Lafferty, J., Williams, C. K. I., Zemel, R., Shawe-Taylor,
J., and Culotta, A. (eds), Advances in Neural Information Processing Systems 23. Cambridge,
MA: MIT Press.
Goutis, C., and Fearn, T. 1996. Partial Least Squares Regression on Smooth Factors. Journal of
the American Statistical Association, 91(434), 627–632.
Graham, D. B., and Allinson, N. M. 1998. Characterizing Virtual Eigensignatures for General
Purpose Face Recognition. Pages 446–456 of: Computer and Systems Sciences. NATO ASI
Series F, vol. 163. Berlin, Germany: Springer.
Gretton, A., Bousquet, O., Smola, A., and Schölkopf, B. 2005. Measuring Statistical Depen-
dence with Hilbert-Schmidt Norms. Pages 63–77 of: Jain, S., Simon, H. U., and Tomita, E.
(eds), Algorithmic Learning Theory. Lecture Notes in Artificial Intelligence. Berlin, Germany:
Springer-Verlag.
References 315
Gretton, A., Borgwardt, K. M., Rasch, M., Schölkopf, B., and Smola, A. J. 2007. A Kernel Method
for the Two-Sample-Problem. Pages 513–520 of: Schölkopf, B., Platt, J., and Hoffman, T.
(eds), Advances in Neural Information Processing Systems 19. Cambridge, MA: MIT Press.
Gretton, A., Fukumizu, K., Teo, C. H., Song, L., Schölkopf, B., and Smola, A. 2008. A Kernel
Statistical Test of Independence. Pages 585–592 of: Platt, J. C., Koller, D., Singer, Y., and
Roweis, S. (eds), Advances in Neural Information Processing Systems 20. Cambridge, MA:
MIT Press.
Gretton, A., Smola, A., Huang, J., Schmittfull, M., Borgwardt, K., and Schölkopf, B. 2009.
Covariate Shift by Kernel Mean Matching. Chap. 8, pages 131–160 of: Quiñonero-Candela, J.,
Sugiyama, M., Schwaighofer, A., and Lawrence, N. (eds), Dataset Shift in Machine Learning.
Cambridge, MA: MIT Press.
Guralnik, V., and Srivastava, J. 1999. Event Detection from Time Series Data. Pages 33–42 of:
Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining (KDD1999).
Gustafsson, F. 2000. Adaptive Filtering and Change Detection. Chichester, UK: Wiley.
Guyon, I., and Elisseeff, A. 2003. An Introduction to Variable Feature Selection. Journal of
Machine Learning Research, 3, 1157–1182.
Hachiya, H., Akiyama, T., Sugiyama, M., and Peters, J. 2009. Adaptive Importance Sampling
for Value Function Approximation in Off-policy Reinforcement Learning. Neural Networks,
22(10), 1399–1410.
Hachiya, H., Sugiyama, M., and Ueda, N. 2011a. Importance-Weighted Least-Squares Probabilis-
tic Classifier for Covariate Shift Adaptation with Application to Human Activity Recognition.
Neurocomputing. To appear.
Hachiya, H., Peters, J., and Sugiyama, M. 2011b. Reward Weighted Regression with Sample
Reuse. Neural Computation, 23(11), 2798–2832.
Hall, P., and Tajvidi, N. 2002. Permutation Tests for Equality of Distributions in Highdimensional
Settings. Biometrika, 89(2), 359–374.
Härdle, W., Müller, M., Sperlich, S., and Werwatz, A. 2004. Nonparametric and Semiparametric
Models. Berlin, Germany: Springer.
Hartigan, J. A. 1975. Clustering Algorithms. New York: Wiley.
Hastie, T., and Tibshirani, R. 1996a. Discriminant Adaptive Nearest Neighbor Classification.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(6), 607–615.
Hastie, T., and Tibshirani, R. 1996b. Discriminant Analysis by Gaussian mixtures. Journal of the
Royal Statistical Society, Series B, 58(1), 155–176.
Hastie, T., Tibshirani, R., and Friedman, J. 2001. The Elements of Statistical Learning: Data
Mining, Inference, and Prediction. New York: Springer.
Hastie, T., Rosset, S., Tibshirani, R., and Zhu, J. 2004. The Entire Regularization Path for the
Support Vector Machine. Journal of Machine Learning Research, 5, 1391–1415.
He, X., and Niyogi, P. 2004. Locality Preserving Projections. Pages 153–160 of: Thrun, S.,
Saul, L., and Schölkopf, B. (eds), Advances in Neural Information Processing Systems 16.
Cambridge, MA: MIT Press.
Heckman, J. J. 1979. Sample Selection Bias as a Specification Error. Econometrica, 47(1),
153–161.
Henkel, R. E. 1976. Tests of Significance. Beverly Hills, CA: Sage.
Hido, S., Tsuboi, Y., Kashima, H., Sugiyama, M., and Kanamori, T. 2011. Statistical Outlier
Detection Using Direct Density Ratio Estimation. Knowledge and Information Systems, 26(2),
309–336.
Hinton, G. E., and Salakhutdinov, R. R. 2006. Reducing the Dimensionality of Data with Neural
Networks. Science, 313(5786), 504–507.
Hodge, V., and Austin, J. 2004. A Survey of Outlier Detection Methodologies. Artificial
Intelligence Review, 22(2), 85–126.
Hoerl, A. E., and Kennard, R. W. 1970. Ridge Regression: Biased Estimation for Nonorthogonal
Problems. Technometrics, 12(3), 55–67.
Horn, R., and Johnson, C. 1985. Matrix Analysis. Cambridge, UK: Cambridge University Press.
316 References
Hotelling, H. 1936. Relations between Two Sets of Variates. Biometrika, 28(3–4), 321–377.
Hotelling, H. 1951. A Generalized T Test and Measure of Multivariate Dispersion. Pages 23–41
of: Proceedings of the 2nd Berkeley Symposium on Mathematical Statistics and Probability.
Berkeley: University of California Press.
Hoyer, P. O., Janzing, D., Mooij, J. M., Peters, J., and Schölkopf, B. 2009. Nonlinear Causal Dis-
covery with Additive Noise Models. Pages 689–696 of: Koller, D., Schuurmans, D., Bengio,
Y., and Bottou, L. (eds), Advances in Neural Information Processing Systems 21. Cambridge,
MA: MIT Press.
Huang, J., Smola, A., Gretton, A., Borgwardt, K. M., and Schölkopf, B. 2007. Correcting Sample
Selection Bias by Unlabeled Data. Pages 601–608 of: Schölkopf, B., Platt, J., and Hoffman, T.
(eds), Advances in Neural Information Processing Systems 19. Cambridge, MA: MIT Press.
Huber, P. J. 1985. Projection Pursuit. The Annals of Statistics, 13(2), 435–475.
Hulle, M. M. Van. 2005. Edgeworth Approximation of Multivariate Differential Entropy. Neural
Computation, 17(9), 1903–1910.
Hulle, M. M. Van. 2008. Sequential Fixed-Point ICA Based on Mutual Information Minimization.
Neural Computation, 20(5), 1344–1365.
Hyvaerinen, A. 1999. Fast and Robust Fixed-Point Algorithms for Independent Component
Analysis. IEEE Transactions on Neural Networks, 10(3), 626.
Hyvärinen, A., Karhunen, J., and Oja, E. 2001. Independent Component Analysis. New York:
Wiley.
Ide, T., and Kashima, H. 2004. Eigenspace-Based Anomaly Detection in Computer Systems.
Pages 440–449 of: Proceedings of the 10th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining (KDD2004).
Ishiguro, M., Sakamoto, Y., and Kitagawa, G. 1997. Bootstrapping Log Likelihood and EIC, an
Extension of AIC. Annals of the Institute of Statistical Mathematics, 49, 411–434.
Jacoba, P., and Oliveirab, P. E. 1997. Kernel Estimators of General Radon-Nikodym Derivatives.
Statistics, 30, 25–46.
Jain, A. K., and Dubes, R. C. 1988. Algorithms for Clustering Data. Englewood Cliffs, NJ:
Prentice Hall.
Jaynes, E. T. 1957. Information Theory and Statistical Mechanics. Physical Review, 106(4),
620–630.
Jebara, T. 2004. Kernelized Sorting, Permutation and Alignment for Minimum Volume PCA.
Pages 609–623 of: 17th Annual Conference on Learning Theory (COLT2004).
Jiang, X., and Zhu, X. 2009. vEye: Behavioral Footprinting for Self-Propagating Worm Detection
and Profiling. Knowledge and Information Systems, 18(2), 231–262.
Joachims, T. 1999. Making Large-Scale SVM Learning Practical. Pages 169–184 of: Schölkopf,
B., Burges, C. J. C., and Smola, A. J. (eds), Advances in Kernel Methods—Support Vector
Learning. Cambridge, MA: MIT Press.
Joachims, T. 2006. Training Linear SVMs in Linear Time. Pages 217–226 of: ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining (KDD2006).
Jolliffe, I. T. 1986. Principal Component Analysis. New York: Springer-Verlag.
Jones, M. C., Hjort, N. L., Harris, I. R., and Basu,A. 2001.AComparison of Related Density-based
Minimum Divergence Estimators. Biometrika, 88, 865–873.
Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. 1999. An Introduction to Variational
Methods for Graphical Models. Machine Learning, 37(2), 183.
Jutten, C., and Herault, J. 1991. Blind Separation of Sources, Part I: An Adaptive algorithm Based
on Neuromimetic Architecture. Signal Processing, 24(1), 1–10.
Kanamori, T. 2007. Pool-Based Active Learning with Optimal Sampling Distribution and its
Information Geometrical Interpretation. Neurocomputing, 71(1–3), 353–362.
Kanamori, T., and Shimodaira, H. 2003. Active Learning Algorithm Using the Maximum
Weighted Log-Likelihood Estimator. Journal of Statistical Planning and Inference, 116(1),
149–162.
Kanamori, T., Hido, S., and Sugiyama, M. 2009. A Least-squares Approach to Direct Importance
Estimation. Journal of Machine Learning Research, 10(Jul.), 1391–1445.
References 317
Kanamori, T., Suzuki, T., and Sugiyama, M. 2010. Theoretical Analysis of Density Ratio Estima-
tion. IEICE Transactions on Fundamentals of Electronics, Communications and Computer
Sciences, E93-A(4), 787–798.
Kanamori, T., Suzuki, T., and Sugiyama, M. 2011a. f -Divergence Estimation and Two-Sample
Homogeneity Test under Semiparametric Density-Ratio Models. IEEE Transactions on
Information Theory. To appear.
Kanamori, T., Suzuki, T., and Sugiyama, M. 2011b. Statistical Analysis of Kernel-Based Least-
Squares Density-Ratio Estimation. Machine Learning. To appear.
Kanamori, T., Suzuki, T., and Sugiyama, M. 2011c. Kernel-Based Least-Squares Density-Ratio
Estimation II. Condition Number Analysis. Machine Learning. submitted.
Kankainen, A. 1995. Consistent Testing of Total Independence Based on the Empirical
Characteristic Function. Ph.D. thesis, University of Jyväskylä, Jyväskylä, Finland.
Karatzoglou, A., Smola, A., Hornik, K., and Zeileis, A. 2004. kernlab—An S4 Package for Kernel
Methods in R. Journal of Statistical Planning and Inference, 11(9), 1–20.
Kashima, H., and Koyanagi, T. 2002. Kernels for Semi-Structured Data. Pages 291–298 of:
Proceedings of the Nineteenth International Conference on Machine Learning.
Kashima, H., Tsuda, K., and Inokuchi, A. 2003. Marginalized Kernels between Labeled Graphs.
Pages 321–328 of: Proceedings of the Twentieth International Conference on Machine
Learning.
Kato, T., Kashima, H., Sugiyama, M., and Asai, K. 2010. Conic Programming for Multi-Task
Learning. IEEE Transactions on Knowledge and Data Engineering, 22(7), 957–968.
Kawahara, Y., and Sugiyama, M. 2011. Sequential Change-Point Detection Based on Direct
Density-Ratio Estimation. Statistical Analysis and Data Mining. To appear.
Kawanabe, M., Sugiyama, M., Blanchard, G., and Müller, K.-R. 2007. A New Algorithm of
Non-Gaussian Component Analysis with Radial Kernel Functions. Annals of the Institute of
Statistical Mathematics, 59(1), 57–75.
Ke, Y., Sukthankar, R., and Hebert, M. 2007. Event Detection in Crowded Videos. Pages 1–8 of:
Proceedings of the 11th IEEE International Conference on Computer Vision (ICCV2007).
Keziou, A. 2003a. Dual Representation of φ-Divergences and Applications. Comptes Rendus
Mathématique, 336(10), 857–862.
Keziou, A. 2003b. Utilisation Des Divergences Entre Mesures en Statistique Inferentielle. Ph.D.
thesis, UPMC University. in French.
Keziou, A., and Leoni-Aubin, S. 2005. Test of Homogeneity in Semiparametric Two-sample
Density Ratio Models. Comptes Rendus Mathématique, 340(12), 905–910.
Keziou, A., and Leoni-Aubin, S. 2008. On Empirical Likelihood for Semiparametric Two-Sample
Density Ratio Models. Journal of Statistical Planning and Inference, 138(4), 915–928.
Khan, S., Bandyopadhyay, S., Ganguly, A., and Saigal, S. 2007. Relative Performance of Mutual
Information Estimation Methods for Quantifying the Dependence among Short and Noisy
Data. Physical Review E, 76, 026209.
Kifer, D., Ben-David, S, and Gehrke, J. 2004. Detecting Change in Data Streams. Pages 180–191
of: Proceedings of the 30th International Conference on Very Large Data Bases (VLDB2004).
Kimeldorf, G. S., and Wahba, G. 1971. Some Results on Tchebycheffian Spline Functions.
Journal of Mathematical Analysis and Applications, 33(1), 82–95.
Kimura, M., and Sugiyama, M. 2011. Dependence-Maximization Clustering with Least-
Squares Mutual Information. Journal of Advanced Computational Intelligence and Intelligent
Informatics, 15(7), 800–805.
Koh, K., Kim, S.-J., and Boyd, S. P. 2007. An Interior-point Method for Large-
Scale l1 -Regularized Logistic Regression. Journal of Machine Learning Research, 8,
1519–1555.
Kohonen, T. 1988. Learning Vector Quantization. Neural Networks, 1(Supplementary 1),
303.
Kohonen, T. 1995. Self-Organizing Maps. Berlin, Germany: Springer.
Koltchinskii, V. 2006. Local Rademacher Complexities and Oracle Inequalities in Risk
Minimization. The Annals of Statistics, 34, 2593–2656.
318 References
Kondor, R. I., and Lafferty, J. 2002. Diffusion Kernels on Graphs and Other Discrete Input Spaces.
Pages 315–322 of: Proceedings of the Nineteenth International Conference on Machine
Learning.
Konishi, S., and Kitagawa, G. 1996. Generalized Information Criteria in Model Selection.
Biometrika, 83(4), 875–890.
Korostelëv, A. P., and Tsybakov, A. B. 1993. Minimax Theory of Image Reconstruction. New
York: Springer.
Kraskov, A., Stögbauer, H., and Grassberger, P. 2004. Estimating Mutual Information. Physical
Review E, 69(6), 066138.
Kullback, S. 1959. Information Theory and Statistics. New York: Wiley.
Kullback, S., and Leibler, R. A. 1951. On Information and Sufficiency. Annals of Mathematical
Statistics, 22, 79–86.
Kurihara, N., Sugiyama, M., Ogawa, H., Kitagawa, K., and Suzuki, K. 2010. Iteratively-
Reweighted Local Model Fitting Method for Adaptive and Accurate Single-Shot Surface
Profiling. Applied Optics, 49(22), 4270–4277.
Lafferty, J., McCallum,A., and Pereira, F. 2001. Conditional Random Fields: Probabilistic Models
for Segmenting and Labeling Sequence Data. Pages 282–289 of: Proceedings of the 18th
International Conference on Machine Learning.
Lagoudakis, M. G., and Parr, R. 2003. Least-Squares Policy Iteration. Journal of Machine
Learning Research, 4, 1107–1149.
Lapedriza, À., Masip, D., and Vitrià, J. 2007. A Hierarchical Approach for Multi-task Logistic
Regression. Pages 258–265 of: Mart, J., Bened, J. M., Mendonga, A. M., and Serrat, J. (eds),
Proceedings of the 3rd Iberian Conference on Pattern Recognition and Image Analysis, Part
II. Lecture Notes in Computer Science, vol. 4478. Berlin, Germany: Springer-Verlag.
Larsen, J., and Hansen, L. K. 1996. Linear Unlearning for Cross-Validation. Advances in
Computational Mathematics, 5, 269–280.
Latecki, L. J., Lazarevic, A., and Pokrajac, D. 2007. Outlier Detection with Kernel Density
Functions. Pages 61–75 of: Proceedings of the 5th International Conference on Machine
Learning and Data Mining in Pattern Recognition.
Lee, T.-W., Girolami, M., and Sejnowski, T. J. 1999. Independent Component Analysis Using
an Extended Infomax Algorithm for Mixed Subgaussian and Supergaussian Sources. Neural
Computation, 11(2), 417–441.
Lehmann, E. L. 1986. Testing Statistical Hypotheses. 2nd edn. New York: Wiley.
Lehmann, E. L., and Casella, G. 1998. Theory of Point Estimation. 2nd edn. New York: Springer.
Li, K. 1991. Sliced Inverse Regression for Dimension Reduction. Journal of the American
Statistical Association, 86(414), 316–342.
Li, K. 1992. On Principal Hessian Directions for Data Visualization and Dimension Reduc-
tion: Another Application of Stein’s Lemma. Journal of the American Statistical Association,
87(420), 1025–1039.
Li, K. C., Lue, H. H., and Chen, C. H. 2000. Interactive Tree-structured Regression via Principal
Hessian Directions. Journal of the American Statistical Association, 95(450), 547–560.
Li, L., and Lu, W. 2008. Sufficient Dimension Reduction with Missing Predictors. Journal of the
American Statistical Association, 103(482), 822–831.
Li, Q. 1996. Nonparametric Testing of Closeness between Two Unknown Distribution Functions.
Econometric Reviews, 15(3), 261–274.
Li, Y., Liu, Y., and Zhu, J. 2007. Quantile Regression in Reproducing Kernel Hilbert Spaces.
Journal of the American Statistical Association, 102(477), 255–268.
Li, Y., Kambara, H., Koike, Y., and Sugiyama, M. 2010. Application of Covariate Shift Adaptation
Techniques in Brain Computer Interfaces. IEEE Transactions on Biomedical Engineering,
57(6), 1318–1324.
Lin, Y. 2002. Support Vector Machines and the Bayes Rule in Classification. Data Mining and
Knowledge Discovery, 6(3), 259–275.
Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., and Watkins, C. 2002. Text
Classification Using String Kernels. Journal of Machine Learning Research, 2, 419–444.
References 319
Luenberger, D., and Ye, Y. 2008. Linear and Nonlinear Programming. Reading, MA: Springer.
Luntz, A., and Brailovsky, V. 1969. On Estimation of Characters Obtained in Statistical Procedure
of Recognition. Technicheskaya Kibernetica, 3. in Russian.
MacKay, D. J. C. 2003. Information Theory, Inference, and Learning Algorithms. Cambridge,
UK: Cambridge University Press.
MacQueen, J. B. 1967. Some Methods for Classification and Analysis of Multivariate Obser-
vations. Pages 281–297 of: Proceedings of the 5th Berkeley Symposium on Mathematical
Statistics and Probability, vol. 1. Berkeley: University of California Press.
Mallows, C. L. 1973. Some Comments on CP . Technometrics, 15(4), 661–675.
Manevitz, L. M., and Yousef, M. 2002. One-Class SVMs for Document Classification. Journal
of Machine Learning Research, 2, 139–154.
Meila, M., and Heckerman, D. 2001. An Experimental Comparison of Model-Based Clustering
Methods. Machine Learning, 42(1/2), 9.
Mendelson, S. 2002. Improving the Sample Complexity Using Global Data. IEEE Transactions
on Information Theory, 48(7), 1977–1991.
Mercer, J. 1909. Functions of Positive and Negative Type and Their Connection with the Theory
of Integral Equations. Philosophical Transactions of the Royal Society of London, A-209,
415–446.
Micchelli, C. A., and Pontil, M. 2005. Kernels for Multi-Task Learning. Pages 921–928 of: Saul,
L. K., Weiss, Y., and Bottou, L. (eds), Advances in Neural Information Processing Systems
17. Cambridge, MA: MIT Press.
Minka, T. P. 2007. A Comparison of Numerical Optimizers for Logistic Regression. Tech. rept.
Microsoft Research.
Moré, J. J., and Sorensen, D. C. 1984. Newton’s Method. In: Golub, G. H. (ed), Studies in
Numerical Analysis. Washington, DC: Mathematical Association of America.
Mori, S., Sugiyama, M., Ogawa, H., Kitagawa, K., and Irie, K. 2011. Automatic Parameter
Optimization of the Local Model Fitting Method for Single-shot Surface Profiling. Applied
Optics, 50(21), 3773–3780.
Müller, A. 1997. Integral Probability Metrics and Their Generating Classes of Functions.
Advances in Applied Probability, 29, 429–443.
Murad, U., and Pinkas, G. 1999. Unsupervised Profiling for Identifying Superimposed Fraud.
Pages 251–261 of: Proceedings of the 5th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining (KDD1999).
Murata, N., Yoshizawa, S., and Amari, S. 1994. Network Information Criterion — Determining
the Number of Hidden Units for an Artificial Neural Network Model. IEEE Transactions on
Neural Networks, 5(6), 865–872.
Ng, A. Y., Jordan, M. I., and Weiss, Y. 2002. On Spectral Clustering: Analysis and An Algorithm.
Pages 849–856 of: Dietterich, T. G., Becker, S., and Ghahramani, Z. (eds), Advances in Neural
Information Processing Systems 14. Cambridge, MA: MIT Press.
Nguyen, X., Wainwright, M. J., and Jordan, M. I. 2010. Estimating Divergence Functionals
and the Likelihood Ratio by Convex Risk Minimization. IEEE Transactions on Information
Theory, 56(11), 5847–5861.
Nishimori, Y., and Akaho, S. 2005. Learning Algorithms Utilizing Quasi-geodesic Flows on the
Stiefel Manifold. Neurocomputing, 67, 106–135.
Oja, E. 1982. A Simplified Neuron Model as a Principal Component Analyzer. Journal of
Mathematical Biology, 15(3), 267–273.
Oja, E. 1989. Neural Networks, Principal Components and Subspaces. International Journal of
Neural Systems, 1, 61–68.
Patriksson, M. 1999. Nonlinear Programming and Variational Inequality Problems. Dordrecht,
the Netherlands: Kluwer Academic.
Pearl, J. 2000. Causality: Models, Reasning and Inference. New York: Cambridge University
Press.
Pearson, K. 1900. On the Criterion That a Given System of Deviations from the Proba-
ble in the Case of a Correlated System of Variables Is Such That It Can Be Reasonably
320 References
Supposed to Have Arisen from Random Sampling. Philosophical Magazine Series 5, 50(302),
157–175.
Pérez-Cruz, F. 2008. Kullback-Leibler Divergence Estimation of Continuous Distributions. Pages
1666–1670 of: Proceedings of IEEE International Symposium on Information Theory.
Platt, J. 1999. Fast Training of Support Vector Machines Using Sequential Minimal Optimization.
Pages 169–184 of: Schölkopf, B., Burges, C. J. C., and Smola, A. J. (eds), Advances in Kernel
Methods—Support Vector Learning. Cambridge, MA: MIT Press.
Platt, J. 2000. Probabilities for SV Machines. In: Smola, A. J., Bartlett, P. L., Schölkopf, B., and
Schuurmans, D. (eds), Advances in Large Margin Classifiers. Cambridge, MA: MIT Press.
Plumbley, M. D. 2005. Geometrical Methods for Non-Negative ICA: Manifolds, Lie Groups and
Toral Subalgebras. Neurocomputing, 67(Aug.), 161–197.
Press, W. H., Flannery, B. P., Teukolsky, S. A., and Vetterling, W. T. 1992. Numerical Recipes in
C. 2nd edn. Cambridge, UK: Cambridge University Press.
Pukelsheim, F. 1993. Optimal Design of Experiments. New York: Wiley.
Qin, J. 1998. Inferences for Case-control and Semiparametric Two-sample Density Ratio Models.
Biometrika, 85(3), 619–630.
Qing, W., Kulkarni, S. R., and Verdu, S. 2006. A Nearest-Neighbor Approach to Estimating
Divergence between Continuous Random Vectors. Pages 242–246 of: Proceedings of IEEE
International Symposium on Information Theory.
Quadrianto, N., Smola, A. J., Song, L., and Tuytelaars, T. 2010. Kernelized Sorting. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 32, 1809–1821.
Quiñonero-Candela, J., Sugiyama, M., Schwaighofer, A., and Lawrence, N. (eds). 2009. Dataset
Shift in Machine Learning. Cambridge, MA: MIT Press.
R Development Core Team. 2009. R: A Language and Environment for Statistical Computing.
R Foundation for Statistical Computing, Vienna, Austria. http://www.r-project.org.
Rao, C. 1945. Information and theAccuracyAttainable in the Estimation of Statistical Parameters.
Bulletin of the Calcutta Mathematics Society, 37, 81–89.
Rasmussen, C. E., and Williams, C. K. I. 2006. Gaussian Processes for Machine Learning.
Cambridge, MA: MIT Press.
Rätsch, G., Onoda, T., and Müller, K.-R. 2001. Soft Margins for AdaBoost. Machine Learning,
42(3), 287–320.
Reiss, P. T., and Ogden, R. T. 2007. Functional Principal Component Regression and Functional
Partial Least Squares. Journal of the American Statistical Association, 102(479), 984–996.
Rifkin, R., Yeo, G., and Poggio, T. 2003. Regularized Least-Squares Classification. Pages 131–
154 of: Suykens, J. A. K., Horvath, G., Basu, S., Micchelli, C., and Vandewalle, J. (eds),
Advances in Learning Theory: Methods, Models and Applications. NATO Science Series III:
Computer & Systems Sciences, vol. 190. Amsterdam, the Netherlands: IOS Press.
Rissanen, J. 1978. Modeling by Shortest Data Description. Automatica, 14(5), 465–471.
Rissanen, J. 1987. Stochastic Complexity. Journal of the Royal Statistical Society, Series B,
49(3), 223–239.
Rockafellar, R. T. 1970. Convex Analysis. Princeton, NJ: Princeton University Press.
Rosenblatt, M. 1956. Remarks on some nonparametric estimates of a density function. Annals
of Mathematical Statistics, 27, 832–837.
Roweis, S., and Saul, L. 2000. Nonlinear Dimensionality Reduction by Locally Linear
Embedding. Science, 290(5500), 2323–2326.
Sankar, A., Spielman, D. A., and Teng, S.-H. 2006. Smoothed Analysis of the Condition Numbers
and Growth Factors of Matrices. SIAM Journal on Matrix Analysis and Applications, 28(2),
446–476.
Saul, L. K., and Roweis, S. T. 2003. Think Globally, Fit Locally: Unsupervised Learning of Low
Dimensional Manifolds. Journal of Machine Learning Research, 4(Jun), 119–155.
Schapire, R., Freund, Y., Bartlett, P., and Lee, W. Sun. 1998. Boosting the Margin: A New
Explanation for the Effectiveness of Voting Methods. Annals of Statistics, 26, 1651–1686.
Scheinberg, K. 2006. An Efficient Implementation of an Active Set Method for SVMs. Journal
of Machine Learning Research, 7, 2237–2257.
References 321
Sugiyama, M., Yamada, M., Kimura, M., and Hachiya, H. 2011d. On Information-Maximization
Clustering: Tuning Parameter Selection and Analytic Solution. In: Proceedings of 28th
International Conference on Machine Learning (ICML2011), 65–72.
Sutton, R. S., and Barto, G. A. 1998. Reinforcement Learning: An Introduction. Cambridge, MA:
MIT Press.
Suykens, J. A. K., Gestel, T. Van, Brabanter, J. De, Moor, B. De, and Vandewalle, J. 2002. Least
Squares Support Vector Machines. Singapore: World Scientific Pub. Co.
Suzuki, T., and Sugiyama, M. 2010. Sufficient Dimension Reduction via Squared-loss Mutual
Information Estimation. Pages 804–811 of: Teh, Y. W., and Tiggerington, M. (eds), Pro-
ceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics
(AISTATS2010). JMLR Workshop and Conference Proceedings, vol. 9.
Suzuki, T., and Sugiyama, M. 2011. Least-Squares Independent Component Analysis. Neural
Computation, 23(1), 284–301.
Suzuki, T., Sugiyama, M., Sese, J., and Kanamori, T. 2008. Approximating Mutual Information
by Maximum Likelihood Density Ratio Estimation. Pages 5–20 of: Saeys, Y., Liu, H., Inza,
I., Wehenkel, L., and de Peer, Y. Van (eds), Proceedings of ECML-PKDD2008 Workshop
on New Challenges for Feature Selection in Data Mining and Knowledge Discovery 2008
(FSDM2008). JMLR Workshop and Conference Proceedings, vol. 4.
Suzuki, T., Sugiyama, M., and Tanaka, T. 2009a. Mutual Information Approximation via Maxi-
mum Likelihood Estimation of Density Ratio. Pages 463–467 of: Proceedings of 2009 IEEE
International Symposium on Information Theory (ISIT2009).
Suzuki, T., Sugiyama, M., Kanamori, T., and Sese, J. 2009b. Mutual Information Estimation
Reveals Global Associations between Stimuli and Biological Processes. BMC Bioinformatics,
10(1), S52.
Suzuki, T., Sugiyama, M., and Tanaka, Toshiyuki. 2011. Mutual Information Approximation via
Maximum Likelihood Estimation of Density Ratio. in preparation.
Takeuchi, I., Le, Q. V., Sears, T. D., and Smola, A. J. 2006. Nonparametric Quantile Estimation.
Journal of Machine Learning Research, 7, 1231–1264.
Takeuchi, I., Nomura, K., and Kanamori, T. 2009. Nonparametric Conditional Density Estimation
Using Piecewise-linear Solution Path of Kernel Quantile Regression. Neural Computation,
21(2), 533–559.
Takeuchi, K. 1976. Distribution of Information Statistics and Validity Criteria of Models.
Mathematical Science, 153, 12–18. in Japanese.
Takimoto, M., Matsugu, M., and Sugiyama, M. 2009. Visual Inspection of Precision Instruments
by Least-Squares Outlier Detection. Pages 22–26 of: Proceedings of The Fourth International
Workshop on Data-Mining and Statistical Science (DMSS2009).
Talagrand, M. 1996a. New Concentration Inequalities in Product Spaces. Inventiones Mathemat-
icae, 126, 505–563.
Talagrand, M. 1996b. A New Look at Independence. The Annals of Statistics, 24, 1–34.
Tang, Y., and Zhang, H. H. 2006. Multiclass Proximal Support Vector Machines. Journal of
Computational and Graphical Statistics, 15(2), 339–355.
Tao, T., and Vu, V. H. 2007. The Condition Number of a Randomly Perturbed Matrix. Pages 248–
255 of: Proceedings of the Thirty-Ninth Annual ACM Symposium on Theory of Computing.
New York: ACM.
Tax, D. M. J., and Duin, R. P. W. 2004. Support Vector Data Description. Machine Learning,
54(1), 45–66.
Tenenbaum, J. B., de Silva, V., and Langford, J. C. 2000. A Global Geometric Framework for
Nonlinear Dimensionality Reduction. Science, 290(5500), 2319–2323.
Teo, C. H., Le, Q., Smola, A., and Vishwanathan, S. V. N. 2007. A Scalable Modular Convex
Solver for Regularized Risk Minimization. Pages 727–736 of: ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining (KDD2007).
Tibshirani, R. 1996. Regression Shrinkage and Subset Selection with the Lasso. Journal of the
Royal Statistical Society, Series B, 58(1), 267–288.
324 References
327
328 Index
Jacobian, 90
natural gradient, 104, 179, 187
nearest neighbor, 164
k-means clustering, 61 nearest neighbor density estimation, 35
Karush–Kuhn–Tacker conditions, 70 Newton’s method, 49, 203, 283
Index 329