Nothing Special   »   [go: up one dir, main page]

The Bayesian Update: Variational Formulations and Gradient Flows

Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

Bayesian Analysis (2020) 15, Number 1, pp.

29–56

The Bayesian Update: Variational Formulations


and Gradient Flows
Nicolas Garcia Trillos∗ and Daniel Sanz-Alonso†

Abstract. The Bayesian update can be viewed as a variational problem by char-


acterizing the posterior as the minimizer of a functional. The variational viewpoint
is far from new and is at the heart of popular methods for posterior approxima-
tion. However, some of its consequences seem largely unexplored. We focus on
the following one: defining the posterior as the minimizer of a functional gives a
natural path towards the posterior by moving in the direction of steepest descent
of the functional. This idea is made precise through the theory of gradient flows,
allowing to bring new tools to the study of Bayesian models and algorithms. Since
the posterior may be characterized as the minimizer of different functionals, sev-
eral variational formulations may be considered. We study three of them and their
three associated gradient flows. We show that, in all cases, the rate of convergence
of the flows to the posterior can be bounded by the geodesic convexity of the func-
tional to be minimized. Each gradient flow naturally suggests a nonlinear diffusion
with the posterior as invariant distribution. These diffusions may be discretized
to build proposals for Markov chain Monte Carlo (MCMC) algorithms. By con-
struction, the diffusions are guaranteed to satisfy a certain optimality condition,
and rates of convergence are given by the convexity of the functionals. We use this
observation to propose a criterion for the choice of metric in Riemannian MCMC
methods.
Keywords: gradient flows, Wasserstein space, convexity, Riemannian MCMC.
MSC 2010 subject classifications: 62C10, 62F15, 49N99.

1 Introduction
In this paper we revisit the old idea of viewing the posterior as the minimizer of an energy
functional. The use of variational formulations of Bayes rule seems to have been largely
focused on one of its methodological benefits: restricting the minimization to a subclass
of measures is the backbone of variational Bayes methods for posterior approximation.
Our aim is to bring attention to two other theoretical and methodological benefits, and
to study in some detail one of these: namely, that each variational formulation suggests a
natural path, defined by a gradient flow, towards the posterior. We use this observation
to propose a criterion for the choice of metric in Riemannian MCMC methods.
Let us recall informally a variational formulation of Bayes rule. Given a prior p(u)
on an unknown parameter u and a likelihood function L(y|u), the posterior p(u|y) ∝

∗ Department of Statistics, University of Wisconsin Madison, garciatrillo@wisc.edu


† Department of Statistics, University of Chicago, sanzalonso@uchicago.edu


c 2020 International Society for Bayesian Analysis https://doi.org/10.1214/18-BA1137
30 The Bayesian Update: Variational Formulations and Gradient Flows

L(y|u)p(u) can be characterized Zellner (1988) as the distribution q ∗ (u) that minimizes

   
JKL q(u) = DKL q(u)p(u) − log L(y|u)q(u)du. (1)

 
Indeed, minimizing JKL q(u) is equivalent to minimizing the Kullback-Leibler diver-
gence DKL (q(u)p(u|y)), and so clearly the minimizer is q ∗ (u) = p(u|y). Other varia-
tional formulations may be considered by minimizing, for instance, other divergences
rather than Kullback-Leibler. In this paper we consider three variational formulations of
the Bayesian update. The first two characterize the posterior measure as the minimizer
of functionals JKL or Jχ2 constructed by penalizing deviations from the prior measure in
Kullback-Leibler or χ2 divergence—definitions of these functionals and divergences are
given in equations (10), (13), (4), (5). The third one characterizes the posterior density
as the minimizer of the Dirichlet energy Dμ —see (6).
Why is it useful to view the posterior p(u|y) as the minimizer of an energy? We list
below three advantages of this viewpoint, the third of which will be the focus of our
paper.

1. The variational formulation provides a natural way to approximate the poste-


rior by restricting the minimization problem to distributions q(u) satisfying some
computationally desirable property. For instance, variational Bayes methods often
restrict the minimization to q(u) with product structure Attias (1999), Wainwright
and Jordan (2008), Fox and Roberts (2012). A similar idea is studied in Pinski
et al. (2015), where q(u) is restricted to a class of Gaussian distributions. An itera-
tive variational procedure that progressively improves the posterior approximation
by enriching the family of distributions was introduced in Guo et al. (2016).

2. If the prior pεn (u) or the likelihood Lεn (y|u) depend on a parameter n , then the
variational formulation allows to show large n convergence of posteriors pεn (u|y)
by establishing the Γ-convergence of the associated energies. This method of proof
has been employed by the authors in Garcia Trillos and Sanz-Alonso (2018a),
Garcia Trillos et al. (2017b) to analyze the large-data consistency of graph-based
Bayesian semi-supervised learning.

3. Each variational formulation gives a natural path, defined by a gradient flow,


towards the posterior. These flows can be thought of as time-parameterized curves
in the space of probability measures, converging in the large-time limit towards
the posterior.

In this paper we study three gradient flows associated with the variational formu-
lations defined by minimization of the functionals JKL , Jχ2 , and Dμ . For intuition, we
recall that a gradient flow in Euclidean space defines a curve whose tangent always
points in the direction of steepest descent of a given function –see equation (7). In the
same fashion, a gradient flow in a more general metric space can be thought of as a
curve on said space that always points in the direction of steepest descent of a given
functional Ambrosio et al. (2008). In Euclidean space the direction of steepest descent
N. Garcia Trillos and D. Sanz-Alonso 31

is naturally defined as that in which an Euclidean infinitesimal increment leads to the


largest decrease on the value of the function. In a general metric space the direction of
steepest descent is the one in which an infinitesimal increment defined in terms of the
distance leads to the largest decrease on the value of the functional. In this paper we
study:

(i) The gradient flows defined by JKL and Jχ2 in the space of probability measures
with finite second moments endowed with the Wasserstein distance—definitions
are given in (3). By construction, these flows give curves of probability measures
that evolve following the direction of steepest descent of JKL and Jχ2 in Wasserstein
distance, converging to the posterior measure in the large-time limit.
(ii) The gradient flow defined by the Dirichlet energy Dμ in the space of square in-
tegrable densities endowed with the L2 distance. By construction, this flow gives
a curve of densities in L2 that evolves following the direction of steepest descent
of Dμ in L2 distance, converging to the posterior density in the large-time limit.
Interestingly, the curve of measures associated with these densities is the exact
same as the curve defined by the JKL flow on Wasserstein space Jordan et al.
(1998).

A question arises: what is the rate of convergence of these flows to the posterior? The
answer is, to a large extent, provided by the theory of optimal transport and gradient
flows Ambrosio et al. (2008), Villani (2003), Villani (2008), Santambrogio (2015). We
will review and provide a unified account of these results in the main body of the paper,
section 3. In the remainder of this introduction we discuss how rates of convergence
may be studied in terms of convexity of functionals, and how these rates may be used
as a guide for the choice of proposals for MCMC methods.
Rates of convergence of the flows hinge on the convexity level of each of the func-
tionals JKL , Jχ2 , and Dμ . Recalling the Euclidean case may be helpful: gradient descent
on a highly convex function will lead to fast convergence to the minimizer. What is,
however, a sensible notion of convexity for functionals defined over measures or densi-
ties? Our presentation highlights that the notion of geodesic (or displacement) convexity
McCann (1997) nicely unifies the theory: it guarantees the existence and uniqueness of
the three gradient flows and it also provides a bound on their rate of convergence to
the posterior. In the L2 setting one can show that positive geodesic convexity is equiv-
alent to the posterior satisfying a Poincaré inequality, and also to the existence of a
spectral gap—see subsection 3.2. On the other hand, the geodesic convexity of JKL and
Jχ2 in Wasserstein space is determined by the Ricci curvature of the manifold, as well
as by the likelihood function and prior density—see (10), (11), (13), (14). Typically the
three functionals JKL , Jχ2 , and Dμ will have different levels of geodesic convexity, and
establishing a sharp bound on each of them may not be equally tractable.
The theory of gradient flows and optimal transport gives, for each of the flows, an
associated Fokker-Planck partial differential equation (PDE) that governs the evolution
of densities Ohta and Takatsu (2011), Santambrogio (2015). Such PDEs are typically
costly to discretize if the parameter space is of moderate or high dimension, but they
32 The Bayesian Update: Variational Formulations and Gradient Flows

may be used in small-dimensional problems as a way to define tempering schemes. Here


we do not explore this idea any further. Instead, we focus on the (nonlinear) diffusion
processes associated with the PDEs. These diffusions are Langevin-type stochastic dif-
ferential equations, whose evolving densities satisfy the Fokker-Planck equations. By
construction, the invariant distribution of each of these diffusions is the sought poste-
rior, and a bound on the rate of convergence of the diffusions to the posterior is given by
the geodesic convexity of the corresponding functional. The gradient flow perspective
automatically gives a sense in which the diffusions are optimal: the associated densities
move locally (in Wasserstein or L2 sense) in the direction of steepest descent of the
functional. From this it immediately follows, for instance, that the law of a standard
Langevin diffusion in Euclidean space evolves locally in Wasserstein space in the di-
rection that minimizes Kullback-Leibler, and that it also evolves locally in L2 in the
direction that minimizes the Dirichlet energy.
The MCMC methodology allows to use a proposal based on a discretization of
a diffusion—combined with an accept-reject mechanism to remove the discretization
bias—to produce, in the large-time asymptotic, correlated posterior samples. Heuris-
tically, the rate of convergence of the un-discretized diffusion may guide the choice of
proposal. Proposals based on Langevin diffusions were first suggested in Besag (1994),
and the exponential ergodicity of the resulting algorithms was analyzed in Roberts and
Tweedie (1996). The paper Girolami and Calderhead (2011) considered changing the
metric on the parameter space in order to accelerate MCMC algorithms by taking into
account the geometric structure that the posterior defines in the parameter space. This
led to a new family of Riemannian MCMC algorithms. Our paper is concerned with the
study of un-discretized diffusions; the effect of the accept-reject mechanism on rates and
ergodicity of MCMC methods will be studied elsewhere. We suggest that a way to guide
the choice of metric of Riemannian MCMC methods is to choose the one that leads to a
faster rate of convergence of the diffusion under certain constraints. We emphasize that
despite working with un-discretized diffusions, our guidance for choice of proposals ac-
counts for the fact that discretization will eventually be needed. Our criterion weeds out
choices of metric that lead to diffusions that achieve fast rate of convergence by merely
speeding-up the drift. This is crucial, since a larger drift typically leads to a larger
discretization error, and therefore to more rejections in the MCMC accept-reject mech-
anism in order to remove the bias in the discrete chain. This important constraint on
the size of the drift seems to have been overlooked in existing continuous-time analysis
of MCMC methods.
In summary, the following points highlight the key elements and common structure
of the variational formulations of the Bayesian update and of the study of the associated
gradient flows:

• The posterior can be characterized as the minimizer of different functionals on


probability measures or densities.

• One can then study the gradient flows of these functionals with respect to a metric
on the space of probability measures or densities; the resulting curve is a curve of
maximal slope and its endpoint is the posterior.
N. Garcia Trillos and D. Sanz-Alonso 33

• The gradient flows are characterized by a Fokker-Planck PDE that governs the
evolution of the density of an associated diffusion process.

• By studying the convexity of the functionals (with respect to a given metric)


one can obtain rates of convergence of the gradient flows towards the posterior.
In particular, the level of convexity determines the speed of convergence of the
densities of the associated diffusion process towards the posterior, and hence can
be used as a criterion to guide the choice of proposals for MCMC methods; here
we emphasize that care must be taken when comparing different diffusions if a
higher speed of convergence is at the cost of a more expensive discretization.

The ideas in this paper immediately extend beyond the Bayesian interpretation
stressed here to any application (e.g. the study of conditioned diffusions) where a mea-
sure of interest is defined in terms of a reference measure and a change of measure.
Also, we consider only Kullback-Leibler and χ2 prior penalizations to define the func-
tionals JKL and Jχ2 , but it would be possible to extend the analysis to the family of
m-divergences introduced in Ohta and Takatsu (2011). Kullback-Leibler and χ2 prior
penalization correspond to m → 1 and m = 2 within this family. In what follows we
point out some of the features of the different functionals and gradient flows that we
consider in this paper.

1.1 Comparison of Functionals and Flows


We now provide a comparison of the three choices of functionals that we consider.

1. The two gradient flows in Wasserstein space (arising from the functionals JKL and
Jχ2 ) are fundamentally connected with the variational formulation: these varia-
tional formulations can be used to define posterior-type measures via a penal-
ization of deviations from the prior and deviations from the data in situations
where establishing the existence of conditional distributions by disintegration of
measures is technically demanding. On the other hand, the variational formula-
tion for the Dirichlet energy is less natural and requires previous knowledge of the
posterior.

2. The precise level of geodesic convexity of the functionals JKL (and Jχ2 ) can be
computed from point evaluation of the Ricci tensor (of the parameter space) and
derivatives of the densities. In particular, knowledge of the underlying metric
suffices to compute these quantities. In contrast, establishing a sharp Poincaré
inequality—the level of geodesic convexity of the Dirichlet energy in L2 (M, μ)—
is in practice unfeasible, as it effectively requires solving an infinite dimensional
optimization problem. It is for this reason—and because of the explicit dependence
of the convexity in Wasserstein space with the geometry induced by the manifold
metric tensor—that our analysis of the choice of metric in Riemannian MCMC
methods is based on the JKL functional (see section 4, and in particular Theorem
10).
34 The Bayesian Update: Variational Formulations and Gradient Flows

3. On the flip side of point 2, a Poincaré inequality for the posterior with a not
necessarily optimal constant can be established using only tail information. In
particular, even when the functional JKL is not geodesically convex in Wasserstein
space, one may still be able to obtain a Poincaré inequality (see subsection 5.2 for
an example).

4. In contrast to the diffusions arising from the JKL or Dirichlet flows, the stochastic
processes arising from the Jχ2 formulation are inhomogeneous, and hence simu-
lation seems more challenging unless further structure is assumed on the prior
measure and likelihood function. Also, the evolution of densities of the gradient
flow of Jχ2 in Wasserstein space is given by a porous medium PDE.

1.2 Outline
The rest of the paper is organized as follows. Section 2 contains some background ma-
terial on the Wasserstein space, geodesic convexity of functionals, and gradient flows
in metric spaces. The core of the paper is section 3, where we study the geodesic con-
vexity, PDEs, and diffusions associated with each of the three functionals JKL , Jχ2 ,
and Dμ . In section 4 we consider an application of the theory to the choice of metric
in Riemannian MCMC methods Girolami and Calderhead (2011), and in section 5 we
illustrate the main concepts and ideas through examples arising in Bayesian formula-
tions of semi-supervised learning Garcia Trillos and Sanz-Alonso (2018a), Garcia Trillos
et al. (2017b), Bertozzi et al. (2017). We close in section 6 by summarizing our main
contributions and pointing to open directions.

1.3 Set-up and Notation


(M, g) will denote a smooth connected m-dimensional Riemannian manifold with metric
tensor g representing the parameter space. We will denote by d the associated Rieman-
nian distance, and assume that (M, d) is a complete metric space. By the Hopf-Rinow
theorem it follows that M is a geodesic space—we refer to subsection (2.1) for a dis-
cussion on geodesic spaces and their relevance here. We denote by volg the associated
volume form. To emphasize the dependence of differential operators on the metric with
which M is endowed, we write ∇g , divg , Hessg and Δg for the gradient, divergence, Hes-
sian, and Laplace Beltrami operators on (M, g). The reader not versed in Riemannian
geometry may focus on the case M = Rm with the usual metric tensor, in which case d
is the Euclidean distance and dvolg = dx is the Lebesgue measure. However, in section
4 where we discuss applications to Riemannian MCMC, we endow Rm with a general
metric tensor g and hence familiarity with some notions from differential geometry is
desirable.
We denote by P(M) the space of probability measures on M (endowed with the
Borel σ-algebra). We will be concerned with the update of a prior probability measure
π ∈ P(M)—that represents various degrees of belief on the value of a quantity or pa-
rameter of interest—into a posterior probability measure μ ∈ P(M), based on observed
N. Garcia Trillos and D. Sanz-Alonso 35

data y. We will assume that the prior is defined as a change of measure from volg , and
that the posterior is defined as a change of measure from π as follows:
π = e−Ψ volg , μ ∝ e−φ π. (2)
The data is incorporated in the Bayesian update through the negative log-likelihood
function φ(·) = φ(·; y).

2 Preliminaries
In this section we provide some background material. The Wasserstein space, and the
notion of λ-geodesic convexity of functionals are reviewed in subsection 2.1. Gradient
flows in metric spaces are reviewed in subsection 2.2.

2.1 Geodesic Spaces and Geodesic Convexity of Functionals


A geodesic space (X, dX ) is a metric space with a notion of length of curves that is
compatible with the metric, and where every two points in the space can be connected
by a curve whose length achieves the distance between the points (see Burago et al.
(2001) for more details). Geodesic spaces constitute a large family of metric spaces with
a rich theory of gradient flows. Here we consider three geodesic spaces. First, the base
space (M, d), i.e. the manifold M equipped with its Riemannian distance. Second, the
space P2 (M) of square integrable Borel probability measures defined on M, endowed
with the Wasserstein distance W22. Third, the space of functions f ∈ L (M, μ), with
2

M
f dμ = 1, equipped with the L (M, μ) norm.
We spell out the definitions of P2 (M) and W2 :
  
P2 (M) := ν ∈ P(M) : d2 (x, x0 )dν(x) < ∞, for some x0 ∈ M ,
M
 (3)
W22 (ν1 , ν2 ) := inf d(x, y)2 dα(x, y), ν1 , ν2 ∈ P2 (M).
α M×M

The infimum in the previous display is taken over all transportation plans between ν1 and
ν2 , i.e. over α ∈ P(M×M) with marginals ν1 and ν2 on the first and second factors. The
space (P2 (M), W2 ) is indeed a geodesic space: geodesics in (P2 (M), W2 ) are induced
by those in (M, d). All it takes to construct a geodesic connecting ν0 ∈ P2 (M) and
ν1 ∈ P2 (M) is to find an optimal transport plan between ν0 and ν1 to determine source
locations and target locations, and then transport the mass along geodesics in M (see
Villani (2003) and Santambrogio (2015)).

The space of functions f ∈ L2 (M, μ), with M f dμ = 1, equipped with the L2 (M, μ)
norm is also a geodesic space, where a constant speed geodesic connecting f0 and f1 is
given by linear interpolation: t ∈ [0, 1] → (1 − t)f0 + tf1 .
We will consider several functionals E : X → R ∪ {∞} throughout the paper. They
will all be defined in one of our three geodesic spaces—that is, X = M, X = P2 (M) or
X = L2 (M, μ). Important examples will be, respectively:
36 The Bayesian Update: Variational Formulations and Gradient Flows

1. Functions Ψ : M → R ∪ {∞}.
2. The Kullback-Leibler and χ2 divergences DKL (·π), Dχ2 (·π) : P(M) → [0, ∞],
where π is a given (prior) measure and, for ν1 , ν2 ∈ P(M),

dν1
M dν2
(u) log dνdν2 (u) dν2 (u), ν1
1
ν2 ,
DKL (ν1 ν2 ) := (4)
∞, otherwise,

⎨ 2
dν1
(u) − 1 dν2 (u), ν1 ν2 ,
Dχ2 (ν1 ν2 ) := M dν2 (5)
⎩∞, otherwise;
and the potential-type functional J : P(M) → R ∪ {∞} given by

J(ν) := h(u)dν(u),
M

where h is a given potential function.


3. The Dirichlet energy Dμ : L2 (M, μ) → [0, ∞] defined by

M
∇g f (u)2 dμ(u), f ∈ L2 (M, μ) ∩ H 1 (M),
Dμ (f ) = , (6)
+∞, otherwise.
Recall that here and throughout, ∇g denotes the gradient in (M, g) and · is the
norm on each tangent space Tx M.

A crucial unifying concept will be that of λ-geodesic convexity of functionals. We


recall it here:
Definition 1. Let (X, dX ) be a geodesic space and let λ ∈ R. A functional E : X →
R ∪ {∞} is called λ-geodesically convex provided that for any x0 , x1 ∈ X there exists a
constant speed geodesic t ∈ [0, 1] → γ(t) ∈ X such that γ(0) = x0 , γ(1) = x1 , and
  t(1 − t) 2
E γ(t) ≤ (1 − t)E(x0 ) + tE(x1 ) − λ dX (x0 , x1 ), ∀t ∈ [0, 1].
2

The following remark characterizes the λ-convexity of functionals when X = M.


Remark 2. Let Ψ ∈ C 2 (M) so that we can define its Hessian at all points in M (see the
proof of Theorem 10 in the Supplementary Material (Garcia Trillos and Sanz-Alonso,
2018b) for the definition). Then the following conditions are equivalent:

(i) Ψ is λ-geodesically convex.


(ii) Hessg Ψx (v, v) ≥ λ for all x ∈ M and all unit vectors v ∈ Tx M.

If (M, d) is the Euclidean space, (i) and (ii) are also equivalent to:

(iii) Ψ − λ2 |·|2 is a convex function.

This latter condition is known in the optimization literature as strong convexity.


N. Garcia Trillos and D. Sanz-Alonso 37

2.2 Gradient Flows in Metric Spaces


In this subsection we review the basic concepts needed to define gradient flows in a
metric space (X, dX ). We follow Chapter 8 of Santambrogio (2015); a standard technical
reference is Ambrosio et al. (2008).
To guide the reader, we first recall the formulation of gradient flows in Euclidean
space, where X = Rd and dX is the Euclidean metric. Let E : Rd → R be a differentiable
function, and consider the equation
  
ẋ(t) = −∇E x(t) , t ≥ 0,
(7)
x(0) = x0 .

Then, the solution x to (7) is the gradient flow of E in Euclidean space with initial
condition x0 ; it is a curve whose tangent vector at every point in time is the negative
of the gradient of the function E at that time. In order to generalize the notion of a
gradient flow to functionals defined on more general metric spaces, and in particular
when the metric space
 has
 no differential
  structure, we reformulate (7) in integral form
by using that dtd
E x(t) = ∇E x(t) , ẋ(t) = − 12 |ẋ(t)|2 − 12 |∇E x(t) |2 as follows:
 
  1 t 1 t   2
∇E x(r)  dr, t > 0.
2
E(x0 ) = E x(t) + |ẋ(r)| dr + (8)
2 0 2 0
This identity, known as energy dissipation equality, is equivalent to (7)—see Chapter
8 of Santambrogio (2015) for further details and other possible formulations. Crucially
(8) involves notions that can be defined in an arbitrary metric space (X, dX ): the metric
derivative of a curve t → x(t) ∈ X is given by
 
dX x(t), x(s)
|ẋ(t)| := lim ,
s→t |s − t|
and the slope of a functional E : X → R ∪ {∞} is defined as the map |∇E| : {x ∈ X :
E(x) < ∞} → R ∪ {∞} given by
 +
E(x) − E(y)
|∇E|(x) := lim sup .
y→x dX (x, y)
The identity (8) is the standard way to introduce gradient flows in arbitrary metric
spaces. In this paper we consider gradient flows in L2 and Wasserstein spaces, where
the notion of tangent vector is available. L2 has Hilbert space structure, whereas the
Wasserstein space can be seen as an infinite dimensional manifold (see Ambrosio et al.
(2008), Santambrogio (2015)).

3 Variational Characterizations of the Posterior and


Gradient Flows
In this section we lay out the main elements of the theory of variational formulations and
gradient flows in regards to the Bayesian update. Subsection 3.1 details three variational
38 The Bayesian Update: Variational Formulations and Gradient Flows

formulations defined in terms of the functionals JKL , Jχ2 and the Dirichlet energy Dμ .
Subsection 3.2 studies the geodesic convexity of JKL and Jχ2 in Wasserstein space and
of Dμ in L2 . Finally, subsection 3.3 collects the PDEs that characterize the gradient
flows, as well as the corresponding diffusion processes.

3.1 Variational Formulation of the Bayesian Update


The variational formulation of the posterior as the minimizer of JKL and Jχ2 share the
same structure and will be outlined first. The variational formulation in terms of the
Dirichlet energy will be given below.

The Functionals JKL and Jχ2

In mathematical analysis Jordan and Kinderlehrer (1996) and probability theory Dupuis
and Ellis (2011) it is often useful to note that a probability measure μ defined by
1  
μ(du) = exp −φ(u) π(du) (9)
Z
is the minimizer of the functional
JKL (ν) := DKL (νπ) + FKL (ν; φ), ν ∈ P(M), (10)
where 
FKL (ν; φ) := φ(u)dν(u), (11)
M
and the integral is interpreted as +∞ if φ is not integrable with respect to ν. In physical
terms, the Kullback-Leibler divergence represents an internal energy, FKL represents a
potential energy, and the constant Z is known as the partition function. Here we are
concerned with a statistical interpretation of equation (9), and view it as defining a
posterior measure as a change of measure from a prior measure. In this context, the
Kullback-Leibler term DKL (·π) in (10) represents a penalization of deviations from
prior beliefs, the term FKL (ν; φ) penalizes deviations from the data, and the normalizing
constant Z represents the marginal likelihood. For brevity, we will henceforth suppress
the data y from the negative log-likelihood function φ, writing φ(u) instead of φ(u; y).
We remark that the fact that μ minimizes JKL follows immediately from the identity
DKL (·μ) = JKL (·) + log Z. (12)
Minimizing JKL (·) or DKL (·μ) is thus equivalent, but the functional JKL makes apparent
the roles of the prior and the likelihood.
The posterior μ also minimizes the functional
Jχ2 (ν) := Dχ2 (νπ) + Fχ2 (ν; φ), ν ∈ P(M), (13)
where

 
Fχ2 (ν; φ) := φ̃(u)dν(u), φ̃ = g exp(φ(u)) , g(t) := t − 1, t > 0. (14)
M
N. Garcia Trillos and D. Sanz-Alonso 39

We refer to Ohta and Takatsu (2011) for details. Note that both JKL and Jχ2 are
defined in terms of the two starting points of the Bayesian update: the prior π and
the negative log-likelihood φ. The associated variational formulations suggest a way
to define posterior-type measures based on these two ingredients in scenarios where
establishing the existence of conditional distributions via disintegration of measures is
technically demanding. This appealing feature of the two variational formulations above
is not shared by the one described in the next subsection.

The Dirichlet Energy D μ

Let now the posterior μ be given, and consider the space L2 (M, μ) of functions defined
on M which are square integrable with respect to μ. Recall the Dirichlet energy

μ
D (f ) = ∇g f (u)2 dμ(u),
M

introduced in equation (6). Now, since the measure μ can be characterized as the prob-
ability measure with density ρμ ≡ 1 a.s. with respect to μ, it follows that the posterior
density ρμ ≡ 1 is the μ
 minimizer of the Dirichlet energy D over probability densities
ρ ∈ L2 (M, μ) with M ρdμ = 1.

3.2 Geodesic Convexity and Functional Inequalities


In this section we study the geodesic convexity of the functionals JKL , Jχ2 , and Dμ . The
geodesic convexity of JKL and Jχ2 in Wasserstein space is considered first, and will be
followed by the geodesic convexity of Dμ in L2 . We will show the equivalence of the
latter to the posterior satisfying a Poincaré inequality.

Geodesic Convexity of JKL and Jχ2

The next proposition can be found in von Renesse and Sturm (2005) and Sturm (2006).
It shows that the convexity of JKL can be determined by the so-called curvature-
dimension condition—a condition that involves the curvature of the manifold and the
Hessian of the combined change of measure Ψ + φ. We recall the notation π = e−Ψ volg
and μ ∝ e−φ π.
Proposition 3. Suppose that Ψ, φ ∈ C 2 (M). Then JKL (or DKL (·μ)) is λ-geodesically
convex if, and only if,
Ricg (v, v) + Hessg Ψ(v, v) + Hessg φ(v, v) ≥ λ, ∀x ∈ M, ∀v ∈ Tx M with g(v, v) = 1,
where Ricg denotes the Ricci curvature tensor.

We recall that the Ricci curvature provides a way to quantify the disagreement
between the geometry of a Riemannian manifold and that of ordinary Euclidean space.
The Ricci tensor is defined as the trace of a map involving the Riemannian curvature
(see do Carmo Valero (1992)).
The following example illustrates the geodesic convexity of DKL (·μ) for Gaussian μ.
40 The Bayesian Update: Variational Formulations and Gradient Flows

Example 1. Let μ = N (θ, Σ) be a Gaussian measure in Rm (endowed with the Eu-


clidean metric), with Σ positive definite. Then DKL (·μ) is 1/Λmax (Σ)-geodesically con-
vex, where Λmax (Σ) is the largest eigenvalue of Σ. This follows immediately from the
above, since here Ψ(x) = 12 x − θ, Σ−1 (x − θ), and the Euclidean space is flat (its Ricci
curvature is identically equal to zero). Note that the level of convexity of the functional
depends only on the largest eigenvalue of the covariance, but not on the dimension m
of the underlying space.

The λ-convexity of JKL guarantees the existence of the gradient flow of JKL in Wasser-
stein space. Moreover, it determines the rate of convergence towards the posterior μ.
Precisely, if μ0 is absolutely continuous with respect to μ, and if λ > 0, then the gradient
flow t ∈ [0, ∞) → μt of JKL with respect to the Wasserstein metric starting at μ0 is well
defined and we have:
DKL (μt μ) ≤ e−λt DKL (μ0 μ), t ≥ 0,
−λt
(15)
W2 (μt , μ) ≤ λe
2
DKL (μ0 μ), t ≥ 0.
The second inequality, known as Talagrand inequality Villani (2003), establishes a com-
parison between Wasserstein geometry and information geometry. It can be established
directly combining the λ-geodesic convexity of JKL (for positive λ) with the first in-
equality. From (15) we see that a higher level of convexity of JKL allows to guarantee a
faster rate of convergence towards the posterior distribution μ.
We now turn to the geodesic convexity properties of Jχ2 . We recall that m denotes
the dimension of the manifold M. The following proposition can be found in (Ohta and
Takatsu, 2011, Theorem 4.1).
Proposition 4. Jχ2 is λ-geodesically convex if and only if both of the following two
properties are satisfied:

m+1 ∇g Ψ, v ≥ 0, ∀x ∈ M, ∀v ∈ Tx M.
1 2
1. Ricg (v, v) + Hessg Ψ(v, v) +
2. φ is λ-geodesically convex as a real valued function defined on M.

There are two main conclusions we can extract from the previous proposition. First,
that condition 1) is only related to the prior distribution π whereas condition 2) is only
related to the likelihood; in particular, the convexity properties of Jχ2 can indeed be
studied by studying separately the prior and the likelihood (notice that the proposition
gives an equivalence). Secondly, notice that condition 1) is a qualitative property and if it
is not met there is no hope that the functional Jχ2 has any level of global convexity even
when the likelihood function is a highly convex function. In addition, if 1) is satisfied,
the convexity of φ determines completely the level of convexity of Jχ2 . These features
are markedly different from the ones observed in the Kullback-Leibler case.
As for the functional JKL , one can establish the following functional inequalities,
under the assumption of λ-geodesic convexity of Jχ2 for λ > 0:
 
Jχ2 (μt ) − Jχ2 (μ) ≤ e−λt Jχ2 (μ0 ) − Jχ2 (μ) , t ≥ 0,
  (16)
W2 (μt , μ)2 ≤ λe−λt Jχ2 (μ0 ) − Jχ2 (μ) , t ≥ 0.
N. Garcia Trillos and D. Sanz-Alonso 41

The above inequalities exhibit the fact that a higher level of convexity of Jχ2 guar-
antees a faster convergence towards the posterior distribution μ.

Geodesic Convexity of Dirichlet Energy

We now study the geodesic convexity of the Dirichlet energy functional defined in equa-
tion (6). In what follows we denote by  ·  the L2 norm with respect to μ. Let us start
recalling Poincaré inequality.
Definition 5. We say that a Borel probability measure μ on  M has a Poincaré in-
equality with constant λ if for every f ∈ L2 (M, μ) satisfying M f dμ = 0 we have
1 μ
f 2μ ≤ D (f ).
λ

We now show that Poincaré inequalities are directly related to the geodesic convexity
of the functional Dμ in the L2 (M, μ) space.
Proposition 6. Let λ be a positive real number and let μ be a Borel probability measure
on M. Then, the measure μ has a Poincaré inequality with constant λ if and only if
the functional
 Dμ is 2λ-geodesically convex in the space of functions f ∈ L2 (M, μ)
satisfying f dμ = 1.

Proof. First of all we claim that


Dμ (tf0 + (1 − t)f1 ) + t(1 − t)Dμ (f0 − f1 ) = tDμ (f0 ) + (1 − t)Dμ (f1 ), (17)
for all f0 , f1 ∈ L2 (M, μ) and every t ∈ [0, 1]. To see this, it is enough to assume that
both Dμ (f0 ) and Dμ (f1 ) are finite and then notice that equality (17) follows from the
easily verifiable fact that for an arbitrary Hilbert space V with induced norm | · | one
has
|tv0 + (1 − t)v1 |2 + t(1 − t)|v0 − v1 |2 = t|v0 |2 + (1 − t)|v1 |2 , ∀v0 , v1 ∈ V, ∀t ∈ [0, 1].

Now, suppose that μ has a Poincaré  inequality with constant λ and consider two
functions f0 , f1 ∈ L2 (M, μ) satisfying M f0 dμ = M f1 dμ = 1. Then, (17) combined
with Poincaré inequality (taking f := f0 − f1 ) gives:
 
Dμ tf0 + (1 − t)f1 + λt(1 − t)f0 − f1 2μ ≤ tDμ (f0 ) + (1 − t)Dμ (f1 ), (18)
which is precisely the 2λ-geodesic convexity condition for Dμ .
Conversely, suppose that Dμ is 2λ-geodesic convex in 2
 the space of L (M, μ) functions
that integrate to one. Let f ∈ L (M, μ) be such that M f dμ = 0 and without the loss
2

of generality assume that Dμ (f ) < ∞ and that f μ = 0. Under these conditions,


the
 positive and negative parts of f , f + and f − , satisfy Dμ (f + ), Dμ (f − ) < ∞ and
M
f dμ = r = M f − dμ where r > 0. The inequality
+

1 μ
f 2μ ≤ D (f )
λ
42 The Bayesian Update: Variational Formulations and Gradient Flows

is obtained directly from (17) and (18) applied to


1 − 1 +
f0 := f , f1 := f , t = 1/2.
r r
Remark 7. It is well known that the best Poincaré constant for a measure μ is equal
to the smallest non-trivial eigenvalue of the operator −Δμg defined formally as

1
−Δμg f := − divg (e−φ−ψ ∇g f ),
Z
where divg and ∇g are the divergence and gradient operators in (M, g). This eigenvalue
can be written variationally as

Dμ (f )
λ2 := min ,
f ∈L 2 (M,μ) f − fμ 2μ

where 
fμ := f dμ.
M
Remark 8. Spectral gaps are used in the theory of MCMC as a means to bound the
asymptotic variance of empirical expectations Kipnis and Varadhan (1986).

Let us now consider t ∈ (0, ∞) → μt the flow of Dμ in L2 (M, μ) with some initial
condition dμ
dμ = ρ0 . It is well known that this flow coincides with that of the functional
0

JKL in Wasserstein space. However, taking the Dirichlet-L2 point of view, one can use
a Poincaré inequality (i.e. the geodesic convexity of Dμ ) to deduce the exponential
convergence of μt towards μ in the χ2 -sense. Indeed, let
dμt
ρt := , t ∈ (0, ∞).

A standard computation then shows that
 
1 d ∂ρ
ρt − 1μ =
2
(ρt − 1) dμ = − |∇g (ρt (u) − 1)|2 dμ(u) ≤ −λρt − 12μ .
2 dt M ∂t M

In the second equality we have used that ∂ρ μ


∂t = Δg ρ, as discussed in subsection 3.3
below. Hence by Gronwall’s inequality, see e.g. Teschl (2012),

ρt − 1μ ≤ exp(−λt)ρ0 − 1μ , t > 0.

3.3 PDEs and Diffusions


Here we describe the PDEs that govern the evolution of densities of the three gradient
flows, and the stochastic processes associated with these PDEs. We consider first the
flows defined with the functionals JKL and Dμ and then the flow defined by the functional
Jχ2 .
N. Garcia Trillos and D. Sanz-Alonso 43

JKL -Wasserstein and D μ -L2 (M, μ)

It was shown in Jordan et al. (1998)—in the Euclidean setting and in the unweighted
case π = dx—that the gradient flow of the Kullback-Leibler functional DKL (·π) in
Wasserstein space produces a solution to the Fokker-Planck equation. More generally,
under the convexity conditions guaranteeing the existence of the gradient flow t ∈
(0, ∞) → μt of DKL (·μ) (equivalently of JKL ) starting from μ0 ∈ P(M), the densities

dμt dμt
ρt := , θt := , t ∈ (0, ∞)
dμ dvolg

satisfy (formally) the following Fokker-Planck equations

∂ρ
= Δμg ρ. (19)
∂t
∂θ  
= Δg θ + divg θ(∇g φ + ∇g Ψ) . (20)
∂t
Equation (20) can be identified as the evolution of the densities (w.r.t. dvolg ) of the
diffusion √
dXt = −∇g Ψ(Xt ) + φ(Xt ) dt + 2dBtg , (21)

where B g denotes a Brownian motion defined on (M, g) and ∇g is the gradient on


(M, g). Naturally, the Dμ flow in L2 has the same associated Fokker-Planck equation
(19) and diffusion process (21).

Jχ2 -Wasserstein

The PDE satisfied (formally) by the densities

dμt
ρ̃t := ,

of the Jχ2 -Wasserstein flow t ∈ (0, ∞) → μt is the (weighted) porous medium equation:

∂ ρ̃
= Δπg ρ̃2 + divπg (ρ̃∇g φ), (22)
∂t
where the weighted Laplacian and divergence are defined formally as

Δπg f := Δg f − ∇g f, ∇g Ψ,


(23)
divπg F := divg F − F, ∇g Ψ.

Consider now the stochastic process {ut }t≥0 formally defined as the solution to the
nonlinear diffusion
  
dXt = − ρ̃(t, Xt )∇g Ψ(Xt ) + ∇g φ(Xt ) dt + 2ρ̃(t, Xt )dBtg , u0 ∼ ρ0 , (24)
44 The Bayesian Update: Variational Formulations and Gradient Flows

where ρ̃ is the solution to (22). Let θt be the evolution of the densities (with respect
to dvolg ) of the above diffusion. Then a formal computation shows that θ satisfies the
Fokker-Planck equation:

∂θ  
= −divg θ −ρ̃∇g Ψ − ∇g φ + Δg (ρ̃θ).
∂t
1
If we let β = Z exp(−Ψ)θ we see, using (23), that
 
Δπg β 2 = eΨ Δg (β 2 e−Ψ ) + divg (β 2 e−Ψ )∇g Ψ ,
 
divπg (β∇g φ) = eΨ divg (βe−Ψ )∇g Ψ ,

implying that the distributions of the stochastic process (24) are those generated by the
gradient flow of Jχ2 in Wasserstein space.
Remark 9. In contrast with the Langevin diffusion (21), the process (24) is defined
in terms of the solution of the equation satisfied by its densities. In particular, if one
wanted to simulate (24) one would need to know the solution of (22) before hand.

4 Application: Sampling and Riemannian MCMC


So far we have treated the Riemannian manifold (M, g) as fixed. In this section we take
a different perspective and treat the metric g as a free parameter. Precisely, we will now
consider a family of gradient flows of the functional JKL with respect to Wasserstein
distances induced by different metrics g on the parameter space. We do this motivated
by the so called Riemannian MCMC methods for sampling, where a change of metric
in the base space is introduced in order to produce Langevin-type proposals that are
adapted to the geometric features of the target, thereby exploring regions of interest and
accelerating the convergence of the chain to the posterior. There are different heuristics
regarding the choice of metric (see Girolami and Calderhead (2011)), but no principled
way to compare different metrics and rank their performance for sampling purposes.
With the developments presented in this paper we propose one such principled criterion
as we describe below. We restrict our attention to the case M = Rm .
Let g be a Riemannian metric tensor on Rm defined via

gx (u, v) := G(x)u, v, u, v ∈ Tx M,

where for every x ∈ Rm , G(x) is a m × m positive definite matrix. In what follows


we identify g with G and refer to both as ‘the metric’ and we use terms such as g-
geodesic, g-Wasserstein distance, etc. to emphasize that the notions considered are being
constructed using the metric g. Let dg be the distance induced by the metric tensor g and
let volg be the associated volume form. Notice that in terms of the Lebesgue measure
and the metric G, we can write
  
dvolg (x) = det G(x) dx.
N. Garcia Trillos and D. Sanz-Alonso 45

We use the canonical basis for Rm as global chart for Rm and consider the canonical

vector fields ∂x 1
, . . . , ∂x∂m . The Christoffel symbols associated to the Levi-Civita con-
nection of the Riemannian manifold (Rm , g) can be written in terms of derivatives of
the metric as  
1 ∂ ∂ ∂
l
Γij = Gki + Gkj − Gij G−1lk , (25)
2 ∂xj ∂xi ∂xk
where in the right hand-side—and in what follows—we use Einstein’s summation con-
vention. The proof of the following result is in the Supplementary Material.
Theorem 10. Let F ∈ C 2 (Rm ) and
 
μ(du) ∝ exp −F (u) du.
The sharp constant λ for which JKL (or DKL (·μ)) is λ-geodesically convex in the g-
Wasserstein distance is equal to
 
λG := infm Λmin G−1/2 B + Hess F − C G−1/2 ,
x∈R

where Hess F is the usual (Euclidean) Hessian matrix of F , B is the matrix with coor-
dinates
∂Γlij
Bij := − Γkil Γljk , (26)
∂xl
and C is the matrix with coordinates
∂F
Cij := Γlij . (27)
∂xl
Moreover, for any a > 0,
1
λaG = λG . (28)
a
Note that λG is a key quantity in evaluating the quality of a metric G in building
geometry-informed Langevin diffusions for sampling purposes, as it gives the exponential
rate at which the evolution of probabilities built using the metric G converges towards
the posterior: larger λG corresponds to faster convergence. However, in order to establish
a fair performance comparison, the metrics need to be scaled appropriately. Indeed a
faster rate can be obtained by scaling down the metric (which can be thought of as
time-rescaling), as it is clearly seen by the scaling property (28) of the functional λG .
It is important to note that scaling down the metric leads to a faster diffusion, but
also makes its discretization more expensive. Indeed the error of Euler discretizations
is largely influenced by the Lipschitz constant of the drift. This motivates that a fair
criterion for choosing the metric could be to maximize λG with the constraint
Lip(∇g F ) = Lip(G−1 ∇F ) ≤ 1, (29)
since ∇g F = G−1 ∇F (where ∇ denotes the standard Euclidean gradient) is the drift
of the diffusion (21). Note that the constraint (29) ensures that the metric cannot be
scaled down arbitrarily while also guaranteeing that the discretizations do not become
increasingly expensive. We remark that other constraints involving higher regularity
requirements may be useful if higher order discretizations are desired.
46 The Bayesian Update: Variational Formulations and Gradient Flows

Remark 11. The functional λG can be used to determine the optimal metric among a
certain subclass of metrics of interest satisfying the condition (29). For instance, it may
be of interest to find the optimal constant metric G (see Proposition 12 below), or to
find the best metric within a finite family of metrics. On the other hand the constraint
(29) forces feasible metrics to induce diffusions that are not expensive to discretize.

To illustrate the previous remark we show that for a Gaussian target measure the op-
timal preconditioner is, unsurprisingly, given by the Fisher information. More precisely
we have the following proposition:
Proposition 12. Let μ = N (0, Σ). Then

G∗ := Σ−1 (30)

maximizes λG over the class of constant metrics G satisfying G−1 Σ−1  ≤ 1, as in


(29). Moreover, the maximum value is

λG∗ = 1.

Proof. Suppose for the sake of contradiction that there exists a constant metric G
that satisfies condition (29), which in this case reads G−1 Σ−1  ≤ 1 and is such that
λG > λG ∗ .
Let u be a unit norm eigenvector of G with eigenvalue λ > 0. Notice that by definition
of λG we must have

G−1/2 Σ−1 G−1/2 u, u ≥ λG > λG∗ = 1. (31)

The left hand side of the above display can be rewritten as

G−1/2 Σ−1 G−1/2 u, u = G−1 Σ−1 G−1/2 u, G1/2 u

and by Cauchy-Schwartz inequality we see that

G−1 Σ−1 G−1/2 u, G1/2 u ≤ G−1 Σ−1 G−1/2 u · G1/2 u ≤ G−1/2 u · G1/2 u

Since u is an eigenvector
√ of G with eigenvalue λ, it follows that u is also an eigenvector
of G1/2 with eigenvalue λ and of G−1/2 with eigenvalue √1λ . Therefore the right hand
side of the above display is equal to one. This however contradicts (31). From this we
deduce the optimality of G∗ among feasible metrics.

Example 2. Suppose that F (u) = 12 Σ−1 u, u, with


 
1 0
Σ= .
0 

Consider the optimal metric  


1 0
G∗ =
0 −1
N. Garcia Trillos and D. Sanz-Alonso 47

given by the previous proposition and the rescaled Euclidean metric Ge


 −1 
 0
Ge = ,
0 −1

where the scalings have been chosen so that

Lip(G∗−1 ∇F ) = Lip(G−1
e ∇F ) = 1.

A calculation then shows that λG∗ = 1 while λGe = . Note that if the Euclidean
metric is not rescaled by −1 —violating the constraint (29)—then the same unit rate
of convergence as with the metric G∗ is achieved. However, the drift of the associated
diffusion √
dXt = −Σ−1 Xt dt + 2 dBt
is of order −1 , making the discretization increasingly expensive in the small  limit. On
the other hand, since both G∗−1 and G−1 e are of order , the drifts for both associated
diffusions are order 1. This motivates our choice of constraint in equation (29).

5 Example: Semi-Supervised Learning


In this section we study the geodesic convexity of functionals arising in the Bayesian
formulation of semi-supervised classification. Our purpose is to illustrate the concepts
in a tangible setting, and to show that establishing sharp levels of geodesic convexity
may be more tractable for some functionals than others.
In semi-supervised classification one is interested in the following task: given a data
cloud X = {x1 , . . . , xn } together with (noisy) labels yi ∈ {−1, 1} for some of the data
points xi , i ∈ Z ⊂ {1, . . . , n}, classify the unlabeled data points by assigning labels
to them. Here we assume to have access to a weight matrix W quantifying the level
of similarity between the points in X. Thus, we focus on the graph-based approach
to semi-supervised classification, which boils down to propagating the known labels to
the whole cloud, using the geometry of the weighted graph (X, W ). We will investi-
gate the existence and convergence of gradient flows for several Bayesian graph-based
classification models proposed in Bertozzi et al. (2017). In the Bayesian approach, the
geometric structure that the weighted graph imposes on the data cloud is used to build
a prior on a latent space, and the noisy given labels are used to build the likelihood. The
Bayesian solution to the classification problem is a measure on the latent space, that is
then push-forwarded into a measure on the label space {−1, 1}n . This latter measure
contains information on the most likely labels, and also provides a principled way to
quantify the remaining uncertainty on the classification process.
Let (X, W ) then be a weighted graph, where X = {x1 , . . . , xn } is the set of nodes
of the graph and W is the weight matrix between the points in X. All the entries of W
are non-negative real numbers and we assume that W is symmetric. Let L be the graph
Laplacian matrix defined by
L := D − W,
48 The Bayesian Update: Variational Formulations and Gradient Flows

where D is the degree  matrix of the weighted graph, i.e., the diagonal matrix with
n
diagonal entries Dii = j=1 Wij . The above corresponds to the unnormalized graph
Laplacian, but different normalizations are possible Von Luxburg (2007). The graph-
Laplacian will be used in all the models below to favor prior draws of the latent variables
that are consistent with the geometry of the data cloud.
Remark 13. A special case of a weighted graph (X, W ) frequently found in the literature
is that in which the points in X are i.i.d. points sampled from some distribution on a
manifold M embedded in Rd , and the similarity matrix W is obtained as
 
|xi − xj |
Wij = K .
r

In the above, K is a compactly supported kernel function, |xi − xj | is the Euclidean


distance between the points xi and xj , and r > 0 is a parameter controlling data density.
It can be shown (see Burago et al. (2013) and Garcia Trillos et al. (2017a)) that the
smallest non-trivial eigenvalue of a rescaled version of the resulting graph Laplacian is
close to the smallest non-trivial eigenvalue of a weighted Laplacian on the manifold,
provided that r is scaled with n appropriately.

In the next two subsections we study the probit and logistic models.

5.1 Probit and Logistic Models


Traditionally, the probit approach to semi-supervised learning is to classify the unlabeled
data points by first optimizing the functional G : Rn → R given by
  w
1  
G(u) := Lα u, u − log H(yj uj ); γ , H(w; γ) := exp(−t2 /2γ 2 ) dt (32)
2 −∞
j∈Z
n
over all u ∈ Rn satisfying i=1 ui = 0, and then thresholding the optimizer with the
sign function; the parameter α > 0 is used to regularize the functions u. The minimizer
of the functional G can be interpreted as the MAP (maximum a posteriori estimator) in
the Bayesian formulation of probit semi-supervised learning (see Bertozzi et al. (2017))
that we now recall:
n
Prior: Consider the subspace U := {u ∈ Rn : i=1 ui = 0} and let π be the
Gaussian measure on U defined by
 
dπ 1 α  
(u) ∝ exp − L u, u =: exp −Ψ(u) . (33)
du 2

The measure π is interpreted as a prior over functions on the point cloud X with average
zero. Larger values of α > 0 force more regularization of the functions u.
Likelihood function: For a fixed u ∈ U and for j ∈ Z define

yj = S(uj + ηj ),
N. Garcia Trillos and D. Sanz-Alonso 49

where the ηj are i.i.d. N (0, γ 2 ), and S is the sign function. This specifies the distribution
of observed labels given the underlying latent variable u. We then define, for given data
y, the negative log-density function
  
φ(u; y) := − log H(yj uj ; γ) , u ∈ U, (34)
j∈Z

where H is given by (32).


Posterior distribution: As shown in Bertozzi et al. (2017), a simple application
of Bayes’ rule gives the posterior distribution of u given y (denoted by μy ):
dμy   dμy  
(v) = exp −φ(v; y) ; (v) ∝ exp −Ψ(v) − φ(v; y) ,
dπ dv
where Ψ is given by (33), and φ is given by (34).

From what has been discussed in the previous sections, the posterior μy can be
characterized as the unique minimizer of the energy

JKL (ν) := DKL (νπ) + φ(u; y)dν(u), ν ∈ P(Rn ). (35)
Rn

Let us first consider the gradient flow of JKL with respect to the usual Wasserstein space
(i.e. the one induced by the Euclidean distance).
We can study the geodesic convexity of this functional by studying independently
the convexity properties of DKL (νπ) and of φ(·; y). Precisely:

i) Since π is a Gaussian measure with covariance L−α , Example 1 shows that DKL (νπ)
is (Λmin (L))α -geodesically convex in Wasserstein space, where Λmin (L) is the small-
est non-trivial eigenvalue of L.
ii) The function φ(·; y) is convex—see
 the appendix of Bertozzi et al. (2017). Hence,
the functional FKL (ν) = Rn φ(u; y)dν(u) is 0-geodesically convex in Wasserstein
space.

It then follows from Proposition 3 that JKL is (Λmin (L))α -geodesically convex in Wasser-
stein space. As a consequence, if we consider t ∈ [0, ∞) → μt , the gradient flow of JKL
with respect to the Wasserstein distance starting at μ0 (an absolutely continuous mea-
sure with respect to μ), geometric inequalities can be immediately obtained from (15);
such inequalities will not deteriorate with n—see Remark 13.
However, the diffusion associated to this flow is given by

dXt = (−Lα Xt − ∇φ(X; y)) dt + 2dBt , (36)

and in particular its drift (more precisely the term Lα Xt ) deteriorates as n gets larger.
Notice that if we wanted to control the cost of discretization by rescaling the Euclidean
50 The Bayesian Update: Variational Formulations and Gradient Flows

metric (as exhibited in Example 2), the geodesic convexity of the resulting flow would
vanish as n gets larger.
The previous discussion shows that the flow of JKL in the usual Wasserstein sense
does not produce a flow with good convergence properties that at the same time is
cheap to discretize (robustly in n). This motivates considering the gradient flow of JKL
with respect to the Wasserstein distance induced by a certain constant metric g. Indeed,
inspired by Proposition 12, let us consider the constant metric tensor

G := Lα .

Since the metric tensor is constant, in particular its induced volume form volg is pro-
portional to the Lebesgue measure and hence we can write
 
1
dμy (u) ∝ exp −φ(u; y) − Lα u, u dvolg (u), u ∈ U.
2

On the other hand, from the discussion in Section 3.3 we know that the densities of the
stochastic process
1 √
dXt = −∇g (φ(Xt ; y) + Lα Xt , Xt )dt + 2dBtg
2
correspond to the gradient flow of the energy JKL with respect to the Wasserstein
distance induced by the metric g, where B g is a Brownian motion on (Rm , g). This
diffusion can be rewritten in terms of the standard Euclidean gradient ∇ and Brownian
motion B as √
dXt = −(Xt + L−α ∇φ(Xt ; y))dt + 2L−α dBt , (37)
after noticing that √
∇g = G−1 ∇, Bg = G−1 B,
where for the second identity we have used the fact that G is constant. How convex is
the energy JKL with respect to the Wasserstein distance induced by g? Since the metric
tensor G is constant it follows that
 
λG := infm Λmin G−1/2 Hess F G−1/2 ,
x∈R

where F (u) := φ(u; y) + 12 Lα u, u. Finally, due to the convexity of φ(u; y) we deduce
that
λG ≥ Λmin G−1/2 Lα G−1/2 = 1.

We notice that in (37) L appears as L−α . This is a fundamental difference from (36)
(where L appears as Lα ) with computational advantages, given that the eigenvalues of
L grow towards infinity.
Remark 14. A carefully designed discretization of (37) induces the so called Langevin
pCN proposal for MCMC computing (see Cotter et al. (2013)).
N. Garcia Trillos and D. Sanz-Alonso 51

Remark 15. In the above we have considered a probit model for the likelihood function.
The ideas generalize straightforwardly to other settings, notably the logistic model
  
φ(u; y) := − log σ(yj uj ; γ) , u ∈ U, (38)
j∈Z

where
1
σ(t; γ) := .
1 + e−t/γ
The convexity of φ for the logistic model (38) can be established by direct computation
of the second derivative of σ.

5.2 Ginzburg-Landau Model


We now present the Ginzburg-Landau model for semi-supervised learning. This model
will provide us with an example of a functional JKL whose geodesic convexity with
respect to Wasserstein distance is not positive (and hence one can not deduce geometric
inequalities describing the rate of convergence towards the posterior), but for which one
can obtain a positive spectral gap giving the rate of convergence of the flow of Dirichlet
energy in the L2 sense.
Let
1 2
W (t) := (t − 1)2 , t ∈ R.
4
We consider the following Bayesian model.
Prior:
dπ 1     
(u) ∝ exp − u, Lα u − W uj =: exp −Ψ(u) , u ∈ U. (39)
dv 2
j∈Z

Likelihood function: For j ∈ Z,

yj = uj + ηj , ηj ∼ N (0, γ 2 ).

This leads to the following negative log-density function:


1 
φ(u; y) = |yj − uj |2 . (40)
2γ 2
j∈Z

Posterior distribution: Combining the prior and the likelihood via Bayes’ formula
gives the posterior distribution

dμy   dμy  
(u) = exp −φ(u; y) ; (u) ∝ exp −Ψ(u) − φ(u; y) , u ∈ U,
dπ du
where Ψ is given by (39), and φ is given by (40).
52 The Bayesian Update: Variational Formulations and Gradient Flows

For this model, the negative prior log-density Ψ is not convex, and Wasserstein λ-
geodesic convexity of the functional DKL (·π) only holds for negative λ. In particular, it
is not possible to deduce exponential decay taking the Wasserstein flow point of view.
However, in the L2 /Dirichlet energy setting we can still show exponential convergence
towards the posterior μy . Indeed, because the negative log-likelihood of μy satisfies:
|∇Ψ(u) + ∇φ(u; y)|2
− ΔΨ(u) − Δφ(u; y) → ∞, as |u| → ∞,
2
there exists some λ > 0 for which μy has a Poincaré inequality with constant λ (see
Chapter 4.5 in Pavliotis (2014)). In this example we can say more, and in particular we
are able to find a Poincaré constant that depends explicitly on ε, the smallest non-trivial
eigenvalue of L, and k := |Z|.

Let ψ(u) := j∈Z Wε (uj ) and let ψc be its convex envelope, i.e. let ψc be the largest
convex function that is below ψ. It is straightforward to show that ψc (0) = 0 and that
 
k
infn {exp (ψc (u) − ψ(u))} = exp − .
u∈R 4ε
Consider now the probability measure μc with Lebesgue density
 
1 1
dμc (u) = exp − Lα u, u − φ(u; y) − ψc (u) du,
Zc 2
and define λ2 and λ2,c as in Remark 7 using μy and μc instead of μ. For any given
f ∈ L2 (μ) we then have
    
|∇f |2 dμ |∇f |2 dμ k |∇f |2 dμc
 ≥  ≥ exp −  ,
|f − fμ |2 dμ |f − fμc |2 dμ ε |f − fμc |2 dμc

where the first inequality follows from the fact that fμ = argmina∈R |f − a|2 dμ and
the second inequality follows directly from the fact that 0 ≥ ψc − ψ ≥ − kε . It follows
that
   
k k
λ2 ≥ exp − λ2,c ≥ exp − (Λmin (L))α .
ε ε
where the last inequality follows from the fact that the negative log-likelihood of μy
satisfies the Bakry-Emery condition with constant Λmin (L) (see Chapter 4.5 in Pavliotis
(2014)). Clearly, the Poincaré constant above is very large for small  or for large k
(number of labeled data points). We also notice that the cost of discretization of the
diffusion associated to this flow increases with n (as in Section 5.1).
Remark 16. A similar analysis can be carried out now using the constant metric

G := Lα .

More precisely, consider the flow of the Dirichlet energy



Dg (f ) := |∇g f |2g dμ(u)
μ
N. Garcia Trillos and D. Sanz-Alonso 53

with respect to L2 (μ). How convex is this functional? For every f ∈ U we have
    
|∇g f |2g dμ |∇g f |2g dμ k |∇g f |2g dμc
 ≥  ≥ exp −  ,
|f − fμ |2 dμ |f − fμc |2 dμ ε |f − fμc |2 dμc

from where it follows that


 
k
λg2 ≥ exp − .
ε
A similar remark to the one at the end of section 5.1 regarding the dependence in L of
the resulting diffusion applies here as well.

6 Conclusions and Future Work


The main contribution of this paper is to explore three variational formulations of the
Bayesian update and their associated gradient flows. We have shown that, for each of the
three variational formulations, the geodesic convexity of the objective functionals gives
a bound on the rate of convergence of the flows to the posterior. As an application of the
theory, we have suggested a criterion for the optimal choice of metric in Riemannian
MCMC schemes. We summarize below some additional outcomes and directions for
further work.

• We bring attention to different variational formulations of the Bayesian update.


These formulations have the potential of extending the theory of Bayesian inverse
problems in function spaces, in particular in cases with infinite dimensional, non-
additive, and non-Gaussian observation noise. Moreover, they suggest numerical
approximations to the posterior by restricting the space of allowed measures in the
minimization, by discretization of the associated gradient flows, or by sampling
via simulation of the associated diffusion.

• The variational framework considered in this paper provides a natural setting for
the study of robustness of Bayesian models, and for the analysis of convergence
of discrete to continuum Bayesian models. Indeed, the authors Garcia Trillos and
Sanz-Alonso (2018a), Garcia Trillos et al. (2017b) have recently established the
consistency of Bayesian semi-supervised learning in the regime with fixed number
of labeled data points and growing number of unlabeled data. The analysis relies
on the variational formulation based on Kullback-Leibler prior penalization in
equation (35).

• Our results give new understanding of the ubiquity of Kullback-Leibler penal-


izations in sampling methodology. In practice Kullback-Leibler is often used for
computational and analytical tractability. The results in section 3.3 show that
Kullback-Leibler prior penalization leads to a heat-type flow and, therefore, to an
easily discretized diffusion process. On the other hand, χ2 prior penalization leads
to a nonlinear diffusion process.
54 The Bayesian Update: Variational Formulations and Gradient Flows

Supplementary Material
Supplementary Material to “The Bayesian Update: Variational Formulations and Gra-
dient Flows” (DOI: 10.1214/18-BA1137SUPP; .pdf).

References
Ambrosio, L., Gigli, N., and Savaré, G. (2008). Gradient flows: in metric spaces and in
the space of probability measures. Springer Science & Business Media. MR2401600.
30, 31, 37
Attias, H. (1999). “A Variational Bayesian Framework for Graphical Models.” In NIPS ,
volume 12. 30
Bertozzi, A. L., Luo, X., Stuart, A. M., and Zygalakis, K. C. (2017). “Uncer-
tainty quantification in the classification of high dimensional data.” MR3794350.
doi: https://doi.org/10.1137/17M1134214. 34, 47, 48, 49
Besag, J. E. (1994). “Comments on “Representations of knowledge in complex systems”
by U. Grenander and M. I. Miller.” Journal of the Royal Statistical Society: Series B,
56: 591–592. MR1293234. 32
Burago, D., Burago, Y., and Ivanov, S. (2001). A course in metric geometry, volume 33
of Graduate Studies in Mathematics. American Mathematical Society, Providence,
RI. MR1835418. doi: https://doi.org/10.1090/gsm/033. 35
Burago, D., Ivanov, S., and Kurylev, Y. (2013). “A graph discretization of
the Laplace-Beltrami operator.” arXiv preprint arXiv:1301.2222 . MR3299811.
doi: https://doi.org/10.4171/JST/83. 48
Cotter, S. L., Roberts, G. O., Stuart, A. M., and White, D. (2013). “MCMC methods for
functions: modifying old algorithms to make them faster.” Statistical Science, 28(3):
424–446. MR3135540. doi: https://doi.org/10.1214/13-STS421. 50
do Carmo Valero, M. P. (1992). Riemannian Geometry. MR1138207.
doi: https://doi.org/10.1007/978-1-4757-2201-7. 39
Dupuis, P. and Ellis, R. S. (2011). A weak convergence approach to the theory of large
deviations, volume 902. John Wiley & Sons. MR1431744. doi: https://doi.org/
10.1002/9781118165904. 38
Fox, C. W. and Roberts, S. J. (2012). “A tutorial on variational Bayesian inference.”
Artificial intelligence review , 38(2): 85–95. 30
Garcia Trillos, N., Gerlach, M., Hein, M., and Slepcev, D. (2017a). “Spectral convergence
of empirical graph Laplacians.” In preparation. 48
Garcia Trillos, N., Kaplan, Z., Samakhoana, T., and Sanz-Alonso, D. (2017b). “On the
consistency of graph-based Bayesian learning and the scalability of sampling algo-
rithms.” arXiv preprint arXiv:1710.07702 . 30, 34, 53
N. Garcia Trillos and D. Sanz-Alonso 55

Garcia Trillos, N. and Sanz-Alonso, D. (2018a). “Continuum limits of posteriors in


graph Bayesian inverse problems.” SIAM Journal on Mathematical Analysis, 50(4):
4020–4040. MR3829512. doi: https://doi.org/10.1137/17M1138005. 30, 34, 53
Garcia Trillos, N. and Sanz-Alonso, D. (2018b). Supplementary Material to “The
Bayesian Update: Variational Formulations and Gradient Flows”. doi: https://doi.
org/10.1214/18-BA1137. 36
Girolami, M. and Calderhead, B. (2011). “Riemann manifold Langevin and hamil-
tonian Monte Carlo methods.” Journal of the Royal Statistical Society: Series
B (Statistical Methodology), 73(2): 123–214. MR2814492. doi: https://doi.org/
10.1111/j.1467-9868.2010.00765.x. 32, 34, 44
Guo, F., Wang, X., Fan, K., Broderick, T., and Dunson, D. B. (2016). “Boosting vari-
ational inference.” arXiv preprint arXiv:1611.05559 . 30
Jordan, R. and Kinderlehrer, D. (1996). “An extended variational principle.” Partial
differential equations and applications: collected papers in honor of Carlo Pucci , 187.
MR1371591. doi: https://doi.org/10.5006/1.3292113. 38
Jordan, R., Kinderlehrer, D., and Otto, F. (1998). “The variational formulation of
the Fokker–Planck equation.” SIAM Journal on Mathematical Analysis, 29(1): 1–17.
MR1617171. doi: https://doi.org/10.1137/S0036141096303359. 31, 43
Kipnis, C. and Varadhan, S. R. S. (1986). “Central limit theorem for additive functionals
of reversible Markov processes and applications to simple exclusions.” Communica-
tions in Mathematical Physics, 104(1): 1–19. MR0834478. 42
McCann, R. J. (1997). “A convexity principle for interacting gases.” Advances
in Mathematics, 128(1): 153–179. MR1451422. doi: https://doi.org/10.1006/
aima.1997.1634. 31
Ohta, S. and Takatsu, A. (2011). “Displacement convexity of generalized rel-
ative entropies.” Advances in Mathematics, 228(3): 1742–1787. MR2824568.
doi: https://doi.org/10.1016/j.aim.2011.06.029. 31, 33, 39, 40, 56
Pavliotis, G. A. (2014). “Stochastic Processes and Applications.” Texts in Ap-
plied Mathematics. Springer, Berlin. MR3288096. doi: https://doi.org/10.1007/
978-1-4939-1323-7. 52
Pinski, F. J., Simpson, G., Stuart, A. M., and Weber, H. (2015). “Kullback–
Leibler approximation for probability measures on infinite dimensional spaces.”
SIAM Journal on Mathematical Analysis, 47(6): 4091–4122. MR3419882.
doi: https://doi.org/10.1137/140962802. 30
Roberts, G. O. and Tweedie, R. L. (1996). “Exponential convergence of Langevin
distributions and their discrete approximations.” Bernoulli , 341–363. MR1440273.
doi: https://doi.org/10.2307/3318418. 32
Santambrogio, F. (2015). “Optimal transport for applied mathematicians.” Birkäuser,
NY . MR3409718. doi: https://doi.org/10.1007/978-3-319-20828-2. 31, 35, 37
56 The Bayesian Update: Variational Formulations and Gradient Flows

Sturm, K. T. (2006). “On the geometry of metric measure spaces.” Acta Mathematica,
196(1): 65–131. MR2237206. doi: https://doi.org/10.1007/s11511-006-0002-8.
39
Teschl, G. (2012). Ordinary differential equations and dynamical systems, volume 140.
American Mathematical Soc. MR2961944. doi: https://doi.org/10.1090/gsm/140.
42
Villani, C. (2003). Topics in optimal transportation, volume 58. American Mathematical
Soc. MR1964483. doi: https://doi.org/10.1007/b12016. 31, 35, 40
Villani, C. (2008). Optimal transport: old and new , volume 338. Springer Science &
Business Media. MR2459454. doi: https://doi.org/10.1007/978-3-540-71050-9.
31
Von Luxburg, U. (2007). “A tutorial on spectral clustering.” Statistics and Computing,
17(4): 395–416. MR2409803. doi: https://doi.org/10.1007/s11222-007-9033-z.
48
von Renesse, M. K. and Sturm, K. T. (2005). “Transport inequalities, gradient estimates,
entropy and Ricci curvature.” Communications on Pure and Applied Mathematics,
58(7): 923–940. MR2142879. doi: https://doi.org/10.1002/cpa.20060. 39
Wainwright, M. J. and Jordan, M. I. (2008). “Graphical models, exponential families,
and variational inference.” Foundations and Trends
R in Machine Learning, 1(1–2):
1–305. 30
Zellner, A. (1988). “Optimal information processing and Bayes’s theorem.” The Amer-
ican Statistician, 42(4): 278–280. MR0971095. doi: https://doi.org/10.2307/
2685143. 30, 56

Acknowledgments
We are thankful to Matı́as Delgadino for pointing to us the reference Ohta and Takatsu (2011)
and to Sayan Mukherjee for the reference Zellner (1988). We thank the editor and referees for
their help in improving the readability of our manuscript.

You might also like