The Bayesian Update: Variational Formulations and Gradient Flows
The Bayesian Update: Variational Formulations and Gradient Flows
The Bayesian Update: Variational Formulations and Gradient Flows
29–56
1 Introduction
In this paper we revisit the old idea of viewing the posterior as the minimizer of an energy
functional. The use of variational formulations of Bayes rule seems to have been largely
focused on one of its methodological benefits: restricting the minimization to a subclass
of measures is the backbone of variational Bayes methods for posterior approximation.
Our aim is to bring attention to two other theoretical and methodological benefits, and
to study in some detail one of these: namely, that each variational formulation suggests a
natural path, defined by a gradient flow, towards the posterior. We use this observation
to propose a criterion for the choice of metric in Riemannian MCMC methods.
Let us recall informally a variational formulation of Bayes rule. Given a prior p(u)
on an unknown parameter u and a likelihood function L(y|u), the posterior p(u|y) ∝
c 2020 International Society for Bayesian Analysis https://doi.org/10.1214/18-BA1137
30 The Bayesian Update: Variational Formulations and Gradient Flows
L(y|u)p(u) can be characterized Zellner (1988) as the distribution q ∗ (u) that minimizes
JKL q(u) = DKL q(u)p(u) − log L(y|u)q(u)du. (1)
Indeed, minimizing JKL q(u) is equivalent to minimizing the Kullback-Leibler diver-
gence DKL (q(u)p(u|y)), and so clearly the minimizer is q ∗ (u) = p(u|y). Other varia-
tional formulations may be considered by minimizing, for instance, other divergences
rather than Kullback-Leibler. In this paper we consider three variational formulations of
the Bayesian update. The first two characterize the posterior measure as the minimizer
of functionals JKL or Jχ2 constructed by penalizing deviations from the prior measure in
Kullback-Leibler or χ2 divergence—definitions of these functionals and divergences are
given in equations (10), (13), (4), (5). The third one characterizes the posterior density
as the minimizer of the Dirichlet energy Dμ —see (6).
Why is it useful to view the posterior p(u|y) as the minimizer of an energy? We list
below three advantages of this viewpoint, the third of which will be the focus of our
paper.
2. If the prior pεn (u) or the likelihood Lεn (y|u) depend on a parameter n , then the
variational formulation allows to show large n convergence of posteriors pεn (u|y)
by establishing the Γ-convergence of the associated energies. This method of proof
has been employed by the authors in Garcia Trillos and Sanz-Alonso (2018a),
Garcia Trillos et al. (2017b) to analyze the large-data consistency of graph-based
Bayesian semi-supervised learning.
In this paper we study three gradient flows associated with the variational formu-
lations defined by minimization of the functionals JKL , Jχ2 , and Dμ . For intuition, we
recall that a gradient flow in Euclidean space defines a curve whose tangent always
points in the direction of steepest descent of a given function –see equation (7). In the
same fashion, a gradient flow in a more general metric space can be thought of as a
curve on said space that always points in the direction of steepest descent of a given
functional Ambrosio et al. (2008). In Euclidean space the direction of steepest descent
N. Garcia Trillos and D. Sanz-Alonso 31
(i) The gradient flows defined by JKL and Jχ2 in the space of probability measures
with finite second moments endowed with the Wasserstein distance—definitions
are given in (3). By construction, these flows give curves of probability measures
that evolve following the direction of steepest descent of JKL and Jχ2 in Wasserstein
distance, converging to the posterior measure in the large-time limit.
(ii) The gradient flow defined by the Dirichlet energy Dμ in the space of square in-
tegrable densities endowed with the L2 distance. By construction, this flow gives
a curve of densities in L2 that evolves following the direction of steepest descent
of Dμ in L2 distance, converging to the posterior density in the large-time limit.
Interestingly, the curve of measures associated with these densities is the exact
same as the curve defined by the JKL flow on Wasserstein space Jordan et al.
(1998).
A question arises: what is the rate of convergence of these flows to the posterior? The
answer is, to a large extent, provided by the theory of optimal transport and gradient
flows Ambrosio et al. (2008), Villani (2003), Villani (2008), Santambrogio (2015). We
will review and provide a unified account of these results in the main body of the paper,
section 3. In the remainder of this introduction we discuss how rates of convergence
may be studied in terms of convexity of functionals, and how these rates may be used
as a guide for the choice of proposals for MCMC methods.
Rates of convergence of the flows hinge on the convexity level of each of the func-
tionals JKL , Jχ2 , and Dμ . Recalling the Euclidean case may be helpful: gradient descent
on a highly convex function will lead to fast convergence to the minimizer. What is,
however, a sensible notion of convexity for functionals defined over measures or densi-
ties? Our presentation highlights that the notion of geodesic (or displacement) convexity
McCann (1997) nicely unifies the theory: it guarantees the existence and uniqueness of
the three gradient flows and it also provides a bound on their rate of convergence to
the posterior. In the L2 setting one can show that positive geodesic convexity is equiv-
alent to the posterior satisfying a Poincaré inequality, and also to the existence of a
spectral gap—see subsection 3.2. On the other hand, the geodesic convexity of JKL and
Jχ2 in Wasserstein space is determined by the Ricci curvature of the manifold, as well
as by the likelihood function and prior density—see (10), (11), (13), (14). Typically the
three functionals JKL , Jχ2 , and Dμ will have different levels of geodesic convexity, and
establishing a sharp bound on each of them may not be equally tractable.
The theory of gradient flows and optimal transport gives, for each of the flows, an
associated Fokker-Planck partial differential equation (PDE) that governs the evolution
of densities Ohta and Takatsu (2011), Santambrogio (2015). Such PDEs are typically
costly to discretize if the parameter space is of moderate or high dimension, but they
32 The Bayesian Update: Variational Formulations and Gradient Flows
• One can then study the gradient flows of these functionals with respect to a metric
on the space of probability measures or densities; the resulting curve is a curve of
maximal slope and its endpoint is the posterior.
N. Garcia Trillos and D. Sanz-Alonso 33
• The gradient flows are characterized by a Fokker-Planck PDE that governs the
evolution of the density of an associated diffusion process.
The ideas in this paper immediately extend beyond the Bayesian interpretation
stressed here to any application (e.g. the study of conditioned diffusions) where a mea-
sure of interest is defined in terms of a reference measure and a change of measure.
Also, we consider only Kullback-Leibler and χ2 prior penalizations to define the func-
tionals JKL and Jχ2 , but it would be possible to extend the analysis to the family of
m-divergences introduced in Ohta and Takatsu (2011). Kullback-Leibler and χ2 prior
penalization correspond to m → 1 and m = 2 within this family. In what follows we
point out some of the features of the different functionals and gradient flows that we
consider in this paper.
1. The two gradient flows in Wasserstein space (arising from the functionals JKL and
Jχ2 ) are fundamentally connected with the variational formulation: these varia-
tional formulations can be used to define posterior-type measures via a penal-
ization of deviations from the prior and deviations from the data in situations
where establishing the existence of conditional distributions by disintegration of
measures is technically demanding. On the other hand, the variational formula-
tion for the Dirichlet energy is less natural and requires previous knowledge of the
posterior.
2. The precise level of geodesic convexity of the functionals JKL (and Jχ2 ) can be
computed from point evaluation of the Ricci tensor (of the parameter space) and
derivatives of the densities. In particular, knowledge of the underlying metric
suffices to compute these quantities. In contrast, establishing a sharp Poincaré
inequality—the level of geodesic convexity of the Dirichlet energy in L2 (M, μ)—
is in practice unfeasible, as it effectively requires solving an infinite dimensional
optimization problem. It is for this reason—and because of the explicit dependence
of the convexity in Wasserstein space with the geometry induced by the manifold
metric tensor—that our analysis of the choice of metric in Riemannian MCMC
methods is based on the JKL functional (see section 4, and in particular Theorem
10).
34 The Bayesian Update: Variational Formulations and Gradient Flows
3. On the flip side of point 2, a Poincaré inequality for the posterior with a not
necessarily optimal constant can be established using only tail information. In
particular, even when the functional JKL is not geodesically convex in Wasserstein
space, one may still be able to obtain a Poincaré inequality (see subsection 5.2 for
an example).
4. In contrast to the diffusions arising from the JKL or Dirichlet flows, the stochastic
processes arising from the Jχ2 formulation are inhomogeneous, and hence simu-
lation seems more challenging unless further structure is assumed on the prior
measure and likelihood function. Also, the evolution of densities of the gradient
flow of Jχ2 in Wasserstein space is given by a porous medium PDE.
1.2 Outline
The rest of the paper is organized as follows. Section 2 contains some background ma-
terial on the Wasserstein space, geodesic convexity of functionals, and gradient flows
in metric spaces. The core of the paper is section 3, where we study the geodesic con-
vexity, PDEs, and diffusions associated with each of the three functionals JKL , Jχ2 ,
and Dμ . In section 4 we consider an application of the theory to the choice of metric
in Riemannian MCMC methods Girolami and Calderhead (2011), and in section 5 we
illustrate the main concepts and ideas through examples arising in Bayesian formula-
tions of semi-supervised learning Garcia Trillos and Sanz-Alonso (2018a), Garcia Trillos
et al. (2017b), Bertozzi et al. (2017). We close in section 6 by summarizing our main
contributions and pointing to open directions.
data y. We will assume that the prior is defined as a change of measure from volg , and
that the posterior is defined as a change of measure from π as follows:
π = e−Ψ volg , μ ∝ e−φ π. (2)
The data is incorporated in the Bayesian update through the negative log-likelihood
function φ(·) = φ(·; y).
2 Preliminaries
In this section we provide some background material. The Wasserstein space, and the
notion of λ-geodesic convexity of functionals are reviewed in subsection 2.1. Gradient
flows in metric spaces are reviewed in subsection 2.2.
M
f dμ = 1, equipped with the L (M, μ) norm.
We spell out the definitions of P2 (M) and W2 :
P2 (M) := ν ∈ P(M) : d2 (x, x0 )dν(x) < ∞, for some x0 ∈ M ,
M
(3)
W22 (ν1 , ν2 ) := inf d(x, y)2 dα(x, y), ν1 , ν2 ∈ P2 (M).
α M×M
The infimum in the previous display is taken over all transportation plans between ν1 and
ν2 , i.e. over α ∈ P(M×M) with marginals ν1 and ν2 on the first and second factors. The
space (P2 (M), W2 ) is indeed a geodesic space: geodesics in (P2 (M), W2 ) are induced
by those in (M, d). All it takes to construct a geodesic connecting ν0 ∈ P2 (M) and
ν1 ∈ P2 (M) is to find an optimal transport plan between ν0 and ν1 to determine source
locations and target locations, and then transport the mass along geodesics in M (see
Villani (2003) and Santambrogio (2015)).
The space of functions f ∈ L2 (M, μ), with M f dμ = 1, equipped with the L2 (M, μ)
norm is also a geodesic space, where a constant speed geodesic connecting f0 and f1 is
given by linear interpolation: t ∈ [0, 1] → (1 − t)f0 + tf1 .
We will consider several functionals E : X → R ∪ {∞} throughout the paper. They
will all be defined in one of our three geodesic spaces—that is, X = M, X = P2 (M) or
X = L2 (M, μ). Important examples will be, respectively:
36 The Bayesian Update: Variational Formulations and Gradient Flows
1. Functions Ψ : M → R ∪ {∞}.
2. The Kullback-Leibler and χ2 divergences DKL (·π), Dχ2 (·π) : P(M) → [0, ∞],
where π is a given (prior) measure and, for ν1 , ν2 ∈ P(M),
dν1
M dν2
(u) log dνdν2 (u) dν2 (u), ν1
1
ν2 ,
DKL (ν1 ν2 ) := (4)
∞, otherwise,
⎧
⎨ 2
dν1
(u) − 1 dν2 (u), ν1 ν2 ,
Dχ2 (ν1 ν2 ) := M dν2 (5)
⎩∞, otherwise;
and the potential-type functional J : P(M) → R ∪ {∞} given by
J(ν) := h(u)dν(u),
M
If (M, d) is the Euclidean space, (i) and (ii) are also equivalent to:
Then, the solution x to (7) is the gradient flow of E in Euclidean space with initial
condition x0 ; it is a curve whose tangent vector at every point in time is the negative
of the gradient of the function E at that time. In order to generalize the notion of a
gradient flow to functionals defined on more general metric spaces, and in particular
when the metric space
has
no differential
structure, we reformulate (7) in integral form
by using that dtd
E x(t) = ∇E x(t) , ẋ(t) = − 12 |ẋ(t)|2 − 12 |∇E x(t) |2 as follows:
1 t 1 t 2
∇E x(r) dr, t > 0.
2
E(x0 ) = E x(t) + |ẋ(r)| dr + (8)
2 0 2 0
This identity, known as energy dissipation equality, is equivalent to (7)—see Chapter
8 of Santambrogio (2015) for further details and other possible formulations. Crucially
(8) involves notions that can be defined in an arbitrary metric space (X, dX ): the metric
derivative of a curve t → x(t) ∈ X is given by
dX x(t), x(s)
|ẋ(t)| := lim ,
s→t |s − t|
and the slope of a functional E : X → R ∪ {∞} is defined as the map |∇E| : {x ∈ X :
E(x) < ∞} → R ∪ {∞} given by
+
E(x) − E(y)
|∇E|(x) := lim sup .
y→x dX (x, y)
The identity (8) is the standard way to introduce gradient flows in arbitrary metric
spaces. In this paper we consider gradient flows in L2 and Wasserstein spaces, where
the notion of tangent vector is available. L2 has Hilbert space structure, whereas the
Wasserstein space can be seen as an infinite dimensional manifold (see Ambrosio et al.
(2008), Santambrogio (2015)).
formulations defined in terms of the functionals JKL , Jχ2 and the Dirichlet energy Dμ .
Subsection 3.2 studies the geodesic convexity of JKL and Jχ2 in Wasserstein space and
of Dμ in L2 . Finally, subsection 3.3 collects the PDEs that characterize the gradient
flows, as well as the corresponding diffusion processes.
In mathematical analysis Jordan and Kinderlehrer (1996) and probability theory Dupuis
and Ellis (2011) it is often useful to note that a probability measure μ defined by
1
μ(du) = exp −φ(u) π(du) (9)
Z
is the minimizer of the functional
JKL (ν) := DKL (νπ) + FKL (ν; φ), ν ∈ P(M), (10)
where
FKL (ν; φ) := φ(u)dν(u), (11)
M
and the integral is interpreted as +∞ if φ is not integrable with respect to ν. In physical
terms, the Kullback-Leibler divergence represents an internal energy, FKL represents a
potential energy, and the constant Z is known as the partition function. Here we are
concerned with a statistical interpretation of equation (9), and view it as defining a
posterior measure as a change of measure from a prior measure. In this context, the
Kullback-Leibler term DKL (·π) in (10) represents a penalization of deviations from
prior beliefs, the term FKL (ν; φ) penalizes deviations from the data, and the normalizing
constant Z represents the marginal likelihood. For brevity, we will henceforth suppress
the data y from the negative log-likelihood function φ, writing φ(u) instead of φ(u; y).
We remark that the fact that μ minimizes JKL follows immediately from the identity
DKL (·μ) = JKL (·) + log Z. (12)
Minimizing JKL (·) or DKL (·μ) is thus equivalent, but the functional JKL makes apparent
the roles of the prior and the likelihood.
The posterior μ also minimizes the functional
Jχ2 (ν) := Dχ2 (νπ) + Fχ2 (ν; φ), ν ∈ P(M), (13)
where
Fχ2 (ν; φ) := φ̃(u)dν(u), φ̃ = g exp(φ(u)) , g(t) := t − 1, t > 0. (14)
M
N. Garcia Trillos and D. Sanz-Alonso 39
We refer to Ohta and Takatsu (2011) for details. Note that both JKL and Jχ2 are
defined in terms of the two starting points of the Bayesian update: the prior π and
the negative log-likelihood φ. The associated variational formulations suggest a way
to define posterior-type measures based on these two ingredients in scenarios where
establishing the existence of conditional distributions via disintegration of measures is
technically demanding. This appealing feature of the two variational formulations above
is not shared by the one described in the next subsection.
Let now the posterior μ be given, and consider the space L2 (M, μ) of functions defined
on M which are square integrable with respect to μ. Recall the Dirichlet energy
μ
D (f ) = ∇g f (u)2 dμ(u),
M
introduced in equation (6). Now, since the measure μ can be characterized as the prob-
ability measure with density ρμ ≡ 1 a.s. with respect to μ, it follows that the posterior
density ρμ ≡ 1 is the μ
minimizer of the Dirichlet energy D over probability densities
ρ ∈ L2 (M, μ) with M ρdμ = 1.
The next proposition can be found in von Renesse and Sturm (2005) and Sturm (2006).
It shows that the convexity of JKL can be determined by the so-called curvature-
dimension condition—a condition that involves the curvature of the manifold and the
Hessian of the combined change of measure Ψ + φ. We recall the notation π = e−Ψ volg
and μ ∝ e−φ π.
Proposition 3. Suppose that Ψ, φ ∈ C 2 (M). Then JKL (or DKL (·μ)) is λ-geodesically
convex if, and only if,
Ricg (v, v) + Hessg Ψ(v, v) + Hessg φ(v, v) ≥ λ, ∀x ∈ M, ∀v ∈ Tx M with g(v, v) = 1,
where Ricg denotes the Ricci curvature tensor.
We recall that the Ricci curvature provides a way to quantify the disagreement
between the geometry of a Riemannian manifold and that of ordinary Euclidean space.
The Ricci tensor is defined as the trace of a map involving the Riemannian curvature
(see do Carmo Valero (1992)).
The following example illustrates the geodesic convexity of DKL (·μ) for Gaussian μ.
40 The Bayesian Update: Variational Formulations and Gradient Flows
The λ-convexity of JKL guarantees the existence of the gradient flow of JKL in Wasser-
stein space. Moreover, it determines the rate of convergence towards the posterior μ.
Precisely, if μ0 is absolutely continuous with respect to μ, and if λ > 0, then the gradient
flow t ∈ [0, ∞) → μt of JKL with respect to the Wasserstein metric starting at μ0 is well
defined and we have:
DKL (μt μ) ≤ e−λt DKL (μ0 μ), t ≥ 0,
−λt
(15)
W2 (μt , μ) ≤ λe
2
DKL (μ0 μ), t ≥ 0.
The second inequality, known as Talagrand inequality Villani (2003), establishes a com-
parison between Wasserstein geometry and information geometry. It can be established
directly combining the λ-geodesic convexity of JKL (for positive λ) with the first in-
equality. From (15) we see that a higher level of convexity of JKL allows to guarantee a
faster rate of convergence towards the posterior distribution μ.
We now turn to the geodesic convexity properties of Jχ2 . We recall that m denotes
the dimension of the manifold M. The following proposition can be found in (Ohta and
Takatsu, 2011, Theorem 4.1).
Proposition 4. Jχ2 is λ-geodesically convex if and only if both of the following two
properties are satisfied:
m+1 ∇g Ψ, v ≥ 0, ∀x ∈ M, ∀v ∈ Tx M.
1 2
1. Ricg (v, v) + Hessg Ψ(v, v) +
2. φ is λ-geodesically convex as a real valued function defined on M.
There are two main conclusions we can extract from the previous proposition. First,
that condition 1) is only related to the prior distribution π whereas condition 2) is only
related to the likelihood; in particular, the convexity properties of Jχ2 can indeed be
studied by studying separately the prior and the likelihood (notice that the proposition
gives an equivalence). Secondly, notice that condition 1) is a qualitative property and if it
is not met there is no hope that the functional Jχ2 has any level of global convexity even
when the likelihood function is a highly convex function. In addition, if 1) is satisfied,
the convexity of φ determines completely the level of convexity of Jχ2 . These features
are markedly different from the ones observed in the Kullback-Leibler case.
As for the functional JKL , one can establish the following functional inequalities,
under the assumption of λ-geodesic convexity of Jχ2 for λ > 0:
Jχ2 (μt ) − Jχ2 (μ) ≤ e−λt Jχ2 (μ0 ) − Jχ2 (μ) , t ≥ 0,
(16)
W2 (μt , μ)2 ≤ λe−λt Jχ2 (μ0 ) − Jχ2 (μ) , t ≥ 0.
N. Garcia Trillos and D. Sanz-Alonso 41
The above inequalities exhibit the fact that a higher level of convexity of Jχ2 guar-
antees a faster convergence towards the posterior distribution μ.
We now study the geodesic convexity of the Dirichlet energy functional defined in equa-
tion (6). In what follows we denote by · the L2 norm with respect to μ. Let us start
recalling Poincaré inequality.
Definition 5. We say that a Borel probability measure μ on M has a Poincaré in-
equality with constant λ if for every f ∈ L2 (M, μ) satisfying M f dμ = 0 we have
1 μ
f 2μ ≤ D (f ).
λ
We now show that Poincaré inequalities are directly related to the geodesic convexity
of the functional Dμ in the L2 (M, μ) space.
Proposition 6. Let λ be a positive real number and let μ be a Borel probability measure
on M. Then, the measure μ has a Poincaré inequality with constant λ if and only if
the functional
Dμ is 2λ-geodesically convex in the space of functions f ∈ L2 (M, μ)
satisfying f dμ = 1.
Now, suppose that μ has a Poincaré inequality with constant λ and consider two
functions f0 , f1 ∈ L2 (M, μ) satisfying M f0 dμ = M f1 dμ = 1. Then, (17) combined
with Poincaré inequality (taking f := f0 − f1 ) gives:
Dμ tf0 + (1 − t)f1 + λt(1 − t)f0 − f1 2μ ≤ tDμ (f0 ) + (1 − t)Dμ (f1 ), (18)
which is precisely the 2λ-geodesic convexity condition for Dμ .
Conversely, suppose that Dμ is 2λ-geodesic convex in 2
the space of L (M, μ) functions
that integrate to one. Let f ∈ L (M, μ) be such that M f dμ = 0 and without the loss
2
1 μ
f 2μ ≤ D (f )
λ
42 The Bayesian Update: Variational Formulations and Gradient Flows
1
−Δμg f := − divg (e−φ−ψ ∇g f ),
Z
where divg and ∇g are the divergence and gradient operators in (M, g). This eigenvalue
can be written variationally as
Dμ (f )
λ2 := min ,
f ∈L 2 (M,μ) f − fμ 2μ
where
fμ := f dμ.
M
Remark 8. Spectral gaps are used in the theory of MCMC as a means to bound the
asymptotic variance of empirical expectations Kipnis and Varadhan (1986).
Let us now consider t ∈ (0, ∞) → μt the flow of Dμ in L2 (M, μ) with some initial
condition dμ
dμ = ρ0 . It is well known that this flow coincides with that of the functional
0
JKL in Wasserstein space. However, taking the Dirichlet-L2 point of view, one can use
a Poincaré inequality (i.e. the geodesic convexity of Dμ ) to deduce the exponential
convergence of μt towards μ in the χ2 -sense. Indeed, let
dμt
ρt := , t ∈ (0, ∞).
dμ
A standard computation then shows that
1 d ∂ρ
ρt − 1μ =
2
(ρt − 1) dμ = − |∇g (ρt (u) − 1)|2 dμ(u) ≤ −λρt − 12μ .
2 dt M ∂t M
It was shown in Jordan et al. (1998)—in the Euclidean setting and in the unweighted
case π = dx—that the gradient flow of the Kullback-Leibler functional DKL (·π) in
Wasserstein space produces a solution to the Fokker-Planck equation. More generally,
under the convexity conditions guaranteeing the existence of the gradient flow t ∈
(0, ∞) → μt of DKL (·μ) (equivalently of JKL ) starting from μ0 ∈ P(M), the densities
dμt dμt
ρt := , θt := , t ∈ (0, ∞)
dμ dvolg
∂ρ
= Δμg ρ. (19)
∂t
∂θ
= Δg θ + divg θ(∇g φ + ∇g Ψ) . (20)
∂t
Equation (20) can be identified as the evolution of the densities (w.r.t. dvolg ) of the
diffusion √
dXt = −∇g Ψ(Xt ) + φ(Xt ) dt + 2dBtg , (21)
Jχ2 -Wasserstein
dμt
ρ̃t := ,
dπ
of the Jχ2 -Wasserstein flow t ∈ (0, ∞) → μt is the (weighted) porous medium equation:
∂ ρ̃
= Δπg ρ̃2 + divπg (ρ̃∇g φ), (22)
∂t
where the weighted Laplacian and divergence are defined formally as
Consider now the stochastic process {ut }t≥0 formally defined as the solution to the
nonlinear diffusion
dXt = − ρ̃(t, Xt )∇g Ψ(Xt ) + ∇g φ(Xt ) dt + 2ρ̃(t, Xt )dBtg , u0 ∼ ρ0 , (24)
44 The Bayesian Update: Variational Formulations and Gradient Flows
where ρ̃ is the solution to (22). Let θt be the evolution of the densities (with respect
to dvolg ) of the above diffusion. Then a formal computation shows that θ satisfies the
Fokker-Planck equation:
∂θ
= −divg θ −ρ̃∇g Ψ − ∇g φ + Δg (ρ̃θ).
∂t
1
If we let β = Z exp(−Ψ)θ we see, using (23), that
Δπg β 2 = eΨ Δg (β 2 e−Ψ ) + divg (β 2 e−Ψ )∇g Ψ ,
divπg (β∇g φ) = eΨ divg (βe−Ψ )∇g Ψ ,
implying that the distributions of the stochastic process (24) are those generated by the
gradient flow of Jχ2 in Wasserstein space.
Remark 9. In contrast with the Langevin diffusion (21), the process (24) is defined
in terms of the solution of the equation satisfied by its densities. In particular, if one
wanted to simulate (24) one would need to know the solution of (22) before hand.
We use the canonical basis for Rm as global chart for Rm and consider the canonical
∂
vector fields ∂x 1
, . . . , ∂x∂m . The Christoffel symbols associated to the Levi-Civita con-
nection of the Riemannian manifold (Rm , g) can be written in terms of derivatives of
the metric as
1 ∂ ∂ ∂
l
Γij = Gki + Gkj − Gij G−1lk , (25)
2 ∂xj ∂xi ∂xk
where in the right hand-side—and in what follows—we use Einstein’s summation con-
vention. The proof of the following result is in the Supplementary Material.
Theorem 10. Let F ∈ C 2 (Rm ) and
μ(du) ∝ exp −F (u) du.
The sharp constant λ for which JKL (or DKL (·μ)) is λ-geodesically convex in the g-
Wasserstein distance is equal to
λG := infm Λmin G−1/2 B + Hess F − C G−1/2 ,
x∈R
where Hess F is the usual (Euclidean) Hessian matrix of F , B is the matrix with coor-
dinates
∂Γlij
Bij := − Γkil Γljk , (26)
∂xl
and C is the matrix with coordinates
∂F
Cij := Γlij . (27)
∂xl
Moreover, for any a > 0,
1
λaG = λG . (28)
a
Note that λG is a key quantity in evaluating the quality of a metric G in building
geometry-informed Langevin diffusions for sampling purposes, as it gives the exponential
rate at which the evolution of probabilities built using the metric G converges towards
the posterior: larger λG corresponds to faster convergence. However, in order to establish
a fair performance comparison, the metrics need to be scaled appropriately. Indeed a
faster rate can be obtained by scaling down the metric (which can be thought of as
time-rescaling), as it is clearly seen by the scaling property (28) of the functional λG .
It is important to note that scaling down the metric leads to a faster diffusion, but
also makes its discretization more expensive. Indeed the error of Euler discretizations
is largely influenced by the Lipschitz constant of the drift. This motivates that a fair
criterion for choosing the metric could be to maximize λG with the constraint
Lip(∇g F ) = Lip(G−1 ∇F ) ≤ 1, (29)
since ∇g F = G−1 ∇F (where ∇ denotes the standard Euclidean gradient) is the drift
of the diffusion (21). Note that the constraint (29) ensures that the metric cannot be
scaled down arbitrarily while also guaranteeing that the discretizations do not become
increasingly expensive. We remark that other constraints involving higher regularity
requirements may be useful if higher order discretizations are desired.
46 The Bayesian Update: Variational Formulations and Gradient Flows
Remark 11. The functional λG can be used to determine the optimal metric among a
certain subclass of metrics of interest satisfying the condition (29). For instance, it may
be of interest to find the optimal constant metric G (see Proposition 12 below), or to
find the best metric within a finite family of metrics. On the other hand the constraint
(29) forces feasible metrics to induce diffusions that are not expensive to discretize.
To illustrate the previous remark we show that for a Gaussian target measure the op-
timal preconditioner is, unsurprisingly, given by the Fisher information. More precisely
we have the following proposition:
Proposition 12. Let μ = N (0, Σ). Then
G∗ := Σ−1 (30)
λG∗ = 1.
Proof. Suppose for the sake of contradiction that there exists a constant metric G
that satisfies condition (29), which in this case reads G−1 Σ−1 ≤ 1 and is such that
λG > λG ∗ .
Let u be a unit norm eigenvector of G with eigenvalue λ > 0. Notice that by definition
of λG we must have
G−1 Σ−1 G−1/2 u, G1/2 u ≤ G−1 Σ−1 G−1/2 u · G1/2 u ≤ G−1/2 u · G1/2 u
Since u is an eigenvector
√ of G with eigenvalue λ, it follows that u is also an eigenvector
of G1/2 with eigenvalue λ and of G−1/2 with eigenvalue √1λ . Therefore the right hand
side of the above display is equal to one. This however contradicts (31). From this we
deduce the optimality of G∗ among feasible metrics.
Lip(G∗−1 ∇F ) = Lip(G−1
e ∇F ) = 1.
A calculation then shows that λG∗ = 1 while λGe = . Note that if the Euclidean
metric is not rescaled by −1 —violating the constraint (29)—then the same unit rate
of convergence as with the metric G∗ is achieved. However, the drift of the associated
diffusion √
dXt = −Σ−1 Xt dt + 2 dBt
is of order −1 , making the discretization increasingly expensive in the small limit. On
the other hand, since both G∗−1 and G−1 e are of order , the drifts for both associated
diffusions are order 1. This motivates our choice of constraint in equation (29).
where D is the degree matrix of the weighted graph, i.e., the diagonal matrix with
n
diagonal entries Dii = j=1 Wij . The above corresponds to the unnormalized graph
Laplacian, but different normalizations are possible Von Luxburg (2007). The graph-
Laplacian will be used in all the models below to favor prior draws of the latent variables
that are consistent with the geometry of the data cloud.
Remark 13. A special case of a weighted graph (X, W ) frequently found in the literature
is that in which the points in X are i.i.d. points sampled from some distribution on a
manifold M embedded in Rd , and the similarity matrix W is obtained as
|xi − xj |
Wij = K .
r
In the next two subsections we study the probit and logistic models.
The measure π is interpreted as a prior over functions on the point cloud X with average
zero. Larger values of α > 0 force more regularization of the functions u.
Likelihood function: For a fixed u ∈ U and for j ∈ Z define
yj = S(uj + ηj ),
N. Garcia Trillos and D. Sanz-Alonso 49
where the ηj are i.i.d. N (0, γ 2 ), and S is the sign function. This specifies the distribution
of observed labels given the underlying latent variable u. We then define, for given data
y, the negative log-density function
φ(u; y) := − log H(yj uj ; γ) , u ∈ U, (34)
j∈Z
From what has been discussed in the previous sections, the posterior μy can be
characterized as the unique minimizer of the energy
JKL (ν) := DKL (νπ) + φ(u; y)dν(u), ν ∈ P(Rn ). (35)
Rn
Let us first consider the gradient flow of JKL with respect to the usual Wasserstein space
(i.e. the one induced by the Euclidean distance).
We can study the geodesic convexity of this functional by studying independently
the convexity properties of DKL (νπ) and of φ(·; y). Precisely:
i) Since π is a Gaussian measure with covariance L−α , Example 1 shows that DKL (νπ)
is (Λmin (L))α -geodesically convex in Wasserstein space, where Λmin (L) is the small-
est non-trivial eigenvalue of L.
ii) The function φ(·; y) is convex—see
the appendix of Bertozzi et al. (2017). Hence,
the functional FKL (ν) = Rn φ(u; y)dν(u) is 0-geodesically convex in Wasserstein
space.
It then follows from Proposition 3 that JKL is (Λmin (L))α -geodesically convex in Wasser-
stein space. As a consequence, if we consider t ∈ [0, ∞) → μt , the gradient flow of JKL
with respect to the Wasserstein distance starting at μ0 (an absolutely continuous mea-
sure with respect to μ), geometric inequalities can be immediately obtained from (15);
such inequalities will not deteriorate with n—see Remark 13.
However, the diffusion associated to this flow is given by
√
dXt = (−Lα Xt − ∇φ(X; y)) dt + 2dBt , (36)
and in particular its drift (more precisely the term Lα Xt ) deteriorates as n gets larger.
Notice that if we wanted to control the cost of discretization by rescaling the Euclidean
50 The Bayesian Update: Variational Formulations and Gradient Flows
metric (as exhibited in Example 2), the geodesic convexity of the resulting flow would
vanish as n gets larger.
The previous discussion shows that the flow of JKL in the usual Wasserstein sense
does not produce a flow with good convergence properties that at the same time is
cheap to discretize (robustly in n). This motivates considering the gradient flow of JKL
with respect to the Wasserstein distance induced by a certain constant metric g. Indeed,
inspired by Proposition 12, let us consider the constant metric tensor
G := Lα .
Since the metric tensor is constant, in particular its induced volume form volg is pro-
portional to the Lebesgue measure and hence we can write
1
dμy (u) ∝ exp −φ(u; y) − Lα u, u dvolg (u), u ∈ U.
2
On the other hand, from the discussion in Section 3.3 we know that the densities of the
stochastic process
1 √
dXt = −∇g (φ(Xt ; y) + Lα Xt , Xt )dt + 2dBtg
2
correspond to the gradient flow of the energy JKL with respect to the Wasserstein
distance induced by the metric g, where B g is a Brownian motion on (Rm , g). This
diffusion can be rewritten in terms of the standard Euclidean gradient ∇ and Brownian
motion B as √
dXt = −(Xt + L−α ∇φ(Xt ; y))dt + 2L−α dBt , (37)
after noticing that √
∇g = G−1 ∇, Bg = G−1 B,
where for the second identity we have used the fact that G is constant. How convex is
the energy JKL with respect to the Wasserstein distance induced by g? Since the metric
tensor G is constant it follows that
λG := infm Λmin G−1/2 Hess F G−1/2 ,
x∈R
where F (u) := φ(u; y) + 12 Lα u, u. Finally, due to the convexity of φ(u; y) we deduce
that
λG ≥ Λmin G−1/2 Lα G−1/2 = 1.
We notice that in (37) L appears as L−α . This is a fundamental difference from (36)
(where L appears as Lα ) with computational advantages, given that the eigenvalues of
L grow towards infinity.
Remark 14. A carefully designed discretization of (37) induces the so called Langevin
pCN proposal for MCMC computing (see Cotter et al. (2013)).
N. Garcia Trillos and D. Sanz-Alonso 51
Remark 15. In the above we have considered a probit model for the likelihood function.
The ideas generalize straightforwardly to other settings, notably the logistic model
φ(u; y) := − log σ(yj uj ; γ) , u ∈ U, (38)
j∈Z
where
1
σ(t; γ) := .
1 + e−t/γ
The convexity of φ for the logistic model (38) can be established by direct computation
of the second derivative of σ.
yj = uj + ηj , ηj ∼ N (0, γ 2 ).
Posterior distribution: Combining the prior and the likelihood via Bayes’ formula
gives the posterior distribution
dμy dμy
(u) = exp −φ(u; y) ; (u) ∝ exp −Ψ(u) − φ(u; y) , u ∈ U,
dπ du
where Ψ is given by (39), and φ is given by (40).
52 The Bayesian Update: Variational Formulations and Gradient Flows
For this model, the negative prior log-density Ψ is not convex, and Wasserstein λ-
geodesic convexity of the functional DKL (·π) only holds for negative λ. In particular, it
is not possible to deduce exponential decay taking the Wasserstein flow point of view.
However, in the L2 /Dirichlet energy setting we can still show exponential convergence
towards the posterior μy . Indeed, because the negative log-likelihood of μy satisfies:
|∇Ψ(u) + ∇φ(u; y)|2
− ΔΨ(u) − Δφ(u; y) → ∞, as |u| → ∞,
2
there exists some λ > 0 for which μy has a Poincaré inequality with constant λ (see
Chapter 4.5 in Pavliotis (2014)). In this example we can say more, and in particular we
are able to find a Poincaré constant that depends explicitly on ε, the smallest non-trivial
eigenvalue of L, and k := |Z|.
Let ψ(u) := j∈Z Wε (uj ) and let ψc be its convex envelope, i.e. let ψc be the largest
convex function that is below ψ. It is straightforward to show that ψc (0) = 0 and that
k
infn {exp (ψc (u) − ψ(u))} = exp − .
u∈R 4ε
Consider now the probability measure μc with Lebesgue density
1 1
dμc (u) = exp − Lα u, u − φ(u; y) − ψc (u) du,
Zc 2
and define λ2 and λ2,c as in Remark 7 using μy and μc instead of μ. For any given
f ∈ L2 (μ) we then have
|∇f |2 dμ |∇f |2 dμ k |∇f |2 dμc
≥ ≥ exp − ,
|f − fμ |2 dμ |f − fμc |2 dμ ε |f − fμc |2 dμc
where the first inequality follows from the fact that fμ = argmina∈R |f − a|2 dμ and
the second inequality follows directly from the fact that 0 ≥ ψc − ψ ≥ − kε . It follows
that
k k
λ2 ≥ exp − λ2,c ≥ exp − (Λmin (L))α .
ε ε
where the last inequality follows from the fact that the negative log-likelihood of μy
satisfies the Bakry-Emery condition with constant Λmin (L) (see Chapter 4.5 in Pavliotis
(2014)). Clearly, the Poincaré constant above is very large for small or for large k
(number of labeled data points). We also notice that the cost of discretization of the
diffusion associated to this flow increases with n (as in Section 5.1).
Remark 16. A similar analysis can be carried out now using the constant metric
G := Lα .
with respect to L2 (μ). How convex is this functional? For every f ∈ U we have
|∇g f |2g dμ |∇g f |2g dμ k |∇g f |2g dμc
≥ ≥ exp − ,
|f − fμ |2 dμ |f − fμc |2 dμ ε |f − fμc |2 dμc
• The variational framework considered in this paper provides a natural setting for
the study of robustness of Bayesian models, and for the analysis of convergence
of discrete to continuum Bayesian models. Indeed, the authors Garcia Trillos and
Sanz-Alonso (2018a), Garcia Trillos et al. (2017b) have recently established the
consistency of Bayesian semi-supervised learning in the regime with fixed number
of labeled data points and growing number of unlabeled data. The analysis relies
on the variational formulation based on Kullback-Leibler prior penalization in
equation (35).
Supplementary Material
Supplementary Material to “The Bayesian Update: Variational Formulations and Gra-
dient Flows” (DOI: 10.1214/18-BA1137SUPP; .pdf).
References
Ambrosio, L., Gigli, N., and Savaré, G. (2008). Gradient flows: in metric spaces and in
the space of probability measures. Springer Science & Business Media. MR2401600.
30, 31, 37
Attias, H. (1999). “A Variational Bayesian Framework for Graphical Models.” In NIPS ,
volume 12. 30
Bertozzi, A. L., Luo, X., Stuart, A. M., and Zygalakis, K. C. (2017). “Uncer-
tainty quantification in the classification of high dimensional data.” MR3794350.
doi: https://doi.org/10.1137/17M1134214. 34, 47, 48, 49
Besag, J. E. (1994). “Comments on “Representations of knowledge in complex systems”
by U. Grenander and M. I. Miller.” Journal of the Royal Statistical Society: Series B,
56: 591–592. MR1293234. 32
Burago, D., Burago, Y., and Ivanov, S. (2001). A course in metric geometry, volume 33
of Graduate Studies in Mathematics. American Mathematical Society, Providence,
RI. MR1835418. doi: https://doi.org/10.1090/gsm/033. 35
Burago, D., Ivanov, S., and Kurylev, Y. (2013). “A graph discretization of
the Laplace-Beltrami operator.” arXiv preprint arXiv:1301.2222 . MR3299811.
doi: https://doi.org/10.4171/JST/83. 48
Cotter, S. L., Roberts, G. O., Stuart, A. M., and White, D. (2013). “MCMC methods for
functions: modifying old algorithms to make them faster.” Statistical Science, 28(3):
424–446. MR3135540. doi: https://doi.org/10.1214/13-STS421. 50
do Carmo Valero, M. P. (1992). Riemannian Geometry. MR1138207.
doi: https://doi.org/10.1007/978-1-4757-2201-7. 39
Dupuis, P. and Ellis, R. S. (2011). A weak convergence approach to the theory of large
deviations, volume 902. John Wiley & Sons. MR1431744. doi: https://doi.org/
10.1002/9781118165904. 38
Fox, C. W. and Roberts, S. J. (2012). “A tutorial on variational Bayesian inference.”
Artificial intelligence review , 38(2): 85–95. 30
Garcia Trillos, N., Gerlach, M., Hein, M., and Slepcev, D. (2017a). “Spectral convergence
of empirical graph Laplacians.” In preparation. 48
Garcia Trillos, N., Kaplan, Z., Samakhoana, T., and Sanz-Alonso, D. (2017b). “On the
consistency of graph-based Bayesian learning and the scalability of sampling algo-
rithms.” arXiv preprint arXiv:1710.07702 . 30, 34, 53
N. Garcia Trillos and D. Sanz-Alonso 55
Sturm, K. T. (2006). “On the geometry of metric measure spaces.” Acta Mathematica,
196(1): 65–131. MR2237206. doi: https://doi.org/10.1007/s11511-006-0002-8.
39
Teschl, G. (2012). Ordinary differential equations and dynamical systems, volume 140.
American Mathematical Soc. MR2961944. doi: https://doi.org/10.1090/gsm/140.
42
Villani, C. (2003). Topics in optimal transportation, volume 58. American Mathematical
Soc. MR1964483. doi: https://doi.org/10.1007/b12016. 31, 35, 40
Villani, C. (2008). Optimal transport: old and new , volume 338. Springer Science &
Business Media. MR2459454. doi: https://doi.org/10.1007/978-3-540-71050-9.
31
Von Luxburg, U. (2007). “A tutorial on spectral clustering.” Statistics and Computing,
17(4): 395–416. MR2409803. doi: https://doi.org/10.1007/s11222-007-9033-z.
48
von Renesse, M. K. and Sturm, K. T. (2005). “Transport inequalities, gradient estimates,
entropy and Ricci curvature.” Communications on Pure and Applied Mathematics,
58(7): 923–940. MR2142879. doi: https://doi.org/10.1002/cpa.20060. 39
Wainwright, M. J. and Jordan, M. I. (2008). “Graphical models, exponential families,
and variational inference.” Foundations and Trends
R in Machine Learning, 1(1–2):
1–305. 30
Zellner, A. (1988). “Optimal information processing and Bayes’s theorem.” The Amer-
ican Statistician, 42(4): 278–280. MR0971095. doi: https://doi.org/10.2307/
2685143. 30, 56
Acknowledgments
We are thankful to Matı́as Delgadino for pointing to us the reference Ohta and Takatsu (2011)
and to Sayan Mukherjee for the reference Zellner (1988). We thank the editor and referees for
their help in improving the readability of our manuscript.