Open AccessArticle

Divergences Induced by the Cumulant and Partition Functions of Exponential Families and Their Deformations Induced by Comparative Convexity

Frank Nielsen

Sony Computer Science Laboratories, Tokyo 141-0022, Japan

Entropy 2024, 26(3), 193; https://doi.org/10.3390/e26030193

Submission received: 20 December 2023 / Revised: 21 February 2024 / Accepted: 21 February 2024 / Published: 23 February 2024

Download

Browse Figures

Versions Notes

Abstract

Exponential families are statistical models which are the workhorses in statistics, information theory, and machine learning, among others. An exponential family can either be normalized subtractively by its cumulant or free energy function, or equivalently normalized divisively by its partition function. Both the cumulant and partition functions are strictly convex and smooth functions inducing corresponding pairs of Bregman and Jensen divergences. It is well known that skewed Bhattacharyya distances between the probability densities of an exponential family amount to skewed Jensen divergences induced by the cumulant function between their corresponding natural parameters, and that in limit cases the sided Kullback–Leibler divergences amount to reverse-sided Bregman divergences. In this work, we first show that the

α

-divergences between non-normalized densities of an exponential family amount to scaled

α

-skewed Jensen divergences induced by the partition function. We then show how comparative convexity with respect to a pair of quasi-arithmetical means allows both convex functions and their arguments to be deformed, thereby defining dually flat spaces with corresponding divergences when ordinary convexity is preserved.

Keywords:

convex duality; exponential family; Bregman divergence; Jensen divergence; Bhattacharyya distance; Rényi divergence; α-divergences; comparative convexity; log convexity; exponential convexity; quasi-arithmetic means; information geometry

1. Introduction

In information geometry [1], any strictly convex and smooth function induces a dually flat space (DFS) with a canonical divergence which can be expressed in charts either as dual Bregman divergences [2] or equivalently as dual Fenchel–Young divergences [3]. For example, the cumulant function of an exponential family [4] (also called the free energy) generates a DFS, that is, an exponential family manifold [5] with the canonical divergence yielding the reverse Kullback–Leibler divergence. Another typical example of a strictly convex and smooth function generating a DFS is the negative entropy of a mixture family, that is, a mixture family manifold with the canonical divergence yielding the (forward) Kullback–Leibler divergence [3]. In addition, any strictly convex and smooth function induces a family of scaled skewed Jensen divergences [6,7], which in limit cases includes the sided forward and reverse Bregman divergences.

In Section 2, we present two equivalent approaches to normalizing an exponential family: first by its cumulant function, and second by its partition function. Because both the cumulant and partition functions are strictly convex and smooth, they induce corresponding families of scaled skewed Jensen divergences and Bregman divergences, with corresponding dually flat spaces and related statistical divergences.

In Section 3, we recall the well-known result that the statistical

α

-skewed Bhattacharyya distances between the probability densities of an exponential family amount to a scaled

α

-skewed Jensen divergence between their natural parameters. In Section 4, we prove that the

α

-divergences [8] between the unnormalized densities of a exponential family amount to scaled

α

-skewed Jensen divergence between their natural parameters (Proposition 5). More generally, we explain in Section 5 how to deform a convex function using comparative convexity [9]: When the ordinary convexity of the deformed convex function is preserved, we obtain new skewed Jensen divergences and Bregman divergences with corresponding dually flat spaces. Finally, Section 6 concludes this work with a discussion.

2. Dual Subtractive and Divisive Normalizations of Exponential Families

2.1. Natural Exponential Families

Let

(X, A, μ)

be a measure space [10], where

X

denotes the sample set (e.g., finite alphabet,

N

R^{d}

, space of positive-definite matrices

{Sym}_{+ +} (d)

, etc.),

A

σ

-algebra on

X

(e.g., power set

2^{X}

, Borel

σ

-algebra

B (X)

, etc.), and

μ

a positive measure (e.g., counting measure or Lebesgue measure) on the measurable space

(X, A)

A natural exponential family [4,11] (commonly abbreviated as NEF [12]) is a set of probability distributions

P = {P_{θ} : θ \in Θ}

all dominated by

μ

such that their Radon–Nikodym densities

p_{θ} (x) = \frac{d P_{θ}}{d μ} (x)

can be expressed canonically as

p_{θ} (x) \propto {\tilde{p}}_{θ} (x) = exp (\sum_{i = 1}^{m} θ_{i} x_{i}),

(1)

where

θ

is called the natural parameter and

x = (x_{1}, \dots, x_{m})

denotes the linear sufficient statistic vector [11]. The order of the NEF [13] is m. When the parameter

θ

ranges in the full natural parameter space

Θ = \{θ : \int_{X} {\tilde{p}}_{θ} (x) d μ (x) < \infty\} \subset R^{m},

the family is called full. The NEF is said to be regular when

Θ

is topologically open.

The unnormalized positive density

{\tilde{p}}_{θ} (x)

is indicated with a tilde notation and the corresponding normalized probability density is obtained as

p_{θ} (x) = \frac{1}{Z (θ)} {\tilde{p}}_{θ} (x)

, where

Z (θ) = \int {\tilde{p}}_{θ} (x) d μ (x)

is the Laplace transform of

μ

(the density normalizer). For example, the family of exponential distributions

E = {λ e^{- λ x} : λ > 0}

is an NEF with densities defined on the support

X = R_{\geq 0} = {x \in R : x \geq 0}

, natural parameter

θ = - λ

Θ = R_{< 0} = {θ \in R : θ < 0}

, sufficient linear statistic x, and normalizer

Z (θ) = - \frac{1}{θ}

2.2. Exponential Families

More generally, exponential families include many well known distributions after reparameterization [4] of their ordinary parameter

λ

θ (λ)

. The general canonical form of the densities of an exponential family is

p_{λ} (x) \propto {\tilde{p}}_{λ} (x) = exp (〈θ (λ), t (x)〉) h (x),

(2)

where

t (x) = (t_{1} (x), \dots, t_{m} (x))

are the sufficient statistic vector (such that

1, t_{1} (x), \dots, t_{m} (x)

are linearly independent),

h (x)

is an auxiliary term used to define the base measure with respect to

μ

, and

〈\cdot, \cdot〉

is an inner product (e.g., scalar product of

R^{m}

, trace product of symmetric matrices, etc.). By defining a new measure

ν

such that

\frac{d μ}{d ν} (x) = h (x)

, we may consider without loss of generality the densities

{\bar{p}}_{λ} (x) = \frac{d P_{λ}}{d ν} (x)

with

h (x) = 1

For example, the Bernoulli distributions, Gaussian or normal distributions, Gamma and Beta distributions, Poisson distributions, Rayleigh distributions, and Weibull distributions with prescribed shape parameter are just a few examples of exponential families with the inner product on

R^{m}

defined as the scalar product. The categorical distributions (i.e., discrete distributions on a finite alphabet sample space) form an exponential family as well [1]. Zero-centered Gaussian distributions and Wishart distributions are examples of exponential families parameterized by positive-definite matrices with inner products defined by the matrix trace product, which is

〈A, B〉 = tr (A B)

Exponential families abound in statistics and machine learning. Any two probability measures Q and R with densities q and r with respect to a dominating measure, say,

μ = \frac{Q + R}{2}

, define an exponential family

P_{Q, R} = \{p_{λ} (x) \propto q^{λ} (x) r^{1 - λ} (x) : λ \in (0, 1)\},

which is called the likelihood ratio exponential family [14], as the sufficient statistic is

t (x) = log \frac{q (x)}{r (x)}

(with auxiliary carrier term

h (x) = r (x)

), or the Bhattacharyya arc, as the cumulant function of

P_{Q, R}

is expressed as the negative of the skewed Bhattacharyya distances [7,15].

In machine learning, undirected graphical models [16] and energy-based models [17], including Markov random fields [18] and conditional random fields, are exponential families [19]. Exponential families are universal approximators of smooth densities [20].

From a theoretical standpoint, it is often enough to consider (without loss of generality) natural exponential families with densities expressed as in Equation (1). However, here we consider generic exponential families with the densities expressed in Equation (2) in order to report common examples encountered in practice, such as the multivariate Gaussian family [21].

When the natural parameter space

Θ

is not full but rather parameterized by

λ = c (λ^{'})

for

λ^{'} \in Λ^{'}

with

\dim (Λ^{'}) < m

and a smooth function

c (u)

, the exponential family is called a curved exponential family [1]. For example, the special family of normal distributions

{p_{μ, σ^{2} = μ^{2}} : μ \in R}

is a curved exponential family with

u = μ

and

c (u) = (u, u^{2})

[1].

2.3. Normalizations of Exponential Families

Recall that

{\tilde{p}}_{θ} (x) = exp (〈θ, t (x)〉) h (x)

denotes the unnormalized density expressed using the natural parameter

θ = θ (λ)

. We can normalize

{\tilde{p}}_{θ} (x)

using either the partition function

Z (θ)

or equivalently using the cumulant function

F (θ)

, as follows:

\begin{matrix} p_{θ} (x) & = & \frac{exp (〈θ, t (x)〉)}{Z (θ)} h (x), \end{matrix}

(3)

\begin{matrix} = & exp (〈θ, t (x)〉 - F (θ) + k (x)), \end{matrix}

(4)

where

h (x) = exp (k (x))

Z (θ) = \int {\tilde{p}}_{θ} (x) d μ (x)

, and

F (θ) = log Z (θ) = log \int {\tilde{p}}_{θ} (x) d μ (x)

. Thus, the logarithm and exponential functions allow conversion to and from the dual normalizers Z and F:

Z (θ) = exp (F (θ)) \Leftrightarrow F (θ) = log Z (θ) .

We may view Equation (3) as an exponential tilting [13] of density

h (x) d μ (x)

In the context of

λ

-deformed exponential families [22] which generalize exponential families, the function

Z (θ)

is called the divisive normalization factor (Equation (3)) and the function

F (θ)

is called the subtractive normalization factor (Equation (4)). Notice that

F (θ)

is called the cumulant function because when

X \sim p_{θ} (x)

is a random variable following a probability distribution of an exponential family, the function

F (θ)

appears in the cumulant generating function of X:

K_{X} (t) = log E_{X} [e^{〈t, X〉}] = F (θ + t) - F (θ)

. In statistical physics, the cumulant function is called the log-normalizer or log-partition function. Because

Z > 0

and

F = log Z

, we can deduce that

F \geq Z

, as

log x \leq x

for

x > 0

It is well known that the cumulant function

F (θ)

is a strictly convex function and that the partition function

Z (θ)

is strictly log-convex [11].

Proposition 1

([11]). The natural parameter space Θ of an exponential family is convex.

Proposition 2

([11]). The cumulant function

F (θ)

is strictly convex and the partition function

Z (θ)

is positive and strictly log-convex.

It can be shown that the cumulant and partition functions are smooth

C^{\infty}

analytic functions [4]. A remarkable property is that strictly log-convex functions are also strictly convex.

Proposition 3

([23], Section 3.5). A strictly log-convex function

Z : Θ \subset R^{m} \to R

is strictly convex.

The converse of Proposition 3 is not necessarily true, however; certain convex functions are not log-convex, and as such the class of strictly log-convex functions is a proper subclass of strictly convex functions. For example,

θ^{2}

is convex but log-concave, as

{(log θ^{2})}^{″} = - \frac{2}{θ^{2}} < 0

(Figure 1).

Remark 1.

Because

Z = exp (F)

is strictly convex (Proposition 3), F is exponentially convex.

Definition 1.

The cumulant function F and partition function Z of a regular exponential family are both strictly convex and smooth functions inducing a pair of dually flat spaces with corresponding Bregman divergences [2]

B_{F}

(i.e.,

B_{log Z}

) and

B_{Z}

(i.e.,

B_{exp F}

\begin{matrix} B_{Z} (θ_{1} : θ_{2}) & = & Z (θ_{1}) - Z (θ_{2}) - 〈θ_{1} - θ_{2}, \nabla Z (θ_{2})〉 \geq 0, \end{matrix}

(5)

\begin{matrix} B_{log Z} (θ_{1} : θ_{2}) & = & log (\frac{Z (θ_{1})}{Z (θ_{2})}) - 〈θ_{1} - θ_{2}, \frac{\nabla Z (θ_{2})}{Z (θ_{2})}〉 \geq 0, \end{matrix}

(6)

along with a pair of families of skewed Jensen divergences

J_{F, α}

and

J_{Z, α}

\begin{matrix} J_{Z, α} (θ_{1} : θ_{2}) & = & α Z (θ_{1}) + (1 - α) Z (θ_{2}) - Z (α θ_{1} + (1 - α) θ_{2}) \geq 0, \end{matrix}

(7)

\begin{matrix} J_{log Z, α} (θ_{1} : θ_{2}) & = & log \frac{Z {(θ_{1})}^{α} Z {(θ_{2})}^{1 - α}}{Z (α θ_{1} + (1 - α) θ_{2})} \geq 0 . \end{matrix}

(8)

For a strictly convex function

F (θ)

, we define the symmetric Jensen divergence as follows:

J_{F} (θ_{1}, θ_{2}) = J_{F, \frac{1}{2}} (θ_{1} : θ_{2}) = \frac{F (θ_{1}) + F (θ_{2})}{2} - F (\frac{θ_{1} + θ_{2}}{2}) .

Let

B_{Θ}

denote the set of real-valued strictly convex and differentiable functions defined on an open set

Θ

, called Bregman generators. We may equivalently consider the set of strictly concave and differentiable functions

G (θ)

and let

F (θ) = - G (θ)

; see [24] (Equation (1)).

Remark 2.

The non-negativeness of the Bregman divergences for the cumulant and partition functions define the criteria for checking the strict convexity or log-convexity of a

C^{1}

function:

\begin{matrix} F (θ) i s s t r i c t l y c o n v e x & \Leftrightarrow & \forall θ_{1} \neq θ_{2}, B_{F} (θ_{1} : θ_{2}) > 0, \\ \Leftrightarrow & \forall θ_{1} \neq θ_{2}, F (θ_{1}) > F (θ_{2}) + 〈θ_{1} - θ_{2}, \nabla F (θ)〉, \end{matrix}

and

\begin{matrix} Z (θ) i s s t r i c t l y l o g - c o n v e x & \Leftrightarrow & \forall θ_{1} \neq θ_{2}, B_{log Z} (θ_{1} : θ_{2}) > 0, \\ \Leftrightarrow & \forall θ_{1} \neq θ_{2}, log Z (θ_{1}) > log Z (θ_{2}) + 〈θ_{1} - θ_{2}, \frac{\nabla Z (θ_{2})}{Z (θ_{2})}〉 . \end{matrix}

The forward Bregman divergence

B_{F} (θ_{1} : θ_{2})

and reverse Bregman divergence

B_{F} (θ_{2} : θ_{1})

can be unified with the

α

-skewed Jensen divergences by rescaling

J_{F, α}

and allowing

α

to range in

R

[6,7]:

J_{F, α}^{s} (θ_{1} : θ_{2}) = \{\begin{matrix} \frac{1}{α (1 - α)} J_{F, α} (θ_{1} : θ_{2}), & α \in R ∖ {0, 1}, \\ B_{F} (θ_{1} : θ_{2}), & α = 0, \\ 4 J_{F} (θ_{1}, θ_{2}), & α = \frac{1}{2}, \\ B_{F}^{*} (θ_{1} : θ_{2}) = B_{F} (θ_{2} : θ_{1}), & α = 1 . \end{matrix},

(9)

where

{B_{F}}^{*}

denotes the reverse Bregman divergence obtained by swapping the parameter order (reference duality [6]):

{B_{F}}^{*} (θ_{1} : θ_{2}) = B_{F} (θ_{2} : θ_{1})

Remark 3.

Alternatively, we may rescale

J_{F}

by a factor

κ (α) = \frac{1}{α (1 - α) 4^{4 α (1 - α)}}

, i.e.,

J_{F, α}^{\bar{s}} (θ_{1} : θ_{2}) = κ (α) J_{F, α} (θ_{1} : θ_{2})

such that

κ (\frac{1}{2}) = 1

and

J_{F, \frac{1}{2}}^{\bar{s}} (θ_{1} : θ_{2}) = J_{F} (θ_{1}, θ_{2})

Next, in Section 3 we first recall the connections between these Jensen and Bregman divergences, which are divergences between parameters, and the statistical divergence counterparts between probability densities. Then, in Section 4 we introduce the novel connections between these parameter divergences and

α

-divergences between unnormalized densities.

3. Divergences Related to the Cumulant Function

Consider the scaled

α

-skewed Bhattacharyya distances [7,15] between two probability densities

p (x)

and

q (x)

D_{B, α}^{s} (p : q) = - \frac{1}{α (1 - α)} log \int p^{α} q^{1 - α} d μ, α \in R ∖ {0, 1} .

The scaled

α

-skewed Bhattacharyya distances can additionally be interpreted as Rényi divergences [25] scaled by

\frac{1}{α}

D_{B, α}^{s} (p : q) = \frac{1}{α} D_{R, α} (p : q)

, where the Rényi

α

-divergences are defined by

D_{R, α} (p : q) = \frac{1}{α - 1} log \int p^{α} q^{1 - α} d μ .

The Bhattacharyya distance

D_{B} (p, q) = - log \int \sqrt{p q} d μ

corresponds to one-fourth of

D_{B, \frac{1}{2}}^{s} (p : q)

D_{B} (p, q) = \frac{1}{4} D_{B, \frac{1}{2}}^{s} (p : q)

. Because

D_{B, α}^{s}

tends to the Kullback–Leibler divergence

D_{KL}

when

α \to 1

and to the reverse Kullback–Leibler divergence

{D_{KL}}^{*}

when

α \to 0

, we have

D_{B, α}^{s} (p : q) = \{\begin{matrix} - \frac{1}{α (1 - α)} log \int p^{α} q^{1 - α} d μ, & α \in R ∖ {0, 1}, \\ D_{KL} (p : q), & α = 1, \\ 4 D_{B} (p, q) & α = \frac{1}{2}, \\ {D_{KL}}^{*} (p : q) = D_{KL} (q : p) & α = 0 . \end{matrix}

When both probability densities belong to the same exponential family

E = {p_{θ} (x) : θ \in Θ}

with cumulant

F (θ)

, we have the following proposition.

Proposition 4

([7]). The scaled α-skewed Bhattacharyya distances between two probability densities

p_{θ_{1}}

and

p_{θ_{2}}

of an exponential family amount to the scaled α-skewed Jensen divergence between their natural parameters:

D_{B, α}^{s} (p_{θ_{1}} : p_{θ_{2}}) = J_{F, α}^{s} (θ_{1}, θ_{2}) .

(10)

Proof.

The proof follows by first considering the

α

-skewed Bhattacharyya similarity coefficient

ρ_{α} (p, q) = \int p^{α} q^{1 - α} d μ

\begin{matrix} ρ_{α} (p_{θ_{1}} : p_{θ_{2}}) & = & \int exp {(〈θ_{1}, x〉 - F (θ_{1}))}^{α} exp {(〈θ_{2}, x〉 - F (θ_{2}))}^{1 - α} d μ, \\ = & \int exp (〈α θ_{1} + (1 - α) θ_{2}), x〉) exp (- (α F (θ_{1}) + (1 - α) F (θ_{2}))) d μ . \end{matrix}

Multiplying the last equation by

exp (F (α θ_{1} + (1 - α) θ_{2})) exp (- F (α θ_{1} + (1 - α) θ_{2})) = exp (0) = 1

with

\bar{θ} = α θ_{1} + (1 - α) θ_{2}

, we obtain

ρ_{α} (p_{θ_{1}} : p_{θ_{2}}) = exp (- (α F (θ_{1}) + (1 - α) F (θ_{2})) exp (F (\bar{θ})) \int exp (〈\bar{θ}, x〉 - F (\bar{θ})) d μ .

Because

\bar{θ} \in Θ

, we have

\int exp (〈\bar{θ}, x〉 - F (\bar{θ})) d μ = 1

; therefore, we obtain

ρ_{α} (p_{θ_{1}} : p_{θ_{2}}) = exp (- J_{F, α} (θ_{1} : θ_{2})) .

□

For practitioners in machine learning, it is well known that the Kullback–Leibler divergence between two probability densities

p_{θ_{1}}

and

p_{θ_{2}}

of an exponential family amounts to a Bregman divergence for the cumulant generator on a swapped parameter order (e.g., [26,27]):

D_{KL} (p_{θ_{1}} : p_{θ_{2}}) = B_{F} (θ_{2} : θ_{1}) .

This is a particular instance of Equation (10) obtained for

α = 1

D_{B, 1}^{s} (p_{θ_{1}} : p_{θ_{2}}) = J_{F, 1}^{s} (θ_{1}, θ_{2}) .

This formula has been further generalized in [28] by considering truncations of exponential family densities. Let

X_{1} \subseteq X_{2} \subseteq X

and

E_{1} = {1_{X_{1}} (x) p_{θ} (x)}

E_{2} = {1_{X_{2}} (x) q_{θ^{'}} (x)}

be two truncated families of

X

with corresponding cumulant functions

F_{1} (θ) = log \int_{X_{1}} exp (〈t (x), θ〉) d μ

and

F_{2} (θ^{'}) = log \int_{X_{2}} exp (〈t (x), θ〉) d μ \geq F_{1} (θ^{'}) .

Then, we have

\begin{matrix} D_{KL} (p_{θ_{1}} : q_{θ_{2}^{'}}) & = & B_{F_{2}, F_{1}} (θ_{2}^{'} : θ_{1}), \\ = & F_{2} (θ_{2}^{'}) - F_{1} (θ_{1}) - 〈θ_{2}^{'} - θ_{1}, \nabla F_{1} (θ_{1})〉 . \end{matrix}

Truncated exponential families are normalized exponential families which may not be regular [29], i.e., the parameter space

Θ

may not be open.

4. Divergences Related to the Partition Function

Certain exponential families have intractable cumulant/partition functions (e.g., exponential families with sufficient statistics

t (x) = (x, x^{2}, \dots, x^{m})

for high degrees m [20]) or cumulant/partition functions which require exponential time to compute [30] (e.g., graphical models [16], high-dimensional grid sample spaces, energy-based models [17] in deep learning, etc.). In such cases, the maximum likelihood estimator (MLE) cannot be used to infer the natural parameter of exponential densities. Many alternative methods have been proposed to handle such exponential families with untractable partition functions, e.g., score matching [31] or divergence-based inference [32,33]). Thus, it is important to consider dissimilarities between non-normalized statistical models.

The squared Hellinger distance [1] between two positive potentially unnormalized densities

\tilde{p}

and

\tilde{q}

is defined by

\begin{matrix} D_{H}^{2} (\tilde{p}, \tilde{q}) & = & \frac{1}{2} \int {(\sqrt{\tilde{p}} - \sqrt{\tilde{q}})}^{2} d μ, \\ = & \frac{\int \tilde{p} d μ + \int \tilde{q} d μ}{2} - \int \sqrt{\tilde{p} \tilde{q}} d μ . \end{matrix}

Notice that the Hellinger divergence can be interpreted as the integral of the difference between the arithmetical mean

A (\tilde{p}, \tilde{q}) = \frac{\tilde{p} + \tilde{q}}{2}

minus the geometrical mean

G (\tilde{p}, \tilde{q}) = \sqrt{\tilde{p} \tilde{q}}

of the densities:

D_{H}^{2} (\tilde{p}, \tilde{q}) = \int (A (\tilde{p}, \tilde{q}) - G (\tilde{p}, \tilde{q})) d μ

. This further proves that

D_{H} (\tilde{p}, \tilde{q}) \geq 0

, as

A \geq G

. The Hellinger distance

D_{H}

satisfies the metric axioms of distances.

When considering unnormalized densities

{\tilde{p}}_{θ_{1}} = exp (〈t (x), θ_{1}〉)

and

{\tilde{p}}_{θ_{2}} = exp (〈t (x), θ_{2}〉)

of an exponential family

E

with a partition function

Z (θ) = \int {\tilde{p}}_{θ} d μ

, we obtain

D_{H}^{2} ({\tilde{p}}_{θ_{1}}, {\tilde{p}}_{θ_{2}}) = \frac{Z (θ_{1}) + Z (θ_{2})}{2} - Z (\frac{θ_{1} + θ_{2}}{2}) = J_{Z} (θ_{1}, θ_{2}),

(11)

\sqrt{{\tilde{p}}_{θ_{1}} {\tilde{p}}_{θ_{2}}} = {\tilde{p}}_{\frac{θ_{1} + θ_{2}}{2}}

The Kullback–Leibler divergence [1] as extended to two positive densities

\tilde{p}

and

\tilde{q}

is defined by

D_{KL} (\tilde{p} : \tilde{q}) = \int (\tilde{p} log \frac{\tilde{p}}{\tilde{q}} + \tilde{q} - \tilde{p}) d μ .

(12)

When considering unnormalized densities

{\tilde{p}}_{θ_{1}}

and

{\tilde{p}}_{θ_{2}}

E

, we obtain

\begin{matrix} D_{KL} ({\tilde{p}}_{θ_{1}} : {\tilde{p}}_{θ_{2}}) & = & \int ({\tilde{p}}_{θ_{1}} (x) log \frac{{\tilde{p}}_{θ_{1}} (x)}{{\tilde{p}}_{θ_{2}} (x)} + {\tilde{p}}_{θ_{2}} (x) - {\tilde{p}}_{θ_{1}} (x)) d μ (x), \end{matrix}

(13)

\begin{matrix} = & \int (e^{〈t (x), θ_{1}〉} 〈θ_{1} - θ_{2}, t (x)〉 + e^{〈t (x), θ_{2}〉} - e^{〈t (x), θ_{1}〉}) d μ (x), \end{matrix}

(14)

\begin{matrix} = & 〈\int t (x) e^{〈t (x), θ_{1}〉} d μ (x), θ_{1} - θ_{2}〉 + Z (θ_{2}) - Z (θ_{1}), \end{matrix}

(15)

\begin{matrix} = & 〈θ_{1} - θ_{2}, \nabla Z (θ_{1})〉 + Z (θ_{2}) - Z (θ_{1}) = B_{Z} (θ_{2} : θ_{1}), \end{matrix}

(16)

\nabla Z (θ) = \int t (x) {\tilde{p}}_{θ} (x) d μ (x)

. Let

D_{KL}^{*} (\tilde{p} : \tilde{q}) = D_{KL} (\tilde{q} : \tilde{p})

denote the reverse KLD.

More generally, the family of

α

-divergences [1] between the unnormalized densities

\tilde{p}

and

\tilde{q}

is defined for

α \in R

D_{α} (\tilde{p} : \tilde{q}) = \{\begin{matrix} \frac{1}{α (1 - α)} \int (α \tilde{p} + (1 - α) \tilde{q} - {\tilde{p}}^{α} {\tilde{q}}^{1 - α}) d μ, & α \notin {0, 1} \\ D_{KL}^{*} (\tilde{p} : \tilde{q}) = D_{KL} (\tilde{q} : \tilde{p}) & α = 0, \\ 4 D_{H}^{2} (\tilde{p}, \tilde{q}) & α = \frac{1}{2}, \\ D_{KL} (\tilde{p} : \tilde{q}) & α = 1 . \end{matrix}

We now have

D_{α}^{*} (\tilde{p} : \tilde{q}) = D_{α} (\tilde{q} : \tilde{p}) = D_{1 - α} (\tilde{p} : \tilde{q})

, and the

α

-divergences are homogeneous divergences of degree 1. For all

λ > 0

, we have

D_{α} (λ \tilde{q} : λ \tilde{p}) = λ D_{α} (\tilde{q} : \tilde{p})

. Moreoever, because

α \tilde{p} + (1 - α) \tilde{q} - {\tilde{p}}^{α} {\tilde{q}}^{1 - α}

can be expressed as the difference of the weighted arithmetic mean minus the weighted geometric mean

A (\tilde{p}, \tilde{q}; α; 1 - α) - G (\tilde{p}, \tilde{q}; α; 1 - α)

, it follows from the arithmetical–geometrical mean inequality that we have

D_{α} (\tilde{p} : \tilde{q}) \geq 0

When considering unnormalized densities

{\tilde{p}}_{θ_{1}}

and

{\tilde{p}}_{θ_{2}}

E

, we obtain

D_{α} ({\tilde{p}}_{θ_{1}} : {\tilde{p}}_{θ_{2}}) = \{\begin{matrix} \frac{1}{α (1 - α)} J_{Z, α} (θ_{1} : θ_{2}), & α \notin {0, 1} \\ B_{Z} (θ_{1} : θ_{2}) & α = 0, \\ 4 J_{Z} (θ_{1}, θ_{2}) & α = \frac{1}{2}, \\ B_{Z}^{*} (θ_{1} : θ_{2}) = B_{Z} (θ_{2} : θ_{1}) & α = 1 \end{matrix} .

Proposition 5.

The α-divergences between the unnormalized densities of an exponential family amount to scaled α-Jensen divergences between their natural parameters for the partition function

D_{α} ({\tilde{p}}_{θ_{1}} : {\tilde{p}}_{θ_{2}}) = J_{Z, α}^{s} (θ_{1} : θ_{2}) .

When

α \in {0, 1}

, the oriented Kullback–Leibler divergences between unnormalized exponential family densities amount to reverse Bregman divergences on their corresponding natural parameters for the partition function

D_{KL} (\tilde{p} : \tilde{q}) = B_{Z} (θ_{2} : θ_{1}) .

Proof.

For

α \notin {0, 1}

, consider

D_{α} ({\tilde{p}}_{θ_{1}} : {\tilde{p}}_{θ_{2}}) = \frac{1}{α (1 - α)} \int (α {\tilde{p}}_{θ_{1}} + (1 - α) {\tilde{p}}_{θ_{2}} - {\tilde{p}}_{θ_{1}}^{α} {\tilde{p}}_{θ_{2}}^{1 - α}) d μ .

Here, we have

\int α {\tilde{p}}_{θ_{1}} d μ = α Z (θ_{1})

\int (1 - α) {\tilde{p}}_{θ_{2}} d μ = (1 - α) Z (θ_{2})

and

\int {\tilde{p}}_{θ_{1}}^{α} {\tilde{p}}_{θ_{2}}^{1 - α} d μ = \int {\tilde{p}}_{α θ_{1} + (1 - α) θ_{2}} d μ = Z (α θ_{1} + (1 - α) θ_{2})

. It follows that

D_{α} ({\tilde{p}}_{θ_{1}} : {\tilde{p}}_{θ_{2}}) = \frac{1}{α (1 - α)} J_{Z, α} (θ_{1} : θ_{2}) = J_{Z, α}^{s} (θ_{1} : θ_{2}) .

□

Notice that the KLD extended to unnormalized densities can be written as a generalized relative entropy, i.e., it can be obtained as the difference of the extended cross-entropy minus the extended entropy (self cross-entropy):

\begin{matrix} D_{KL} (\tilde{p} : \tilde{q}) & = & H^{\times} (\tilde{p} : \tilde{q}) - H (\tilde{p}), \\ = & \int (\tilde{p} log \frac{\tilde{p}}{\tilde{q}} + \tilde{q} - \tilde{p}) d μ \end{matrix}

with

H^{\times} (\tilde{p} : \tilde{q}) = \int (\tilde{p} (x) log \frac{1}{\tilde{q} (x)} + \tilde{q} (x)) d μ (x) - 1

and

H (\tilde{p}) = H^{\times} (\tilde{p} : \tilde{p}) = \int (\tilde{p} (x) log \frac{1}{\tilde{p} (x)} + \tilde{p} (x)) d μ (x) - 1 .

Remark 4.

In general, we can consider two unnormalized positive densities

\tilde{p} (x)

and

\tilde{q} (x)

. Let

p (x) = \frac{\tilde{p} (x)}{Z_{p}}

and

q (x) = \frac{\tilde{q} (x)}{Z_{q}}

denote their corresponding normalized densities (with normalizing factors

Z_{p} = \int \tilde{p} d μ

and

Z_{q} = \int \tilde{q} d μ

); then, the KLD between

\tilde{p}

and

\tilde{q}

can be expressed using the KLD between their normalized densities and normalizing factors, as follows:

D_{KL} (\tilde{p} : \tilde{q}) = Z_{p} (D_{KL} (p : q) + log \frac{Z_{p}}{Z_{q}}) + Z_{q} - Z_{p} .

(17)

Similarly, we have

\begin{matrix} H^{\times} (\tilde{p} : \tilde{q}) & = & Z_{p} H^{\times} (p : q) - Z_{p} log Z_{q} + Z_{q} - 1, \end{matrix}

(18)

\begin{matrix} H (\tilde{p}) & = & Z_{p} H (p) - Z_{p} log Z_{p} + Z_{p} - 1, \end{matrix}

(19)

and

D_{KL} (\tilde{p} : \tilde{q}) = H^{\times} (\tilde{p} : \tilde{q}) - H (\tilde{p})

Notice that Equation (17) allows us to derive the following identity between

B_{Z}

and

B_{F}

\begin{matrix} B_{Z} (θ_{2} : θ_{1}) & = & Z (θ_{1}) B_{F} (θ_{2} : θ_{1}) + Z (θ_{1}) log \frac{Z (θ_{1})}{Z (θ_{2})} + Z (θ_{2}) - Z (θ_{1}), \end{matrix}

(20)

\begin{matrix} = & exp (F (θ_{1})) B_{F} (θ_{2} : θ_{1}) + (exp F (θ_{1})) (F (θ_{1}) - F (θ_{2})) + exp (F (θ_{2})) - exp (F (θ_{1})) . \end{matrix}

(21)

Let

D_{skl} (a : b) = a log \frac{a}{b} + b - a

be the scalar KLD for

a > 0

and

b > 0

. Then, we can rewrite Equation (17) as

D_{KL} (\tilde{p} : \tilde{q}) = Z_{p} D_{KL} (p : q) + D_{skl} (Z_{p} : Z_{q}),

and we have

B_{Z} (θ_{2} : θ_{1}) = Z (θ_{1}) B_{F} (θ_{2} : θ_{1}) + D_{skl} (Z (θ_{1}) : Z (θ_{2})) .

In addition, the KLD between the unnormalized densities

\tilde{p}

and

\tilde{q}

with support

X

can be written as a definite integral of a scalar Bregman divergence:

D_{KL} (\tilde{p} : \tilde{q}) = \int_{X} D_{skl} (\tilde{p} (x) : \tilde{q} (x)) d μ (x) = \int_{X} B_{f_{skl}} (\tilde{p} (x) : \tilde{q} (x)) d μ (x),

where

f_{skl} (x) = x log x - x

. Because

B_{f_{skl}} (a : b) \geq 0 \forall a > 0, b > 0

, we can deduce that

D_{KL} (\tilde{p} : \tilde{q}) \geq 0

with equality iff

\tilde{p} (x) = \tilde{q} (x)

μ

almost everywhere.

Notice that

B_{Z} (θ_{2} : θ_{1}) = Z (θ_{1}) B_{F} (θ_{2} : θ_{1}) + D_{skl} (Z (θ_{1}) : Z (θ_{2}))

can be interpreted as the sum of two divergences, that is, a conformal Bregman divergence with a scalar Bregman divergence.

Remark 5.

Consider the KLD between the normalized

p_{θ_{1}}

and unnormalized

{\tilde{p}}_{θ_{2}}

densities of the same exponential family. In this case, we have

\begin{matrix} D_{KL} (p_{θ_{1}} : {\tilde{p}}_{θ_{2}}) & = & B_{F} (θ_{2} : θ_{1}) - log Z (θ_{2}) + Z (θ_{2}) - 1, \\ = & Z (θ_{2}) - 1 - F (θ_{1}) - 〈θ_{2} - θ_{1}, \nabla F (θ_{2})〉, \end{matrix}

(22)

\begin{matrix} = & B_{Z - 1, F} (θ_{2} : θ_{1}) . \end{matrix}

(23)

The divergence

B_{Z - 1, F}

is a dual Bregman pseudo-divergence [28]:

B_{F_{1}, F_{2}} (θ_{1} : θ_{2}) = F_{1} (θ_{1}) - F_{2} (θ_{2}) - 〈θ_{1} - θ_{2}, \nabla F_{2} (θ_{2})〉,

for

F_{1}

and

F_{2}

that are two strictly convex and smooth functions such that

F_{1} \geq F_{2}

. Indeed, we can check that generators

F_{1} (θ) = Z (θ) - 1

and

F_{2} (θ) = F (θ)

are both Bregman generators; then, we have

F_{1} (θ) \geq F_{2} (θ)

, as

e^{x} \geq x + 1

for all x (with equality when

x = 0

), i.e.,

Z (θ) - 1 \geq F (θ)

More generally, the α-divergences between

p_{θ_{1}}

and

{\tilde{p}}_{θ_{2}}

can be written as

D_{α} (p_{θ_{1}} : {\tilde{p}}_{θ_{2}}) = \frac{1}{α (1 - α)} (α Z (θ_{1}) + (1 - α) - \frac{Z (α θ_{1} + (1 - α) θ_{2})}{Z (θ_{2})}),

(24)

with the (signed) α-skewed Bhattacharyya distances provided by

D_{B, α} (p_{θ_{1}} : {\tilde{p}}_{θ_{2}}) = log Z (θ_{2}) - log Z (α θ_{1} + (1 - α) θ_{2}) .

Let us illustrate Proposition 5 with some examples.

Example 1.

Consider the family of exponential distributions

E = {p_{λ} (x) = 1_{x \geq 0} λ exp (- λ x)}

, where

E

is an exponential family with a natural parameter

θ = λ

, parameter space

Θ = R > 0

, sufficient statistic

t (x) = - x

. The partition function is

Z (θ) = \frac{1}{θ}

, with

Z^{'} (θ) = - \frac{1}{θ^{2}}

and

Z^{″} (θ) = \frac{2}{λ^{3}} > 0

, while the cumulant function is

F (θ) = log Z (θ) = - log θ

with moment parameter

η = E_{p_{λ}} [t (x)] = F^{'} (θ) = - \frac{1}{θ}

. The α-divergences between two unnormalized exponential distributions are

D_{α} ({\tilde{p}}_{λ_{1}} : {\tilde{p}}_{λ_{2}}) = \{\begin{matrix} \frac{1}{α (1 - α)} J_{Z, α} (θ_{1} : θ_{2}) = \frac{{(λ_{1} - λ_{2})}^{2})}{α λ_{1}^{2} λ_{2} + (1 - α) λ_{1} λ_{2}^{2}} & α \notin {0, 1} \\ D_{KL} ({\tilde{p}}_{λ_{2}} : {\tilde{p}}_{λ_{1}}) = B_{Z} (θ_{1} : θ_{2}) = \frac{{(λ_{1} - λ_{2})}^{2}}{λ_{1} λ_{2}^{2}} & α = 0, \\ 4 J_{Z} (θ_{1}, θ_{2}) = \frac{{(λ_{1} - λ_{2})}^{2}}{2 (λ_{1} λ_{2}^{2} + λ_{1}^{2} λ_{2})} & α = \frac{1}{2}, \\ D_{KL} ({\tilde{p}}_{λ_{1}} : {\tilde{p}}_{λ_{2}}) = B_{Z} (θ_{2} : θ_{1}) = \frac{{(λ_{1} - λ_{2})}^{2}}{λ_{2} λ_{1}^{2}} & α = 1 \end{matrix} .

(25)

Example 2.

Consider the family of univariate centered normal distributions with

{\tilde{p}}_{σ^{2}} (x) \propto exp (- \frac{x^{2}}{2 σ^{2}})

and partition function

Z (σ^{2}) = \sqrt{2 π σ^{2}}

such that

p_{σ^{2}} (x) = \frac{1}{Z (σ^{2})} {\tilde{p}}_{σ^{2}} (x) = \frac{1}{\sqrt{2 π σ^{2}}} exp (- \frac{x^{2}}{2 σ^{2}})

. Here, we have a natural parameter

θ = \frac{1}{σ^{2}} \in Θ = R_{> 0}

and sufficient statistic

t (x) = - \frac{x^{2}}{2}

. The partition function expressed with the natural parameter is

Z (θ) = \sqrt{\frac{2 π}{θ}}

, with

Z^{'} (θ) = - \sqrt{\frac{π}{2}} θ^{- \frac{3}{2}}

and

Z^{″} (θ) = \frac{3 \sqrt{π}}{2^{\frac{3}{2}}} θ^{- \frac{5}{2}} > 0

(strictly convex on Θ). The unnormalized KLD between

{\tilde{p}}_{σ_{1}^{2}}

and

{\tilde{p}}_{σ_{2}^{2}}

D_{KL} ({\tilde{p}}_{σ_{1}^{2}} : {\tilde{p}}_{σ_{2}^{2}}) = B_{Z} (θ_{2} : θ_{1}) = \sqrt{\frac{π}{2}} (2 σ_{2} - 3 σ_{1} + \frac{σ_{1}^{3}}{σ_{2}^{2}}) .

We can check that we have

D_{KL} ({\tilde{p}}_{σ^{2}} : {\tilde{p}}_{σ^{2}}) = 0

For the Hellinger divergence, we have

D_{H}^{2} ({\tilde{p}}_{σ_{1}^{2}} : {\tilde{p}}_{σ_{2}^{2}}) = J_{Z} (θ_{1}, θ_{2}) = \sqrt{\frac{π}{2}} (σ_{1} + σ_{2}) - 2 \sqrt{π} \frac{σ_{1} σ_{2}}{\sqrt{σ_{1}^{2} + σ_{2}^{2}}},

and we can check that

D_{H} ({\tilde{p}}_{σ^{2}} : {\tilde{p}}_{σ^{2}}) = 0

Consider the family of the d-variate case of centered normal distributions with unnormalized density

{\tilde{p}}_{Σ} (x) \propto exp (- \frac{1}{2} x^{⊤} Σ^{- 1} x) = exp (- \frac{1}{2} tr (x^{⊤} Σ^{- 1} x)) = exp (- \frac{1}{2} tr (x x^{⊤} Σ^{- 1}))

obtained using the matrix trace cyclic property, where Σ is the covariance matrix. Here, we have

θ = Σ^{- 1}

(precision matrix) and

Θ = {Sym}_{+ +} (d)

for

t (x) = - \frac{1}{2} x x^{⊤}

, with the matrix inner product

〈A, B〉 = tr (A^{⊤} B)

. The partition function

Z (Σ) = {(2 π)}^{\frac{d}{2}} \sqrt{\det (Σ)}

expressed with the natural parameter is

Z (θ) = {(2 π)}^{\frac{d}{2}} \sqrt{\frac{1}{\det (θ)}}

. This is a convex function with

\nabla Z (θ) = - \frac{1}{2} {(2 π)}^{\frac{d}{2}} \frac{\nabla_{θ} \det (θ)}{\det {(θ)}^{\frac{3}{2}}} = - \frac{1}{2} {(2 π)}^{\frac{d}{2}} \frac{θ^{- 1}}{\det {(θ)}^{\frac{1}{2}}},

\nabla_{θ} \det (θ) = \det (θ) θ^{- ⊤}

using matrix calculus.

Now, consider the family of univariate normal distributions

E = \{p_{μ, σ^{2}} (x) = \frac{1}{\sqrt{2 π σ^{2}}} exp (- \frac{1}{2} {(\frac{x - μ}{σ})}^{2})\} .

Let

θ = (θ_{1} = \frac{1}{σ_{2}}, θ_{2} = \frac{μ}{σ^{2}})

and

Z (θ_{1}, θ_{2}) = \sqrt{\frac{2 π}{θ_{1}}} exp (\frac{1}{2} \frac{θ_{2}^{2}}{θ_{1}}) .

The unnormalized densities are

{\tilde{p}}_{θ} (x) = exp (- \frac{θ_{1} x^{2}}{2} + x θ_{2})

, and we have

\nabla Z (θ) = [\begin{matrix} \sqrt{\frac{π}{2}} \frac{(θ_{1} + θ_{2}^{2}) exp (\frac{θ_{2}^{2}}{2 θ_{1}})}{θ_{1}^{\frac{5}{2}}} \\ \sqrt{2 π} \frac{θ_{2} exp (\frac{θ_{2}^{2}}{2 θ_{1}})}{θ_{1}^{\frac{3}{2}}} \end{matrix}] .

It follows that

D_{KL} [{\tilde{p}}_{θ} : {\tilde{p}}_{θ^{'}}] = B_{Z} (θ^{'} : θ)

5. Deforming Convex Functions and Their Induced Dually Flat Spaces

5.1. Comparative Convexity

The log-convexity can be interpreted as a special case of comparative convexity with respect to a pair

(M, N)

of comparable weighted means [9], as follows.

A function Z is

(M, N)

-convex if and only if for

α \in [0, 1]

we have

Z (M (x, y; α, 1 - α)) \leq N (Z (x), Z (y); α, 1 - α),

(26)

and is strictly

(M, N)

-convex iff we have strict inequality for

α \in (0, 1)

and

x \neq y

. Furthermore, a function Z is (strictly)

(M, N)

-concave if

- Z

is (strictly)

(M, N)

-convex.

Log-convexity corresponds to

(A, G)

-convexity, i.e., convexity with respect to the weighted arithmetical and geometrical means defined respectively by

A (x, y; α, 1 - α) = α x + (1 - α) y

and

G (x, y; α, 1 - α) = x^{α} y^{1 - α}

. Ordinary convexity is

(A, A)

-convexity.

A weighted quasi-arithmetical mean [34] (also called a Kolmogorov–Nagumo mean [35]) is defined for a continuous and strictly increasing function h by

M_{h} (x, y; α, 1 - α) = h^{- 1} (α h (x) + (1 - α) h (x)) .

We let

M_{h} (x, y) = M_{h} (x, y; \frac{1}{2}, \frac{1}{2})

. Quasi-arithmetical means include the arithmetical mean obtained for

h (u) = id (u) = u

and the geometrical mean for

h (u) = log (u)

, and more generally power means

M_{p} (x, y; α, 1 - α) = {(α x^{p} + (1 - α) y^{p})}^{\frac{1}{p}} = M_{h_{p}} (x, y; α, 1 - α), p \neq 0,

which are quasi-arithmetical means obtained for the family of generators

h_{p} (u) = \frac{u^{p} - 1}{p}

with inverse

h_{p}^{- 1} (u) = {(1 + u p)}^{\frac{1}{p}}

. In the limit

p \to 0

, we have

M_{0} (x, y) = G (x, y)

for the generator

{lim}_{p \to 0} h_{p} (u) = h_{0} (u) = log u

Proposition 6

([36,37]). A function

Z (θ)

is strictly

(M_{ρ}, M_{τ})

-convex with respect to two strictly increasing smooth functions ρ and τ if and only if the function

F = τ \circ Z \circ ρ^{- 1}

is strictly convex.

Notice that the set of strictly increasing smooth functions form a non-Abelian group, with the group operation as the function composition, the neutral element as the identity function, and the inverse element as the functional inverse function.

Because log-convexity is

(A = M_{id}, G = M_{log})

-convexity, a function Z is strictly log-convex iff

log \circ Z \circ {id}^{- 1} = log \circ Z

is strictly convex. We have

Z = τ^{- 1} \circ F \circ ρ \Leftrightarrow F = τ \circ Z \circ ρ^{- 1} .

Starting from a given convex function

F (θ)

, we can deform the function

F (θ)

to obtain a function

Z (θ)

using two strictly monotone functions

τ

and

ρ

Z (θ) = τ^{- 1} (F (ρ (θ)))

For a

(M_{ρ}, M_{τ})

-convex function

Z (θ)

which is also strictly convex, we can define a pair of Bregman divergences

B_{Z}

and

B_{F}

with

F (θ) = τ (Z (ρ^{- 1} (θ)))

and a corresponding pair of skewed Jensen divergences.

Thus, we have the following generic deformation scheme.

\underset{(M_{ρ^{- 1}}, M_{τ^{- 1}}) - c o n v e x w h e n Z i s c o n v e x}{\underset{︸}{F = τ \circ Z \circ ρ^{- 1}}} \underset{(ρ^{- 1}, τ^{- 1}) -deformation}{\underset{⇌}{(ρ, τ) -deformation}} \underset{(M_{ρ}, M_{τ}) - c o n v e x w h e n F i s c o n v e x}{\underset{︸}{Z = τ^{- 1} \circ F \circ ρ}}

In particular, when the function Z is deformed by strictly increasing the power functions

h_{p_{1}}

and

h_{p_{2}}

for

p_{1}

and

p_{2}

R

Z_{p_{1}, p_{2}} = h_{p_{2}} \circ Z \circ h_{p_{1}}^{- 1},

then

Z_{p_{1}, p_{2}}

is strictly convex when it is strictly

(M_{p_{1}}, M_{p_{2}})

-convex, and as such induces corresponding Bregman and Jensen divergences.

Example 3.

Consider the partition function

Z (θ) = \frac{1}{θ}

of the exponential distribution family (

θ > 0

with

Θ = R_{> 0}

). Let

Z_{p} (θ) = (h_{p} \circ Z) (θ) = \frac{θ^{- p} - 1}{p}

; then, we have

Z_{p}^{″} (θ) = (1 + p) \frac{1}{θ^{2 + p}} > 0

when

p > - 1

. Thus, we can deform Z smoothly by

Z_{p}

while preserving the convexity by ranging p from

- 1

+ \infty

. In this way, we obtain a corresponding family of Bregman and Jensen divergences.

The proposed convex deformation using quasi-arithmetical mean generators differs from the interpolation of convex functions using the technique of proximal averaging [38].

Note that in [37] the comparative convexity with respect to a pair of quasi-arithmetical means

(M_{ρ}, M_{τ})

is used to define a

(M_{ρ}, M_{τ})

-Bregman divergence, which turns out to be equivalent to a conformal Bregman divergence on the

ρ

-embedding of the parameters.

5.2. Dually Flat Spaces

We start with a refinement of the class of convex functions used to generate dually flat spaces.

Definition 2

(Legendre type function [39]).

(Θ, F)

is of Legendre type if the function

F : Θ \to R

is strictly convex and differentiable with

Θ \neq \emptyset

and

lim_{λ \to 0} \frac{d}{d λ} F (λ θ + (1 - λ) \bar{θ}) = - \infty, \forall θ \in Θ, \forall \bar{θ} \in \partial Θ .

(27)

Legendre-type functions

F (Θ)

admit a convex conjugate

F^{*} (η)

via the Legendre transform

F^{*} (η) = {sup}_{θ \in Θ} 〈θ, η〉 - F (θ)

F^{*} (η) = 〈\nabla F^{- 1} (η), η〉 - F (\nabla F^{- 1} (η)) .

A smooth and strictly convex function

(Θ, F (θ))

of Legendre type induces a dually flat space [1]

M

, i.e., a smooth Hessian manifold [40] with a single global chart

(Θ, θ (\cdot))

[1]. A canonical divergence

D (p : q)

between two points p and q of

M

is viewed as a single-parameter contrast function [41]

D (r_{p q})

on the product manifold

M \times M

. The canonical divergence and its dual canonical divergence

D^{*} (r_{q p}) = D (r_{p q})

can be expressed equivalently as either dual Bregman divergences or dual Fenchel–Young divergences (Figure 2):

\begin{matrix} D (r_{p q}) & = & B_{F} (θ (p) : θ (q)) = Y_{F, F^{*}} (θ (p) : η (q)), \\ = & D^{*} (r_{q p}) = B_{F^{*}} (η (q) : η (p)) = Y_{F^{*}, F} (η (q) : θ (p)), \end{matrix}

where

Y_{F, F^{*}}

is the Fenchel–Young divergence:

Y_{F, F^{*}} (θ (p) : η (q)) = F (θ (p)) + F^{*} (η (q)) - 〈θ (p), η (q)〉 .

We have the dual global coordinate system

η = \nabla F (θ)

and the domain

H = {\nabla F (θ) : θ \in Θ}

which defines the dual Legendre-type potential function

(H, F^{*} (η))

. The Legendre-type function ensures that

{F^{*}}^{*} = F

(a sufficient condition is to have F be convex and lower semi-continuous [42]).

A manifold

M

is called dually flat, as the torsion-free affine connections ∇ and

\nabla^{*}

induced by the potential functions

F (θ)

and

F^{*} (η)

linked with the Legendre–Fenchel transformation are flat [1], that is, their Christoffel symbols vanishes in the dual coordinate system:

Γ (θ) = 0

and

Γ^{*} (η) = 0

The Legendre-type function

(Θ, F (θ))

is not defined uniquely; the function

\bar{F} (\bar{θ}) = F (A θ + b) + C θ + d

with

\bar{θ} = A θ + b

for A and C invertible matrices and b and d vectors defines the same dually flat space with the same canonical divergence

D (p, q)

D (p : q) = B_{F} (θ (p) : θ (q)) = B_{\bar{F}} (\bar{θ} (p) : \bar{θ} (q)) .

Thus, a log-convex Legendre-type function

Z (θ)

induces two dually flat spaces by considering the DFSs induced by

Z (θ)

and

F (θ) = log Z (θ)

. Let the gradient maps be

η = \nabla Z (θ)

and

\tilde{η} = \nabla F (θ) = \frac{η}{Z (θ)}

When

F (θ)

is chosen as the cumulant function of an exponential family, the Bregman divergence

B_{F} (θ_{1} : θ_{2})

can be interpreted as a statistical divergence between corresponding probability densities, meaning that the Bregman divergence amounts to the reverse Kullback–Leibler divergence:

B_{F} (θ_{1} : θ_{2}) = D_{KL}^{*} (p_{θ_{1}} : p_{θ_{2}})

, where

D_{KL}^{*}

is the reverse KLD.

Notice that deforming a convex function

F (θ)

into

F (ρ (θ))

such that

F \circ ρ

remains strictly convex has been considered by Yoshizawa and Tanabe [43] to build a two-parameter deformation

ρ_{α, β}

of the dually flat space induced by the cumulant function

F (θ)

of the multivariate normal family. Additionally, see the method of Hougaard [44] for obtaining other exponential families from a given exponential family.

Thus, in general, there are many more dually flat spaces with corresponding divergences and statistical divergences than the usually considered exponential family manifold [5] induced by the cumulant function. It is interesting to consider their use in information sciences.

6. Conclusions and Discussion

For machine learning practioners, it is well known that the Kullback–Leibler divergence (KLD) between two probability densities

p_{θ_{1}}

and

p_{θ_{2}}

of an exponential family with cumulant function F (free energy in thermodynamics) amounts to a reverse Bregman divergence [26] induced by F, or equivalently to a reverse Fenchel–Young divergence [27]

D_{KL} (p_{θ_{1}} : p_{θ_{2}}) = B_{F} (θ_{2} : θ_{1}) = Y_{F, F^{*}} (θ_{2} : η_{1}),

where

η = \nabla F (θ)

is the dual moment or expectation parameter.

In this paper, we have shown that the KLD as extended to positive unnormalized densities

{\tilde{p}}_{θ_{1}}

and

{\tilde{p}}_{θ_{2}}

of an exponential family with a convex partition function

Z (θ)

(Laplace transform) amounts to a reverse Bregman divergence induced by Z, or equivalently to a reverse Fenchel–Young divergence

D_{KL} ({\tilde{p}}_{θ_{1}} : {\tilde{p}}_{θ_{2}}) = B_{Z} (θ_{2} : θ_{1}) = Y_{Z, Z^{*}} (θ_{2} : {\tilde{η}}_{1}),

where

\tilde{η} = \nabla Z (θ)

More generally, we have shown that the scaled

α

-skewed Jensen divergences induced by the cumulant and partition functions between natural parameters coincide with the scaled

α

-skewed Bhattacharyya distances between probability densities and the

α

-divergences between unnormalized densities, respectively:

\begin{matrix} D_{B, α}^{s} (p_{θ_{1}} : p_{θ_{2}}) & = & J_{F, α}^{s} (θ_{1} : θ_{2}), \\ D_{α} ({\tilde{p}}_{θ_{1}} : {\tilde{p}}_{θ_{2}}) & = & J_{Z, α}^{s} (θ_{1} : θ_{2}) . \end{matrix}

We have noted that the partition functions Z of exponential families are both convex and log-convex, and that the corresponding cumulant functions are both convex and exponentially convex.

Figure 3 summarizes the relationships between statistical divergences and between the normalized and unnormalized densities of an exponential family, as well as the corresponding divergences between their natural parameters. Notice that Brekelmans and Nielsen [45] considered deformed uni-order likelihood ratio exponential families (LREFs) for annealing paths and obtained an identity for the

α

-divergences between unnormalized densities and Bregman divergences induced by multiplicatively scaled partition functions.

Because the log-convex partition function is also convex, we have generalized the principle of building pairs of convex generators using the comparative convexity with respect to a pair of quasi-arithmetical means, and have further discussed the induced dually flat spaces and divergences. In particular, by considering the convexity-preserving deformations obtained by power mean generators, we have shown how to obtain a family of convex generators and dually flat spaces. Notice that some parametric families of Bregman divergences, such as the

α

-divergences [46],

β

-divergences [47], and V-geometry [48] of symmetric positive-definite matrices, yield families of dually flat spaces.

Banerjee et al. [49] proved a duality between regular exponential families and a subclass of Bregman divergences, which they accordingly termed regular Bregman divergences. In particular, this duality allows the Maximum Likelihood Estimator (MLE) of an exponential family with a cumulant function F to be viewed as a right-sided Bregman centroid with respect to the Legendre–Fenchel dual

F^{*}

. In [50], the scope of this duality was further extended for arbitrary Bregman divergences by introducing a class of generalized exponential families.

Concave deformations have been recently studied in [51], where the authors introduced the

{log}_{ϕ}

-concavity induced by a positive continuous function

ϕ

generating a deformed logarithm

{log}_{ϕ}

as the

(A, {log}_{ϕ})

-comparative concavity (Definition 1.2 in [51]), as well as the weaker notion of F-concavity which corresponds to the

(A, F)

-concavity (Definition 2.1 in [51], requiring strictly increasing functions F). Our deformation framework

Z = τ^{- 1} \circ F \circ ρ

is more general, as it is double-sided. We jointly deform the function F by

F_{τ} = τ^{- 1} \circ F

and its argument

θ

θ_{ρ} = ρ (θ)

Exponentially concave functions have been considered as generators of L-divergences in [24];

α

-exponentially concave functions G such that

exp (α G)

are concave for

α > 0

generalize the L-divergences to

L_{α}

-divergences, which can be expressed equivalently using a generalization of the Fenchel–Young divergence based on the c-transforms [24]. When

α < 0

, exponentially convex functions are considered instead of exponentially concave functions. The information geometry induced by

L_{α}

-divergences are dually projectively flat with constant curvature, and reciprocally possess a dually projectively flat structure with constant curvature, inducing (locally) a canonical

L_{- α}

-divergence. Wong and Zhang [52] investigated a one-parameter deformation of convex duality, called

λ

-duality, by considering functions f such that

\frac{1}{λ} (e^{λ f} - 1)

are convex for

λ \neq 0

. They defined the

λ

-conjugate transform as a particular case of the c-transform [24] and studie the information geometry of the induced

λ

-logarithmic divergences. The

λ

-duality yields a generalization of exponential and mixture families to

λ

-exponential and

λ

-mixture families related to the Rényi divergence.

Finally, certain statistical divergences, called projective divergences, are invariant under rescaling, and as such can define dissimilarities between non-normalized densities. For example, the

γ

-divergences [32]

D_{γ}

are such that

D_{γ} (p : q) = D_{γ} (\tilde{p} : \tilde{q})

(with

γ

-divergences tending to the KLD when

γ \to 0

) or the Cauchy–Schwarz divergence [53].

Funding

This research received no external funding.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Acknowledgments

The author heartily thanks the three reviewers for their helpful comments which led to this improved paper.

Conflicts of Interest

Author Frank Nielsen is employed by the company Sony Computer Science Laboratories Inc. The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. The authors declare no conflicts of interest.

References

Amari, S.I. Information Geometry and Its Applications; Applied Mathematical Sciences; Springer: Tokyo, Japan, 2016. [Google Scholar]
Bregman, L.M. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Comput. Math. Math. Phys. 1967, 7, 200–217. [Google Scholar] [CrossRef]
Nielsen, F.; Hadjeres, G. Monte Carlo information-geometric structures. In Geometric Structures of Information; Springer: Berlin/Heidelberg, Germany, 2019; pp. 69–103. [Google Scholar]
Brown, L.D. Fundamentals of Statistical Exponential Families with Applications in Statistical Decision Theory. In Lecture Notes-Monograph Series; Cornell University: Ithaca, NY, USA, 1986; Volume 9. [Google Scholar]
Scarfone, A.M.; Wada, T. Legendre structure of κ-thermostatistics revisited in the framework of information geometry. J. Phys. Math. Theor. 2014, 47, 275002. [Google Scholar] [CrossRef]
Zhang, J. Divergence function, duality, and convex analysis. Neural Comput. 2004, 16, 159–195. [Google Scholar] [CrossRef] [PubMed]
Nielsen, F.; Boltz, S. The Burbea-Rao and Bhattacharyya centroids. IEEE Trans. Inf. Theory 2011, 57, 5455–5466. [Google Scholar] [CrossRef]
Cichocki, A.; Amari, S.I. Families of alpha-beta-and gamma-divergences: Flexible and robust measures of similarities. Entropy 2010, 12, 1532–1568. [Google Scholar] [CrossRef]
Niculescu, C.; Persson, L.E. Convex Functions and Their Applications, 2nd ed.; Springer: Berlin/Heidelberg, Germany, 2018; Volume 23, first edition published in 2006. [Google Scholar]
Billingsley, P. Probability and Measure; John Wiley & Sons: Hoboken, NJ, USA, 2017. [Google Scholar]
Barndorff-Nielsen, O. Information and Exponential Families; John Wiley & Sons: Hoboken, NJ, USA, 2014. [Google Scholar]
Morris, C.N. Natural exponential families with quadratic variance functions. Ann. Stat. 1982, 10, 65–80. [Google Scholar] [CrossRef]
Efron, B. Exponential Families in Theory and Practice; Cambridge University Press: Cambridge, UK, 2022. [Google Scholar]
Grünwald, P.D. The Minimum Description Length Principle; MIT Press: Cambridge, MA, USA, 2007. [Google Scholar]
Kailath, T. The divergence and Bhattacharyya distance measures in signal selection. IEEE Trans. Commun. Technol. 1967, 15, 52–60. [Google Scholar] [CrossRef]
Wainwright, M.J.; Jordan, M.I. Graphical models, exponential families, and variational inference. Found. Trends^® Mach. Learn. 2008, 1, 1–305. [Google Scholar]
LeCun, Y.; Chopra, S.; Hadsell, R.; Ranzato, M.; Huang, F. A tutorial on energy-based learning. In Predicting Structured Data; University of Toronto: Toronto, ON, USA, 2006; Volume 1. [Google Scholar]
Kindermann, R.; Snell, J.L. Markov Random Fields and Their Applications; American Mathematical Society: Providence, RI, USA, 1980; Volume 1. [Google Scholar]
Dai, B.; Liu, Z.; Dai, H.; He, N.; Gretton, A.; Song, L.; Schuurmans, D. Exponential family estimation via adversarial dynamics embedding. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2019; Volume 32. [Google Scholar]
Cobb, L.; Koppstein, P.; Chen, N.H. Estimation and moment recursion relations for multimodal distributions of the exponential family. J. Am. Stat. Assoc. 1983, 78, 124–130. [Google Scholar] [CrossRef]
Garcia, V.; Nielsen, F. Simplification and hierarchical representations of mixtures of exponential families. Signal Process. 2010, 90, 3197–3212. [Google Scholar] [CrossRef]
Zhang, J.; Wong, T.K.L. λ-Deformed probability families with subtractive and divisive normalizations. In Handbook of Statistics; Elsevier: Amsterdam, The Netherlands, 2021; Volume 45, pp. 187–215. [Google Scholar]
Boyd, S.P.; Vandenberghe, L. Convex Optimization; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
Wong, T.K.L. Logarithmic divergences from optimal transport and Rényi geometry. Inf. Geom. 2018, 1, 39–78. [Google Scholar] [CrossRef]
Van Erven, T.; Harremos, P. Rényi divergence and Kullback-Leibler divergence. IEEE Trans. Inf. Theory 2014, 60, 3797–3820. [Google Scholar] [CrossRef]
Azoury, K.S.; Warmuth, M.K. Relative loss bounds for on-line density estimation with the exponential family of distributions. Mach. Learn. 2001, 43, 211–246. [Google Scholar] [CrossRef]
Amari, S.I. Differential-Geometrical Methods in Statistics, 1st ed.; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2012; Volume 28. [Google Scholar]
Nielsen, F. Statistical divergences between densities of truncated exponential families with nested supports: Duo Bregman and duo Jensen divergences. Entropy 2022, 24, 421. [Google Scholar] [CrossRef] [PubMed]
Del Castillo, J. The singly truncated normal distribution: A non-steep exponential family. Ann. Inst. Stat. Math. 1994, 46, 57–66. [Google Scholar] [CrossRef]
Wainwright, M.J.; Jaakkola, T.S.; Willsky, A.S. A new class of upper bounds on the log partition function. IEEE Trans. Inf. Theory 2005, 51, 2313–2335. [Google Scholar] [CrossRef]
Hyvärinen, A.; Dayan, P. Estimation of non-normalized statistical models by score matching. J. Mach. Learn. Res. 2005, 6, 695–709. [Google Scholar]
Fujisawa, H.; Eguchi, S. Robust parameter estimation with a small bias against heavy contamination. J. Multivar. Anal. 2008, 99, 2053–2081. [Google Scholar] [CrossRef]
Eguchi, S.; Komori, O. Minimum Divergence Methods in Statistical Machine Learning; Springer: Berlin/Heidelberg, Germany, 2022. [Google Scholar]
Kolmogorov, A. Sur la Notion de la Moyenne; Cold Spring Harbor Laboratory: Cold Spring Harbor, NY, USA, 1930. [Google Scholar]
Komori, O.; Eguchi, S. A unified formulation of k-Means, fuzzy c-Means and Gaussian mixture model by the Kolmogorov–Nagumo average. Entropy 2021, 23, 518. [Google Scholar] [CrossRef]
Aczél, J. A generalization of the notion of convex functions. Det K. Nor. Vidensk. Selsk. Forh. Trondheim 1947, 19, 87–90. [Google Scholar]
Nielsen, F.; Nock, R. Generalizing skew Jensen divergences and Bregman divergences with comparative convexity. IEEE Signal Process. Lett. 2017, 24, 1123–1127. [Google Scholar] [CrossRef]
Bauschke, H.H.; Goebel, R.; Lucet, Y.; Wang, X. The proximal average: Basic theory. SIAM J. Optim. 2008, 19, 766–785. [Google Scholar] [CrossRef]
Rockafellar, R.T. Conjugates and Legendre transforms of convex functions. Can. J. Math. 1967, 19, 200–205. [Google Scholar] [CrossRef]
Shima, H. The Geometry of Hessian Structures; World Scientific: Singapore, 2007. [Google Scholar]
Eguchi, S. A differential geometric approach to statistical inference on the basis of contrast functionals. Hiroshima Math. J. 1985, 15, 341–391. [Google Scholar] [CrossRef]
Rockafellar, R. Convex Analysis; Princeton Landmarks in Mathematics and Physics; Princeton University Press: Princeton, NJ, USA, 1997. [Google Scholar]
Yoshizawa, S.; Tanabe, K. Dual differential geometry associated with the Kullbaek-Leibler information on the Gaussian distributions and its 2-parameter deformations. SUT J. Math. 1999, 35, 113–137. [Google Scholar] [CrossRef]
Hougaard, P. Convex Functions in Exponential Families; Department of Mathematical Sciences, University of Copenhagen: Copenhagen, Denmark, 1983. [Google Scholar]
Brekelmans, R.; Nielsen, F. Variational representations of annealing paths: Bregman information under monotonic embeddings. Inf. Geom. 2024. [Google Scholar] [CrossRef]
Amari, S.I. α-Divergence is unique, belonging to both f-divergence and Bregman divergence classes. IEEE Trans. Inf. Theory 2009, 55, 4925–4931. [Google Scholar] [CrossRef]
Hennequin, R.; David, B.; Badeau, R. Beta-divergence as a subclass of Bregman divergence. IEEE Signal Process. Lett. 2010, 18, 83–86. [Google Scholar] [CrossRef]
Ohara, A.; Eguchi, S. Group invariance of information geometry on q-Gaussian distributions induced by Beta-divergence. Entropy 2013, 15, 4732–4747. [Google Scholar] [CrossRef]
Banerjee, A.; Merugu, S.; Dhillon, I.S.; Ghosh, J.; Lafferty, J. Clustering with Bregman divergences. J. Mach. Learn. Res. 2005, 6, 1705–1749. [Google Scholar]
Frongillo, R.; Reid, M.D. Convex Found. Gen. Maxent Model. 2014, 1636, 11–16. [Google Scholar]
Ishige, K.; Salani, P.; Takatsu, A. Hierarchy of deformations in concavity. Inf. Geom. 2022, 7, 251–269. [Google Scholar] [CrossRef]
Zhang, J.; Wong, T.K.L. λ-Deformation: A canonical framework for statistical manifolds of constant curvature. Entropy 2022, 24, 193. [Google Scholar] [CrossRef] [PubMed]
Jenssen, R.; Principe, J.C.; Erdogmus, D.; Eltoft, T. The Cauchy–Schwarz divergence and Parzen windowing: Connections to graph theory and Mercer kernels. J. Frankl. Inst. 2006, 343, 614–629. [Google Scholar] [CrossRef]

Figure 1. Strictly log-convex functions form a proper subset of strictly convex functions.

Figure 2. The canonical divergence

D

and dual canonical divergence

D^{*}

on a dually flat space

M

equipped with potential functions

F

and

F^{*}

can be viewed as single-parameter contrast functions on the product manifold

M \times M

: The divergence

D

can be expressed using either the

θ \times θ

-coordinate system as a Bregman divergence or the mixed

θ \times η

-coordinate system as a Fenchel–Young divergence. Similarly, the dual divergence

D

can be expressed using either the

η \times η

-coordinate system as a dual Bregman divergence or the mixed

η \times θ

-coordinate system as a dual Fenchel–Young divergence.

Figure 2. The canonical divergence

D

and dual canonical divergence

D^{*}

on a dually flat space

M

equipped with potential functions

F

and

F^{*}

can be viewed as single-parameter contrast functions on the product manifold

M \times M

: The divergence

D

can be expressed using either the

θ \times θ

-coordinate system as a Bregman divergence or the mixed

θ \times η

-coordinate system as a Fenchel–Young divergence. Similarly, the dual divergence

D

can be expressed using either the

η \times η

-coordinate system as a dual Bregman divergence or the mixed

η \times θ

-coordinate system as a dual Fenchel–Young divergence.

Figure 3. Statistical divergences between normalized

p_{θ}

and unnormalized

{\tilde{p}}_{θ}

densities of an exponential family

E

with corresponding divergences between their natural parameters. Without loss of generality, we consider a natural exponential family (i.e.,

t (x) = x

and

k (x) = 0

) with cumulant function F and partition function Z, with

J_{F}

and

B_{F}

respectively denoting the Jensen and Bregman divergences induced by the generator F. The statistical divergences

D_{R, α}

and

D_{B, α}

denote the Rényi

α

-divergences and skewed

α

-Bhattacharyya distances, respectively. The superscript “s” indicates rescaling by the multiplicative factor

\frac{1}{α (1 - α)}

, while the superscript “*” denotes the reverse divergence obtained by swapping the parameter order.

Figure 3. Statistical divergences between normalized

p_{θ}

and unnormalized

{\tilde{p}}_{θ}

densities of an exponential family

E

with corresponding divergences between their natural parameters. Without loss of generality, we consider a natural exponential family (i.e.,

t (x) = x

and

k (x) = 0

) with cumulant function F and partition function Z, with

J_{F}

and

B_{F}

respectively denoting the Jensen and Bregman divergences induced by the generator F. The statistical divergences

D_{R, α}

and

D_{B, α}

denote the Rényi

α

-divergences and skewed

α

-Bhattacharyya distances, respectively. The superscript “s” indicates rescaling by the multiplicative factor

\frac{1}{α (1 - α)}

, while the superscript “*” denotes the reverse divergence obtained by swapping the parameter order.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Nielsen, F. Divergences Induced by the Cumulant and Partition Functions of Exponential Families and Their Deformations Induced by Comparative Convexity. Entropy 2024, 26, 193. https://doi.org/10.3390/e26030193

AMA Style

Nielsen F. Divergences Induced by the Cumulant and Partition Functions of Exponential Families and Their Deformations Induced by Comparative Convexity. Entropy. 2024; 26(3):193. https://doi.org/10.3390/e26030193

Chicago/Turabian Style

Nielsen, Frank. 2024. "Divergences Induced by the Cumulant and Partition Functions of Exponential Families and Their Deformations Induced by Comparative Convexity" Entropy 26, no. 3: 193. https://doi.org/10.3390/e26030193

APA Style

Nielsen, F. (2024). Divergences Induced by the Cumulant and Partition Functions of Exponential Families and Their Deformations Induced by Comparative Convexity. Entropy, 26(3), 193. https://doi.org/10.3390/e26030193

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Divergences Induced by the Cumulant and Partition Functions of Exponential Families and Their Deformations Induced by Comparative Convexity

Abstract

1. Introduction

2. Dual Subtractive and Divisive Normalizations of Exponential Families

2.1. Natural Exponential Families

2.2. Exponential Families

2.3. Normalizations of Exponential Families

3. Divergences Related to the Cumulant Function

4. Divergences Related to the Partition Function

5. Deforming Convex Functions and Their Induced Dually Flat Spaces

5.1. Comparative Convexity

5.2. Dually Flat Spaces

6. Conclusions and Discussion

Funding

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI