Nothing Special   »   [go: up one dir, main page]

Affine Statistical Bundle Modeled On A Gaussian Orlicz-Sobolev Space

Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

Information Geometry (2023) 7 (Suppl 1):S109–S130

https://doi.org/10.1007/s41884-022-00078-6

SURVEY PAPER

Affine statistical bundle modeled on a Gaussian


Orlicz–Sobolev space

Giovanni Pistone1

Received: 14 October 2022 / Revised: 10 November 2022 / Accepted: 12 November 2022 /


Published online: 24 November 2022
© The Author(s), under exclusive licence to Springer Nature Singapore Pte Ltd. 2022

Abstract
The dually flat structure of statistical manifolds can be derived in a non-parametric
way from a particular case of affine space defined on a qualified set of probability
measures. The statistically natural displacement mapping of the affine space depends
on the notion of Fisher’s score. The model space must be carefully defined if the state
space is not finite. Among various options, we discuss how to use Orlicz–Sobolev
spaces with Gaussian weight. Such a fully non-parametric set-up provides tools to
discuss intrinsically infinite-dimensional evolution problems

Keywords Information geometry · Gaussian Orlicz–Sobolev space · Statistical


bundle · Exponential manifold · Dually flat affine manifold

1 Non-parametric statistical bundle

Professor S.-I. Amari clearly stated in a 1987 conference paper [3] the notion of a
non-parametric fiber bundle in Information Geometry (IG). He says
A fibre bundle is constructed on a finite-dimensional parametric statistical model
with a Hilbert space as the fibre space. The Hilbert space represents the tangent
directions of the set of probability distributions in the function space. A pair of
dual linear connections is introduced in the Hilbert bundle.
A related journal paper [6] shows the applications to the statistics of semi-parametric
statistical models. Kass and Vos, in their 1997 monograph [26, 10.3], further explain
the construction by describing the tangent fiber T p M of a statistical model M as a
vector space of random variables:

Communicated by Nihat Ay.

B Giovanni Pistone
giovanni.pistone@carloalberto.org

1 de Castro Statistics, Collegio Carlo Alberto, Piazza Vincenzo Arbarello 8, 10122 Turin, Italy

123
S110 G. Pistone

The tangent space T p M is an inner product space of random variables having


zero expectation and finite variance that describe M at p. Finite-dimensional
vector bundles enlarge T p M by adding other random variables, each of which
has mean zero and finite variance. The Hilbert bundle carries this process to its
conclusion by enlarging T p M to the space of all random variables having zero
mean and finite variance.

I started to work on a non-parametric version of professor Amari’s ideas in the early


nineties [48]. In doing that, the self-imposed prescription to avoid parameters led to the
use of differential geometry as presented by Bourbaki [12] and Lang [27]. Specifically,
the approach involves the expression of the tangent spaces as vector spaces of random
variables. I give below an updated summary of such a construction.
If μ is any probability measure, μ ∈ M1 , we define as fiber at μ the Hilbert space
Hμ of μ-square integrable random variable with zero μ-mean, Hμ = L 20 (μ). The set
of all couples (μ, u) with μ ∈ M1 and u ∈ Hμ is the (maximal) (statistical) Hilbert
bundle H M1 . The restriction to μ-centered random variables is intended to reflect
the affine constraint μ(X ) = 1 satisfied by M1 as a subset of the vector space S of
finite signed measures.
In which sense does each fiber Hμ contain the tangents to statistical models? Clearly,
M1 is a convex set whose tangent space is the set S0 of signed finite measure with total
charge 0. But there is another sense to consider. Assume the 1-dimensional statistical
model θ → μ(θ ) is set-wise continuously differentiable, that is, for each measurable
set A ∈ X the real map θ → μ(A; θ ) is continuously differentiable with derivative
θ → μ̇(A; θ ). Assume that θ → μ̇(θ ) is a curve in the Banach space S0 of signed
finite measures with 0 total charge and total variation norm. Clearly, μ̇(A; θ ) = 0
if μ̇(A; θ ) = 0, hence μ̇(θ ) is absolutely continuous with respect to μ(θ ). See this
argument in [9] and, in full detail, in H.V. Lê’s contribution [28]. 
 The Radon-Nikodym derivative (θ ) = d μ̇(θ
 )/dμ(θ ) is such that (θ ) dμ(θ ) =
d μ̇(θ ) = 0. If moreover (θ )2 dμ(θ ) = l(θ )d μ̇(θ ) is finite, then (θ ) ∈ Hθ =
Hμθ . Under the classical statistical assumption of a regular likelihood, μθ = pθ · μ,
pθ > 0, it holds
ṗθ d
(θ ) = = log pθ
pθ dθ
That is, θ is the Fisher’s score of the statistical model, a statistically natural expression
of the rate of variation [23, 4.2]. In this sense, it is an expression
 of the tangent vector.
Conversely, let be given an element u ∈ Hμ such that eθu dμ < ∞ for all θ in a
neighborhood of 0. The statistical model μ(θ ) ∝ eθu · μ is such that the Fisher score
at 0 is u. This, again, is quite a classical argument. It depends only on the classical
theory of exponential families, see [14] or [23, 5.5].
Let us fix a reference measure μ and restrict our attention to the set P · μ, P strictly
positive probability densities. If the sample space is finite, all conditions outlined above
are met. The Hilbert bundle is an expression of the tangent bundle where Fisher’s score
expresses the velocity of one-dimensional curves.
If the sample space is not finite, one must introduce restrictions on both the set
of probabilities and the random variables in the fiber. One option is to restrict to a
convenient subset of positive densities E ⊂ P in such a way there is a vector space

123
Affine statistical bundle modeled on a Gaussian Orlicz–Sobolev space S111

B(E) ⊂ ∩μ∈E L 2 (μ) that contains the scores of all one-dimensional statistical models.
Then, each fiber is defined to be
  


S p E = u ∈ B(E)  u p dμ = 0 .

For example, assume the sample space is a compact set K , and μ is a diffuse
measure. Let E be the set of all continuous, strictly positive probability densities. E is
an open convex set of C(K ). If θ → pθ is a differentiable curvein E, then the score
dθ log pθ is a curve in B(E) = C(K ) such that for each θ it holds dθ log pθ dμ = 0.
d d

And conversely, for each u ∈ Bθ the curve θ → pθ ∝ e has values in E and its
tu

score at θ equals u. See [10] for an approach to non-parametric IG based on smooth


densities.
A similar but different approach will be used below. Both depend on the idea of
finding a convenient class of random variables u such that the exponential family
proportional to eθu is conveniently defined.
For my purpose, the most important of professor Amari’s contributions has been
the definition of a couple of affine connections in a fully nonparametric way, [3, Th. 1
and 2]. In the notations used here, he considers two types of transport between fibers,

e
Uνμ : Sμ E u → u − u dν ∈ Sν E (1)


m
Uνμ : Sμ E u → u ∈ Sν E, (2)

and proves the duality result


e   
Uνμ u, v ν
= u, m Uμ
νv μ .

Also, there is a transport of the inner product from one fiber to the other,
  e ν m ν 
u, v μ = e Uμ e ν
ν Uμ u, v μ = Uμ u, Uμ v ν .

The equations above clearly define a geometry of probability measures that is related
but different from the previously studied Riemannian geometry based on the notion
of Fisher-Rao information matrix taken as an expression of an inner product between
tangent vectors. This new geometry originated, a least in the statistical community,
with the idea of defining the geometry of curved exponential models as embedded in
a larger exponential family [2, 18, 21, 22].
Such a theory has been known for a long time in statistical mechanics. The main
difference is that R. Fisher and other statisticians of the same period used to think about
parsimoniously parameterized models. In contrast, physicists such as Boltzmann and
Gibbs used to think in terms of simple relations between statistical observables. The

123
S112 G. Pistone

exponential family appears as a model with peculiar invariance properties in statistical


mechanics.1
y2
Consider, for example, the Hamiltonian H (y, x) = 2m + V (x), y, x ∈ R. Let the
associated flow be (t, y, x) → Tt (y, x) ∈ R2 , that is,

d ∂ ∂
Tt (y, x) = ωH (Tt (y, x)) , ωH = − H , H ,
dt ∂x ∂y
T0 (y, x) = (y, x) .

The evolution of a probability measure f ◦ m, m(dy, d x) = d yd x, f > 0, under the


action of the flow is

μ(t) = (Tt )# f · m = f ◦ St det J St · m = f t · m , St = Tt−1 ,

provided the flow is a global diffeomorphism. The curve t → f t is controlled by the


continuity equation,

f t + ωH · ∇ f t = 0 .
∂t
The score is d μ̇t /dμt = d log f t / f t and the continuity equation can be written


log f t + ωH · ∇ log f t = 0 .
∂t
In particular, f t = f if f is a function of H . Among all invariant probability
densities, the curve θ → pt = eθ H /Z (θ ) represents an evolution in the class of
invariant probabilities. The score of the model is

d d
log eθ H /Z (θ ) = H − log Z (t) = H − H pt dm .
dt dt

In statistical mechanics, the score is interpreted as the fluctuation of the Hamiltonian.


The differential geometry of the dual connections of Nagaoka [7, 8.4] naturally
follows from the abovementioned elements. In particular, the differential notion of
connection can be derived from the notion of parallel transport in an affine setting. In
the non-parametric setting, it is convenient to base the affine structure on a variation
of the original notion of affine space of [55]. Below is a summary of the presentation
in [19].
The word “affine” above refers to the geometrical construction of vectors associated
with displacement according to classical H. Weyl’s axioms of an affine space. Let be
given a set M and a real finite-dimensional vector space V . A displacement mapping
is a mapping
−→
M × M (P, Q) → P Q ∈ V ,
such that
1 The difference in terminology between Mathematical Statistics and Physics is sometimes confusing. The
tutorial [51] and the textbook [8] should be helpful.

123
Affine statistical bundle modeled on a Gaussian Orlicz–Sobolev space S113

−→
1. for each fixed P the partial mapping s P : Q → P Q is 1-to-1 and onto, and
−→ −→ −→
2. the parallelogram law, P Q + Q R = P R, holds true.
The structure (M, V , −
→) is, by definition, the affine space. The corresponding affine
manifold is derived from the atlas of charts s P : M → V , P ∈ M. Notice that the
change of chart is the choice of a new origin. Such a structure supports a full geometrical
development; see [37].
Weyl’s axioms suggest the following definition.
Let M be a set and let Bμ , μ ∈ M, be a family of real topological vector spaces.
μ μ
Let (Uν ), ν, μ ∈ M be a family of isomorphism Uν : Bν → Bμ satisfying the cocycle
condition,
μ ρ ρ μ
AF0 Uμ = I and Uν Uνμ = Uμ , where Uν is the transport from Bν onto Bμ .
Consider a displacement mapping

S : (ν, μ) → sν (μ) ∈ Bν

defined on a subset of the product space dom (S) ⊂ M × M. Assume


AF1 For each fixed ν the partial mapping Mν μ → sν (μ) = S(ν, μ) is injective.
μ
AF2 S(μ1 , μ2 ) + Uμ12 S(μ2 , μ3 ) = S(μ1 , μ3 ).
μ
The structure (M, (Bμ )μ∈M , (Uν )μ,ν∈M , S) is an affine bundle.
 to charts sν : Mν → Bν , ν ∈ M,
The affine bundle provides a family of candidates
μ
to for an atlas. Let M, (Bμ )μ∈M , (Uν )μ,ν∈M , S be an affine space and assume the
following smoothness condition moreover.
AF3 For each ν, the image set sν (Mν ) is a neighborhood of 0 in Bμ . That is, its
interior sν (Mν )◦ is an open set containing sν (ν) = 0.
Define the coordinates domains as Uν = sν−1 (sν (M)◦ ), so that (sν , Uν , Bν ) is a chart
on M. Such a chart is said to have origin ν. Such charts are compatible, and the
resulting manifold

M = M, (Bμ )μ∈M , (Uμ
ν )μ,ν∈M , (sμ )μ⇒M

is, by definition, the affine manifold associated with the given affine bundle.
Here is our main
 instance. Consider the exponential transport of Eq. (1) and define
s p (q) = log qp − log qp p · dm. The parallelogram identity is


q q
log − log p dm
p p
  
r r r r
+ log − log p dm − log − log p dm dm
q q q q

r r
= log − log p dm
p p

123
S114 G. Pistone

The inverse of the chart is easily seen to be




s −1
p (u) = exp u − K p (u) · p , K p (u) = log eu p dm , u ∈ B p .

The dual instance is associated with the mixture transport of Eq. (2) and s p (q) =
q
p − 1. The parallelogram identity is

q q r r
−1+ −1 = −1.
p p q p

Given an affine manifold M, the affine bundle is again an affine manifold. In


  

SM = (μ, v)  μ ∈ M, v ∈ Bμ

the equation

SM × SM ((ν, u), (μ, v)) → (sν (μ), Uνμ v) ∈ Bν × Bν

defines a displacement on the bundle. For each ν define the chart

sν : S M (μ, v) → (sν (μ), Uνμ v) ∈ Bν × Bν

to define the affine bundle SM as a manifold. Equivalently, we can say that SM is


the bundle with trivialization

sν : (μ, v) → (sν (μ), Uνμ v) .

The affine bundle is a convenient expression of the tangent bundle of the affine
manifold if we define the velocity as follows. The velocity of the smooth curve t →
γ (t) of the affine manifold M is the curve t → (γ (t), γ (t)) of the affine bundle
whose second component is

d 
γ (t) = lim h −1
(sγ (t) (γ (t + h)) = sγ (t) (γ (t + h)) .
h→0 dh h=0

By assumption AF2 applied to the points, the expression in the chart centered at ν
of γ (t) is Uνγ (t) γ (t) = sν (γ (t)).
For example, in the exponential manifold, it holds
 
γ (t) d γ (t) γ (t) d
γ (t) = e
Up log − E p log = log γ (t) ,
dt p p dt

so that the (affine) velocity in the exponential manifold equals Fisher’s score.

123
Affine statistical bundle modeled on a Gaussian Orlicz–Sobolev space S115

Let F be a section of the affine manifold, that is, μ → (μ, F(μ)) ∈ SM. An
integral curve of the section F is a curve t → γ (t) such that γ (t) = F(γ (t)). A flow
of the section F is a mapping

M×I (ν, t) → γt (ν)

such that for each ν the curve t → γt (ν) is an integral curve and γ (0, ν) = ν.
The following proposition gives a characterization of affine geodesics. The follow-
ing statements are equivalent.
γ (t)
1. The curve I : t → γ (t) is auto parallel, that is, γ (t) = Uγ (s) γ (s), s, t ∈ I .
2. The expression of the curve in each chart is affine.
3. For all s, t 
γ (t) = Sγ−1
(s) (t − s)γ (s)
The acceleration is defined as a velocity in the affine bundle. Consider the curve
∗∗
t → μ(t) with velocity t → μ(t). The acceleration t → μ(t) is the velocity t →
(μ(t), μ(t)).

(μ(t), μ(t)) = lim h −1 sμ(t),μ(t) (μ(t + h), μ(t + h)) .


∗∗

h→0

Especially, for all μ ∈ M,

d μ
μ(t) = Uμ(t)
∗∗
μ U μ(t) .
dt μ(t)
This equation shows that a curve with 0 acceleration is auto-parallel.
In the exponential example, the acceleration is computed as follows: In the expo-
nential case,

∗∗ p(t) d e p p(t) d ṗ(t) ṗ(t)
p(t) = e U p U p(t) p(t) = e U p − p dm
dt dt p(t) p(t)
  
p̈(t) ṗ(t) 2 p̈(t) ṗ(t) 2
= − − − p(t) dm
p(t) p(t) p(t) p(t)

p̈(t) ṗ(t) 2 ṗ(t) 2
= − + p(t) dm .
p(t) p(t) p(t)

2 Gaussian Orlicz–Sobolev model space

Above, we have discussed in general terms how to define an affine Banach manifold.
We now proceed to instantiate the general formalism into a specific case of Gaussian
space. In doing that, the usual toolbox of IG should be extended with other analytical
notions. A general reference is [11]. We now restrict our attention to a particular
instance of model Banach space. Precisely, we will use the generalization of Lebesgue
spaces called Orlicz spaces. General references are the monographs [36, Ch. II] and

123
S116 G. Pistone

[1, Ch. VII]. The basic technical tools are the notion of conjugation between convex
functions and the analysis of the Gaussian space. I will use my conference paper [46].

2.1 Orlicz spaces

Assume φ ∈ C[0, +∞[ is null at 0, φ(0) = 0, strictly increasing, and


limu→+∞ φ(u) = +∞. Let be its primitive function with (0) = 0. We call
such a function a Young function. The inverse function ψ = φ −1 has the same prop-
erties as φ, so that the primitive with (0) = 0 is again a Young function. The
couple ( , ), is a couple of conjugate Young functions. The relation is symmetric,
and we write both = ∗ and = ∗ . The Young inequality holds,

(x) + (y) ≥ x y , x, y ≥ 0 ,

and the Legendre equality holds,

(x) + (φ(x)) = xφ(x) , x ≥ 0 .

Here are my notations for specific cases we are going to use:

(x) = x p / p , (y) = y q /q , p, q > 1 , 1/ p + 1/q = 1 ; (3)


exp2 (x) = e − 1 − x , (exp2 )∗ (y) = (1 + y) log(1 + y) − y ;
x
(4)
 y
cosh2 (x) = cosh x − 1 , (cosh2 )∗ (y) = sinh−1 (v) dv ; (5)
0
1 2
gauss2 (x) = exp x −1. (6)
2

Given a Young function  measure μ, the Orlicz space L (μ)


, and a probability

is the Banach space whose closed unit ball is f ∈ L 0 (μ)  (| f |) dμ ≤ 1 . The
corresponding Minkowski norm is usually called the Luxemburg norm,
  

 f L (μ) = inf α > 0  (α −1
| f |) dμ ≤ 1 .


The Young inequality provides a separating duality u, v μ = uv dμ of L (μ) and
L ∗ (μ) such that u, v μ ≤ 2 u L (μ) v L ∗ (μ) . The dual norm is called the Orlicz
norm and is equivalent to the Luxembourg norm.
The domination relation between Young functions implies continuous injection
properties for the corresponding Orlicz spaces. We say that 2 eventually dominates
1 , written 1 ≺ 2 , if there is a constant κ such that 1 (x) ≤ 2 (κ x) for all x larger
than some x̄. As, in our case, μ is a probability measure, the continuous embedding
L 2 (μ) → L 1 (μ) holds if, and only if, 1 ≺ 2 . See proof in [1, Th. 8.2]. If
1 ≺ 2 , then ( 2 )∗ ≺ ( 1 )∗ . Looking at the examples above, exp2 (4) and cosh2

123
Affine statistical bundle modeled on a Gaussian Orlicz–Sobolev space S117

(5) are equivalent, they both are eventually dominated by gauss2 (6) and eventually
dominate all powers (3).
A special case occurs when there exists a function C such that (ax) ≤ C(a) (x)
for all a ≥ 0. This is true, for example, for a power function and in the case of the
functions (exp2 )∗ and (cosh2 )∗ . In such a case, the conjugate space is the dual space
and bounded functions form a dense set.
The spaces corresponding to case (3) are ordinary Lebesgue spaces. The cases (4)
and (5) provide isomorphic Banach spaces, which are of special interest to us as they
provide the model spaces for our non-parametric version of IG. In fact, a random
variable u belongs to L exp2 (μ) if, and only if, the exponential family pθ ∝ eθu
is defined in a neighborhood of θ = 0. In the conjugate space, a strictly positive
probability density f has finite entropy if, and only if, the random variable v = f − 1
belongs to L (exp2 )∗ (μ).

2.2 Sub-exponential and sub-Gaussian random variables

Another important feature of the class L cosh2 (μ) is the following. Such a class coin-
cides with the class of sub-exponential random variables, that is, those for which there
exist constants C1 , C2 > 0 such that the large deviations admit an exponential bound

Pμ (| f | ≥ t) ≤ C1 exp (−C2 t) , t ≥ 0 .

Sub-exponential random variables are of special interest in statistical applications


because they admit explicit exponential bounds in the Law of Large Numbers. There
is a large amount of literature on this subject; see, for example, [15, 52, 54].
Random variables whose square is sub-exponential are called sub-Gaussian. For
 2  (x) = 2(x ) is again a Young function, and
each Young function , the function 2
 
 f  L (μ) ≤ λ if, and only if, | f | L (μ) ≤ λ . We denote the resulting space by
L 2 (μ). For example, gauss2 and cosh2 are ≺-equivalent , hence the isomorphism
L gauss2 (μ) ↔ L 2cosh2 (μ).
As an application, consider that for each increasing convex it holds ( f g) ≤
(( f 2 + g 2 )/2) ≤ ( ( f 2 ) + (g 2 ))/2. It follows that when the L 2 (μ)-norm of f
and of g is bounded by one, the L (μ)-norm of f , g, and f g, are all bounded by
one. The need to control the product of two random variables in L cosh2 (μ) appears,
for example, in the study of the covariant derivatives of the statistical bundle, see [20,
24, 31, 45].

2.3 Calculus of the Gaussian space

From now on, the base probability


 space is the Gaussian probability space (Rn , γ ),
γ (z) = (2π ) exp −
n/2 |z|2
/2 . Let us recall a few simple facts about the analysis of
the Gaussian space; see [33, Ch. V].
k (Rn ), k = 0, 1, . . . , the vector space of functions that are
Let us denote by Cpoly
differentiable up to order k and which are bounded, together with all derivatives, by
a polynomial. This class of functions is dense in L 2 (γ ). For each couple f , g ∈

123
S118 G. Pistone

1 (Rn ), we have
Cpoly
 
f (x) ∂i g(x) γ (x) d x = δi f (x) g(x) γ (x) d x ,

where the divergence operator δi is defined by δi f (x) = xi f (x) − ∂i f (x). Vector


case is
 

∇ f (x) · ∇g(x) γ (x) d x = f (x) δ · ∇g(x) γ (x) d x , f , g ∈ Cpoly
2
Rn ,

with δ · ∇g(x) = x · ∇g(x) − g(x).


Hermite polynomials Hα = δ α 1 provide an orthogonal basis for L 2 (γ ) such that
∂i Hα = αi Hα−ei , e1 the i-th element of the standard basis of Rn . In turn, this provides
a way to prove that there is a closure of both operator ∂i and δi on a domain which
is a Hilbert subspace of L 2 (γ ). This space is denoted by D 2 in [33]. Moreover, the
closure of ∂i is the infinitesimal generator of the translation operator, [11, 34]. The
space D 2 is a Sobolev Space with Gaussian weight based on the L 2 norm, [1].

2.4 Exponential statistical bundle

We refer to Sect. 1, and [43, 44] for the definition of maximal exponential manifold
E (γ ), and of statistical bundle S E (γ ). Below we report the results that are necessary
in the context of the present paper.
A key result is the proof of the following proposition, see [16, 17, 48] and [49,
Th. 4.7].
For all p, q ∈ E (γ ) it holds q = eu−K p (u) · p, where u ∈ L cosh2 (γ ), E p [u] =
0, and u belongs to the interior of the proper domain of the convex function K p . This
property is equivalent to any of the following:
1. p and q are connected by an open exponential arc;
2. L cosh2 ( p) = L cosh2 (q) and the norms are equivalent;
3. p/q ∈ ∪a>1 L a (q) and q/ p ∈ ∪a>1 L a ( p).
Item 2 ensures that all the fibers of the statistical bundle, namely S p E (γ ), p ∈
E (γ ), are isomorphic. Item 3 gives an explicit description of the exponential manifold.
For example, let p be a positive probability density with respect to γ , and take q = 1
and a = 2. Then a sufficient condition for p ∈ E (γ ) is
 
1
p(x)2 γ (x) d x < ∞ and γ (x) d x < ∞ .
p(x)

2.5 Orlicz spaces with derivatives

By replacing the L 2 -norm with a cosh2 -Orlicz norm, a set-up for IG obtains [30,
44]. Precisely, we have exponential families with weakly differentiable densities and

123
Affine statistical bundle modeled on a Gaussian Orlicz–Sobolev space S119

a Gaussian reference probability measure. The main outcome is the possibility to


discuss topics related to the analytical picture of IG.
Let us start with a class of inequalities related to the classical Gauss–Poincaré
inequality,
  2 
f (x) − f (y) γ (y) dy γ (x) d x ≤ |∇ f (x)|2 γ (x) d x ,

where f ∈ Cpoly
1 (Rn ). See a proof in [38, 1.4]. In terms of norms, the inequality above
  
is equivalent to  f − f  L 2 (γ ) ≤ |∇ f | L 2 (γ ) , where f = f (y) γ (y) dy .
For example, if p ∈ Cpoly
2 is a probability density with respect to γ , then the
χ 2 -divergence
of P = p · γ from γ is bounded by
 
Dχ 2 (P|γ ) = ( p(x) − 1) γ (x) d x ≤
2
(δ · ∇ p(x))2 γ (x) d x .

Generalizations of the Gauss–Poincaré inequality follow from the properties of the


Ornstein–Uhlenbeck (OU) semigroup
 
Pt f (x) = f (e−t x + 1 − e−2t y) γ (y) dy , t ≥ 0, f ∈ Cpoly
k
(Rn ) ,

see [33, V-1.5] and [38, 1.3]. Notice that the OU semigroup interpolates between
P0 f = f and P∞ f = f . If X , Y are independent standard Gaussian random variables
in Rn , then
 
X t = e−t X + 1 − e−2t Y , Yt = 1 − e−2t X − e−t Y

are independent standard Gaussian random variables for all t ≥ 0. By the change of
variable (X , Y ) → (X t , Yt ) and Jensen’s inequality, it follows for each convex that
 
(Pt f (x)) γ (x) d x ≤ ( f (x)) γ (x) d x .

That is, for all t ≥ 0, the mapping f → Pt f is non-expansive for the norm of each
Orlicz space L (γ ).
For all : R convex and all f ∈ Cpoly1 (Rn ), it holds

 
f (x) − f (y) γ (y) dy γ (x) d x

π
≤ ∇ f (x) · y γ (x)γ (y) d xd y
2

1 π
|∇ f (x)| z e−z /2 γ (x) dzd x
2
=√
2π 2

123
S120 G. Pistone


=  (|∇ f (x)|) γ (x) d x , (7)

where  is the convex function



π
(a) = az γ (z) dz .
2

The first example of convex function is (s) = es , with (a) = exp π 2 a2


8 , so

that the inequality applied to π f becomes
 
2κ  κ2
exp f (x) − f γ (x) d x ≤ exp |∇ f (x)|2 γ (x) d x (8)
π 2

The RHS is finite if the function f is Lipschitz with norm κ −1 .


The first case of bound for Orlicz norms we is the (s) = s 2 p , p > 1/2. In such a
case,
π 2p
(a) = m(2 p) a 2 p ,
2
where m(2 p) is the 2 p-moment of the standard Gaussian distribution. It follows that
  
  π
f − f (y) γ (y) dy  |∇ f | L 2 p (γ ) .
 2 p ≤ 2 (m(2 p))
1/2 p

L (γ )

The cases (a) = a 2 p are special in that we can use them in the proof of the
multiplicative property (ab) = (a) (b). The argument generalizes to the case
where the convex function is a Young function whose increase is controlled through
a function C, (uv) ≤ C(u) (v), and such that there exists a κ > 0 for which

π
C κu γ (u) du ≤ 1 ,
2

so that section 2.5 becomes


 
π π
(κa) = κaz γ (z) dz ≤ C κz γ (z) dz (a) ≤ (a) .
2 2

By using this bound in Eq. (7), we get


  
κ f (x) − f (y) γ (y) dy γ (x) d x ≤ (|∇ f (x)|) γ (x) d x .

Assume now that |∇ f | L (γ ) ≤ 1 so that the LHS does not exceed 1. Then
 

κ f − f  L (γ ) ≤ 1, which, in turn, implies the inequality
 
f − f ≤ κ −1 |∇ f | L (γ ) .
L (γ )

123
Affine statistical bundle modeled on a Gaussian Orlicz–Sobolev space S121

It is of particular interest the case of the Young function = cosh −1, for which
there is no such bound. Instead, we use Eq. (8) with κ and −κ to get

2κ 
(cosh −1) f (x) − f γ (x) d x
π

≤ gauss2 (κ |∇ f (x)|) γ (x) d x . (9)

Now, if κ = |∇ f |−1


L gauss2 (γ ) , then the LHS is smaller or equal then 1, and hence
 

2κ/π f − f  ≤ 1. It follows that
L cosh −1 (γ )

  π
f − f ≤ |∇ f | L gauss (γ ) .
L cosh −1 (γ ) 2 2

In the following proposition, we summarize the inequalities proved so far.


There exists constants C1 , C2 ( p), C3 such that for all f ∈ Cpoly
1 (Rn ) the following

inequalities hold:
  
 
f − f (y) γ (y) dy  ≤ C1 |∇ f | L (exp )∗ (γ ) . (10)
  2
L (exp2 )∗ (γ )
  
 
f − f (y) γ (y) dy  ≤ C2 ( p) |∇ f | L 2 p (γ ) , p > 1/2 . (11)
 
L 2 p (γ )
  
 
f − f (y) γ (y) dy  ≤ C3 |∇ f | L gauss . (12)
  2 (γ )
L cosh2 (γ )

Other equivalent norms could be used in the inequalities above. For example,
L (exp2 )∗ (γ ) ↔ L (cosh −1)∗ (γ ) and L gauss2 (γ ) ↔ L 2cosh −1 (γ ).
We now consider a further set of inequalities based on the use of infinitesimal
generator −δ · ∇ of the OU semigroup [38, 1.3.7].
We have, for all f ∈ Cpoly2 (Rn ), that

 ∞  ∞
d
f (x) − f = − Pt f (x) dt = δ · ∇ Pt f (x) dt . (13)
0 dt 0

Note that
 
∇ Pt f (x) = ∇ f (e−t x + 1 − e−2t y) γ (y) dy
 
= e−t ∇ f (e−t x + 1 − e−2t y) γ (y) dy = e−t Pt ∇ f (x) ,

so that
Pt δ · ∇ f (x) = δ · ∇ Pt f (x) = e−t δ · Pt ∇ f (x) .

123
S122 G. Pistone

Now, Eq. (13) becomes


 ∞
f (x) − f = e−t δ · Pt ∇ f (x) dt . (14)
0

As 
δ · ∇ f (x) γ (x) d x = 0 ,

the covariance of f , g ∈ Cpoly


0 (Rn ) is

Covγ ( f , g)
 
= ( f (x) − f )g(x) γ (x) d x = ( f (x) − f )(g(x) − g) γ (x) d x .

It follows that for all f , g ∈ Cpoly


2 (Rn ) we derive from Eq. (14)

 ∞ 
−t
Covγ ( f , g) = e Pt ∇ f (x) · ∇g(x) γ (x) d x dt . (15)
0

We use here a result of [44, Prop. 5]. Let |·|1 and |·|2 be two norms on Rn , such
that |x · y| ≤ |x|1 |y|2 . For a Young function , consider the norm of L (γ ) and the
conjugate space endowed with the dual norm,
  

 f L ,∗ (γ )
= sup f g γ  (g) γ ≤ 1 .

The following proposition includes the standard Poincaré case provided (u) =
u 2 /2.
Given a couple of conjugate Young function , , and norms |·|1 , |·|2 on Rn such
that x · y ≤ |x|1 |y|2 , x, y ∈ Rn , for all f , g ∈ Cpoly
1 (Rn ), it holds

     
Covγ ( f , g) ≤ |∇ f |1  |∇g|2  .
L (γ ) L ,∗ (γ )

The case of our interest here is = cosh −1, = (cosh −1)∗ . As (cos −1)∗ ≺
(cosh −1), it follows, in particular, that Covγ ( f , f ) is bounded by constant times
|∇ f |2L cosh −1 (γ ) .

2.6 Orlicz–Sobolev space with Gaussian weight

A reasonable option for our model space is to assume densities f = eu−K γ (u) ·
γ in the Gaussian maximal exponential family, f ∈ E (γ ), and, moreover, assume
differentiability in the form u ∈ L 2cosh2 (γ ) = L gauss2 (γ ), that is, u 2 ∈ L cosh2 (γ ), see
Sect. 2.2.

123
Affine statistical bundle modeled on a Gaussian Orlicz–Sobolev space S123

Precisely, the exponential and the mixture Orlicz–Sobolev–Gauss (OSG) spaces of


interest are, respectively,
  

1,2
Wcosh (γ ) = v ∈ L (γ )  ∂ v ∈ L (γ ) , (16)
2 cosh2  j gauss2
  

1,2
Wcosh (γ ) = η ∈ L (γ )  ∂ f ∈ L (γ ) , (17)
2∗ cosh2∗  j gauss2∗

where ∂ j , j = 1, . . . , n, is the partial derivative in the sense of distributions.


As φ ∈ C0∞ (Rn ) implies φγ ∈ C0∞ (Rn ), for each f ∈ Wcosh 1,2
2∗
(γ ) it holds
         
∂j f , φ γ
= ∂ j f , φγ = − f , ∂ j (φγ ) = f , γ (u j − ∂ j )φ = f , δ j φ γ ,

with δ j φ = (u j − ∂ j )φ. Here, the Stein operator δi acts on C0∞ (Rn ) [13].
The meaning of both operators ∂ j and δ j = (u j − ∂ j ) when acting on square-
integrable random variables of the Gaussian space is well known. Still, here we
are specifically interested in the action on OSG spaces. Let us denote by Cp∞ (Rn )
the space of infinitely differentiable functions with polynomial growth of all deriva-
tives. Polynomial growth implies the existence of γ -moments of all derivatives, hence
Cp∞ (Rn ) ⊂ Wcosh1,2
2∗
(γ ). If f ∈ Cp∞ (Rn ), then the distributional derivative and the
ordinary derivative are equal and moreover δ j f ∈ Cp∞ (Rn ). For each φ ∈ C0∞ (Rn )
   
we have φ, δ j f γ = ∂ j φ, f γ .
1,2 1,2
The OSG spaces Wcosh 2
(γ ) and Wcosh 2∗
(γ ) are both Banach spaces [36, Sec. 10].
The norm is the graph norm,


n
 
vW 1,2 = v L cosh (γ ) +
∂ j v  ,
cosh2 (γ ) 2 L gauss2 (γ )
j=1

n
 
ηW 1,2 = η L cosh (γ ) + ∂ j η .
cosh2∗ (γ ) 2∗ L gauss2∗ (γ )
j=1

In the cases of null integral, Eq. (12) shows that the second term only provides an
1,2
equivalent norm for Wcosh 2
(γ ).
We review some relations between OSG spaces and Sobolev spaces without weight
[1] in the following proposition. For each ball radius R > 0,

n n R2
(2π )− 2 ≥ γ (x) ≥ γ (x)(|x| < R) ≥ (2π )− 2 e− 2 (|x| < R), x ∈ Rn .

Let  R denote the open sphere of radius R > 0 and consider the restriction u →
u R of u to  R .
1. We have the continuous mappings

W 1,(cosh −1) Rn ⊂ W 1,(cosh −1) (γ ) → W 1, p ( R ), p ≥ 1.

123
S124 G. Pistone

2. We have the continuous mappings


W 1, p (Rn ) ⊂ W 1,(cosh −1)∗ Rn ⊂ W 1,(cosh −1)∗ (γ ) → W 1,1 ( R ), p > 1.

3. Each u ∈ W 1,(cosh −1) (γ ) is a.s. Hölder of all orders on each  R and hence a.s.
continuous. The restriction W 1,(cosh −1) (γ ) → C( R ) is compact.

For example,if q = ev−K γ (v) · γ and p = eu−K γ (u) · γ with q, p ∈ E (γ ) and


1,2
v, u ∈ Wcosh 2
(γ ), the the Hyvärinen divergence is


1
DH ( p|q) = ∇(u − v)2 p(x) γ (x) d x < +∞
2

because ∇(u − v) ∈ L cosh2 (γ ) = L cosh2 ( p) ⊂ L 2 ( p · γ ).

3 Conclusion

In this final section, I suggest a few applications of my infinite-dimensional setup


of professor Amari’s ideas I have considered recently, starting from [30, 44]. I will
conclude by mentioning a few other topics, for which it is probably possible to extend
rigorous results from the finite state space to the Gaussian space.

3.1 Sub-exponential random variables

2 (Rn ) and assume f is globally Lipschitz, that is, |∇ f (x)| ≤  f 


Let f ∈ Cpoly Lip(Rn ) ,
where  f Lip(Rn ) is the Lipschitz semi-norm. It follows from Eq. (8) that f ∈
L cosh2 (γ ) and the norm admits a computable bound. If p is any probability den-
sity of the maximal exponential model of γ , that is, it is connected to 1 by an open
exponential arc, then the proposition in Sect. 2.4 implies that f ∈ L cosh2 ( p), that is,
f is sub-exponential under the distribution P = p · γ . If the sequence (X n )∞n=1 is an
independent sample of p · γ , then the sequence of sample means will converge,


1
n
lim f (X j ) = f (x) p(x) γ (x) d x ,
n→∞ n
j=1

with an exponential bound on the tail probability. See [52, 2.8] and [50].

3.2 Hyvärinen divergence

I adapt [25, 40], and [5, 13.6.2] to my Gaussian case. Consider the Hyvärinen diver-
gence of Sect. 2.6 in the Gaussian case, that is, P = p · γ and Q = q · γ . As a function

123
Affine statistical bundle modeled on a Gaussian Orlicz–Sobolev space S125

of q, the divergence is

1
q → DH (q · γ | p · γ ) = |∇ log p(x)|2 p(x) γ (x) d x
2
 
1
+ |∇ log q(x)| p(x) γ (x) d x − ∇ log p(x) · ∇ log q(x) p(x) γ (x) d x ,
2
2

where the first term does not depend on q and the second term is a p · γ -expectation.
As ∇ log p = p −1 ∇ p, the third term equals

− δ · ∇ log q(x) p(x) γ (x) d x ,

which is again a p-expectation. To minimize the Hyvärinen divergence we must min-


imize the p-expected value of the local score

1
S(q, x) = |∇ log q(x)|2 − δ · ∇ log q(x)
2
u−K (u) with
If p and q belong  to the maximal exponential model of γ , then1 q = e2
u ∈ L cosh2 (γ ) and u(x) γ (x) d x = 0. The local score becomes 2 |∇u| −δ ·∇u. To
compute the p-expected value of the score with an independent sample of p·γ , we have
the interest to assume that the score is in L cosh2 (γ ), because this assumption implies
the good convergence of the empirical means for all p. Assume, for example, ∇u ∈
L 2cosh2 (γ ) = L gauss2 (γ ). This implies directly |∇u|2 ∈ L (cosh −1) (γ ). Moreover, we
must assume that the L cosh2 (γ )-norm of δ · ∇u is finite. Under such assumptions,
one hopes that the minimization of a suitable model of the sample expectation of the
Hyvärinen score is consistent.

3.3 Otto’s metric

This metric was originally defined in [39]. Let be given in the maximal exponential
model of γ , p ∈ E (γ ), and let f and g be given in the p-fiber  of the statistical
1,2 1,2
statistical bundle, that is, f , g ∈ Wcosh ( p) = W (γ ) and f (x) p γ (x) d x =
 2 cosh2
g(x) p γ (x) d x = 0. Otto’s inner product is

( f , g) → f,g p = ∇ f (x) · ∇g(x) p(x) γ (x) d x

= f (x) δ · ( p(x)∇g(x)) γ (x) d x = f , δ · ( p∇g) p .

The LHS is well defined and continuous if ∇ f , ∇g ∈ L 2cosh2 (γ ), because, in such


a case, |∇ f |2 , |∇g|2 ∈ L cosh2 (γ ) = L cosh2 ( p). The RHS, if defined, is an inner
product
 in L cosh2 (γ ). Note that the mapping g → δ · ( p∇g) is 1-to-1 if g because by
g(x) p(x) γ (x) d x = 0. The inverse of this mapping provides the natural gradient
of Otto’s inner product in the sense of [4, 29].

123
S126 G. Pistone

3.4 Boltzmann equation

  space-homogeneous Boltzmann operator with angular collision kernel B(z, x) =


The
x z  is discussed, for example, in [53]. I briefly show below how to use Gaussian
Orlicz spaces in this context, see [30, 42].

The mixture bundle  S E (γ ) is the set of all couples ( f , η) with f ∈ E (γ ),
η ∈ L cosh2∗ (γ ), and η(x) f (x) γ (x) d x = 0. The ∗ -notation recalls that the dual
of L cosh2∗ (γ ) is L cosh2 (γ ). The Boltzmann operator is the mapping E (γ ) → Q( f )
with
 
Q( f )(v) = ( f (v − x x  (v − w)) f (w + x x  (v − w))
R S
3 2
 
− f (v) f (w)) x  (v − w) σ (d x) dw ,

where x  is the transpose of the column vector x, S2 is the unit sphere of R3 , and σ is
the uniform probability on S2 . One can prove that f → Q( f )/ f is a section of the
mixture bundle. The Boltzmann equation can be seen as the equation f = Q( f )/ f .
The smoothness of the Boltzmann section follows from a superposition of operators:
1. Product: E ( f 0 ) f → f ⊗ f ∈ E ( f 0 ⊗ f 0 );
2. Interaction: E ( f 0 ⊗ f 0 ) f ⊗ f →
 g = B f ⊗ f ∈ E ( f 0 ⊗ f 0 );
3. Conditioning: E ( f 0 ⊗ f 0 ) g → S2 g ◦ A x σ (d x) ∈ E ( f 0 ⊗ f 0 );
4. Marginalization.
There is a weak form of the Boltzmann section. Let v, w be a couple of velocities
before the collision and let us denote by (vx , wx ) the velocities after the collision, see
[53]. For f ∈ E (γ ) and g ∈ L cosh2 (γ ), define the operator A with

1 1
Ag(v, w) = (g(vx ) + g(wx )) σ (d x) − (g(v) + g(w)) .
S2 2 2

Ag is in L cosh2 γ ⊗2 and g, Q( f )/ f f = E f ⊗ f [Ag].

3.5 Other possible applications

The two cases below are open suggestions.


1. The transport problem [41] is discussed from the point of view compatible with IG
in [32, 35]. But it does not fit the strictly positive probability densities assumption.
However, suppose the given margins belong to the exponential manifold. In that
case, it is possible to consider the exponential sub-bundle with the given margins
and discuss the gradient flow of the given optimization problem. This is done in a
particular finite case in [47].
1,2 1,2
2. The dual couple Wcosh 2
(γ ) and Wcosh 2∗
(γ ) define a couple of dual bundles on
the set E (γ ) where velocity, acceleration, moment, gradient, Hessian, are all well
defined from the affine structure. A real function on the exponential bundle is a
Lagrangian function, and a real function on the mixture bundle is a Hamiltonian

123
Affine statistical bundle modeled on a Gaussian Orlicz–Sobolev space S127

function. The conjugation relation holds, and the mechanic’s equations provide
a dynamic picture of the statistical bundle. For example, one can consider the
Lagrangian function, where entropy takes the potential energy, and Fisher’s metric
takes the kinetic energy. This was done in the finite case in [20, 45].
Acknowledgements It is a pleasure to acknowledge the contribution of many people to my work in non-
parametric IG. In particular, I like to mention professor Sun-ichi Amari’s constant encouragement and
the critical assessment by Nihat Ay, Jürgen Jost, Hông Vân Lê, Lorenz Schwachöfer in [9, 3.3]. I also
like to mention friends and coworkers in order of appearance in this paper: Carlo Sempi, Paolo Gibilisco,
Alberto Cena, Maria Piera Rogantin, Barbara Trivellato, Paola Siri, Marina Santacroce, Luigi Malagò, Luigi
Montrucchio, Goffredo Chirco, Bertrand Lods. The author was partially supported by de Castro Statistics,
Collegio Carlo Alberto, and is a member of GNAMPA, Istituto di Alta Matematica, Rome.

Author Contributions Not applicable.

Funding The author is supported by de Castro Statistics, Collegio Carlo Alberto, and INdAM-Gnafa.

Availability of data and materials Not applicable.

Code availability Not applicable.

Declarations
Conflict of interest The author is on the Editorial Board of Information Geometry. The author states that
there are no other conflicts of interest.

Ethics approval Not applicable.

Consent to participate Not applicable.

Consent for publication Not applicable.

References
1. Adams, R.A., Fournier, J.J.F.: Sobolev Spaces. Pure and Applied Mathematics (Amsterdam), vol. 140,
2nd edn., p. 305. Elsevier/Academic Press, Amsterdam (2003)
2. Amari, S.-I.: Differential geometry of curved exponential families-curvatures and information loss.
Ann. Stat. 10(2), 357–385 (1982). https://doi.org/10.1214/aos/1176345779
3. Amari, S.: Dual connections on the Hilbert bundles of statistical models. In: Dodson, C.T.J. (ed.)
Geometrization of Statistical Theory (Lancaster, 1987), pp. 123–151. ULDM Publ, Lancaster (1987)
4. Amari, S.-I.: Natural gradient works efficiently in learning. Neural Comput. 10(2), 251–276 (1998).
https://doi.org/10.1162/089976698300017746
5. Amari, S.-I.: Information Geometry and Its Applications. Applied Mathematical Sciences, vol. 194,
p. 374. Springer, Tokyo (2016). https://doi.org/10.1007/978-4-431-55978-8
6. Amari, S.-I., Kumon, M.: Estimation in the presence of infinitely many nuisance parameters—geometry
of estimating functions. Ann. Stat. 16(3), 1044–1068 (1988). https://doi.org/10.1214/aos/1176350947
7. Amari, S.-I., Nagaoka, H.: Methods of Information Geometry. Translations of Mathematical Mono-
graphs, vol. 191, p. 206. American Mathematical Society, Providence, Oxford University Press, Oxford
(2000). https://doi.org/10.1090/mmono/191 (Translated from the 1993 Japanese original by Daishi
Harada)
8. Arnold, V.I.: Mathematical Methods of Classical Mechanics. Graduate Texts in Mathematics, vol. 60,
p. 516. Springer, New York (1989). (Translated from the 1974 Russian original by K. Vogtmann and
A. Weinstein, Corrected reprint of the second (1989) edition)

123
S128 G. Pistone

9. Ay, N., Jost, J., Lê, H.V., Schwachhöfer, L.: Information Geometry. Ergebnisse der Mathematik und
ihrer Grenzgebiete. 3. Folge, vol. 64, p. 407. Springer, Cham (2017). https://doi.org/10.1007/978-3-
319-56478-4
10. Bauer, M., Bruveris, M., Michor, P.W.: Uniqueness of the Fisher–Rao metric on the space of smooth
densities. Bull. Lond. Math. Soc. 48(3), 499–506 (2016). https://doi.org/10.1112/blms/bdw020
11. Bogachev, V.I.: Differentiable Measures and the Malliavin Calculus. Mathematical Surveys and Mono-
graphs, vol. 164, p. 488. American Mathematical Society, Providence (2010). https://doi.org/10.1090/
surv/164
12. Bourbaki, N.: Variétés Differentielles et Analytiques. Fascicule de Résultats / Paragraphes 1 à 7.
Éléments de mathématiques, vol. XXXIII. Hermann, Paris (1971)
13. Brezis, H.: Functional Analysis, Sobolev Spaces and Partial Differential Equations. Universitext, p.
599. Springer, New York (2011). https://doi.org/10.1007/978-0-387-70914-7
14. Brown, L.D.: Fundamentals of Statistical Exponential Families with Applications in Statistical Deci-
sion Theory. IMS Lecture Notes. Monograph Series, vol. 9, p. 283. Institute of Mathematical Statistics,
Hayward (1986)
15. Buldygin, V.V., Kozachenko, Y.V.: Metric Characterization of Random Variables and Random Pro-
cesses. Translations of Mathematical Monographs, vol. 188, p. 257. American Mathematical Society,
Providence (2000). (Translated from the 1998 Russian original by V. Zaiats)
16. Cena, A.: Geometric structures on the non-parametric statistical manifold. Ph.D. thesis, Università
degli Studi di Milano (2002)
17. Cena, A., Pistone, G.: Exponential statistical manifold. Ann. Inst. Stat. Math. 59(1), 27–56 (2007).
https://doi.org/10.1007/s10463-006-0096-y
18. Čencov, N.N.: Statistical Decision Rules and Optimal Inference. Translations of Mathematical Mono-
graphs, vol. 53, p. 499. American Mathematical Society, Providence (1982). https://doi.org/10.1090/
mmono/053. (Translation from the Russian edited by Lev J. Leifman)
19. Chirco, G., Pistone, G.: Dually affine Information Geometry modeled on a Banach space. (2022).
arXiv:2204.00917
20. Chirco, G., Malagò, L., Pistone, G.: Lagrangian and Hamiltonian dynamics for probabilities on the sta-
tistical bundle. Int. J. Geom. Methods Mod. Phys. (2022). https://doi.org/10.1142/s0219887822502140
21. Efron, B.: Defining the curvature of a statistical problem (with applications to second order efficiency).
Ann. Stat. 3(6), 1189–1242 (1975). https://doi.org/10.1214/aos/1176343282. (With a discussion by
C. R. Rao, Don A. Pierce, D. R. Cox, D. V. Lindley, Lucien LeCam, J. K. Ghosh, J. Pfanzagl, Niels
Keiding, A. P. Dawid, Jim Reeds and with a reply by the author)
22. Efron, B.: The geometry of exponential families. Ann. Stat. 6(2), 362–376 (1978). https://doi.org/10.
1214/aos/1176344130
23. Efron, B., Hastie, T.: Computer Age Statistical Inference. Institute of Mathematical Statistics (IMS)
Monographs. Algorithms, Evidence, and Data Science, vol. 5, p. 475. Cambridge University Press,
New York (2016). https://doi.org/10.1017/CBO9781316576533
24. Gibilisco, P., Pistone, G.: Connections on non-parametric statistical manifolds by Orlicz space geom-
etry. IDAQP 1(2), 325–347 (1998). https://doi.org/10.1142/S021902579800017X
25. Hyvärinen, A.: Estimation of non-normalized statistical models by score matching. J. Mach. Learn.
Res. 6, 695–709 (2005)
26. Kass, R.E., Vos, P.W.: Geometrical Foundations of Asymptotic Inference. Wiley Series in Prob-
ability and Statistics: Probability and Statistics. Wiley, New York (1997). https://doi.org/10.1002/
9781118165980
27. Lang, S.: Differential and Riemannian Manifolds. Graduate Texts in Mathematics, 3rd edn., p. 364.
Springer, New York (1995). https://doi.org/10.1007/978-1-4612-4182-9
28. Lê, H.V.: Natural differentiable structures on statistical models and the Fisher metric (2022).
arXiv:2208.06539
29. Li, W., Montúfar, G.: Natural gradient via optimal transport. Inf. Geom. 1(2), 181–214 (2018). https://
doi.org/10.1007/s41884-018-0015-3
30. Lods, B., Pistone, G.: Information geometry formalism for the spatially homogeneous Boltzmann
equation. Entropy 17(6), 4323–4363 (2015). https://doi.org/10.3390/e17064323
31. Lott, J.: Some geometric calculations on Wasserstein space. Commun. Math. Phys. 277(2), 423–437
(2008). https://doi.org/10.1007/s00220-007-0367-3
32. Malagò, L., Montrucchio, L., Pistone, G.: Wasserstein riemannian geometry of gaussian densities. Inf.
Geom. 1(2), 137–179 (2018). https://doi.org/10.1007/s41884-018-0014-4

123
Affine statistical bundle modeled on a Gaussian Orlicz–Sobolev space S129

33. Malliavin, P.: Integration and Probability. Graduate Texts in Mathematics, vol. 157, p. 322. Springer,
New York (1995). https://doi.org/10.1007/978-1-4612-4202-4. (With the collaboration of Héléne
Airault, Leslie Kay and Gérard Letac, Edited and translated from the French by Kay, With a fore-
word by Mark Pinsky)
34. Malliavin, P.: Stochastic Analysis. Grundlehren der Mathematischen Wissenschaften [Fundamental
Principles of Mathematical Sciences], vol. 313, p. 343. Springer, Berlin (1997). https://doi.org/10.
1007/978-3-642-15074-6
35. Montrucchio, L., Pistone, G.: Kantorovich distance on finite metric spaces: Arens-eells norm and cut
norms. Inf. Geom. (2021). https://doi.org/10.1007/s41884-021-00050-w
36. Musielak, J.: Orlicz Spaces and Modular Spaces. Lecture Notes in Mathematics, vol. 1034. Springer,
Berlin (1983)
37. Nomizu, K., Sasaki, T.: Affine Differential Geometry: Geometry of Affine. Immersions Cambridge
Tracts in Mathematics, vol. 111. Cambridge University Press, Cambridge (1994)
38. Nourdin, I., Peccati, G.: Normal Approximations with Malliavin Calculus. From Stein’s Method to Uni-
versality. Cambridge Tracts in Mathematics, vol. 192, p. 239. Cambridge University Press, Cambridge
(2012). https://doi.org/10.1017/CBO9781139084659
39. Otto, F.: The geometry of dissipative evolution equations: the porous medium equation. Commun.
Partial Differ. Equ. 26(1–2), 101–174 (2001). https://doi.org/10.1081/PDE-100002243
40. Parry, M., Dawid, A.P., Lauritzen, S.: Proper local scoring rules. Ann. Stat. 40(1), 561–592 (2012).
https://doi.org/10.1214/12-AOS971
41. Peyré, G., Cuturi, M.: Computational optimal transport. Found. Trends Mach. Learn. 11(5–6), 355–607
(2019). https://doi.org/10.1561/2200000073. arXiv:1803.00567
42. Pistone, G.: Examples of the application of nonparametric information geometry to statistical physics.
Entropy 15(10), 4042–4065 (2013). https://doi.org/10.3390/e15104042
43. Pistone, G.: Nonparametric information geometry. In: Nielsen, F., Barbaresco, F. (eds.) Geometric
Science of Information. Lecture Notes in Comput. Sci., vol. 8085, pp. 5–36. Springer, Heidel-
berg (2013). https://doi.org/10.1007/978-3-642-40020-9_3. First International Conference, GSI 2013
Paris, France, August 28-30, 2013 Proceedings
44. Pistone, G.: Information geometry of the Gaussian space. In: Information Geometry and Its Applica-
tions. Springer Proc. Math. Stat., vol. 252, pp. 119–155. Springer, Cham (2018). https://doi.org/10.
1007/978-3-319-97798-0_5
45. Pistone, G.: Lagrangian function on the finite state space statistical bundle. Entropy 20(2), 139 (2018).
https://doi.org/10.3390/e20020139
46. Pistone, G.: Information geometry of smooth densities on the Gaussian space: Poincaré inequalities.
In: Nielsen, F. (ed.) Progress in Information Geometry. Signals and Communication Technology, pp.
1–17. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-65459-7_1
47. Pistone, G.: Statistical bundle of the transport model. In: Geometric Science of Information. Lecture
Notes in Comput. Sci., vol. 12829, pp. 752–759. Springer, Cham (2021). https://doi.org/10.1007/978-
3-030-80209-7_81
48. Pistone, G., Sempi, C.: An infinite-dimensional geometric structure on the space of all the probability
measures equivalent to a given one. Ann. Stat. 23(5), 1543–1561 (1995)
49. Santacroce, M., Siri, P., Trivellato, B.: New results on mixture and exponential models by Orlicz spaces.
Bernoulli 22(3), 1431–1447 (2016). https://doi.org/10.3150/15-BEJ698
50. Siri, P., Trivellato, B.: Robust concentration inequalities in maximal exponential models. Stat. Prob.
Lett. 170, 109001 (2021). https://doi.org/10.1016/j.spl.2020.109001
51. Susskind, L., Hrabovsky, G.: The Theoretical Minimum: What You Need to Know to Start Doing
Physics. Basic Books, New York (2013)
52. Vershynin, R.: High-dimensional Probability: an Introduction with Applications in Data Science. Cam-
bridge Series in Statistical and Probabilistic Mathematics, vol. 47, p. 284. Cambridge University Press,
Cambridge (2018). https://doi.org/10.1017/9781108231596. (With a foreword by Sara van de Geer)
53. Villani, C.: A review of mathematical topics in collisional kinetic theory. In: Handbook of Mathemat-
ical Fluid Dynamics, vol. I, pp. 71–305. North-Holland, Amsterdam (2002). https://doi.org/10.1016/
S1874-5792(02)80004-0
54. Wainwright, M.J.: High-dimensional Statistics: A Non-asymptotic Viewpoint. Cambridge Series in
Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge (2019). https://doi.
org/10.1017/9781108627771

123
S130 G. Pistone

55. Weyl, H.: Space Time Matter. Dover, New York (1952). (Translation of the 1921 RAUM ZEIT
MATERIE)

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps
and institutional affiliations.

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under
a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted
manuscript version of this article is solely governed by the terms of such publishing agreement and applicable
law.

123

You might also like