2304.00388 Multilevel CNNs For Parametric PDEs
2304.00388 Multilevel CNNs For Parametric PDEs
2304.00388 Multilevel CNNs For Parametric PDEs
Berlin, Germany
Martin Eigel martin.eigel@wias-berlin.de
Weierstraß Institute
Berlin, Germany
Abstract
We combine concepts from multilevel solvers for partial differential equations (PDEs)
with neural network based deep learning and propose a new methodology for the efficient
numerical solution of high-dimensional parametric PDEs. An in-depth theoretical analysis
shows that the proposed architecture is able to approximate multigrid V-cycles to arbitrary
precision with the number of weights only depending logarithmically on the resolution of
the finest mesh. As a consequence, approximation bounds for the solution of parametric
PDEs by neural networks that are independent on the (stochastic) parameter dimension
can be derived.
The performance of the proposed method is illustrated on high-dimensional parametric
linear elliptic PDEs that are common benchmark problems in uncertainty quantification.
We find substantial improvements over state-of-the-art deep learning-based solvers. As
particularly challenging examples, random conductivity with high-dimensional non-affine
Gaussian fields in 100 parameter dimensions and a random cookie problem are examined.
Due to the multilevel structure of our method, the amount of training samples can be
reduced on finer levels, hence significantly lowering the generation time for training data
and the training time of our method.
Keywords: deep learning, partial differential equations, parametric PDE, multilevel,
expressivity, CNN, uncertainty quantification
1 Introduction
The application of deep learning (DL) to many problems of the natural sciences has become
one of the most promising emerging research areas, sometimes coined scientific machine
learning (Baldi, 2018; Sadowski and Baldi, 2018; Keith et al., 2021; Noé et al., 2020; Lu
et al., 2021). In this work, we focus on the challenging problem of parametric partial
differential equations (PDEs) that are used to describe complex physical systems e.g. in
engineering applications and the natural sciences. The parameters determine the data or
domain and are used to introduce uncertainties into the model, e.g. based on assumed
or measured statistical properties. The resulting high-dimensional problems are analyzed
and solved numerically in the research area of uncertainty quantification (UQ) (Lord et al.,
1
Heiß, Gühring and Eigel
2014). Such parametric models are of importance in automated design, weather forecasting,
risk simulations, and the material sciences, to name but a few.
To make the problem setting concrete, for a possibly countably infinite parameter space
Γ ⊆ RN and a spatial domain D ⊆ Rd , d ∈ {1, 2, 3}, we are concerned with learning the
solution map u : Γ × D → R of the parametric stationary diffusion PDE that satisfies the
following equation for all parameters y ∈ Γ
2
Multilevel CNNs for Parametric PDEs
in contrast to many classical machine learning problems, a dataset can be generated on the
fly and in principle with arbitrary precision.
Complementing numerical results, significant progress has been made to understand
the approximation power, also called expressivity, of NNs for representing solutions of
PDEs (Kutyniok et al., 2022; Schwab and Zech, 2019; Berner et al., 2020b; Mishra and
Molinaro, 2022; Grohs et al., 2018; Gühring et al., 2020; Gühring and Raslan, 2021). These
works show that NNs are at least as expressive as “classical” best-in-class methods like
the ones mentioned before. However, numerical results do often not reflect this theoretical
advantage (Adcock and Dexter, 2021). Instead, the convergence of the error typically stag-
nates early during training and the accuracy of classical approaches is not reached (Geist
et al., 2021; Kharazmi et al., 2019; Lu et al., 2021; Grossmann et al., 2023).
We propose a DL-based NN approach that overcomes the performance shortcomings
of other NN architectures for parametric PDEs by borrowing concepts from MLSC and
multigrid algorithms. Coarse approximation and successive corrections on finer levels are
learned in a supervised way by sub-networks and combined for the final prediction. Analog-
ously to multilevel Monte Carlo (MLMC) and MLSC methods, we show that the fineness
of sampling the stochastic parameter space (in terms of the amount of training data per
level) can be (reciprocally) equilibrated with the fineness of the spatial discretization. The
developed architecture consists of a composition of UNets (Ronneberger et al., 2015), which
inherently are closely related to multiresolution approaches.
In the following, we summarize our main contributions.
• Numerical Results: We conduct an in-depth numerical study of our method and, more
generally, UNet-based approaches on common benchmark problems for parametric
PDEs in UQ. Strikingly, the tested methods show significant improvements over the
state-of-the-art NN architectures for solving parametric PDEs. To reduce computing
complexity during training data generation and training the NN, we shift the bulk
of the training data to coarser levels, which allows our method to be scaled to finer
grids, ultimatively resulting in higher accuracy.
3
Heiß, Gühring and Eigel
2017). Kutyniok et al. (2022) translated a reduced basis method into an NN architecture.
None of these results show bounds that are independent on the (stochastic) parameter di-
mension. In contrast, translated to our setting (and simplified), Kutyniok et al. (2022)
show an asymptotic upper bound of h−2 log(h−1 )p + p log(h−1 )3p weights. Numerical ex-
periments are either not provided or do not support the positive theoretical results (Geist
et al., 2021). For general overview of expressivity results for NNs, we refer the interested
reader to (Gühring et al., 2022).
Related to our method and instructive for its derivation is the MLSC method presented
in (Teckentrup et al., 2015) and also (Harbrecht et al., 2016; Ballani et al., 2016). Another
related work is (Lye et al., 2021), where a training process with multilevel structure is
presented. Under certain assumptions on the achieved accuracy of the optimization, an
error analysis is carried out that could in part also be transferred to our architecture.
The connection between differential operators and convolutional layers is well known (see
Chen et al., 2022) and CNNs have already been applied to approximate parametric PDEs for
problems like predicting the airflow around a given geometry in the works (Bhatnagar et al.,
2019; Guo et al., 2016). In (Zhu and Zabaras, 2018) Bayesian CNNs were used to assess
uncertainty in processes governed by stochastic PDEs. (Ummenhofer et al., 2020) extend the
concept to continuous convolutions in mesh-free Lagrangian approaches for solving problems
in fluid dynamics. The concept of a multilevel decomposition has also been applied to CNNs
to approximate solutions of PDEs or in (Gao et al., 2018) for general generative modeling,
or based on hierarchical matrix representations in (Fan et al., 2019a,b).
Such hierarchical matrix representations have even been extended to graph CNNs as
multipole kernels to allow for a mesh-free approximation of the solutions to parametric
PDEs (Li et al., 2020). This approach is similar to our method. However, the mathematical
connection to classical multigrid algorithms is not explored. The kernels are based on fast
multipole modeling without a direct link to the FE method.
He and Xu (2019) present a multilevel approach called MgNet, where various parts of
a classical multigrid solver are replaced by NNs resulting in a UNet-like NN and use it for
image classification. Meta-MgNet (Chen et al., 2022) is a meta-learning approach using
MgNet to predict solutions for parametric PDEs and is most closely related to our method.
The predicted solution is successively refined and stochastic parameters are incorporated by
computing encodings of the parametric discretized operator on each level using additional
smaller NNs. The output of these meta-networks then influences the smoothing operation
in the MgNet architecture. However, in contrast to our method, Meta-MgNet still depends
on the assembled operator matrix. Furthermore, while using an iterative approach yields
more accurate approximations, the speedup promised by a single application of a CNN is
mitigated and the authors report runtimes on the same order of magnitude as classical
solvers. Lastly, the authors test their method on problems with uniform diffusivity using
two or three degrees of freedom instead of a continuous diffusivity field. Our approach
differs from the Meta-MgNet in the following ways: (i) We show that a general UNet is
already able to approximate parametric differential operators without the need for a meta-
network approach; (ii) we improve on the training procedure by incorporating a multilevel
optimization scheme; (iii) we provide a theoretical complexity analysis; (iv) we provide
experiments showing high accuracies with a single application of our method for a variety
of random parameter fields.
4
Multilevel CNNs for Parametric PDEs
1.2 Outline
In the first section we introduce the parametric Darcy model problem and its FE discret-
ization. In Section 3, we give a brief introduction to multigrid methods. In Section 4,
we describe our DL-based solution approach. Section 5 is concerned with the theoretical
analysis of the expressivity of the presented architecture. Section 6 contains our numerical
study on several parametric PDE problems. We discuss and summarize our findings in
Section 7. Implementation details and our proofs can be found in the appendix.
2 Problem setting
A standard model problem in UQ is the stationary linear diffusion equation (1) where
the diffusion coefficient κ : Γ → L∞ (D) is a random field with finite variance determined
by a possibly high-dimensional random parameter y ∈ Γ ⊆ RN , hence also parametrizing
the solution. The parameters are images of random vectors distributed according to some
product measure π = ⊗k≥1 π1 , where π1 usually either is the uniform (denoted by U ) or
the standard normal measure (denoted by N ) on Γ1 ⊆ R. For each parameter realization
y ∈ Γ, the variational formulation of the problem is given by: For a bounded Lipschitz
domain D ⊆ Rd and source term f ∈ V ∗ = H −1 (D), find u ∈ V := H01 (D) such that for all
test functions w ∈ V ,
Z Z
κ(y, x)∇u(x) · ∇w(x) dx = f (x)w(x) dx. (2)
D D
We assume that the above equation always exhibits a unique solution. In fact, this holds
true with high probability even for the case of unbounded (lognormal) coefficients, where
uniform ellipticity of the bilinear form is not given. This yields the solution operator v,
mapping a parameter realization y ∈ Γ to its corresponding solution v(y):
v : Γ → V, y 7→ v(y).
With the assumed boundedness (with high probability) of the coefficient κ, this is equivalent
to optimizing with respect to the canonical parameter-dependent norm.
We recall the truncated Karhunen-Loève expansion (KLE) (Knio and Le Maitre, 2006;
Lord et al., 2014) with stochastic parameter dimension p, which is a common parametric
representation of random fields. It is given by
p
κ(y, x) = a0 (x) + yk ak (x). (3)
X
k=1
5
Heiß, Gühring and Eigel
completely characterized by their first two moments and the ak are the eigenfunctions of
the covariance integral operator while the parameter vector y is the image of a standard
normal random vector. To become computationally feasible, a sufficient decay of the norms
of these functions is required for the truncation to be reasonable. Note that in the numerical
experiments, we use a well-known “artificial” KLE.
2.1 Discretization
To solve the infinite dimensional variational formulation (2) numerically, a finite dimensional
subspace has to be defined in which an approximation of the solution is computed. As is
common with PDEs, we use the FE method, which is based on a disjoint decomposition
of the computational domain D into simplices (also called elements or cells). Setting the
domain D = [0, 1]2 , we assume a subdivision into congruent triangles such that D is exactly
represented by a uniform square mesh with identical connectivity at every node (see also
Remark 11). The discrete space Vh is then spanned by conforming P1 FE hat functions
Vh
ϕj : R2 → R, i.e., Vh = span{ϕj }dim
j=1 ⊆ V (see e.g., Braess, 2007; Ciarlet, 2002). We write
uh , wh ∈ Vh for the discretized versions of u, w ∈ V of Equation (2) with FE coefficient
vectors u, w ∈ Rdim Vh and denote by κh the nodal interpolation of κ such that κh (y, ·) ∈ Vh
with coefficient vector κy ∈ Rdim Vh for every y ∈ Γ. Hence,
XVh
dim XVh
dim XVh
dim
uh = ui ϕ i , wh = wi ϕi , κh (y, ·) = (κy )i ϕi .
i=1 i=1 i=1
where ay,h (uh , vh ) := D κh (y, x)∇uh (x) · ∇wh (x) dx. Using the FE coefficient vectors leads
R
We assume that there always exists a unique solution to this problem and define the
discretized parameter to solution operator by
vh : Γ → Vh , y 7→ vh (y), (5)
where vh (y) solves Problem 2.1 for any y ∈ Γ. Furthermore, vh : Γ → Rdim Vh maps y to
the FE coefficient vector of vh (y).
6
Multilevel CNNs for Parametric PDEs
V1 ⊆ . . . ⊆ VL ⊆ H01 (D).
These spaces are spanned by the the FE basis functions on the respective level, i.e., V` =
(`) V`
span{ϕj }dimj=1 . A frequency decomposition of the target function u ∈ H0 (D) is obtained
1
by approximating low frequency parts on the coarsest mesh T1 and subsequently calculating
corrections for higher frequencies on finer meshes.
For each level ` = 1, . . . , L we define the discretized solution operator v` : Γ → V` as
in (5). The correction v̂` with respect to the next coarser level ` − 1 for ` ∈ {2, . . . , L} is
given by
v̂` : Γ → V` , y 7→ v` (y) − v`−1 (y). (6)
Additionally, we define v` and v̂` as the functions mapping the parameter y to the corres-
ponding coefficient vectors of v` (y) and v̂` (y) in V` . If the solution of the considered PDE
is sufficiently smooth, the corrections decay exponentially in the level, namely kv̂` (y)kH 1 .
2−` kf k∗ (see Braess, 2007; Yserentant, 1993; Hackbusch, 2013).
Central to the multigrid method is a transfer of discrete functions between adjacent
levels of the hierarchy of spaces. This is achieved in terms of prolongation and restriction
operators, the discrete versions of which are defined in the following.
7
Heiß, Gühring and Eigel
For our analysis, we use a damped Richardson smoother (Richardson and Glazebrook,
1911), which is rarely used in practice due to subpar convergence rates, but theoretically
accessible because of its simplicity. For a suitable damping parameter ω > 0, the Richardson
iteration for solving Equation (4) is given by
u0 := 0,
(7)
un+1 := un + ω(f − Ay un ).
We denote by MGm k0 ,k : R
3×dim Vh → Rdim Vh the function mapping an initial guess, the
diffusivity field and the right-hand side to the result of m multigrid V-cycles (Algorithm 3.1)
with k0 smoothing iterations on the coarsest level and k smoothing iterations on the finer
levels.
Based on this iteration, we use a fundamental convergence result for the multigrid al-
gorithm shown in (Braess and Hackbusch, 1983, Thm. 4.2) together with the assumption
of uniform ellipticity to get the uniform contraction property of the V-cycle multigrid al-
gorithm. For x ∈ Rdim Vh and the discretized operator A ∈ Rdim Vh ×dim Vh , we use the
standard notation for the energy norm kxkA = kA1/2 xk`2 .
In the next remark, we elaborate on the number of smoothing steps on the coarsest grid k0 .
Remark 4 In the V-cycle algorithm it is often assumed that the equation system solved
on the coarsest level is solved exactly (by a direct solver). However, the proof of (Braess
and Hackbusch, 1983, Thm. 4.2) shows that this is not necessary. In fact, a contraction of
the residual smaller than C/(C + 2k) on the coarsest level suffices. This shows that k0 is
independent on the fineness of the finest grid. For a rigorous derivation, we refer to (Xu
and Zhang, 2022, Theorem 4.4). In our setting, it is possible to use the damped Richardson
iteration for the correction on the coarsest grid as well, effectively carrying out additional
smoothing steps.
8
Multilevel CNNs for Parametric PDEs
results can be extended to more dimensions in a straightforward way e.g. by using higher-
dimensional convolutions.
The general procedure of the proposed approach can be described as follows. Initially,
a dataset of solutions is generated using classical finite element solvers for randomly drawn
parameters (datapoints), i.e., realizations of coefficient fields leading to respective FE solu-
tions. Then, for each datapoint a multilevel decomposition of the solution is generated as
formalized in (6). A NN is trained to output the coarse grid solution and the subsequent
level corrections from the input parameters, which are given as coefficient vectors of the
coefficient realizations on the respective FE space. The network prediction is obtained by
summing up these contributions. Finally, to test the accuracy of the network, a set of inde-
pendent FE solutions is generated and error metrics of the NN predictions are computed.
The centerpiece of our methodology is called ML-Net (short for multilevel network). For
a prescribed number of levels L ∈ N, it consists of L jointly trained subnetworks ML-Net` .
Each subnetwork is optimized to map its input – the discretized parameter dependent
diffusion coefficient κy and the output of the previous level z(`−1) – to the level ` correction
v̂` (y) as in (6). The ML-Net` subnetwork architecture is inspired by a cascade of multigrid
V-cycles, which is implemented by a sequence of UNets (Ronneberger et al., 2015). Using
UNets in a sequential manner to correct outputs of previous modules successively is a
well-established strategy in computer vision (see e.g., Genzel et al., 2022). We provide a
theoretical foundation for this design choice in the next section. Since we assume the right-
hand side f ∈ H −1 (D) to be independent of y, we do not include it as an additional input.
9
Heiß, Gühring and Eigel
More precisely,
L
ML-Net[θ; L] : Rdim VL → κ 7→ (z(`) )L
O
Rdim V` , `=1 ,
`=1
where z(`) ∈ Rdim V` is the output of the level ` subnetwork and θ are the learnable para-
meters. For the subnetworks, we have
where ↓ (↑) denotes the up-sampling (down-sampling) operator to level ` resolution dim V` .
These operators are motivated by the prolongation and restriction operators of Definition 2.
However, they are comprised of trainable strided (or transpose-strided) convolutions instead
of an explicit construction. The parameters of the level ` subnetwork θ ` consist of the
parameters of the learnable up- and down-sampling and the UNet-sequence (θ `,r )R r=1 .
`
Our complete approach to tackle a parametric PDE with ML-Net consists of the follow-
ing multi-step procedure:
Step 1: Sampling and computing the data. We denote by N ∈ N the training dataset
size and sample parameter realizations y1 , . . . , yN drawn according to π. For each parameter
realization yi , 1 ≤ i ≤ N , the grid corrections v̂` (yi ) on each level 1 ≤ ` ≤ L as in (6) and
the FE coefficient vector κyi ∈ Rdim VL of κL (·, yi ) on the finest level L are computed. Note
that the evaluation of the grid corrections requires the PDE solution for each parameter on
the finest grid.
Step 2: Training the ML-Net. For each level ` ∈ {1, . . . , L}, compute the normaliza-
2
tion constants δ`2 := N1 N
i=1 kv̂` (yi )k2 and the H mass matrix M ` ∈ R
1 dim V` ×dim V` by
P
D E
(`) (`)
(M ` )kj := ϕk , ϕj , for k, j ∈ {1, . . . , dim V` },
H 1 (D)
10
Multilevel CNNs for Parametric PDEs
(`) (`)
where {ϕ1 , . . . , ϕdim V` } constitutes the conforming P1 FE basis of V` . We optimize the
parameters θ by approximately solving
L X
N
1
min where d`i := ML-Net[θ; L](κyi ) ` − v̂` (yi )
X
dT`i M ` d`i ,
θ
`=1 i=1
δ`
with mini-batch stochastic gradient descent. Our loss computes the H 1 -distance between the
subnetwork output as a coefficient of the FE basis of V` and the normalized grid corrections
directly on the coefficients for each level `, i.e.,
XV`
dim
(`) 1
dT`i M ` d`i = [ML-Net[θ ∗ ; L](κyi )]`j ϕj − v̂` (yi ) .
j=1
δ`
H 1 (D)
Normalization via δ` assures that each level is weighted equally in the loss function.
In the following remark, a simpler baseline method is introduced that can be seen as a
special case of ML-Net.
11
Heiß, Gühring and Eigel
Remark 5 As a comparison and to assess the benefits of the described multilevel approach,
we propose a simpler NN, which directly maps the discretized parameter dependent diffusion
coefficient κy to the coefficients on the finest level. For this, we denote by UNetSeq a
sequence of UNets which can be seen as a variation of ML-Net where R` = 0 for levels
` = 1, . . . , L − 1, i.e., only the highest level sub-module is used. This architecture is a
hybrid approach between single-level and multi-level because of the internal multi-resolution
structure of the UNet blocks. For this architecture, we choose the number of UNets such
that the total parameter count roughly matches ML-Net.
5 Theoretical analysis
In this section, we show that UNet-like NN architectures are able to approximate multigrid
V-cycles on Vh up to arbitrary accuracy ε > 0 with the number of parameters independent
on ε and growing only logarithmically in 1/h (Theorem 6). We use this result in Corollary 8
to approximate the discrete solution vh : Γ → Rdim Vh of the parametric PDE Problem 2.1
with arbitrary accuracy ε by a NN with the number of parameters bounded from above by
log(1/h) log(1/ε) independent of the stochastic parameter dimension p. This is achieved by
a NN that emulates a multigrid solver applied to each y individually and, thus, approx-
imately computes the solution vh (y). Eventually, we derive similar approximation bounds
for the ML-Net architecture and point out computational advantages of ML-Net over pure
multigrid approximations of Corollary 8. Note that while only the two-dimensional setting
is considered in this section, the results can be translated to arbitrary spatial dimensions.
The rigorous mathematical analysis of NNs requires an extensive formalism, which can
e.g., be found in (Petersen and Voigtländer, 2018; Gühring et al., 2020; Gühring and Raslan,
2021). We assume that readers interested in the proofs are familiar with common techniques
such as concatenation/parallelisation of NNs and the approximation of polynomials, the
identity, and algebraic operations. To make our presentation more accessible from a “prac-
titioner point of view”, we present a streamlined exposition which still includes sufficient
details to follow the arguments of the proofs. All proofs can be found in Appendix C.
We consider a CNN as any computational graph consisting of (various kinds of) convo-
lutions in combination with standard building blocks from common DL libraries (e.g., skip
connections and pooling layers). Note that this conception also includes the prominent
UNet architecture (Ronneberger et al., 2015) and other UNet-like architectures with the
same recursive structure.
For our theoretical analysis, we consider activation functions % ∈ L∞ loc (R) such that
there exists x0 ∈ R with % three times continuously differentiable in a neighborhood of
some x0 ∈ R and %00 (x0 ) 6= 0. This includes many standard activation functions such as
exponential linear unit, softsign, swish, and sigmoids. For (leaky) ReLUs a similar but more
involved analysis would be possible, which we avoid for the sake of simplicity.
For a CNN Ψ, we denote the number of weights by M (Ψ). Since we consider the
FE spaces Vh in the two-dimensional setting on uniformly refined square meshes, FE
coefficient vectors x ∈ Rdim Vh can be viewed as two-dimensional arrays. Whenever x
is processed by a CNN, we implicitly assume a 2D matrix representation. We define
kxkH 1 :=
Pdim Vh
i=1 xi ϕi H 1 . For this section, we set D = [0, 1]2 and assume uniform
ellipticity for Problem 2.1.
12
Multilevel CNNs for Parametric PDEs
We start with the approximation of the solution of the multigrid V-cycle MGm k0 ,k with
initialization u0 , diffusion coefficient κ, and right-hand side f by UNet-like CNNs.
Theorem 6 Let Vh ⊆ H01 (D) be the P1 FE space on a uniform square mesh. Then there
exists a constant C > 0 such that for any M, ε > 0 and m, k, k0 ∈ N there exists a CNN
Ψ : R3×dim Vh → Rdim Vh with
In the next remark, we provide more details about the architecture of Ψ and its implications.
Remark 7 The proof of Theorem 6 reveals that Ψ resembles the concatenation of multiple
UNet-like subnetworks, each approximating one V-cycle. We draw two conclusions from
that:
• Generally, this supports the heuristic relation between UNets and multigrid methods
with a rigorous mathematical analysis indicating a possible reason for their success
in multiscale-related tasks e.g., found in medical imaging (Liu et al., 2020) or PDE-
related problems (see Cho et al., 2021; Jiang et al., 2021).
• Together with our numerical results in the following section, this underlines the suit-
ability of UNetSeq (see Remark 5) for the solution of Problem 2.1. This is made more
specific in the next corollary.
The following corollary is an easy consequence of Theorem 6 for the parametric PDE
Problem 2.1 by applying the multigrid solver to solve Equation (5) for each y ∈ Γ indi-
vidually. We use that the required number of smoothing iterations in Theorem 3 is grid
independent and choose the number of V-cycles as m ≈ log(1/ε). An application of the
triangular inequality yields the following result.
Corollary 8 Consider the discretized Problem 2.1 with the conforming P1 FE space Vh ⊆
H01 (D). Assume that κy is uniformly bounded over all y ∈ Γ. Then there exists a constant
C > 0 such that for any ε > 0 there exists a CNN Ψ : R2×dim Vh → Rdim Vh with
Setting the NN approximation error equal to the FE discretization error (ε = h), we get
the bound M (Ψ) ≤ C log(1/h)2 . This shows that the number of parameters at most grows
polylogarithmically in 1/h and is independent of the stochastic parameter dimension p.
Finally, we provide complexity estimates for the ML-Net architecture in the next co-
rollary. Here, at each level of ML-Net a multigrid V-cycle (on the respective level) is
approximated by using Theorem 6.
13
Heiß, Gühring and Eigel
Corollary 9 Consider the discretized Problem 2.1. Let the spaces V1 , . . . , VL be defined as
in Section 3 for L ∈ N, f ∈ H −1 (D) and its discretization denoted by f ∈ Rdim VL . Assume
that the discretized diffusion coefficient κy ∈ Rdim VL is uniformly bounded over all y ∈ Γ.
Then, there exists a constant C > 0 such that for every ε > 0 there exists an ML-Net Ψ (as
in Section 4) with
(i) kΨ(0, κy , f ) − vL (y)kH 1 (D) ≤ ε kf k∗ for all y ∈ Γ,
(ii) M (Ψ) ≤ CL log 1
ε + CL2 .
In the following remark, we discuss the relation of the previous two corollaries.
Remark 10 Equilibrating approximation and discretization error, i.e., setting ε = h =
2−L , we make two observations:
• Both, Corollary 8 and 9, yield an upper bound for the number weights of M (Ψ) ≤ CL2 .
We conclude that from our upper bounds no advantage in terms of expressiveness for
using ML-Net over UNetSeq is evident. In Section 6.2 we show that our numerical
experiments support this finding.
• Despite a similar number of parameters in relation to the approximation accuracy,
a forward pass of ML-net is more efficient than a forward pass of UNetSeq. This is
due to the multilevel structure of the ML-Net that shifts the computational load more
evenly across resolutions. Note that the number of parameters of a convolutional
filter is independent on the input resolution, whereas the computational complexity of
convolving the input with the filter is not. In detail, the number of operations for a
forward pass of ML-Net is O(22L ) = O(h−2 ), while for UNetSeq, we have O(L22L ) =
O(log(h−1 )h−2 ) computations. This is one advantage of ML-Net that results in shorter
training times (see Section 6).
6 Numerical results
This section is concerned with the evaluation of ML-Net and the simpler UNetSeq on several
common benchmark problems in terms of the mean relative H 1 -distance of the predicted
solution. Moreover, the performance of ML-Net is tested with a declining number of training
samples for successive grid levels. This is motivated by multilevel theory and leads to a
beneficial training complexity (see Remark 10 for a more detailed evaluation).
We assess the performance of our methods for different choices of varying computational
complexity for the coefficient functions ak , comprising the diffusivity field κ by (3) and
different parameter distributions π. For all test cases, we choose the unit square domain
D = [0, 1]2 , a0 (x) := a0 ∈ R constant and the right-hand side f ≡ 1. The test problems are
defined as follows.
1. Uniform smooth field. Let Γ = [−1, 1]p and π = U [−1, 1]p . Moreover, let a0 := 1
and choose ak as planar Fourier modes as described in (Eigel et al., 2019b) and (Geist
et al., 2021) with decay kak k∞ = 0.1k −2 for 1 ≤ k ≤ p.
2. Log-normal smooth field. Let Γ = Rp and π = N (0, Idp ). Define κ(y, x) :=
exp(κ̃(y, x)), where κ̃ equals κ from the uniform case with a0 := 0.
14
Multilevel CNNs for Parametric PDEs
Figure 2: Example realizations of κ(·, y) for the four test cases. From left to right:
uniform case (p = 100), log-normal case (p = 100), cookie problem (p = 16), cookie
problem with variable radii (p = 32). The colormaps are normalized for visibility.
3. Cookie problem with fixed radii inclusions. Let Γ = [−1, 1]p , with p having
a natural square root and π = U [−1, 1]p . Moreover, define a0 := 0.1 and ak (x) =
XDk (x), where Dk are disks centered in a cubic lattice with a fixed radius r = 0.3p−1/2
for 1 ≤ k ≤ p.
p 0
κ(y, x) := a0 + (x),
X
y2k−1 XDk,r(y
2k )
k=1
where the Dk,r are disks centered in a cubic lattice with parameter dependent ra-
dius r. We choose r(y) for y ∈ [−1, 1] such that the radii are uniformly distributed in
[0.5(p0 )−1/2 , 0.9(p0 )−1/2 ].
Figure 2 depicts a random realization for each of the 4 problem cases. Note that not all of
the above problems satisfy the theoretical assumptions for the results in Section 5. Notably,
the log-normal case is not uniformly elliptic and our theoretical guarantees thus only hold
with (arbitrarily) high probability as indicated above. Moreover, in case of the cookie
problem with variable radii, the diffusion coefficient depends non-linearly on the parameter
vector since also the basis functions ak are parameter dependent. We include these cases to
study our methodology with diverse and particularly challenging setups.
We use the open source package FEniCS (Logg and Wells, 2010) with the GMRES (Saad
and Schultz, 1986) solver for carrying out the FE simulations to generate training data and
PyTorch (Paszke et al., 2019) for the NN models.
If not stated otherwise, we use a training dataset with 104 samples and 1024 samples
for independent validation computed from i.i.d. parameter vectors. For testing, the per-
formance of the methods is evaluated on 1024 independently generated test samples.
For the multilevel decomposition, we consider L = 7 resolution levels starting with a
coarse resolution of 5 × 5 cells. Iterative uniform mesh refinement with factor η = 2 results
in 320 × 320 cells on the finest level. To reduce computational complexity, we only compute
15
Heiß, Gühring and Eigel
solutions on the finest level and derive the coarser solutions as well as the corrections
in (6) using nodal interpolation. Additionally, we compute reference test solutions on a
high-fidelity mesh with 1280 × 1280 cells, resulting in about 1.6 × 106 degrees of freedom.
For UNetSeq, we choose the number of UNets such that the total parameter count
roughly matches the 5.6×106 parameters of ML-Net. For training, we use the Adam (Kingma
and Ba, 2015) optimizer and train for 200 epochs with learning rate decay. After each run,
we evaluate the model with the best validation performance. Training took approximately
33 GPU-hours on an NVIDIA Tesla P100 for ML-Net and 46 GPU-hours for UNetSeq (see
also Remark 10).
with ∗ ∈ {H 1 , L2 }.
cases described above with varying stochastic parameter dimensions. EMRL2 and EMRL ref
2 are
the NN approximation error on the finest grid L = 7 can be several magnitudes lower than
the FE approximation error. Conversely, the discrepancy between EMRL2 and EMRL ref
2 seen
16
Multilevel CNNs for Parametric PDEs
17
Heiß, Gühring and Eigel
parameter EMRH 1
problem
dimension p ML-Net UNetSeq
10 2.24e−4 ± 9.62e−5 9.75e−5 ± 2.10e−5
50 2.21e−4 ± 2.96e−5 2.08e−3 ± 2.79e−3
uniform
100 2.27e−4 ± 2.72e−5 8.89e−5 ± 1.00e−5
200 2.31e−4 ± 3.50e−5 2.05e−3 ± 2.76e−3
10 2.15e−4 ± 6.18e−5 2.83e−3 ± 3.70e−3
50 9.73e−4 ± 1.04e−3 3.71e−3 ± 4.91e−3
log-normal
100 2.64e−4 ± 1.73e−5 2.82e−3 ± 4.31e−3
200 3.00e−4 ± 1.43e−5 1.97e−4 ± 4.22e−5
16 9.41e−4 ± 1.12e−4 1.10e−3 ± 3.76e−4
cookie fixed
64 1.85e−3 ± 1.38e−4 7.20e−4 ± 1.43e−4
32 4.69e−3 ± 1.41e−3 3.69e−3 ± 4.53e−4
cookie variable
128 5.98e−3 ± 1.13e−4 3.08e−3 ± 5.07e−4
Table 1: EMRH 1 for ML-Net and UNetSeq evaluated on all test cases.
ref
EMRH
parameter 1
problem
dimension p ML-Net UNetSeq
10 5.33e−3 ± 4.50e−6 5.33e−3 ± 7.01e−7
50 5.33e−3 ± 1.15e−6 6.24e−3 ± 1.28e−3
uniform
100 5.33e−3 ± 1.63e−6 5.33e−3 ± 8.55e−7
200 5.33e−3 ± 1.46e−6 6.22e−3 ± 1.25e−3
10 5.33e−3 ± 2.93e−6 6.78e−3 ± 2.04e−3
50 5.51e−3 ± 2.49e−4 7.53e−3 ± 3.10e−3
log-normal
100 5.33e−3 ± 5.54e−7 6.90e−3 ± 2.71e−3
200 5.34e−3 ± 3.03e−6 5.33e−3 ± 1.69e−6
16 7.09e−2 ± 1.87e−5 7.09e−2 ± 7.39e−6
cookie fixed
64 9.73e−2 ± 1.16e−5 9.73e−2 ± 5.62e−6
32 7.83e−2 ± 3.22e−4 7.81e−2 ± 2.39e−4
cookie variable
128 1.12e−1 ± 2.58e−4 1.12e−1 ± 2.76e−5
18
Multilevel CNNs for Parametric PDEs
7 Conclusion
In this work, we combine concepts from established multilevel algorithms with NNs for
efficiently solving challenging high-dimensional parametric PDEs. We provide a theoretical
complexity analysis revealing the relation between UNet-like architectures approximate clas-
sical multigrid algorithms arbitrarily well. Moreover, we show that the ML-Net architecture
presents an advantageous approach for learning the parameter-to-solution operator of the
parametric Darcy problems. The performance of our method is illustrated by several numer-
ical benchmark experiments showing a significant improvement over the state-of-the-art of
previous DL-based approaches for parametric PDEs. In fact, the shown approximation qual-
ity matches the best-in-class techniques such as low-rank least squares methods, stochastic
Galerkin approaches, compressed sensing and stochastic collocation. Additionally, we show
that the multilevel architecture allows to use fewer data points for finer corrections during
19
Heiß, Gühring and Eigel
Table 3: All errors for ML-Net and UNetSeq trained with a reduced dataset size on
two problem cases with parameter dimension p = 100. The ML-Net is trained with
the number of training samples halved for each level and UNetSeq is trained using
232 samples in total.
training with negligible impact on practical performance. This enables the extension to
much finer resolutions without a prohibitive increase of computational complexity. Finally,
we note that while we consider a scalar linear elliptic equation in this work, our method-
ology can be used for a variety of possibly nonlinear and vector-valued problems. Future
research directions could include the analysis of such problems as well as the extension of
our method to adaptive grids.
The authors would like to thank Reinhold Schneider for fruitful discussions on the topic. I.
Gühring acknowledges support from the Research Training Group “Differential Equation-
and Data-driven Models in Life Sciences and Fluid Dynamics: An Interdisciplinary Research
Training Group (DAEDALUS)” (GRK 2433) funded by the German Research Founda-
tion (DFG) and a post-doctoral scholarship from Technical University of Berlin for “Deep
Learning for (Parametric) PDEs from a Theoretical and Practical Perspective”. M. Eigel
acknowledges partial support from DFG SPP 1886 “Polymorphic uncertainty modelling
for the numerical design of structures” and DFG SPP 2298 “Theoretical foundations of
Deep Learning”. We acknowledge the computing facilities and thank the IT support of the
Weierstrass Institute.
References
Ben Adcock and Nick Dexter. The gap between theory and practice in function approxim-
ation with deep neural networks. SIAM Journal on Mathematics of Data Science, 3(2):
624–655, 2021.
20
Multilevel CNNs for Parametric PDEs
21
Heiß, Gühring and Eigel
Sung-Jin Cho, Seo-Won Ji, Jun-Pyo Hong, Seung-Won Jung, and Sung-Jea Ko. Re-
thinking coarse-to-fine approach in single image deblurring. In 2021 IEEE/CVF In-
ternational Conference on Computer Vision (ICCV), pages 4621–4630, 2021. doi:
10.1109/ICCV48922.2021.00460.
Philippe G. Ciarlet. The Finite Element Method for Elliptic Problems. Society for Industrial
and Applied Mathematics, 2002. doi: 10.1137/1.9780898719208.
Albert Cohen and Giovanni Migliorati. Near-optimal approximation methods for elliptic
pdes with lognormal coefficients. Mathematics of Computation, 2023.
Weinan E, Jiequn Han, and Arnulf Jentzen. Deep learning-based numerical methods for
high-dimensional parabolic partial differential equations and backward stochastic differ-
ential equations. Communications in Mathematics and Statistics, 5(4):349–380, 2017.
ISSN 2194-671X. doi: 10.1007/s40304-017-0117-6.
Martin Eigel, Claude Jeffrey Gittelson, Christoph Schwab, and Elmar Zander. Adaptive
stochastic galerkin FEM. Computer Methods in Applied Mechanics and Engineering,
270:247–269, March 2014. doi: 10.1016/j.cma.2013.11.015. URL https://doi.org/10.
1016/j.cma.2013.11.015.
Martin Eigel, Reinhold Schneider, Philipp Trunschke, and Sebastian Wolf. Variational
monte carlo—bridging concepts of machine learning and high-dimensional partial dif-
ferential equations. Advances in Computational Mathematics, 45(5-6):2503–2532, Oc-
tober 2019a. doi: 10.1007/s10444-019-09723-8. URL https://doi.org/10.1007/
s10444-019-09723-8.
Martin Eigel, Reinhold Schneider, Philipp Trunschke, and Sebastian Wolf. Variational
monte carlo—bridging concepts of machine learning and high-dimensional partial differ-
ential equations. Advances in Computational Mathematics, 45(5):2503–2532, 2019b.
Martin Eigel, Manuel Marschall, Max Pfeffer, and Reinhold Schneider. Adaptive stochastic
galerkin FEM for lognormal coefficients in hierarchical tensor representations. Numerische
Mathematik, 145(3):655–692, 6 2020. doi: 10.1007/s00211-020-01123-1.
Oliver G Ernst, Bjorn Sprungk, and Lorenzo Tamellini. Convergence of sparse collocation
for functions of countably many gaussian random variables (with application to elliptic
pdes). SIAM Journal on Numerical Analysis, 56(2):877–905, 2018.
Yuwei Fan, Jordi Feliu-Faba, Lin Lin, Lexing Ying, and Leonardo Zepeda-Núnez. A
multiscale neural network based on hierarchical nested bases. Research in the Math-
ematical Sciences, 6(2):1–28, 2019a.
Yuwei Fan, Lin Lin, Lexing Ying, and Leonardo Zepeda-Núnez. A multiscale neural net-
work based on hierarchical matrices. Multiscale Modeling & Simulation, 17(4):1189–1213,
2019b.
22
Multilevel CNNs for Parametric PDEs
Ruiqi Gao, Yang Lu, Junpei Zhou, Song-Chun Zhu, and Ying Nian Wu. Learning gener-
ative convnets via multi-grid modeling and sampling. 2018 IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pages 9155–9164, 2018.
Moritz Geist, Philipp Petersen, Mones Raslan, Reinhold Schneider, and Gitta Kutyniok.
Numerical solution of the parametric diffusion equation by deep neural networks. Journal
of Scientific Computing, 88(1):1–37, 2021.
Martin Genzel, Ingo Gühring, Jan MacDonald, and Maximilian März. Near-exact recovery
for tomographic inverse problems via deep learning. In Kamalika Chaudhuri, Stefanie
Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato, editors, International
Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland,
USA, volume 162 of Proceedings of Machine Learning Research, pages 7368–7381. PMLR,
2022.
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT Press, Cam-
bridge, 2016.
Philipp Grohs and Lukas Herrmann. Deep neural network approximation for high-
dimensional elliptic pdes with boundary conditions. IMA Journal of Numerical Analysis,
42(3):2055–2082, 2022.
Philipp Grohs, Fabian Hornung, Arnulf Jentzen, and Philippe Von Wurstemberger. A
proof that artificial neural networks overcome the curse of dimensionality in the nu-
merical approximation of black-scholes partial differential equations. arXiv preprint
arXiv:1809.02362, 2018.
Tamara G Grossmann, Urszula Julia Komorowska, Jonas Latz, and Carola-Bibiane Schön-
lieb. Can physics-informed neural networks beat the finite element method? arXiv
preprint arXiv:2302.04107, 2023.
Ingo Gühring and Mones Raslan. Approximation rates for neural networks with encodable
weights in smoothness spaces. Neural Networks, 134:107–130, 2021.
Xiaoxiao Guo, Wei Li, and Francesco Iorio. Convolutional neural networks for steady flow
approximation. In Proceedings of the 22nd ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, KDD ’16, page 481–490, New York, NY, USA,
2016. Association for Computing Machinery. ISBN 9781450342322. doi: 10.1145/2939672.
2939738.
Ingo Gühring, Gitta Kutyniok, and Philipp Petersen. Error bounds for approximations with
deep ReLU neural networks in W s,p norms. Analysis and Applications (Singap.), 18(05):
803–859, 2020.
Ingo Gühring, Mones Raslan, and Gitta Kutyniok. Expressivity of deep neural networks.
In Philipp Grohs and Gitta Kutyniok, editors, Mathematical Aspects of Deep Learning,
page 149–199. Cambridge University Press, 2022.
Wolfgang Hackbusch. Multi-grid methods and applications, volume 4. Springer Science &
Business Media, 2013.
23
Heiß, Gühring and Eigel
Jiequn Han, Arnulf Jentzen, et al. Deep learning-based numerical methods for high-
dimensional parabolic partial differential equations and backward stochastic differential
equations. Communications in mathematics and statistics, 5(4):349–380, 2017.
Helmut Harbrecht, Michael Peters, and Markus Siebenmorgen. Multilevel accelerated quad-
rature for pdes with log-normally distributed diffusion coefficient. SIAM/ASA Journal
on Uncertainty Quantification, 4(1):520–551, 2016.
Juncai He and Jinchao Xu. Mgnet: A unified framework of multigrid and convolutional
neural network. Science China Mathematics, 62(7):1331–1354, 2019. ISSN 1869-1862.
doi: 10.1007/s11425-019-9547-2.
J.S. Hesthaven and S. Ubbiali. Non-intrusive reduced order modeling of nonlinear prob-
lems using neural networks. Journal of Computational Physics, 363:55–78, 2018.
ISSN 0021-9991. doi: https://doi.org/10.1016/j.jcp.2018.02.037. URL https://www.
sciencedirect.com/science/article/pii/S0021999118301190.
Zhihao Jiang, Pejman Tahmasebi, and Zhiqiang Mao. Deep residual u-net convolution
neural networks with autoregressive strategy for fluid flow predictions in large-scale geo-
systems. Advances in Water Resources, 150:103878, 02 2021. doi: 10.1016/j.advwatres.
2021.103878.
John A Keith, Valentin Vassilev-Galindo, Bingqing Cheng, Stefan Chmiela, Michael Gasteg-
ger, Klaus-Robert Müller, and Alexandre Tkatchenko. Combining machine learning and
computational chemistry for predictive insights into chemical systems. Chemical Reviews,
121(16):9816–9872, 2021.
Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In
3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA,
USA, May 7-9, 2015, Conference Track Proceedings, 2015.
Omar M Knio and OP Le Maitre. Uncertainty propagation in cfd using polynomial chaos
decomposition. Fluid dynamics research, 38(9):616, 2006.
Gitta Kutyniok, Philipp Petersen, Mones Raslan, and Reinhold Schneider. A theoretical
analysis of deep neural networks and parametric pdes. Constructive Approximation, 55
(1):73–125, 2022.
Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Andrew Stuart,
Kaushik Bhattacharya, and Anima Anandkumar. Multipole graph neural operator for
parametric partial differential equations. Advances in Neural Information Processing
Systems, 33:6755–6766, 2020.
Zongyi Li, Nikola Borislavov Kovachki, Kamyar Azizzadenesheli, Burigede liu, Kaushik
Bhattacharya, Andrew Stuart, and Anima Anandkumar. Fourier neural operator for
24
Multilevel CNNs for Parametric PDEs
Zongyi Li, Nikola Borislavov Kovachki, Kamyar Azizzadenesheli, Burigede liu, Kaushik
Bhattacharya, Andrew Stuart, and Anima Anandkumar. Fourier neural operator for
parametric partial differential equations. In International Conference on Learning Rep-
resentations, 2021b.
Liangliang Liu, Jianhong Cheng, Quan Quan, Fang-Xiang Wu, Yu-Ping Wang, and Jianxin
Wang. A survey on u-shaped networks in medical image segmentations. Neurocomputing,
409:244–258, 2020. ISSN 0925-2312.
Anders Logg and Garth N. Wells. Dolfin. ACM Transactions on Mathematical Software,
37(2):1–28, 2010. ISSN 1557-7295. doi: 10.1145/1731022.1731030.
Lu Lu, Xuhui Meng, Zhiping Mao, and George Em Karniadakis. Deepxde: A deep learning
library for solving differential equations. SIAM review, 63(1):208–228, 2021.
Kjetil O Lye, Siddhartha Mishra, and Roberto Molinaro. A multi-level procedure for enhan-
cing accuracy of machine learning algorithms. European Journal of Applied Mathematics,
32(3):436–469, 2021.
Siddhartha Mishra and Roberto Molinaro. Estimates on the generalization error of physics-
informed neural networks for approximating PDEs. IMA Journal of Numerical Analysis,
43(1):1–43, 01 2022. ISSN 0272-4979. doi: 10.1093/imanum/drab093.
Fabio Nobile, Raúl Tempone, and Clayton G Webster. A sparse grid stochastic collocation
method for partial differential equations with random input data. SIAM Journal on
Numerical Analysis, 46(5):2309–2345, 2008.
Frank Noé, Alexandre Tkatchenko, Klaus-Robert Müller, and Cecilia Clementi. Machine
learning for molecular simulation. Annual Review of Physical Chemistry, 71(1):361–390,
2020.
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan,
Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas
Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkur-
thy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative
style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzi-
mer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information
Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019.
Philipp Petersen and Felix Voigtländer. Optimal approximation of piecewise smooth func-
tions using deep ReLU neural networks. Neural Networks, 108:296–330, 2018.
25
Heiß, Gühring and Eigel
26
Multilevel CNNs for Parametric PDEs
Benjamin Ummenhofer, Lukas Prantl, Nils Thuerey, and Vladlen Koltun. Lagrangian fluid
simulation with continuous convolutions. In International Conference on Learning Rep-
resentations, 2020.
Dmitry Yarotsky. Error bounds for approximations with deep relu networks. Neural Net-
works, 94:103–114, 2017.
Harry Yserentant. Old and new convergence proofs for multigrid methods. Acta numerica,
2:285–326, 1993.
Bing Yu et al. The deep ritz method: a deep learning-based numerical algorithm for solving
variational problems. Communications in Mathematics and Statistics, 6(1):1–12, 2018.
Yinhao Zhu and Nicholas Zabaras. Bayesian deep convolutional encoder–decoder networks
for surrogate modeling and uncertainty quantification. Journal of Computational Physics,
366:415–447, 2018. ISSN 0021-9991. doi: 10.1016/j.jcp.2018.04.018.
27
Heiß, Gühring and Eigel
For ML-Net, we choose the number of UNets per level R` , ` = 1, . . . , L, such that
each subnetwork ML-Net` has approximately 8×105 trainable parameters. This number
was empirically found to produce satisfactory results and yields R = [11, 9, 5, 4, 3, 2, 2].
To downsample the diffusivity field κ to the input resolution required by the subnetwork
ML-Net` , L − ` + 1 strided convolutions followed by ReLU activations are used. We use 32
channels for convolutional layers of ML-Net, except for the coarsest level where 64 channels
are used. For all levels ` = 2, . . . , L except the coarsest one, the ML-Net` subnetworks
output the predictions as four channels at half the resolution and use pixel un-shuffle layers
(Shi et al., 2016) to assemble the full predictions. The reason for this is that fine grid
corrections for a dyadic subdivision yield a higher similarity between every second value than
neighbouring values in the array. For UNetSeq, 64 channels are used for all convolutional
layers.
ML-Net and UNetSeq are trained for 200 epochs with an initial learning rate of 10−3
during the first 60 epochs. The learning rate is then linearly decayed to 2×10−5 over the
next 100 epochs, where it was held for the rest of the training. Due to memory constraints,
the batch sizes were chosen to be 20 for ML-Net and 16 for UNetSeq. The Adam optimizer
(Kingma and Ba, 2015) was used in the standard configuration with parameters β1 =
0.99, β2 = 0.999, and without weight decay.
The following tables depict relative L2 errors of the considered NN architectures for all
presented test cases.
28
Multilevel CNNs for Parametric PDEs
parameter EMRL2
problem case
dimension p ML-Net UNetSeq
10 9.28e−5 ± 6.92e−5 3.72e−5 ± 9.18e−6
50 4.88e−5 ± 6.82e−6 1.21e−3 ± 1.66e−3
uniform
100 4.90e−5 ± 9.56e−6 3.21e−5 ± 3.06e−6
200 4.81e−5 ± 7.38e−6 1.17e−3 ± 1.61e−3
10 7.44e−5 ± 1.75e−5 1.66e−3 ± 2.22e−3
50 7.46e−4 ± 9.61e−4 2.30e−3 ± 3.13e−3
log-normal
100 6.54e−5 ± 6.71e−6 1.72e−3 ± 2.80e−3
200 7.35e−5 ± 2.82e−6 7.94e−5 ± 2.14e−5
16 3.29e−4 ± 3.26e−5 2.35e−4 ± 1.37e−5
cookie fixed
64 5.32e−4 ± 1.80e−5 1.40e−4 ± 4.32e−5
32 1.33e−3 ± 1.05e−4 7.68e−4 ± 2.45e−5
cookie variable
128 2.01e−3 ± 8.14e−5 7.14e−4 ± 7.74e−5
Table 4: EMRL2 for ML-Net and UNetSeq evaluated on all test cases.
ref
EMRL
parameter 2
problem case
dimension p ML-Net UNetSeq
10 1.05e−4 ± 6.75e−5 4.77e−5 ± 6.80e−6
50 5.86e−5 ± 7.02e−6 1.22e−3 ± 1.66e−3
uniform
100 5.50e−5 ± 5.07e−6 4.19e−5 ± 3.41e−6
200 5.67e−5 ± 5.78e−6 1.18e−3 ± 1.60e−3
10 7.76e−5 ± 1.29e−5 1.66e−3 ± 2.22e−3
50 7.59e−4 ± 9.68e−4 2.30e−3 ± 3.13e−3
log-normal
100 7.06e−5 ± 4.38e−6 1.72e−3 ± 2.80e−3
200 8.31e−5 ± 3.91e−6 8.69e−5 ± 1.97e−5
16 6.19e−3 ± 6.60e−5 6.05e−3 ± 3.50e−5
cookie fixed
64 9.63e−3 ± 4.33e−5 9.48e−3 ± 2.79e−5
32 8.81e−3 ± 3.41e−5 8.49e−3 ± 9.10e−5
cookie variable
128 1.64e−2 ± 5.36e−5 1.62e−2 ± 6.70e−5
29
Heiß, Gühring and Eigel
Consider Problem 2.1. Under the uniform ellipticity assumption1 , the energy norm is equi-
valent to the H 1 norm (see e.g. Cohen and DeVore (2015)). We denote by cH 1 , CH 1 > 0 the
grid independent constants such that for all u ∈ V and y ∈ Γ
Moreover, for a multilevel decomposition up to level L ∈ N and v̂1 , . . . , v̂L defined as in (6),
denote by Ccorr > 0 the constant such that for all y ∈ Γ and ` = 1, . . . , L
Note that for any y ∈ Γ, the solution vh (y) of Problem 2.1 satisfies
Remark 11 For D = [0, 1]2 , mesh width h > 0 and m := 1/h ∈ N, we exclusively consider
the uniform square grid comprised of m2 identical squares each subdivided into two triangles
as illustrated in Figure 4.
30
Multilevel CNNs for Parametric PDEs
and Hout := 2Hin +1. Moreover, we write M (Φ) ∈ N for the number of trainable parameters
and L(Φ) ∈ N for the number of layers of a (possibly convolutional) NN Φ.
Many previous works have focused on approximating functions by fully connected NNs.
With a simple trick, we transfer these approximation results to CNNs.
Proof The proof follows directly by using the affine linear transformations in each layer
of Φ in the channel dimension as 1 × 1 convolutions.
Corollary 13 Let % ∈ L∞ loc (R) such that there exists x0 ∈ R with % is three times continu-
ously differentiable in a neighborhood of some x0 ∈ R and %00 (x0 ) 6= 0. Let W ∈ N, B > 0,
and ε ∈ (0, 1/2), then there exists a CNN × e with activation function %, a two-channel input
and one-channel output that satisfies the following properties:
1. ×(x,
e y) − x y L∞ ([−B,B]2×W ×W ,dxdy) ≤ ε;
e = 2 and M (×)
2. L(×) e ≤ 9;
Proof The proof follows from Theorem 12 together with (Gühring and Raslan, 2021, Co-
rollary C.3).
31
Heiß, Gühring and Eigel
(ii) For each vertex i ∈ {1, . . . , dim Vh } of the triangulation, we enumerate the six adjacent
(k)
triangles (arbitrarily but in the same order for every i) and denote them by Ti with
k = 1, . . . , 6 (see Figure 4 for an illustration).
and
Υ(κ, T ) := [Υ(κ, T , k)]6k=1 ∈ R6×dim Vh .
In the next lemma, we show that the integrals Υ(κ, T , k, i) can be computed via a
convolution from (a discretized) κ and, furthermore, that integrals over a coarse grid can
be computed from fine-grid-integrals, again via a convolution.
Lemma 15 Let Th , T2h be nested triangulations of fineness h and 2h, respectively, with
corresponding FE spaces V2h ⊆ Vh ⊆ H01 (D). Then, the following holds:
(i) There exists a convolutional kernel K ∈ R1×6×3×3 such that for every κ ∈ Vh , we have
for k ∈ {1, . . . , 6}
(ii) There exists a convolutional kernel K ∈ R6×6×3×3 such that for every κ ∈ H01 (D) and
k = 1, . . . , 6, we have
32
Multilevel CNNs for Parametric PDEs
Ti2
Ti3 Ti1
xi xi xj
Ti4 Ti6
Ti5
(h) (h)
where we used in the last step that κ dx = h2/3 if supp ϕj ∩Tik 6= ∅. Clearly, supp ϕj ∩
R
Tik
Tik = ∅ for all j ∈ {1, . . . , dim Vh } with j ∈
/ Neigh2D(i). Furthermore, the position of the
(h)
relevant neighbors {j : supp ϕj ∩ Ti 6= ∅} relative to index i is invariant to 2D-translations
k
The next theorem shows that using the integral representation of κ from Definition 14,
we can now represent Aκ u by the pointwise multiplication of the κ integrals with the output
of a convolution applied to u.
Theorem 16 (Representation of Aκ u) For k = 1, . . . , 6, there exist convolutional ker-
nels K (k) ∈ R1×1×3×3 , such that for the function
6
F : R7×dim V → Rdim V , (u, κ(1) , . . . , κ(6) ) 7→ (u ∗ K (k) ),
X
κ(k)
k=1
33
Heiß, Gühring and Eigel
(1)
(T2h )i
xj3 (Th )(1)
j3
(2)
(Th )j1
(1) (1)
xi x j1 xi (Th )i (Th )j1
(6)
(Th )i
(6)
(Th )j1 x j1
(1)
x j2 (Th )j2
(6)
(Th )j2
(6)
(T2h )i
Proof We start by introducing some notation: Let i, j ∈ {1, . . . , dim Vh } and k ∈ {1, . . . , 6}
and set Cijk ∈ R as the constant that h∇ϕi , ∇ϕj i attains on Tik . Note that Cijk = 0 if
i ∈/ Neigh2D(j). We now represent each entry of Aκ as a sum of multiplications: For
i, j ∈ {1, . . . , dim Vh }, we have
Z 6 Z 6
(Aκ )ij = κ h∇ϕi , ∇ϕj i dx = κ h∇ϕi , ∇ϕj i dx = Υ(κ, T , k, i)Cijk ,
X X
D k
k=1 Ti k=1
where we use Definition 14 (ii) for the last step. For j ∈ {1, . . . , dim Vh }, we can now rewrite
(Aκ u)j as
XVh
dim 6
(Aκ u)j =
X
ui Υ(κ, T , k, j)Cijk
i=1 k=1
6 XVh
dim
=
X
Υ(κ, T , k, j) ui Cijk
k=1 i=1
6
=
X X
Υ(κ, T , k, j) ui Cijk .
k=1 i∈Neigh2D(j)
Here, we use in the first step the definition of Aκ and in the third step that supp ϕi ∩
supp ϕj 6= ∅ only for neighboring indices. Since Cijk only depends on the relative position
of i to j (in the 2D sense) and is independent on j, the inner sum can be expressed in the
34
Multilevel CNNs for Parametric PDEs
(ii) Corollary 13 yields the existence of a two-layer CNN Ψ̃ with at most 9 weights and
1 × 1 kernels such that
ε
Ψ̃([u ∗ K (k) , κ(k) ]6k=1 ) − [κ(k) (u ∗ K (k) )]6k=1 ≤ .
∞ 6
(iii) The channel-wise addition can trivially be constructed with a one-layer CNN with a
1 × 1 kernel.
A concatenation of the above three CNNs concludes the proof.
Recall that the prolongation and its counterpart, the weighted restriction, are essential
operations in the multigrid cycle. The following remark states that these operations can
naturally be expressed as convolutions.
35
Heiß, Gühring and Eigel
F : Rd1 → Rdn+1 , F := fn ◦ . . . ◦ f1 .
Then, for every M, ε > 0, there exist M1 , . . . , Mn > 0 and ε1 , . . . , εn > 0, such that the
following holds true: If f˜i : Rdi → Rdi+1 are functions such that kfi − f˜i kL∞ ([−Mi ,Mi ]di ) ≤ εi
for i = 1, . . . , n, then
F − f˜n ◦ . . . ◦ f˜1 ≤ ε.
L∞ ([−M,M ]d1 )
Proof We prove the statement by induction over n ∈ N. For n = 1, the statement is clear.
Define the function g : Rd1 → Rdn for n > 1 by
g := fn−1 ◦ . . . ◦ f1 .
g − f˜n−1 ◦ . . . ◦ f˜1 ≤ δ.
L∞ ([−M,M ]d1 )
Setting g̃ := f˜n−1 ◦ . . . ◦ f˜1 , we get kg̃kL∞ ([−M,M ]d1 ) ≤ kgkL∞ ([−M,M ]d1 ) + δ ≤ Mn . The proof
is concluded by deriving that
F − (f˜n ◦ . . . ◦ f˜1 )
L∞ ([−M,M ]d1 )
36
Multilevel CNNs for Parametric PDEs
With the preceding preparations, we are now ready to prove our main results. We start
with the approximation of the multigrid algorithm by a CNN.
Proof of Theorem 6 The overall proof strategy consists of decomposing the function
MGm k0 ,k : R
3×dim VL → Rdim VL , where V = V , into a concatenation of a finite number of
L h
continuous functions that can either be implemented by a CNN or arbitrarily well approx-
imated by a CNN. An application of Lemma 20 then yields that the concatenation of these
CNNs (which is again a CNN) approximates MGm k0 ,k . We make the following definitions:
it then follows from Lemma 15(i) that fin ([u, κ, f ]) = [u, Υ(κh , T L ), f ].
`
fsm ([u, Υ(κh , T ` ), f ]) = [u + ω(f − A`κh u), Υ(κh , T ` ), f ].
u
u f − F (u, κ(1) , . . . , κ(6) )
κ(1) `
.
κ(1)
`
: R8×dim V` → R9×dim V` , . 7→
fresi . .. ,
.
(6)
κ (6)
κ
f
f
`
fresi ([u, Υ(κh , T ` ), f ]) = [u, f − A`κh u, Υ(κh , T ` ), f ].
37
Heiß, Gühring and Eigel
Let now [K1 , . . . , K6 ] ∈ R6×6×3×3 be the convolutional kernel from Lemma 15(ii) and
P`−1 ∈ Rdim V` ×dim V`−1 the prolongation matrix from Definition 2 (where its transpose
acts as the weighted restriction), then we define
u
κ(1)
.
.
.
u
(6)
κ
r
(1)
f
κ
`
: R9×dim V` → R8×dim V` × R8×dim V`−1 ,
frest . 7→
. .
.
(6)
0
κ κ(1) ∗ K
1
..
f
.
(6)
κ ∗ K6
T r
P`−1
(iv) Add coarse grid correction to fine grid. As above, let P`−1 ∈ Rdim V` ×dim V`−1 be
the prolongation matrix from Definition 2. We define the function
u
(1)
κ
. u + P`−1 e
.
. κ(1)
(6)
.
κ
`
: R8×dim V` × R8×dim V`−1 → R8×dim V` , .
fprol 7→
.
.
f
κ(6)
e
! f
..
.
38
Multilevel CNNs for Parametric PDEs
Finally, MGm
k0 ,k is given by:
MGm
k0 ,k = fout ◦ i=1 VCk0 ,k
m L
◦ fin .
It is easy to see that the above defined functions are continuous and, thus, Lemma 20
dictates the accuracy with which to approximate each part of the multigrid algorithm (fsm ` ,
fprol , frest , fresi , fin , fout ) on each level ` = 1, . . . , L to yield a final approximation accuracy
` ` ` L L
The approximation of fin is derived using Lemma 15 and fout is trivially represented using a
L L
single convolution. Concatenating the individual CNNs yields a CNN Ψ with an architecture
resembling multiple concatenated UNets, such that
Ψ − MGm
k0 ,k ≤ δ.
L∞ ([−M,M ]3×n )
The norm equivalence of k·kL∞ ([−M,M ]3×n ) and k·kH 1 in the finite dimensional setting dir-
ectly implies inequality (i) with a suitable choice of δ. The parameter bound (ii) follows
from the construction of Ψ.
Proof of Corollary 8 Fix the damping parameter ω > 0 and k ∈ N such that µ(k) < 1
and let m ∈ N be chosen as small as possible such that
1 2CH
−1 !
2
1
m ≥ log log .
µ(k) ε
Let k0 ∈ N be given such that the contraction factor of k0 damped Richardson iterations
on the coarsest
n grid is smaller o
than µ(k). Let Ψ be the CNN from Theorem 6 setting
2
M = max supy∈Γ kκy k∞ , kf k∞ such that for all κ, f ∈ [−M, M ]n
1
Ψ(0, κ, f ) − MGm
k0 ,k (0, κ, f ) ≤ ε kf k∗ .
H1 2
Using the contraction property of the multigrid algorithm from Theorem 3, we get
MGm
k0 ,k (0, κy , f ) − vh (y) ≤ µ(k)m kvh (y)kAy
Ay
1 −1
≤ εCH 1 kf k∗ .
2
This yields
kΨ(0, κy , f ) − vh (y)kH 1
≤ Ψ(0, κy , f ) − MGm
k0 ,k (0, κy , f ) + MGm
k0 ,k (0, κy , f ) − vh (y)
H1 H1
1
≤ ε kf k∗ + CH 1 MGm
k0 ,k (0, κy , f ) − vh (y)
2 Ay
≤ ε kf k∗ .
39
Heiß, Gühring and Eigel
Assuming ε ≤ 1, there is a constant C > 0 such that m ≤ C log 1
ε . Together with the
parameter count from Theorem 6, this concludes the proof.
g m` : R8×dim V → Rdim V as
(ii) the function MG k0 ,k
m`
m`
MG
g
k0 ,k := fout ◦
`
i=1 VCk0 ,k
`
,
m` m` m`
such that MG
g
k0 ,k ◦ fin =
`
i=1 MGk0 ,k .
We outline the following multilevel procedure for solving Problem 2.1 using multigrid with
y ∈ Γ and κy ∈ Rdim VL given as input:
• κy subsampling. Using κy , compute κL y as outlined in Lemma 15 (i). Subsequently,
compute κy for each level ` = 2, . . . , L − 1 as shown in Lemma 15 (ii).
`
• ` = 2, . . . , L. While v̂` (y) is the true correction with respect to the Galerkin approx-
imation v`−1 (y), we introduce v̌` (y) as the correction with respect to ṽ`−1 (y). To
this end, set
v̌` (y) := v` (y) − P` ṽ`−1 (y),
f̃` := f` − A`y P` ṽ`−1 (y) = A`y v̌` (y).
Solve A`y v̌` (y) = f̃` using m` iterations of the multigrid V-cycle and set
m`
ṽ` (y) := MG
g
k0 ,k (0, κy , f̃` ) + P` ṽ`−1 (y).
`
40
Multilevel CNNs for Parametric PDEs
This procedure yields the following estimates using the contraction from Theorem 3.
• ` = 2, . . . , L − 1. Using again Theorem 3, the definitions for m` , ṽ` , v̌` and v̂` =
v` − P` v`−1 , it holds
= g m` (0, κ` , f̃ )
MG − v̌` (y)
k0 ,k y `
A`y
1
≤ kv̌` (y)kA`y
2
1 1
≤ kv̂` (y)kA`y + kP` ṽ`−1 (y) − P` v`−1 (y)kA`y
2 2
1 1
≤ c2−` kf k∗ + kṽ`−1 (y) − v`−1 (y)kA`−1 .
2 2 y
• ` = L. By the same argument as for the case ` = 2, . . . , L − 1 together with the norm
inequality in (9) and the definition of mL , we arrive at the following estimate for the
H 1 -distance on the finest level.
1 L
≤ 2 ε kv̌L (y)kALy
2cL
1 L 1 1
≤ 2 ε kv̂L (y)kALy + kPL ṽL−1 (y) − PL vL−1 (y)kALy
cL 2 2
1 L 1 −L 1
≤ 2 ε c2 kf k∗ + kṽL−1 (y) − vL−1 (y)kAL−1
cL 2 2 y
41
Heiß, Gühring and Eigel
are realized as in Remark 19. Hence, the procedure poses a concatenation of finitely many
functions each representable by CNNs up to any accuracy. Using Lemma 20 (with M = M
and ε = 12 ε kf k∗ ) implies the existence of a CNN Ψ such that for all y ∈ Γ and all right
hand sides f , we have
The structure of the presented multilevel algorithm together with the parameter bounds
from Theorem 6, Lemma 15 and Remark 19 implies that Ψ is an ML-Net architecture and
that there exists a constant C > 0 with
L−1
2L
M (Ψ) ≤ C ` + C log − L log(2) L
X
`=1
ε
1
≤ CL L + log + log(2L) − L log(2)
ε
1
≤ CL log + CL2 .
ε
42