2304.00388 Multilevel CNNs For Parametric PDEs

Multilevel CNNs for Parametric PDEs
Cosmas Heiß cosmas.heiss@gmail.com

Department of Mathematics
Technical University of Berlin
Berlin, Germany
Ingo Gühring guehring@math.tu-berlin.de
Machine Learning Group
Technical University of Berlin
arXiv:2304.00388v2 [cs.LG] 4 Apr 2023
Berlin, Germany
Martin Eigel martin.eigel@wias-berlin.de
Weierstraß Institute
Berlin, Germany
Abstract
We combine concepts from multilevel solvers for partial differential equations (PDEs)
with neural network based deep learning and propose a new methodology for the efficient
numerical solution of high-dimensional parametric PDEs. An in-depth theoretical analysis
shows that the proposed architecture is able to approximate multigrid V-cycles to arbitrary
precision with the number of weights only depending logarithmically on the resolution of
the finest mesh. As a consequence, approximation bounds for the solution of parametric
PDEs by neural networks that are independent on the (stochastic) parameter dimension
can be derived.
The performance of the proposed method is illustrated on high-dimensional parametric
linear elliptic PDEs that are common benchmark problems in uncertainty quantification.
We find substantial improvements over state-of-the-art deep learning-based solvers. As
particularly challenging examples, random conductivity with high-dimensional non-affine
Gaussian fields in 100 parameter dimensions and a random cookie problem are examined.
Due to the multilevel structure of our method, the amount of training samples can be
reduced on finer levels, hence significantly lowering the generation time for training data
and the training time of our method.
Keywords: deep learning, partial differential equations, parametric PDE, multilevel,
expressivity, CNN, uncertainty quantification
1 Introduction
The application of deep learning (DL) to many problems of the natural sciences has become
one of the most promising emerging research areas, sometimes coined scientific machine
learning (Baldi, 2018; Sadowski and Baldi, 2018; Keith et al., 2021; Noé et al., 2020; Lu
et al., 2021). In this work, we focus on the challenging problem of parametric partial
differential equations (PDEs) that are used to describe complex physical systems e.g. in
engineering applications and the natural sciences. The parameters determine the data or
domain and are used to introduce uncertainties into the model, e.g. based on assumed
or measured statistical properties. The resulting high-dimensional problems are analyzed
and solved numerically in the research area of uncertainty quantification (UQ) (Lord et al.,
1
Heiß, Gühring and Eigel
2014). Such parametric models are of importance in automated design, weather forecasting,
risk simulations, and the material sciences, to name but a few.
To make the problem setting concrete, for a possibly countably infinite parameter space
Γ ⊆ RN and a spatial domain D ⊆ Rd , d ∈ {1, 2, 3}, we are concerned with learning the
solution map u : Γ × D → R of the parametric stationary diffusion PDE that satisfies the
following equation for all parameters y ∈ Γ
− divx κ(y, x)∇x u(y, x) = f (x) on D,

(
(1)
u(y, x) = 0 on ∂D,
where κ : Γ × D → R is the parameterized diffusion coefficient describing the diffusivity,
i.e. the media properties. Note that differential operators act with respect to the physical
variable x.
As a straight-forward but computationally ineffective solution, Equation (1) could be
solved numerically for each y ∈ Γ individually, using e.g. the finite element (FE) method and
a Monte Carlo approach to estimate statistical quantities of interest. More refined methods
such as stochastic Galerkin (Eigel et al., 2014, 2020), stochastic collocation (Babuska et al.,
2007; Nobile et al., 2008; Ernst et al., 2018) or least squares methods (Chkifa et al., 2015;
Eigel et al., 2019a; Cohen and Migliorati, 2023) exploit smoothness and low-dimensionality
of the solution manifold {u(y) : y ∈ Γ} by projection or interpolation of a finite set of
discrete solutions {uh (yn )}Nn=1 . After the numerical solution of Equation (1) at sample
points yn and the evaluation of the projection or interpolation scheme, solutions for arbit-
rary y ∈ Γ can quickly be obtained approximately simply by evaluating the constructed
functional “surrogate”. However, while compression techniques such as low-rank approxim-
ations or reduced basis methods foster the exploitation of low-dimensional structures, these
approaches still become practically infeasible for large parameter dimensions rather quickly.
Multilevel approaches represent another concept that can be applied beneficially for this
problem class as was e.g. shown with multilevel stochastic collocation (MLSC) (Teckentrup
et al., 2015) and multilevel quadrature (Harbrecht et al., 2016; Ballani et al., 2016). The
central idea is to equilibrate the error components of the physical and the statistical approx-
imations. This can be achieved by performing a frequency decomposition of the solution in
the physical domain based on a hierarchy of nested FE spaces determined by appropriate
spatial discretizations. An initial approximation on the coarsest level is successively refined
by additive corrections on finer levels. Corrections generally contain strongly decreasing
information for finer levels, which can be exploited in terms of a level dependent fineness of
the stochastic discretization.
For the numerical solution of (usually non-parametric) PDEs, a plethora of NN archi-
tectures has been devised (see e.g., Beck et al., 2023; Berner et al., 2020a; Raissi et al., 2019;
Sirignano and Spiliopoulos, 2018; E et al., 2017; Ramabathiran and Ramachandran, 2021;
Kharazmi et al., 2019; Yu et al., 2018; Han et al., 2017; Li et al., 2021b). Similar to classical
methods, DL-based solvers for parametric PDEs typically learn the mapping y 7→ uh (y) in
a supervised way from computed (or observed) samples {uh (yn )}N n=1 (see e.g., Geist et al.,
2021; Hesthaven and Ubbiali, 2018; Anandkumar et al., 2020). This results in a compu-
tationally expensive data generation process, consisting of approximately computing the
solutions at yn and subsequently training the NN on this data. After training, only a cheap
forward pass is necessary for the approximate solution of (1) at arbitrary y. We note that
2
in contrast to many classical machine learning problems, a dataset can be generated on the
fly and in principle with arbitrary precision.
Complementing numerical results, significant progress has been made to understand
the approximation power, also called expressivity, of NNs for representing solutions of
PDEs (Kutyniok et al., 2022; Schwab and Zech, 2019; Berner et al., 2020b; Mishra and
Molinaro, 2022; Grohs et al., 2018; Gühring et al., 2020; Gühring and Raslan, 2021). These
works show that NNs are at least as expressive as “classical” best-in-class methods like
the ones mentioned before. However, numerical results do often not reflect this theoretical
advantage (Adcock and Dexter, 2021). Instead, the convergence of the error typically stag-
nates early during training and the accuracy of classical approaches is not reached (Geist
et al., 2021; Kharazmi et al., 2019; Lu et al., 2021; Grossmann et al., 2023).
We propose a DL-based NN approach that overcomes the performance shortcomings
of other NN architectures for parametric PDEs by borrowing concepts from MLSC and
multigrid algorithms. Coarse approximation and successive corrections on finer levels are
learned in a supervised way by sub-networks and combined for the final prediction. Analog-
ously to multilevel Monte Carlo (MLMC) and MLSC methods, we show that the fineness
of sampling the stochastic parameter space (in terms of the amount of training data per
level) can be (reciprocally) equilibrated with the fineness of the spatial discretization. The
developed architecture consists of a composition of UNets (Ronneberger et al., 2015), which
inherently are closely related to multiresolution approaches.
In the following, we summarize our main contributions.
• Theory: We conduct an extensive analysis of the expressivity of UNet-like convolu-

tional NNs (CNNs) for the solution of parametric PDEs. Our results show that under
the uniform ellipticity assumption, the number of required parameters is independent
on the parameter dimension with increasing accuracy. In detail, we show that com-
mon multilevel CNN architectures are able to approximate a V-cycle multigrid solver
iteration with Richardson smoother (Richardson and Glazebrook, 1911) with less than
log(1/h) weights up to arbitrary accuracy, where h is the mesh size of the underlying
FE space (Theorem 6). Using this result, we approximate the solution map of (1) up
to expected H 1 -norm error h by an NN with log(1/h)2 weights.
• Numerical Results: We conduct an in-depth numerical study of our method and, more
generally, UNet-based approaches on common benchmark problems for parametric
PDEs in UQ. Strikingly, the tested methods show significant improvements over the
state-of-the-art NN architectures for solving parametric PDEs. To reduce computing
complexity during training data generation and training the NN, we shift the bulk
of the training data to coarser levels, which allows our method to be scaled to finer
grids, ultimatively resulting in higher accuracy.
1.1 Related Work

Schwab and Zech (2019); Grohs and Herrmann (2022) derived expressivity results for ap-
proximating the solution of parametric PDEs based on a sparse generalized polynomial
chaos representation of the solution known from UQ (Schwab and Gittelson, 2011; Cohen
and DeVore, 2015) and a central result for approximating polynomials by NNs (Yarotsky,
3
2017). Kutyniok et al. (2022) translated a reduced basis method into an NN architecture.
None of these results show bounds that are independent on the (stochastic) parameter di-
mension. In contrast, translated to our setting (and simplified), Kutyniok et al. (2022)
show an asymptotic upper bound of h−2 log(h−1 )p + p log(h−1 )3p weights. Numerical ex-
periments are either not provided or do not support the positive theoretical results (Geist
et al., 2021). For general overview of expressivity results for NNs, we refer the interested
reader to (Gühring et al., 2022).
Related to our method and instructive for its derivation is the MLSC method presented
in (Teckentrup et al., 2015) and also (Harbrecht et al., 2016; Ballani et al., 2016). Another
related work is (Lye et al., 2021), where a training process with multilevel structure is
presented. Under certain assumptions on the achieved accuracy of the optimization, an
error analysis is carried out that could in part also be transferred to our architecture.
The connection between differential operators and convolutional layers is well known (see
Chen et al., 2022) and CNNs have already been applied to approximate parametric PDEs for
problems like predicting the airflow around a given geometry in the works (Bhatnagar et al.,
2019; Guo et al., 2016). In (Zhu and Zabaras, 2018) Bayesian CNNs were used to assess
uncertainty in processes governed by stochastic PDEs. (Ummenhofer et al., 2020) extend the
concept to continuous convolutions in mesh-free Lagrangian approaches for solving problems
in fluid dynamics. The concept of a multilevel decomposition has also been applied to CNNs
to approximate solutions of PDEs or in (Gao et al., 2018) for general generative modeling,
or based on hierarchical matrix representations in (Fan et al., 2019a,b).
Such hierarchical matrix representations have even been extended to graph CNNs as
multipole kernels to allow for a mesh-free approximation of the solutions to parametric
PDEs (Li et al., 2020). This approach is similar to our method. However, the mathematical
connection to classical multigrid algorithms is not explored. The kernels are based on fast
multipole modeling without a direct link to the FE method.
He and Xu (2019) present a multilevel approach called MgNet, where various parts of
a classical multigrid solver are replaced by NNs resulting in a UNet-like NN and use it for
image classification. Meta-MgNet (Chen et al., 2022) is a meta-learning approach using
MgNet to predict solutions for parametric PDEs and is most closely related to our method.
The predicted solution is successively refined and stochastic parameters are incorporated by
computing encodings of the parametric discretized operator on each level using additional
smaller NNs. The output of these meta-networks then influences the smoothing operation
in the MgNet architecture. However, in contrast to our method, Meta-MgNet still depends
on the assembled operator matrix. Furthermore, while using an iterative approach yields
more accurate approximations, the speedup promised by a single application of a CNN is
mitigated and the authors report runtimes on the same order of magnitude as classical
solvers. Lastly, the authors test their method on problems with uniform diffusivity using
two or three degrees of freedom instead of a continuous diffusivity field. Our approach
differs from the Meta-MgNet in the following ways: (i) We show that a general UNet is
already able to approximate parametric differential operators without the need for a meta-
network approach; (ii) we improve on the training procedure by incorporating a multilevel
optimization scheme; (iii) we provide a theoretical complexity analysis; (iv) we provide
experiments showing high accuracies with a single application of our method for a variety
of random parameter fields.
4
1.2 Outline
In the first section we introduce the parametric Darcy model problem and its FE discret-
ization. In Section 3, we give a brief introduction to multigrid methods. In Section 4,
we describe our DL-based solution approach. Section 5 is concerned with the theoretical
analysis of the expressivity of the presented architecture. Section 6 contains our numerical
study on several parametric PDE problems. We discuss and summarize our findings in
Section 7. Implementation details and our proofs can be found in the appendix.
2 Problem setting
A standard model problem in UQ is the stationary linear diffusion equation (1) where
the diffusion coefficient κ : Γ → L∞ (D) is a random field with finite variance determined
by a possibly high-dimensional random parameter y ∈ Γ ⊆ RN , hence also parametrizing
the solution. The parameters are images of random vectors distributed according to some
product measure π = ⊗k≥1 π1 , where π1 usually either is the uniform (denoted by U ) or
the standard normal measure (denoted by N ) on Γ1 ⊆ R. For each parameter realization
y ∈ Γ, the variational formulation of the problem is given by: For a bounded Lipschitz
domain D ⊆ Rd and source term f ∈ V ∗ = H −1 (D), find u ∈ V := H01 (D) such that for all
test functions w ∈ V ,
Z Z
κ(y, x)∇u(x) · ∇w(x) dx = f (x)w(x) dx. (2)
D D
We assume that the above equation always exhibits a unique solution. In fact, this holds
true with high probability even for the case of unbounded (lognormal) coefficients, where
uniform ellipticity of the bilinear form is not given. This yields the solution operator v,
mapping a parameter realization y ∈ Γ to its corresponding solution v(y):
v : Γ → V, y 7→ v(y).
In this work, we strive to develop an efficient and accurate NN approximation ṽ : Γ → V ,

which minimizes the expected H 1 -distance
Z
k∇(v(y) − ṽ(y))k2L2 (D) dπ(y).
Γ
With the assumed boundedness (with high probability) of the coefficient κ, this is equivalent
to optimizing with respect to the canonical parameter-dependent norm.
We recall the truncated Karhunen-Loève expansion (KLE) (Knio and Le Maitre, 2006;
Lord et al., 2014) with stochastic parameter dimension p, which is a common parametric
representation of random fields. It is given by
p
κ(y, x) = a0 (x) + yk ak (x). (3)
X
k=1
Here, the p ∈ N basis functions ak ∈ L2 (D), 1 ≤ k ≤ p are determined by assumed

statistical properties of the field κ. When Gaussian random fields are considered, they are
5
completely characterized by their first two moments and the ak are the eigenfunctions of
the covariance integral operator while the parameter vector y is the image of a standard
normal random vector. To become computationally feasible, a sufficient decay of the norms
of these functions is required for the truncation to be reasonable. Note that in the numerical
experiments, we use a well-known “artificial” KLE.
2.1 Discretization
To solve the infinite dimensional variational formulation (2) numerically, a finite dimensional
subspace has to be defined in which an approximation of the solution is computed. As is
common with PDEs, we use the FE method, which is based on a disjoint decomposition
of the computational domain D into simplices (also called elements or cells). Setting the
domain D = [0, 1]2 , we assume a subdivision into congruent triangles such that D is exactly
represented by a uniform square mesh with identical connectivity at every node (see also
Remark 11). The discrete space Vh is then spanned by conforming P1 FE hat functions
Vh
ϕj : R2 → R, i.e., Vh = span{ϕj }dim
j=1 ⊆ V (see e.g., Braess, 2007; Ciarlet, 2002). We write
uh , wh ∈ Vh for the discretized versions of u, w ∈ V of Equation (2) with FE coefficient
vectors u, w ∈ Rdim Vh and denote by κh the nodal interpolation of κ such that κh (y, ·) ∈ Vh
with coefficient vector κy ∈ Rdim Vh for every y ∈ Γ. Hence,
XVh
dim XVh
dim XVh
dim
uh = ui ϕ i , wh = wi ϕi , κh (y, ·) = (κy )i ϕi .
i=1 i=1 i=1
The discretized version of the parametric Darcy problem (2) reads:

Problem 2.1 (Parametric Darcy problem, discretized) For y ∈ Γ, find uh ∈ Vh
such that for all wh ∈ Vh
ay,h (uh , vh ) = f (wh ),
where ay,h (uh , vh ) := D κh (y, x)∇uh (x) · ∇wh (x) dx. Using the FE coefficient vectors leads
R
to the system of linear equations

Ay u = f , (4)
where f := (f (ϕi ))dim Vh
=( f (x)ϕi (x) dx)dim Vh
and Ay := (ay,h (ϕj , ϕi ))dim Vh
R
i=1 D i=1 i,j=1 .
We assume that there always exists a unique solution to this problem and define the
discretized parameter to solution operator by
vh : Γ → Vh , y 7→ vh (y), (5)
where vh (y) solves Problem 2.1 for any y ∈ Γ. Furthermore, vh : Γ → Rdim Vh maps y to
the FE coefficient vector of vh (y).
Remark 1 We neglect the influence of the approximation of the diffusivity field κ as κh

in our analysis since it would only introduce unnecessary technicalities. It is well-known
from FE analysis how this has to be treated theoretically (see e.g., Ciarlet, 2002, Chapter
4). Notably, we always assume that the coefficient is discretized on the finest level of the
multigrid algorithm that we analyze and evaluate in the experiments.
6
3 A primer on multigrid methods

Our proposed method relies fundamentally on the notion of multigrid solvers for algebraic
equations. These iterative solvers exhibit an optimal complexity in the sense that they
converge linearly in terms of degrees of freedom. Their mathematical properties are well
understood and they are commonly used for solving PDEs discretized, in particular, with
FEs and a hierarchy of meshes (see Braess, 2007; Yserentant, 1993; Hackbusch, 2013). This
section provides a brief primer on the multigrid setting.
The starting point is a hierarchy of nested spaces indexed by the level ` ∈ {1, . . . , L}
with finest level L, which determines the accuracy of the approximation. In our setting, this
finest level is chosen in advance and is assumed to yield a sufficient FE accuracy. Similar
to the previous section, we define V` as the FE space of conforming P1 elements on the
associated dyadic quasi-uniform triangulations T` with mesh size h` ∼ 2−` and degrees of
freedom dim V` ∼ 2`d . This yields a sequence of nested FE spaces
V1 ⊆ . . . ⊆ VL ⊆ H01 (D).
These spaces are spanned by the the FE basis functions on the respective level, i.e., V` =
(`) V`
span{ϕj }dimj=1 . A frequency decomposition of the target function u ∈ H0 (D) is obtained
1
by approximating low frequency parts on the coarsest mesh T1 and subsequently calculating
corrections for higher frequencies on finer meshes.
For each level ` = 1, . . . , L we define the discretized solution operator v` : Γ → V` as
in (5). The correction v̂` with respect to the next coarser level ` − 1 for ` ∈ {2, . . . , L} is
given by
v̂` : Γ → V` , y 7→ v` (y) − v`−1 (y). (6)
Additionally, we define v` and v̂` as the functions mapping the parameter y to the corres-
ponding coefficient vectors of v` (y) and v̂` (y) in V` . If the solution of the considered PDE
is sufficiently smooth, the corrections decay exponentially in the level, namely kv̂` (y)kH 1 .
2−` kf k∗ (see Braess, 2007; Yserentant, 1993; Hackbusch, 2013).
Central to the multigrid method is a transfer of discrete functions between adjacent
levels of the hierarchy of spaces. This is achieved in terms of prolongation and restriction
operators, the discrete versions of which are defined in the following.
Definition 2 (Prolongation & weighted restriction matrices) Let L ∈ N and for

some ` ∈ {1, . . . , L − 1} the spaces V` and V`+1 be defined as above. The prolongation
matrix P` ∈ Rdim V`+1 ×dim V` is the matrix representation of the canonical embedding of V`
into V`+1 under the respective FE basis functions. P`T ∈ Rdim V` ×dim V`+1 defines the weighted
restriction.
The theoretical analysis of our multilevel NN architecture hinges on the multigrid V-

cycle which is depicted in Algorithm 3.1. It exploits the observation that simple iterative
solvers reduce high-frequency components of the residual. In addition to these so-called
smoothing iterations, low-frequency correction are recursively computed on coarser grids.
In fact, if the approximation error by the recursive calls for the coarse grid correction is
sufficiently small, a grid independent contraction rate is achieved (see Braess and Hackbusch,
1983; Braess, 2007).
7
For our analysis, we use a damped Richardson smoother (Richardson and Glazebrook,
1911), which is rarely used in practice due to subpar convergence rates, but theoretically
accessible because of its simplicity. For a suitable damping parameter ω > 0, the Richardson
iteration for solving Equation (4) is given by
u0 := 0,
(7)
un+1 := un + ω(f − Ay un ).
We denote by MGm k0 ,k : R
3×dim Vh → Rdim Vh the function mapping an initial guess, the
diffusivity field and the right-hand side to the result of m multigrid V-cycles (Algorithm 3.1)
with k0 smoothing iterations on the coarsest level and k smoothing iterations on the finer
levels.
Based on this iteration, we use a fundamental convergence result for the multigrid al-
gorithm shown in (Braess and Hackbusch, 1983, Thm. 4.2) together with the assumption
of uniform ellipticity to get the uniform contraction property of the V-cycle multigrid al-
gorithm. For x ∈ Rdim Vh and the discretized operator A ∈ Rdim Vh ×dim Vh , we use the
standard notation for the energy norm kxkA = kA1/2 xk`2 .
Theorem 3 Let k ∈ N. There exists ω > 0 and mesh-independent k0 ∈ N, C < 0, such

that the V-cycle iteration with k pre- and post-smoothing iterations, k0 iterations on the
coarsest grid, and Richardson damping parameter ω applied to Problem 2.1 yields a grid
independent contraction of the residual eiy := MGik0 ,k (0, κy , f ) − vh (y) at the i-th iteration,
namely
C
kei+1 i
y kAy ≤ µ(k)key kAy , µ(k) ≤ < 1,
C + 2k
for all y ∈ Γ.
In the next remark, we elaborate on the number of smoothing steps on the coarsest grid k0 .
Remark 4 In the V-cycle algorithm it is often assumed that the equation system solved
on the coarsest level is solved exactly (by a direct solver). However, the proof of (Braess
and Hackbusch, 1983, Thm. 4.2) shows that this is not necessary. In fact, a contraction of
the residual smaller than C/(C + 2k) on the coarsest level suffices. This shows that k0 is
independent on the fineness of the finest grid. For a rigorous derivation, we refer to (Xu
and Zhang, 2022, Theorem 4.4). In our setting, it is possible to use the damped Richardson
iteration for the correction on the coarsest grid as well, effectively carrying out additional
smoothing steps.
4 Methodology of multilevel neural networks

This section gives an overview of our multilevel NN architecture for the efficient numerical
solution of parametric PDEs. Some of the design choices are tailored specifically to the
stationary diffusion equation (1) but can easily be adapted to other settings, in particular
to different linear and presumably also nonlinear PDEs. Furthermore, in our analysis and
experiments we fix the number of spatial dimensions to two. We note again that these
8
Algorithm 3.1 V-cycle function

1: Input: u, f , A, `
2: for k pre-smoothing steps do
3: u ← u + ω(f − Au) // perform smoothing steps
4: end for
5: if ` = 1 then
6: solve Au = f for u // coarsest grid step
7: else
8: P ← P`−1
9: r ← P | (f − Au) // compute restricted residual
10: A ← P | AP // compute restricted operator
11: e←0
12: e ← V-cycle(e, r, A, ` − 1) // solve for coarse correction e
13: u ← u + Pe // add coarse correction
14: end if
15: for m post-smoothing steps do
16: u ← u + ω(f − Au) // perform smoothing steps
17: end for
18: return u
results can be extended to more dimensions in a straightforward way e.g. by using higher-
dimensional convolutions.
The general procedure of the proposed approach can be described as follows. Initially,
a dataset of solutions is generated using classical finite element solvers for randomly drawn
parameters (datapoints), i.e., realizations of coefficient fields leading to respective FE solu-
tions. Then, for each datapoint a multilevel decomposition of the solution is generated as
formalized in (6). A NN is trained to output the coarse grid solution and the subsequent
level corrections from the input parameters, which are given as coefficient vectors of the
coefficient realizations on the respective FE space. The network prediction is obtained by
summing up these contributions. Finally, to test the accuracy of the network, a set of inde-
pendent FE solutions is generated and error metrics of the NN predictions are computed.
The centerpiece of our methodology is called ML-Net (short for multilevel network). For
a prescribed number of levels L ∈ N, it consists of L jointly trained subnetworks ML-Net` .
Each subnetwork is optimized to map its input – the discretized parameter dependent
diffusion coefficient κy and the output of the previous level z(`−1) – to the level ` correction
v̂` (y) as in (6). The ML-Net` subnetwork architecture is inspired by a cascade of multigrid
V-cycles, which is implemented by a sequence of UNets (Ronneberger et al., 2015). Using
UNets in a sequential manner to correct outputs of previous modules successively is a
well-established strategy in computer vision (see e.g., Genzel et al., 2022). We provide a
theoretical foundation for this design choice in the next section. Since we assume the right-
hand side f ∈ H −1 (D) to be independent of y, we do not include it as an additional input.
9
More precisely,
L
ML-Net[θ; L] : Rdim VL → κ 7→ (z(`) )L
O
Rdim V` , `=1 ,
`=1
where z(`) ∈ Rdim V` is the output of the level ` subnetwork and θ are the learnable para-
meters. For the subnetworks, we have
ML-Net1 [θ 1 ; L] : Rdim VL → Rdim V1 ,

ML-Net` [θ ` ; L] : Rdim VL × Rdim V`−1 → Rdim V` , for ` = 2, . . . , L.
The outputs are given by
z(1) = ML-Net1 [θ 1 ; L](κ),

z(`) = ML-Net` [θ ` ; L](κ, z(`−1) ), for ` = 2, . . . , L.
Again, θ ` are the learnable parameters and θ = (θ ` )L

`=1 .
Each level consists of a sequence of UNets of depth ` with input dimension equal to
dim V` (arranged on a 2D grid). By coupling the UNet depth with the level, we ensure that
the UNets on each level internally subsample the data to the coarse grid resolution dim V1
(again arranged on a 2D grid). Consequently, the number of parameters per UNet increases
concurrently with the level. To distribute the compute effort (roughly) equally across levels,
we decrease the number of UNets per level, which we denote by R` , where R ∈ NL . The
NN architecture on level ` is given by
h i
R`
ML-Net` [θ ` ; L](κ, z) := r=1 UNet[θ `,r ; `]([↓ κ, ·]) (↑ z),
where ↓ (↑) denotes the up-sampling (down-sampling) operator to level ` resolution dim V` .
These operators are motivated by the prolongation and restriction operators of Definition 2.
However, they are comprised of trainable strided (or transpose-strided) convolutions instead
of an explicit construction. The parameters of the level ` subnetwork θ ` consist of the
parameters of the learnable up- and down-sampling and the UNet-sequence (θ `,r )R r=1 .
`
Our complete approach to tackle a parametric PDE with ML-Net consists of the follow-
ing multi-step procedure:
Step 1: Sampling and computing the data. We denote by N ∈ N the training dataset
size and sample parameter realizations y1 , . . . , yN drawn according to π. For each parameter
realization yi , 1 ≤ i ≤ N , the grid corrections v̂` (yi ) on each level 1 ≤ ` ≤ L as in (6) and
the FE coefficient vector κyi ∈ Rdim VL of κL (·, yi ) on the finest level L are computed. Note
that the evaluation of the grid corrections requires the PDE solution for each parameter on
the finest grid.
Step 2: Training the ML-Net. For each level ` ∈ {1, . . . , L}, compute the normaliza-
2
tion constants δ`2 := N1 N
i=1 kv̂` (yi )k2 and the H mass matrix M ` ∈ R
1 dim V` ×dim V` by
P
D E
(`) (`)
(M ` )kj := ϕk , ϕj , for k, j ∈ {1, . . . , dim V` },
H 1 (D)
10
Figure 1: Illustration of the structure of the proposed ML-Net. The down-sampling

cascade is shown in yellow. In this example, the network is constructed for L = 3
with the green, cyan and magenta parts representing the individual stages on each
level. The outputs are shown in orange. Solid arrows represent convolutions and
dashed arrows skip connections.
(`) (`)
where {ϕ1 , . . . , ϕdim V` } constitutes the conforming P1 FE basis of V` . We optimize the
parameters θ by approximately solving
L X
N
1
min where dì := ML-Net[θ; L](κyi ) ` − v̂` (yi )
X
dTì M ` dì ,

θ
`=1 i=1
δ`
with mini-batch stochastic gradient descent. Our loss computes the H 1 -distance between the
subnetwork output as a coefficient of the FE basis of V` and the normalized grid corrections
directly on the coefficients for each level `, i.e.,
XV`
dim
(`) 1
dTì M ` dì = [ML-Net[θ ∗ ; L](κyi )]`j ϕj − v̂` (yi ) .
j=1
δ`
H 1 (D)
Normalization via δ` assures that each level is weighted equally in the loss function.
Step 3: Inference. Denoting the optimized parameters by θ ∗ , the approximate solution

(i.e., the NN prediction) for y ∈ Γ is computed by
L XV`
dim
(`)
[ML-Net[θ ∗ ; L](κy )]`j ϕj .
X
v(y) ≈ δ`
`=1 j=1
In the following remark, a simpler baseline method is introduced that can be seen as a
special case of ML-Net.
11
Remark 5 As a comparison and to assess the benefits of the described multilevel approach,
we propose a simpler NN, which directly maps the discretized parameter dependent diffusion
coefficient κy to the coefficients on the finest level. For this, we denote by UNetSeq a
sequence of UNets which can be seen as a variation of ML-Net where R` = 0 for levels
` = 1, . . . , L − 1, i.e., only the highest level sub-module is used. This architecture is a
hybrid approach between single-level and multi-level because of the internal multi-resolution
structure of the UNet blocks. For this architecture, we choose the number of UNets such
that the total parameter count roughly matches ML-Net.
5 Theoretical analysis
In this section, we show that UNet-like NN architectures are able to approximate multigrid
V-cycles on Vh up to arbitrary accuracy ε > 0 with the number of parameters independent
on ε and growing only logarithmically in 1/h (Theorem 6). We use this result in Corollary 8
to approximate the discrete solution vh : Γ → Rdim Vh of the parametric PDE Problem 2.1
with arbitrary accuracy ε by a NN with the number of parameters bounded from above by
log(1/h) log(1/ε) independent of the stochastic parameter dimension p. This is achieved by
a NN that emulates a multigrid solver applied to each y individually and, thus, approx-
imately computes the solution vh (y). Eventually, we derive similar approximation bounds
for the ML-Net architecture and point out computational advantages of ML-Net over pure
multigrid approximations of Corollary 8. Note that while only the two-dimensional setting
is considered in this section, the results can be translated to arbitrary spatial dimensions.
The rigorous mathematical analysis of NNs requires an extensive formalism, which can
e.g., be found in (Petersen and Voigtländer, 2018; Gühring et al., 2020; Gühring and Raslan,
2021). We assume that readers interested in the proofs are familiar with common techniques
such as concatenation/parallelisation of NNs and the approximation of polynomials, the
identity, and algebraic operations. To make our presentation more accessible from a “prac-
titioner point of view”, we present a streamlined exposition which still includes sufficient
details to follow the arguments of the proofs. All proofs can be found in Appendix C.
We consider a CNN as any computational graph consisting of (various kinds of) convo-
lutions in combination with standard building blocks from common DL libraries (e.g., skip
connections and pooling layers). Note that this conception also includes the prominent
UNet architecture (Ronneberger et al., 2015) and other UNet-like architectures with the
same recursive structure.
For our theoretical analysis, we consider activation functions % ∈ L∞ loc (R) such that
there exists x0 ∈ R with % three times continuously differentiable in a neighborhood of
some x0 ∈ R and %00 (x0 ) 6= 0. This includes many standard activation functions such as
exponential linear unit, softsign, swish, and sigmoids. For (leaky) ReLUs a similar but more
involved analysis would be possible, which we avoid for the sake of simplicity.
For a CNN Ψ, we denote the number of weights by M (Ψ). Since we consider the
FE spaces Vh in the two-dimensional setting on uniformly refined square meshes, FE
coefficient vectors x ∈ Rdim Vh can be viewed as two-dimensional arrays. Whenever x
is processed by a CNN, we implicitly assume a 2D matrix representation. We define
kxkH 1 :=
Pdim Vh
i=1 xi ϕi H 1 . For this section, we set D = [0, 1]2 and assume uniform
ellipticity for Problem 2.1.
12
We start with the approximation of the solution of the multigrid V-cycle MGm k0 ,k with
initialization u0 , diffusion coefficient κ, and right-hand side f by UNet-like CNNs.
Theorem 6 Let Vh ⊆ H01 (D) be the P1 FE space on a uniform square mesh. Then there
exists a constant C > 0 such that for any M, ε > 0 and m, k, k0 ∈ N there exists a CNN
Ψ : R3×dim Vh → Rdim Vh with
(i) Ψ(u0 , κ, f ) − MGm

k0 ,k (u0 , κ, f ) ≤ ε for all κ, u0 , f ∈ [−M, M ]dim Vh ,
H 1 (D)

(ii) number of weights bounded by M (Ψ) ≤ C k0 + k log 1
h m.
In the next remark, we provide more details about the architecture of Ψ and its implications.
Remark 7 The proof of Theorem 6 reveals that Ψ resembles the concatenation of multiple
UNet-like subnetworks, each approximating one V-cycle. We draw two conclusions from
that:
• Generally, this supports the heuristic relation between UNets and multigrid methods
with a rigorous mathematical analysis indicating a possible reason for their success
in multiscale-related tasks e.g., found in medical imaging (Liu et al., 2020) or PDE-
related problems (see Cho et al., 2021; Jiang et al., 2021).
• Together with our numerical results in the following section, this underlines the suit-
ability of UNetSeq (see Remark 5) for the solution of Problem 2.1. This is made more
specific in the next corollary.
The following corollary is an easy consequence of Theorem 6 for the parametric PDE
Problem 2.1 by applying the multigrid solver to solve Equation (5) for each y ∈ Γ indi-
vidually. We use that the required number of smoothing iterations in Theorem 3 is grid
independent and choose the number of V-cycles as m ≈ log(1/ε). An application of the
triangular inequality yields the following result.
Corollary 8 Consider the discretized Problem 2.1 with the conforming P1 FE space Vh ⊆
H01 (D). Assume that κy is uniformly bounded over all y ∈ Γ. Then there exists a constant
C > 0 such that for any ε > 0 there exists a CNN Ψ : R2×dim Vh → Rdim Vh with
(i) kΨ(κy , f ) − vh (y)kH 1 (D) ≤ ε kf k∗ for all y ∈ Γ,

(ii) number of weights bounded by M (Ψ) ≤ C log 1
h log 1
ε .
Setting the NN approximation error equal to the FE discretization error (ε = h), we get
the bound M (Ψ) ≤ C log(1/h)2 . This shows that the number of parameters at most grows
polylogarithmically in 1/h and is independent of the stochastic parameter dimension p.
Finally, we provide complexity estimates for the ML-Net architecture in the next co-
rollary. Here, at each level of ML-Net a multigrid V-cycle (on the respective level) is
approximated by using Theorem 6.
13
Corollary 9 Consider the discretized Problem 2.1. Let the spaces V1 , . . . , VL be defined as
in Section 3 for L ∈ N, f ∈ H −1 (D) and its discretization denoted by f ∈ Rdim VL . Assume
that the discretized diffusion coefficient κy ∈ Rdim VL is uniformly bounded over all y ∈ Γ.
Then, there exists a constant C > 0 such that for every ε > 0 there exists an ML-Net Ψ (as
in Section 4) with
(i) kΨ(0, κy , f ) − vL (y)kH 1 (D) ≤ ε kf k∗ for all y ∈ Γ,

(ii) M (Ψ) ≤ CL log 1
ε + CL2 .
In the following remark, we discuss the relation of the previous two corollaries.
Remark 10 Equilibrating approximation and discretization error, i.e., setting ε = h =
2−L , we make two observations:
• Both, Corollary 8 and 9, yield an upper bound for the number weights of M (Ψ) ≤ CL2 .
We conclude that from our upper bounds no advantage in terms of expressiveness for
using ML-Net over UNetSeq is evident. In Section 6.2 we show that our numerical
experiments support this finding.
• Despite a similar number of parameters in relation to the approximation accuracy,
a forward pass of ML-net is more efficient than a forward pass of UNetSeq. This is
due to the multilevel structure of the ML-Net that shifts the computational load more
evenly across resolutions. Note that the number of parameters of a convolutional
filter is independent on the input resolution, whereas the computational complexity of
convolving the input with the filter is not. In detail, the number of operations for a
forward pass of ML-Net is O(22L ) = O(h−2 ), while for UNetSeq, we have O(L22L ) =
O(log(h−1 )h−2 ) computations. This is one advantage of ML-Net that results in shorter
training times (see Section 6).
6 Numerical results
This section is concerned with the evaluation of ML-Net and the simpler UNetSeq on several
common benchmark problems in terms of the mean relative H 1 -distance of the predicted
solution. Moreover, the performance of ML-Net is tested with a declining number of training
samples for successive grid levels. This is motivated by multilevel theory and leads to a
beneficial training complexity (see Remark 10 for a more detailed evaluation).
We assess the performance of our methods for different choices of varying computational
complexity for the coefficient functions ak , comprising the diffusivity field κ by (3) and
different parameter distributions π. For all test cases, we choose the unit square domain
D = [0, 1]2 , a0 (x) := a0 ∈ R constant and the right-hand side f ≡ 1. The test problems are
defined as follows.
1. Uniform smooth field. Let Γ = [−1, 1]p and π = U [−1, 1]p . Moreover, let a0 := 1
and choose ak as planar Fourier modes as described in (Eigel et al., 2019b) and (Geist
et al., 2021) with decay kak k∞ = 0.1k −2 for 1 ≤ k ≤ p.
2. Log-normal smooth field. Let Γ = Rp and π = N (0, Idp ). Define κ(y, x) :=
exp(κ̃(y, x)), where κ̃ equals κ from the uniform case with a0 := 0.
14
Figure 2: Example realizations of κ(·, y) for the four test cases. From left to right:
uniform case (p = 100), log-normal case (p = 100), cookie problem (p = 16), cookie
problem with variable radii (p = 32). The colormaps are normalized for visibility.
3. Cookie problem with fixed radii inclusions. Let Γ = [−1, 1]p , with p having
a natural square root and π = U [−1, 1]p . Moreover, define a0 := 0.1 and ak (x) =
XDk (x), where Dk are disks centered in a cubic lattice with a fixed radius r = 0.3p−1/2
for 1 ≤ k ≤ p.
4. Cookie problem with variable radii inclusions. As an extension of the cookie

problem, we also assume the disk radii to vary. To this end, let p0 be the (quadratic)
number of cookies, p = 2p0 , Γ = [−1, 1]p and π = U [−1, 1]p . The diffusion coefficient
is defined by
p 0
κ(y, x) := a0 + (x),
X
y2k−1 XDk,r(y
2k )
k=1
where the Dk,r are disks centered in a cubic lattice with parameter dependent ra-
dius r. We choose r(y) for y ∈ [−1, 1] such that the radii are uniformly distributed in
[0.5(p0 )−1/2 , 0.9(p0 )−1/2 ].
Figure 2 depicts a random realization for each of the 4 problem cases. Note that not all of
the above problems satisfy the theoretical assumptions for the results in Section 5. Notably,
the log-normal case is not uniformly elliptic and our theoretical guarantees thus only hold
with (arbitrarily) high probability as indicated above. Moreover, in case of the cookie
problem with variable radii, the diffusion coefficient depends non-linearly on the parameter
vector since also the basis functions ak are parameter dependent. We include these cases to
study our methodology with diverse and particularly challenging setups.
We use the open source package FEniCS (Logg and Wells, 2010) with the GMRES (Saad
and Schultz, 1986) solver for carrying out the FE simulations to generate training data and
PyTorch (Paszke et al., 2019) for the NN models.
If not stated otherwise, we use a training dataset with 104 samples and 1024 samples
for independent validation computed from i.i.d. parameter vectors. For testing, the per-
formance of the methods is evaluated on 1024 independently generated test samples.
For the multilevel decomposition, we consider L = 7 resolution levels starting with a
coarse resolution of 5 × 5 cells. Iterative uniform mesh refinement with factor η = 2 results
in 320 × 320 cells on the finest level. To reduce computational complexity, we only compute
15
solutions on the finest level and derive the coarser solutions as well as the corrections
in (6) using nodal interpolation. Additionally, we compute reference test solutions on a
high-fidelity mesh with 1280 × 1280 cells, resulting in about 1.6 × 106 degrees of freedom.
For UNetSeq, we choose the number of UNets such that the total parameter count
roughly matches the 5.6×106 parameters of ML-Net. For training, we use the Adam (Kingma
and Ba, 2015) optimizer and train for 200 epochs with learning rate decay. After each run,
we evaluate the model with the best validation performance. Training took approximately
33 GPU-hours on an NVIDIA Tesla P100 for ML-Net and 46 GPU-hours for UNetSeq (see
also Remark 10).
6.1 Error metrics

We examine the mean relative H 1 error (MRH 1 ) and mean relative L2 error (MRL2 ) with
respect to the solutions on the finest resolution level L as well as the high-fidelity reference
solutions. For parameters y1 , . . . , yN ∈ Γ, predictions u1 , . . . , uN ∈ VL , and the solution
operators vL : Γ → VL and vref : Γ → H 1 (D) mapping to the discrete space on the finest
grid and the reference grid, respectively, we define
v v
kui − vL (yi )k2∗ kui − vref (yi )k2∗
u PN u PN
EMR∗ := t i=1 and EMR∗ := t i=1
u u
ref
2 2 ,
i=1 kvL (yi )k∗ i=1 kvref (yi )k∗
PN PN
with ∗ ∈ {H 1 , L2 }.
6.2 Results for the test cases

Tables 1 and 2 depict the EMRH 1 and EMRH
ref
1 for ML-Net and UNetSeq for the numerical test
cases described above with varying stochastic parameter dimensions. EMRL2 and EMRL ref
2 are
shown in Tables 4 and 5 in Appendix B. A random selection of solutions and NN predictions

is also shown in Figure 3.
In comparison to previously reported performances of DL-based approaches for these
benchmark problems in (Geist et al., 2021; Li et al., 2021a; Lu et al., 2021; Grossmann et al.,
2023), we observe an improvement of one to two orders of magnitude with our methodology.
Geist et al. (2021) observed a strong dependence of the performance of their DL-based
method on the stochastic parameter dimension. In contrast, both ML-Net and UNetSeq
perform very consistently with increasing parameter dimensions although the problems
become more involved. This is in line with the parameter independent bounds for CNNs
derived in Section 5.
EMRH 1 is generally significantly lower than EMRH
ref
1 . This illustrates that for these cases
the NN approximation error on the finest grid L = 7 can be several magnitudes lower than
the FE approximation error. Conversely, the discrepancy between EMRL2 and EMRL ref
2 seen
in Tables 4 and 5 is far less pronounced.

Comparing ML-Net to UNetSeq, we observe that both models generally exhibit a com-
parable performance. Lower variances in model accuracy for ML-Net indicate an improved
training stability when using a suitable multilevel decomposition.
16
Figure 3: Realizations for different benchmark problems (top to bottom: uniform

p = 100, log-normal p = 100, cookie fixed). Left-hand side columns show field
realizations and respective solutions. Right-hand side pictures the error of ML-Net
and UNetSeq predictions w.r.t. exact target solutions.
17
parameter EMRH 1
problem
dimension p ML-Net UNetSeq
10 2.24e−4 ± 9.62e−5 9.75e−5 ± 2.10e−5
50 2.21e−4 ± 2.96e−5 2.08e−3 ± 2.79e−3
uniform
100 2.27e−4 ± 2.72e−5 8.89e−5 ± 1.00e−5
200 2.31e−4 ± 3.50e−5 2.05e−3 ± 2.76e−3
10 2.15e−4 ± 6.18e−5 2.83e−3 ± 3.70e−3
50 9.73e−4 ± 1.04e−3 3.71e−3 ± 4.91e−3
log-normal
100 2.64e−4 ± 1.73e−5 2.82e−3 ± 4.31e−3
200 3.00e−4 ± 1.43e−5 1.97e−4 ± 4.22e−5
16 9.41e−4 ± 1.12e−4 1.10e−3 ± 3.76e−4
cookie fixed
64 1.85e−3 ± 1.38e−4 7.20e−4 ± 1.43e−4
32 4.69e−3 ± 1.41e−3 3.69e−3 ± 4.53e−4
cookie variable
128 5.98e−3 ± 1.13e−4 3.08e−3 ± 5.07e−4
Table 1: EMRH 1 for ML-Net and UNetSeq evaluated on all test cases.
ref
EMRH
parameter 1
problem
10 5.33e−3 ± 4.50e−6 5.33e−3 ± 7.01e−7
50 5.33e−3 ± 1.15e−6 6.24e−3 ± 1.28e−3
uniform
100 5.33e−3 ± 1.63e−6 5.33e−3 ± 8.55e−7
200 5.33e−3 ± 1.46e−6 6.22e−3 ± 1.25e−3
10 5.33e−3 ± 2.93e−6 6.78e−3 ± 2.04e−3
50 5.51e−3 ± 2.49e−4 7.53e−3 ± 3.10e−3
log-normal
100 5.33e−3 ± 5.54e−7 6.90e−3 ± 2.71e−3
200 5.34e−3 ± 3.03e−6 5.33e−3 ± 1.69e−6
16 7.09e−2 ± 1.87e−5 7.09e−2 ± 7.39e−6
cookie fixed
64 9.73e−2 ± 1.16e−5 9.73e−2 ± 5.62e−6
32 7.83e−2 ± 3.22e−4 7.81e−2 ± 2.39e−4
cookie variable
128 1.12e−1 ± 2.58e−4 1.12e−1 ± 2.76e−5
1 for ML-Net and UNetSeq evaluated on all test cases.

ref
Table 2: EMRH
18
6.3 Training with successively fewer samples on finer levels

A striking advantage of stochastic multilevel methods is that the majority of sample points
can be computed on coarser levels with low effort while only few samples are needed on
the finest grid (Teckentrup et al., 2015; Lye et al., 2021; Harbrecht et al., 2016; Ballani
et al., 2016). We transfer this concept to ML-Net by exponentially decreasing the number
of training samples from level to level. To this end, we train each level of our ML-Net only
with a fraction of the dataset. More concretely, we divide the number of samples in each
subsequent level by two, i.e for level ` = 1, . . . , L, we use N` := 21−` × 104 training samples.
We observe that alternating between low level samples and samples for which all corrections
are known during training is needed for stable optimization. The computational budget of
generating a multiscale dataset with 1000 samples on the coarsest level and exponentially
decaying number of samples on subsequent levels corresponds to 232 full resolution samples.
Table 3 depicts the errors for two test cases (uniform and log-normal with p = 100)
for ML-Net trained on the decayed multiscale dataset and UNetSeq trained on the full
resolution dataset generated with a comparable compute budget (232 samples). We make
two observations: First, training ML-Net with a decaying number of samples per level
hardly decreases its performance when compared to the full dataset of 1000 samples from
Table 4. Second, UNetSeq trained on a full-resolution dataset of comparable compute
budget significantly reduces performance compared to training UNetSeq on 1000 samples
(Table 4) and compared to ML-Net on the decayed dataset (Table 3).
The MRL2 errors depicted in Table 3 illustrate the low error that can be achieved,
which is at least one order of magnitude lower than what is reported in other papers, see
also Tables 4 and 5 in Appendix B. Note that the observed MRH 1 error in our experiments
is bounded from below by the FE approximation, i.e. the resolution of the finest mesh. For
the multilevel advantage to fully take effect, training on finer grids would be required to
decrease the FE approximation errors to be comparable to the smaller NN approximation
errors. Hence, using fewer high fidelity training samples is a crucial step to train models
which achieve an overall MRH 1 accuracy comparable to traditional FE solvers.
We conclude that ML-Net can effectively be trained with fewer samples for finer levels,
significantly reducing the overall training and data generation costs and, thus, enabling the
application of our model with much finer grids.
7 Conclusion
In this work, we combine concepts from established multilevel algorithms with NNs for
efficiently solving challenging high-dimensional parametric PDEs. We provide a theoretical
complexity analysis revealing the relation between UNet-like architectures approximate clas-
sical multigrid algorithms arbitrarily well. Moreover, we show that the ML-Net architecture
presents an advantageous approach for learning the parameter-to-solution operator of the
parametric Darcy problems. The performance of our method is illustrated by several numer-
ical benchmark experiments showing a significant improvement over the state-of-the-art of
previous DL-based approaches for parametric PDEs. In fact, the shown approximation qual-
ity matches the best-in-class techniques such as low-rank least squares methods, stochastic
Galerkin approaches, compressed sensing and stochastic collocation. Additionally, we show
that the multilevel architecture allows to use fewer data points for finer corrections during
19
method error dataset uniform log-normal

ML-Net EMRH 1 decaying 2.86e−4 ± 2.24e−5 4.05e−4 ± 2.48e−5
UNetSeq EMRH 1 fixed 1.14e−3 ± 6.47e−4 5.92e−3 ± 4.66e−3
ML-Net ref
EMRH 1 decaying 5.34e−3 ± 1.22e−6 5.34e−3 ± 1.61e−6
UNetSeq ref
EMRH 1 fixed 5.49e−3 ± 1.59e−4 8.53e−3 ± 3.53e−3
ML-Net EMRL2 decaying 5.75e−5 ± 4.43e−6 9.39e−5 ± 5.42e−6
UNetSeq EMRL2 fixed 5.38e−4 ± 3.93e−4 3.76e−3 ± 3.23e−3
ML-Net ref
EMRL 2 decaying 6.65e−5 ± 2.16e−6 9.91e−5 ± 4.59e−6
UNetSeq ref
EMRL 2 fixed 5.40e−4 ± 3.91e−4 3.76e−3 ± 3.23e−3
Table 3: All errors for ML-Net and UNetSeq trained with a reduced dataset size on
two problem cases with parameter dimension p = 100. The ML-Net is trained with
the number of training samples halved for each level and UNetSeq is trained using
232 samples in total.
training with negligible impact on practical performance. This enables the extension to
much finer resolutions without a prohibitive increase of computational complexity. Finally,
we note that while we consider a scalar linear elliptic equation in this work, our method-
ology can be used for a variety of possibly nonlinear and vector-valued problems. Future
research directions could include the analysis of such problems as well as the extension of
our method to adaptive grids.
Acknowledgments and Disclosure of Funding
The authors would like to thank Reinhold Schneider for fruitful discussions on the topic. I.
Gühring acknowledges support from the Research Training Group “Differential Equation-
and Data-driven Models in Life Sciences and Fluid Dynamics: An Interdisciplinary Research
Training Group (DAEDALUS)” (GRK 2433) funded by the German Research Founda-
tion (DFG) and a post-doctoral scholarship from Technical University of Berlin for “Deep
Learning for (Parametric) PDEs from a Theoretical and Practical Perspective”. M. Eigel
acknowledges partial support from DFG SPP 1886 “Polymorphic uncertainty modelling
for the numerical design of structures” and DFG SPP 2298 “Theoretical foundations of
Deep Learning”. We acknowledge the computing facilities and thank the IT support of the
Weierstrass Institute.
References
Ben Adcock and Nick Dexter. The gap between theory and practice in function approxim-
ation with deep neural networks. SIAM Journal on Mathematics of Data Science, 3(2):
624–655, 2021.
20
Anima Anandkumar, Kamyar Azizzadenesheli, Kaushik Bhattacharya, Nikola Kovachki,

Zongyi Li, Burigede Liu, and Andrew Stuart. Neural operator: Graph kernel network
for partial differential equations. In ICLR 2020 Workshop on Integration of Deep Neural
Models and Differential Equations, 2020.
Ivo Babuska, Fabio Nobile, and Raul Tempone. A stochastic collocation method for el-
liptic partial differential equations with random input data. SIAM Journal on Numerical
Analysis, 45:1005–1034, 2007. doi: 10.1137/050645142.
Pierre Baldi. Deep learning in biomedical data science. Annual Review of Biomedical Data
Science, 1(1):181–205, 2018.
Jonas Ballani, Daniel Kressner, and Michael Peters. Multilevel tensor approximation of
pdes with random data. Stochastics and Partial Differential Equations: Analysis and
Computations, 5:400–427, 2016.
Christian Beck, Martin Hutzenthaler, Arnulf Jentzen, and Benno Kuckuck. An overview on
deep learning-based approximation methods for partial differential equations. Discrete
and Continuous Dynamical Systems - B, 28(6):3697–3746, 2023. ISSN 1531-3492. doi:
10.3934/dcdsb.2022238.
Julius Berner, Markus Dablander, and Philipp Grohs. Numerically solving parametric
families of high-dimensional kolmogorov partial differential equations via deep learning.
Advances in Neural Information Processing Systems, 33:16615–16627, 2020a.
Julius Berner, Philipp Grohs, and Arnulf Jentzen. Analysis of the generalization error:
Empirical risk minimization over deep artificial neural networks overcomes the curse of
dimensionality in the numerical approximation of black–scholes partial differential equa-
tions. SIAM Journal on Mathematics of Data Science, 2(3):631–657, 2020b.
Saakaar Bhatnagar, Yaser Afshar, Shaowu Pan, Karthik Duraisamy, and Shailendra
Kaushik. Prediction of aerodynamic flow fields using convolutional neural networks.
Computational Mechanics, 64(2):525–545, 2019. ISSN 1432-0924. doi: 10.1007/
s00466-019-01740-0.
Dietrich Braess. Finite elements: Theory, fast solvers, and applications in solid mechanics.
Cambridge University Press, 2007.
Dietrich Braess and Wolfgang Hackbusch. A new convergence proof for the multigrid method
including the v-cycle. Siam Journal on Numerical Analysis, 20:967–975, 1983. doi: 10.
1137/0720066.
Yuyan Chen, Bin Dong, and Jinchao Xu. Meta-mgnet: Meta multigrid networks for solving
parameterized partial differential equations. Journal of Computational Physics, page
110996, 2022.
Abdellah Chkifa, Albert Cohen, Giovanni Migliorati, Fabio Nobile, and Raul Tempone.
Discrete least squares polynomial approximation with random evaluations- application to
parametric and stochastic elliptic pdes. ESAIM: Mathematical Modelling and Numerical
Analysis-Modélisation Mathématique et Analyse Numérique, 49(3):815–837, 2015.
21
Sung-Jin Cho, Seo-Won Ji, Jun-Pyo Hong, Seung-Won Jung, and Sung-Jea Ko. Re-
thinking coarse-to-fine approach in single image deblurring. In 2021 IEEE/CVF In-
ternational Conference on Computer Vision (ICCV), pages 4621–4630, 2021. doi:
10.1109/ICCV48922.2021.00460.
Philippe G. Ciarlet. The Finite Element Method for Elliptic Problems. Society for Industrial
and Applied Mathematics, 2002. doi: 10.1137/1.9780898719208.
Albert Cohen and Ronald DeVore. Approximation of high-dimensional parametric pdes.

Acta Numerica, 24:1–159, 2015.
Albert Cohen and Giovanni Migliorati. Near-optimal approximation methods for elliptic
pdes with lognormal coefficients. Mathematics of Computation, 2023.
Weinan E, Jiequn Han, and Arnulf Jentzen. Deep learning-based numerical methods for
high-dimensional parabolic partial differential equations and backward stochastic differ-
ential equations. Communications in Mathematics and Statistics, 5(4):349–380, 2017.
ISSN 2194-671X. doi: 10.1007/s40304-017-0117-6.
Martin Eigel, Claude Jeffrey Gittelson, Christoph Schwab, and Elmar Zander. Adaptive
stochastic galerkin FEM. Computer Methods in Applied Mechanics and Engineering,
270:247–269, March 2014. doi: 10.1016/j.cma.2013.11.015. URL https://doi.org/10.
1016/j.cma.2013.11.015.
Martin Eigel, Reinhold Schneider, Philipp Trunschke, and Sebastian Wolf. Variational
monte carlo—bridging concepts of machine learning and high-dimensional partial dif-
ferential equations. Advances in Computational Mathematics, 45(5-6):2503–2532, Oc-
tober 2019a. doi: 10.1007/s10444-019-09723-8. URL https://doi.org/10.1007/
s10444-019-09723-8.
Martin Eigel, Reinhold Schneider, Philipp Trunschke, and Sebastian Wolf. Variational
monte carlo—bridging concepts of machine learning and high-dimensional partial differ-
ential equations. Advances in Computational Mathematics, 45(5):2503–2532, 2019b.
Martin Eigel, Manuel Marschall, Max Pfeffer, and Reinhold Schneider. Adaptive stochastic
galerkin FEM for lognormal coefficients in hierarchical tensor representations. Numerische
Mathematik, 145(3):655–692, 6 2020. doi: 10.1007/s00211-020-01123-1.
Oliver G Ernst, Bjorn Sprungk, and Lorenzo Tamellini. Convergence of sparse collocation
for functions of countably many gaussian random variables (with application to elliptic
pdes). SIAM Journal on Numerical Analysis, 56(2):877–905, 2018.
Yuwei Fan, Jordi Feliu-Faba, Lin Lin, Lexing Ying, and Leonardo Zepeda-Núnez. A
multiscale neural network based on hierarchical nested bases. Research in the Math-
ematical Sciences, 6(2):1–28, 2019a.
Yuwei Fan, Lin Lin, Lexing Ying, and Leonardo Zepeda-Núnez. A multiscale neural net-
work based on hierarchical matrices. Multiscale Modeling & Simulation, 17(4):1189–1213,
2019b.
22
Ruiqi Gao, Yang Lu, Junpei Zhou, Song-Chun Zhu, and Ying Nian Wu. Learning gener-
ative convnets via multi-grid modeling and sampling. 2018 IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pages 9155–9164, 2018.
Moritz Geist, Philipp Petersen, Mones Raslan, Reinhold Schneider, and Gitta Kutyniok.
Numerical solution of the parametric diffusion equation by deep neural networks. Journal
of Scientific Computing, 88(1):1–37, 2021.
Martin Genzel, Ingo Gühring, Jan MacDonald, and Maximilian März. Near-exact recovery
for tomographic inverse problems via deep learning. In Kamalika Chaudhuri, Stefanie
Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato, editors, International
Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland,
USA, volume 162 of Proceedings of Machine Learning Research, pages 7368–7381. PMLR,
2022.
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT Press, Cam-
bridge, 2016.
Philipp Grohs and Lukas Herrmann. Deep neural network approximation for high-
dimensional elliptic pdes with boundary conditions. IMA Journal of Numerical Analysis,
42(3):2055–2082, 2022.
Philipp Grohs, Fabian Hornung, Arnulf Jentzen, and Philippe Von Wurstemberger. A
proof that artificial neural networks overcome the curse of dimensionality in the nu-
merical approximation of black-scholes partial differential equations. arXiv preprint
arXiv:1809.02362, 2018.
Tamara G Grossmann, Urszula Julia Komorowska, Jonas Latz, and Carola-Bibiane Schön-
lieb. Can physics-informed neural networks beat the finite element method? arXiv
preprint arXiv:2302.04107, 2023.
Ingo Gühring and Mones Raslan. Approximation rates for neural networks with encodable
weights in smoothness spaces. Neural Networks, 134:107–130, 2021.
Xiaoxiao Guo, Wei Li, and Francesco Iorio. Convolutional neural networks for steady flow
approximation. In Proceedings of the 22nd ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, KDD ’16, page 481–490, New York, NY, USA,
2016. Association for Computing Machinery. ISBN 9781450342322. doi: 10.1145/2939672.
2939738.
Ingo Gühring, Gitta Kutyniok, and Philipp Petersen. Error bounds for approximations with
deep ReLU neural networks in W s,p norms. Analysis and Applications (Singap.), 18(05):
803–859, 2020.
Ingo Gühring, Mones Raslan, and Gitta Kutyniok. Expressivity of deep neural networks.
In Philipp Grohs and Gitta Kutyniok, editors, Mathematical Aspects of Deep Learning,
page 149–199. Cambridge University Press, 2022.
Wolfgang Hackbusch. Multi-grid methods and applications, volume 4. Springer Science &
Business Media, 2013.
23
Jiequn Han, Arnulf Jentzen, et al. Deep learning-based numerical methods for high-
dimensional parabolic partial differential equations and backward stochastic differential
equations. Communications in mathematics and statistics, 5(4):349–380, 2017.
Helmut Harbrecht, Michael Peters, and Markus Siebenmorgen. Multilevel accelerated quad-
rature for pdes with log-normally distributed diffusion coefficient. SIAM/ASA Journal
on Uncertainty Quantification, 4(1):520–551, 2016.
Juncai He and Jinchao Xu. Mgnet: A unified framework of multigrid and convolutional
neural network. Science China Mathematics, 62(7):1331–1354, 2019. ISSN 1869-1862.
doi: 10.1007/s11425-019-9547-2.
J.S. Hesthaven and S. Ubbiali. Non-intrusive reduced order modeling of nonlinear prob-
lems using neural networks. Journal of Computational Physics, 363:55–78, 2018.
ISSN 0021-9991. doi: https://doi.org/10.1016/j.jcp.2018.02.037. URL https://www.
sciencedirect.com/science/article/pii/S0021999118301190.
Zhihao Jiang, Pejman Tahmasebi, and Zhiqiang Mao. Deep residual u-net convolution
neural networks with autoregressive strategy for fluid flow predictions in large-scale geo-
systems. Advances in Water Resources, 150:103878, 02 2021. doi: 10.1016/j.advwatres.
2021.103878.
John A Keith, Valentin Vassilev-Galindo, Bingqing Cheng, Stefan Chmiela, Michael Gasteg-
ger, Klaus-Robert Müller, and Alexandre Tkatchenko. Combining machine learning and
computational chemistry for predictive insights into chemical systems. Chemical Reviews,
121(16):9816–9872, 2021.
Ehsan Kharazmi, Zhongqiang Zhang, and George Em Karniadakis. Variational physics-

informed neural networks for solving partial differential equations. arXiv preprint
arXiv:1912.00873, 2019.
Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In
3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA,
USA, May 7-9, 2015, Conference Track Proceedings, 2015.
Omar M Knio and OP Le Maitre. Uncertainty propagation in cfd using polynomial chaos
decomposition. Fluid dynamics research, 38(9):616, 2006.
Gitta Kutyniok, Philipp Petersen, Mones Raslan, and Reinhold Schneider. A theoretical
analysis of deep neural networks and parametric pdes. Constructive Approximation, 55
(1):73–125, 2022.
Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Andrew Stuart,
Kaushik Bhattacharya, and Anima Anandkumar. Multipole graph neural operator for
parametric partial differential equations. Advances in Neural Information Processing
Systems, 33:6755–6766, 2020.
Zongyi Li, Nikola Borislavov Kovachki, Kamyar Azizzadenesheli, Burigede liu, Kaushik
Bhattacharya, Andrew Stuart, and Anima Anandkumar. Fourier neural operator for
24
parametric partial differential equations. In International Conference on Learning Rep-

resentations, 2021a.
Zongyi Li, Nikola Borislavov Kovachki, Kamyar Azizzadenesheli, Burigede liu, Kaushik
Bhattacharya, Andrew Stuart, and Anima Anandkumar. Fourier neural operator for
parametric partial differential equations. In International Conference on Learning Rep-
resentations, 2021b.
Liangliang Liu, Jianhong Cheng, Quan Quan, Fang-Xiang Wu, Yu-Ping Wang, and Jianxin
Wang. A survey on u-shaped networks in medical image segmentations. Neurocomputing,
409:244–258, 2020. ISSN 0925-2312.
Anders Logg and Garth N. Wells. Dolfin. ACM Transactions on Mathematical Software,
37(2):1–28, 2010. ISSN 1557-7295. doi: 10.1145/1731022.1731030.
Gabriel J Lord, Catherine E Powell, and Tony Shardlow. An introduction to computational

stochastic PDEs, volume 50. Cambridge University Press, 2014.
Lu Lu, Xuhui Meng, Zhiping Mao, and George Em Karniadakis. Deepxde: A deep learning
library for solving differential equations. SIAM review, 63(1):208–228, 2021.
Kjetil O Lye, Siddhartha Mishra, and Roberto Molinaro. A multi-level procedure for enhan-
cing accuracy of machine learning algorithms. European Journal of Applied Mathematics,
32(3):436–469, 2021.
Siddhartha Mishra and Roberto Molinaro. Estimates on the generalization error of physics-
informed neural networks for approximating PDEs. IMA Journal of Numerical Analysis,
43(1):1–43, 01 2022. ISSN 0272-4979. doi: 10.1093/imanum/drab093.
Fabio Nobile, Raúl Tempone, and Clayton G Webster. A sparse grid stochastic collocation
method for partial differential equations with random input data. SIAM Journal on
Numerical Analysis, 46(5):2309–2345, 2008.
Frank Noé, Alexandre Tkatchenko, Klaus-Robert Müller, and Cecilia Clementi. Machine
learning for molecular simulation. Annual Review of Physical Chemistry, 71(1):361–390,
2020.
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan,
Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas
Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkur-
thy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative
style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzi-
mer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information
Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019.
Philipp Petersen and Felix Voigtländer. Optimal approximation of piecewise smooth func-
tions using deep ReLU neural networks. Neural Networks, 108:296–330, 2018.
25
Maziar Raissi, Paris Perdikaris, and George Em Karniadakis. Physics-informed neural

networks: A deep learning framework for solving forward and inverse problems involving
nonlinear partial differential equations. Journal of Computational Physics, 378:686–707,
2019. ISSN 0021-9991. doi: 10.1016/j.jcp.2018.10.045.
Amuthan A Ramabathiran and Prabhu Ramachandran. Spinn: sparse, physics-based, and
partially interpretable neural networks for pdes. Journal of Computational Physics, 445:
110600, 2021.
Lewis Fry Richardson and Richard Tetley Glazebrook. Ix. the approximate arithmetical
solution by finite differences of physical problems involving differential equations, with
an application to the stresses in a masonry dam. Philosophical Transactions of the Royal
Society of London. Series A, Containing Papers of a Mathematical or Physical Character,
210(459-470):307–357, 1911. doi: 10.1098/rsta.1911.0009.
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for
biomedical image segmentation. In International Conference on Medical image computing
and computer-assisted intervention, pages 234–241. Springer, 2015.
Youcef Saad and Martin H Schultz. Gmres: A generalized minimal residual algorithm
for solving nonsymmetric linear systems. SIAM Journal on Scientific Computing, 7(3):
856–869, 1986. ISSN 0196-5204.
Peter Sadowski and Pierre Baldi. Deep learning in the natural sciences: Applications to
physics. In Lev Rozonoer, Boris Mirkin, and Ilya Muchnik, editors, Braverman Readings
in Machine Learning. Key Ideas from Inception to Current State: International Confer-
ence Commemorating the 40th Anniversary of Emmanuil Braverman’s Decease, Boston,
MA, USA, April 28-30, 2017, Invited Talks, pages 269–297. Springer International Pub-
lishing, Cham, 2018.
Christoph Schwab and Claude Jeffrey Gittelson. Sparse tensor discretizations of high-
dimensional parametric and stochastic pdes. Acta Numerica, 20:291–467, 2011.
Christoph Schwab and Jakob Zech. Deep learning in high dimension: Neural network expres-
sion rates for generalized polynomial chaos expansions in uq. Analysis and Applications,
17(01):19–55, 2019.
W. Shi, J. Caballero, F. Huszar, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang.
Real-time single image and video super-resolution using an efficient sub-pixel convolu-
tional neural network. In 2016 IEEE Conference on Computer Vision and Pattern Re-
cognition (CVPR), pages 1874–1883, Los Alamitos, CA, USA, jun 2016. IEEE Computer
Society. doi: 10.1109/CVPR.2016.207.
Justin Sirignano and Konstantinos Spiliopoulos. Dgm: A deep learning algorithm for solving
partial differential equations. Journal of Computational Physics, 375:1339–1364, 2018.
ISSN 0021-9991. doi: 10.1016/j.jcp.2018.08.029.
A. L. Teckentrup, P. Jantsch, C. G. Webster, and M. Gunzburger. A multilevel stochastic
collocation method for partial differential equations with random input data. SIAM/ASA
Journal on Uncertainty Quantification, 3(1):1046–1074, 2015. doi: 10.1137/140969002.
26
Benjamin Ummenhofer, Lukas Prantl, Nils Thuerey, and Vladlen Koltun. Lagrangian fluid
simulation with continuous convolutions. In International Conference on Learning Rep-
resentations, 2020.
Xuefeng Xu and Chen-Song Zhang. Convergence analysis of inexact two-grid methods: A

theoretical framework. SIAM Journal on Numerical Analysis, 60(1):133–156, 2022.
Dmitry Yarotsky. Error bounds for approximations with deep relu networks. Neural Net-
works, 94:103–114, 2017.
Harry Yserentant. Old and new convergence proofs for multigrid methods. Acta numerica,
2:285–326, 1993.
Bing Yu et al. The deep ritz method: a deep learning-based numerical algorithm for solving
variational problems. Communications in Mathematics and Statistics, 6(1):1–12, 2018.
Yinhao Zhu and Nicholas Zabaras. Bayesian deep convolutional encoder–decoder networks
for surrogate modeling and uncertainty quantification. Journal of Computational Physics,
366:415–447, 2018. ISSN 0021-9991. doi: 10.1016/j.jcp.2018.04.018.
27
Appendix A. Exact ML-Net architecture and training details
In our experiments, we construct ML-Net and UNetSeq with L = 7 levels. Convolutional

layers use 3×3 kernels with zero padding. Upsampling (↑) and downsampling (↓) layers are
implemented by 5×5 transpose-strided and strided convolutions with 2-strides, respectively.
We do not employ any normalization layers and use the ReLU activation function through-
out all architectures. Apart from up- and downsampling layers, UNet subunits of both
ML-Net and UNetSeq contain two convolutional layers on each scale. Skip-connections are
realized by concatenating the output of preceding layers.
For ML-Net, we choose the number of UNets per level R` , ` = 1, . . . , L, such that
each subnetwork ML-Net` has approximately 8×105 trainable parameters. This number
was empirically found to produce satisfactory results and yields R = [11, 9, 5, 4, 3, 2, 2].
To downsample the diffusivity field κ to the input resolution required by the subnetwork
ML-Net` , L − ` + 1 strided convolutions followed by ReLU activations are used. We use 32
channels for convolutional layers of ML-Net, except for the coarsest level where 64 channels
are used. For all levels ` = 2, . . . , L except the coarsest one, the ML-Net` subnetworks
output the predictions as four channels at half the resolution and use pixel un-shuffle layers
(Shi et al., 2016) to assemble the full predictions. The reason for this is that fine grid
corrections for a dyadic subdivision yield a higher similarity between every second value than
neighbouring values in the array. For UNetSeq, 64 channels are used for all convolutional
layers.
ML-Net and UNetSeq are trained for 200 epochs with an initial learning rate of 10−3
during the first 60 epochs. The learning rate is then linearly decayed to 2×10−5 over the
next 100 epochs, where it was held for the rest of the training. Due to memory constraints,
the batch sizes were chosen to be 20 for ML-Net and 16 for UNetSeq. The Adam optimizer
(Kingma and Ba, 2015) was used in the standard configuration with parameters β1 =
0.99, β2 = 0.999, and without weight decay.
Appendix B. Additional tables
The following tables depict relative L2 errors of the considered NN architectures for all
presented test cases.
28
parameter EMRL2
problem case
10 9.28e−5 ± 6.92e−5 3.72e−5 ± 9.18e−6
50 4.88e−5 ± 6.82e−6 1.21e−3 ± 1.66e−3
uniform
100 4.90e−5 ± 9.56e−6 3.21e−5 ± 3.06e−6
200 4.81e−5 ± 7.38e−6 1.17e−3 ± 1.61e−3
10 7.44e−5 ± 1.75e−5 1.66e−3 ± 2.22e−3
50 7.46e−4 ± 9.61e−4 2.30e−3 ± 3.13e−3
log-normal
100 6.54e−5 ± 6.71e−6 1.72e−3 ± 2.80e−3
200 7.35e−5 ± 2.82e−6 7.94e−5 ± 2.14e−5
16 3.29e−4 ± 3.26e−5 2.35e−4 ± 1.37e−5
cookie fixed
64 5.32e−4 ± 1.80e−5 1.40e−4 ± 4.32e−5
32 1.33e−3 ± 1.05e−4 7.68e−4 ± 2.45e−5
cookie variable
128 2.01e−3 ± 8.14e−5 7.14e−4 ± 7.74e−5
Table 4: EMRL2 for ML-Net and UNetSeq evaluated on all test cases.
ref
EMRL
parameter 2
problem case
10 1.05e−4 ± 6.75e−5 4.77e−5 ± 6.80e−6
50 5.86e−5 ± 7.02e−6 1.22e−3 ± 1.66e−3
uniform
100 5.50e−5 ± 5.07e−6 4.19e−5 ± 3.41e−6
200 5.67e−5 ± 5.78e−6 1.18e−3 ± 1.60e−3
10 7.76e−5 ± 1.29e−5 1.66e−3 ± 2.22e−3
50 7.59e−4 ± 9.68e−4 2.30e−3 ± 3.13e−3
log-normal
100 7.06e−5 ± 4.38e−6 1.72e−3 ± 2.80e−3
200 8.31e−5 ± 3.91e−6 8.69e−5 ± 1.97e−5
16 6.19e−3 ± 6.60e−5 6.05e−3 ± 3.50e−5
cookie fixed
64 9.63e−3 ± 4.33e−5 9.48e−3 ± 2.79e−5
32 8.81e−3 ± 3.41e−5 8.49e−3 ± 9.10e−5
cookie variable
128 1.64e−2 ± 5.36e−5 1.62e−2 ± 6.70e−5
2 for ML-Net and UNetSeq evaluated on all test cases.

ref
Table 5: EMRL
Appendix C. Proofs for the results of Section 5

This section is devoted to the proofs of our theoretical analysis in Section 5. For this, we
start with some mathematical notation.
C.1 Mathematical notation

For some function f : D ⊆ Rd → Rn , we denote kf kL∞ (D) := ess supx∈D kf (x)k∞ , where
k·k∞ denotes the maximum norm of a vector. For two tensors x, y ∈ RW ×H with W, H ∈ N,
29
we denote their pointwise multiplication by x y ∈ RW ×H . Since we consider the FE spaces

Vh in the two-dimensional setting on uniformly refined square meshes, FE coefficient vectors
x ∈ Rdim Vh can be viewed as two-dimensional arrays. Whenever x is processed by a CNN,
we implicitly assume a 2D matrix representation.
The bilinear form in (2) induces the problem related and parameter dependent energy
norm given by Z
kwk2Ay := κ(y, x)|∇w(x)|2 dx for w ∈ V. (8)
D
Consider Problem 2.1. Under the uniform ellipticity assumption1 , the energy norm is equi-
valent to the H 1 norm (see e.g. Cohen and DeVore (2015)). We denote by cH 1 , CH 1 > 0 the
grid independent constants such that for all u ∈ V and y ∈ Γ
cH 1 kukAy ≤ kukH 1 ≤ CH 1 kukAy . (9)
Moreover, for a multilevel decomposition up to level L ∈ N and v̂1 , . . . , v̂L defined as in (6),
denote by Ccorr > 0 the constant such that for all y ∈ Γ and ` = 1, . . . , L
kv̂` (y)kH 1 ≤ Ccorr 2−` kf k∗ . (10)
Note that for any y ∈ Γ, the solution vh (y) of Problem 2.1 satisfies
kvh (y)kAy ≤ CH 1 kf k∗ . (11)
To limit excessive mathematical overhead in our proofs, we restrict ourselves to a certain

type of uniform square grid specified in the following remark.
Remark 11 For D = [0, 1]2 , mesh width h > 0 and m := 1/h ∈ N, we exclusively consider
the uniform square grid comprised of m2 identical squares each subdivided into two triangles
as illustrated in Figure 4.
C.2 CNNs: terminology and basic properties

In this section, we briefly introduce basic convolutional operations together with their cor-
responding notation. We mostly focus on the two-dimensional case. However, all definitions
and results can easily be extended to arbitrary dimensions. For a more in-depth treatment
of the underlying concepts, we refer to (Goodfellow et al., 2016, Chapter 9).
Convolutions used in CNNs deviate in some details from the established mathematical
definition of a convolution. In the two-dimensional case, the input consists of a tensor xin ∈
RCin ×Win ×Hin with spatial dimensions Win , Hin ∈ N and number of channels Cin ∈ N, and
the output is given by a tensor xout ∈ RCout ×Wout ×Hout . Here, Cout ∈ N denotes the number
of output channels and Wout , Hout ∈ N the potentially transformed spatial dimensions. xout
is the result of convolving xin with a learnable convolutional kernel K ∈ RCin ×Cout ×WK ×WK
with kernel width WK ∈ N, and the channel-wise addition of a learnable bias B ∈ RCout . We
use three different types of convolutions: (i) vanilla, denoted by xin ∗ K, here Wout = Win
and Hout = Hin ; (ii) two-strided denoted by xin ∗2s K, here Wout = bWin /2c − 1 and
Hout = bHin /2c − 1; (iii) two-transpose-strided, denoted by xin ∗2ts K, here Wout := 2Win + 1
1. which always is satisfied with high probability in our settings
30
and Hout := 2Hin +1. Moreover, we write M (Φ) ∈ N for the number of trainable parameters
and L(Φ) ∈ N for the number of layers of a (possibly convolutional) NN Φ.
Many previous works have focused on approximating functions by fully connected NNs.
With a simple trick, we transfer these approximation results to CNNs.
Theorem 12 Let D ⊆ Rd and f : D → R be some function. Furthermore, let ε > 0 and

Φ be a fully connected NN with d-dimensional input and one-dimensional output such that
kf − ΦkL∞ (D) ≤ ε, then there exists a CNN Ψ with
(i) d input channels and one output channel;
(ii) the spatial dimension of all convolutional kernels is 1 × 1;
(iii) the same activation function as Φ;
(iv) M (Φ) = M (Ψ) and L(Φ) = L(Ψ);
(v) for all W ∈ N, we have

Ψ − fˆ ≤ ε,
L∞ (DW ×W )
where fˆ : DW ×W → RW ×W is the component-wise application of f .
Proof The proof follows directly by using the affine linear transformations in each layer
of Φ in the channel dimension as 1 × 1 convolutions.
In the next corollary, Theorem 12 is used to approximate the pointwise multiplication

function of two input tensors by a CNN.
Corollary 13 Let % ∈ L∞ loc (R) such that there exists x0 ∈ R with % is three times continu-
ously differentiable in a neighborhood of some x0 ∈ R and %00 (x0 ) 6= 0. Let W ∈ N, B > 0,
and ε ∈ (0, 1/2), then there exists a CNN × e with activation function %, a two-channel input
and one-channel output that satisfies the following properties:
1. ×(x,
e y) − x y L∞ ([−B,B]2×W ×W ,dxdy) ≤ ε;
e = 2 and M (×)
2. L(×) e ≤ 9;
3. the spatial dimension of all convolutional kernels is 1 × 1.
Proof The proof follows from Theorem 12 together with (Gühring and Raslan, 2021, Co-
rollary C.3).
31
C.3 Approximating isolated V-cycle building blocks

One of the main intermediate steps to approximate the full multigrid cycle by CNNs is the
approximation of the operator Aκ . The theoretical backbone of this section is the observa-
tion that Aκ u can be approximated by a CNN acting on κ (in an integral representation
defined in Definition 14) and u. We start by defining some basic concepts used for our
proofs. For the rest of this section, we set D = [0, 1]2 .
Definition 14 Let T be a uniform triangulation of D with corresponding conforming P1

FE space Vh with basis {ϕ1 , . . . , ϕdim Vh }. Furthermore, let κ ∈ H01 (D).
(i) For i ∈ {1, . . . , dim Vh } we set Neigh2D(i) := {j ∈ {1, . . . , dim Vh } : supp ϕi ∩

supp ϕj 6= ∅}.
(ii) For each vertex i ∈ {1, . . . , dim Vh } of the triangulation, we enumerate the six adjacent
(k)
triangles (arbitrarily but in the same order for every i) and denote them by Ti with
k = 1, . . . , 6 (see Figure 4 for an illustration).
(iii) For i = 1, . . . , dim Vh and k = 1, . . . , 6, we set

Z
Υ(κ, T , k, i) := κ dx
Tik
and use the notation
Υ(κ, T , k) := [Υ(κ, T , k, i)]dim

i=1
Vh
∈ Rdim Vh
and
Υ(κ, T ) := [Υ(κ, T , k)]6k=1 ∈ R6×dim Vh .
In the next lemma, we show that the integrals Υ(κ, T , k, i) can be computed via a
convolution from (a discretized) κ and, furthermore, that integrals over a coarse grid can
be computed from fine-grid-integrals, again via a convolution.
Lemma 15 Let Th , T2h be nested triangulations of fineness h and 2h, respectively, with
corresponding FE spaces V2h ⊆ Vh ⊆ H01 (D). Then, the following holds:
(i) There exists a convolutional kernel K ∈ R1×6×3×3 such that for every κ ∈ Vh , we have
for k ∈ {1, . . . , 6}
(κ ∗ K)[k] = Υ(κ, Th , k).
(ii) There exists a convolutional kernel K ∈ R6×6×3×3 such that for every κ ∈ H01 (D) and
k = 1, . . . , 6, we have
([Υ(κ, Th , 1), . . . , Υ(κ, Th , 6)] ∗ K)[k] = Υ(κ, T2h , k).
32
Ti2
Ti3 Ti1
xi xi xj
Ti4 Ti6
Ti5
Figure 4: Illustration showing a possible enumeration of the six adjacent triangles

of a grid point on the uniform square mesh from Definition 14 (ii). Note that this
(6) (4) (1) (3)
assignment is redundant, as on the right, one can see that Ti = Tj and Ti = Tj .
Proof We start by showing (i). For k = 1, . . . 6 and i = 1, . . . , dim Vh ,

Z XVh
dim Z
(h) h2
Υ(κ, T , k, i) = κ dx = ϕj dx =
X
κj κj ,
Tik j=1 Tik (h)
3
{j:supp ϕj ∩Tik 6=∅}
(h) (h)
where we used in the last step that κ dx = h2/3 if supp ϕj ∩Tik 6= ∅. Clearly, supp ϕj ∩
R
Tik
Tik = ∅ for all j ∈ {1, . . . , dim Vh } with j ∈
/ Neigh2D(i). Furthermore, the position of the
(h)
relevant neighbors {j : supp ϕj ∩ Ti 6= ∅} relative to index i is invariant to 2D-translations
k
of i. Combining these two observations concludes the proof.

(ii) Intuitively it is clear that a strided convolution is able to locally compute the mean
of all coefficients from the finer triangulation. However, to prove this rigorously would be
a tedious exercise. We hence omit the formal proof and refer to Figure 5 for an illustration
of the neighboring sub-triangles in each grid point.
The next theorem shows that using the integral representation of κ from Definition 14,
we can now represent Aκ u by the pointwise multiplication of the κ integrals with the output
of a convolution applied to u.
Theorem 16 (Representation of Aκ u) For k = 1, . . . , 6, there exist convolutional ker-
nels K (k) ∈ R1×1×3×3 , such that for the function
6
F : R7×dim V → Rdim V , (u, κ(1) , . . . , κ(6) ) 7→ (u ∗ K (k) ),
X
κ(k)
k=1
it holds that F is continuous and for any κ ∈ H01 (D), we have

F (u, Υ(κ, T , 1), . . . , Υ(κ, T , 6)) = Aκ u.
33
(1)
(T2h )i
xj3 (Th )(1)
j3
(2)
(Th )j1
(1) (1)
xi x j1 xi (Th )i (Th )j1
(6)
(Th )i
(6)
(Th )j1 x j1
(1)
x j2 (Th )j2
(6)
(Th )j2
(6)
(T2h )i
Figure 5: Illustration of the neighboring triangles and sub-triangles of a finer mesh

at some grid point. This figure shows how an adjacent triangle (blue) to the grid
point xi in the coarse grid is subdivided. Note that all contained finer triangles can
be assigned to a grid point directly adjacent to xi in the fine mesh. Hence, summing
over the diffusivity values associated with the finer triangles can be represented using
a convolutional (3 × 3)-kernel in the fine mesh.
Proof We start by introducing some notation: Let i, j ∈ {1, . . . , dim Vh } and k ∈ {1, . . . , 6}
and set Cijk ∈ R as the constant that h∇ϕi , ∇ϕj i attains on Tik . Note that Cijk = 0 if
i ∈/ Neigh2D(j). We now represent each entry of Aκ as a sum of multiplications: For
i, j ∈ {1, . . . , dim Vh }, we have
Z 6 Z 6
(Aκ )ij = κ h∇ϕi , ∇ϕj i dx = κ h∇ϕi , ∇ϕj i dx = Υ(κ, T , k, i)Cijk ,
X X
D k
k=1 Ti k=1
where we use Definition 14 (ii) for the last step. For j ∈ {1, . . . , dim Vh }, we can now rewrite
(Aκ u)j as
XVh
dim 6
(Aκ u)j =
X
ui Υ(κ, T , k, j)Cijk
i=1 k=1
6 XVh
dim
=
X
Υ(κ, T , k, j) ui Cijk
k=1 i=1
6
=
X X
Υ(κ, T , k, j) ui Cijk .
k=1 i∈Neigh2D(j)
Here, we use in the first step the definition of Aκ and in the third step that supp ϕi ∩
supp ϕj 6= ∅ only for neighboring indices. Since Cijk only depends on the relative position
of i to j (in the 2D sense) and is independent on j, the inner sum can be expressed in the
34
vectorized form as a zero-padded convolution with a 3 × 3 kernel (the number of neighbors)

where the values depend on k, i.e.,
6
Aκ u = (u ∗ K (k) ) = F (u, Υ(κ, T , 1), . . . , Υ(κ, T , 6)),
X
Υ(κ, T , k)
k=1
where K (k) are the respective kernels.
In the following remark, we elaborate on the treatment of the boundary conditions.

Remark 17 In our setting, we only consider basis functions on inner grid points. Alternat-
ively, one could include the boundary basis functions and constrain their coefficients to zero
to enforce homogeneous Dirichlet boundary conditions. Representing the FE operator as a
zero-padded convolution on the inner grid vertices naturally realizes homogeneous Dirich-
let boundary conditions by implicitly including boundary basis coefficients as zeros through
padding.
Using the representation of Aκ u from Theorem 16 as a convolution followed by a point-
wise multiplication, Corollary 13 can now be applied to obtain an approximation by a
CNN.
Theorem 18 (Approximation of Aκ u) Let T be a uniform triangulation of D with cor-
responding conforming P1 FE space V . Furthermore, let the activation function % ∈ L∞ loc (R)
be such that there exists x0 ∈ R with % is three times continuously differentiable in a neigh-
borhood of some x0 ∈ R and %00 (x0 ) 6= 0. Then, for any ε > 0 and M > 0, there exists a
CNN Ψε,M : R7×dim Vh → Rdim Vh with size independent of ε and M such that
kΨε,M − F kL∞ ([−M,M ]7×dim Vh ) ≤ ε.
Proof A three-step procedure leads to the final result:

(i) Clearly, there exists a one-layer CNN that realizes the mapping
[u, κ(1) , . . . , κ(6) ] 7→ [u ∗ K (k) , κ(k) ]6k=1 .
(ii) Corollary 13 yields the existence of a two-layer CNN Ψ̃ with at most 9 weights and
1 × 1 kernels such that
ε
Ψ̃([u ∗ K (k) , κ(k) ]6k=1 ) − [κ(k) (u ∗ K (k) )]6k=1 ≤ .
∞ 6
(iii) The channel-wise addition can trivially be constructed with a one-layer CNN with a
1 × 1 kernel.
A concatenation of the above three CNNs concludes the proof.
Recall that the prolongation and its counterpart, the weighted restriction, are essential
operations in the multigrid cycle. The following remark states that these operations can
naturally be expressed as convolutions.
35
Remark 19 (Weighted restriction and prolongation) Consider a two-level decompos-

ition using nested conforming P1 FE spaces V1 ⊆ V2 ⊆ H01 (D) on uniform square meshes.
Let P be the corresponding prolongation matrix (Definition 2). Then, there exist convolu-
tional kernels K1 , K2 ∈ R1×1×3×3 such that for any FE coefficient vector
(i) u ∈ Rdim V2 , we have u ∗2s K1 = P T u;
(ii) u ∈ Rdim V1 , we have u ∗2ts K2 = P u.
C.4 Putting it all together

In this section, we combine the approximation of individual computation steps of the multi-
grid algorithm by CNNs to approximate an arbitrary number of multigrid cycles by a CNN.
For the composition of the individual approximations, we first provide an auxiliary lemma.
Lemma 20 Let n ∈ N and d1 , . . . , dn+1 ∈ N. For i = 1, . . . , n let fi : Rdi → Rdi+1 be

continuous functions and define F as their concatenation, i.e.,
F : Rd1 → Rdn+1 , F := fn ◦ . . . ◦ f1 .
Then, for every M, ε > 0, there exist M1 , . . . , Mn > 0 and ε1 , . . . , εn > 0, such that the
following holds true: If f˜i : Rdi → Rdi+1 are functions such that kfi − f˜i kL∞ ([−Mi ,Mi ]di ) ≤ εi
for i = 1, . . . , n, then
F − fñ ◦ . . . ◦ f˜1 ≤ ε.
L∞ ([−M,M ]d1 )
Proof We prove the statement by induction over n ∈ N. For n = 1, the statement is clear.
Define the function g : Rd1 → Rdn for n > 1 by
g := fn−1 ◦ . . . ◦ f1 .
We set εn := ε/2 and Mn := kgkL∞ ([−M,M ]d1 ) + ε < ∞ (since g is continuous). As fn

is uniformly continuous on compact sets, there exists 0 < δ ≤ ε such that for all z, z̃ ∈
[−Mn , Mn ]dn with kz − z̃k∞ ≤ δ we have kfn (z) − fn (z̃)k∞ ≤ 2ε .
Let now M1 , . . . , Mn−1 and ε1 , . . . , εn−1 be given by the induction assumption (applied
with M = M and ε = δ) and let f˜i : Rdi → Rdi+1 be corresponding approximations for
i = 1, . . . , n − 1. Then (again by the induction assumption), it holds that
g − fñ−1 ◦ . . . ◦ f˜1 ≤ δ.
L∞ ([−M,M ]d1 )
Setting g̃ := fñ−1 ◦ . . . ◦ f˜1 , we get kg̃kL∞ ([−M,M ]d1 ) ≤ kgkL∞ ([−M,M ]d1 ) + δ ≤ Mn . The proof
is concluded by deriving that
F − (fñ ◦ . . . ◦ f˜1 )
L∞ ([−M,M ]d1 )
≤ k(fn ◦ g) − (fn ◦ g̃)kL∞ ([−M,M ]d1 ) + (fn ◦ g̃) − (fñ ◦ g̃)

L∞ ([−M,M ]d1 )
ε ε
≤ + = ε.
2 2
36
With the preceding preparations, we are now ready to prove our main results. We start
with the approximation of the multigrid algorithm by a CNN.
Proof of Theorem 6 The overall proof strategy consists of decomposing the function
MGm k0 ,k : R
3×dim VL → Rdim VL , where V = V , into a concatenation of a finite number of
L h
continuous functions that can either be implemented by a CNN or arbitrarily well approx-
imated by a CNN. An application of Lemma 20 then yields that the concatenation of these
CNNs (which is again a CNN) approximates MGm k0 ,k . We make the following definitions:
(i) Integrating the diffusion coefficient. Let K ∈ R1×6×3×3 be the convolutional

kernel from Lemma 15(i). For the function
   
u u
fin : R3×dim VL → R8×dim VL , κ 7→ κ ∗ K  ,
   
f f
it then follows from Lemma 15(i) that fin ([u, κ, f ]) = [u, Υ(κh , T L ), f ].
(ii) Smoothing iterations. For ` = 1, . . . , L let F` : R7×dim V` → Rdim V` be defined as

in Theorem 16. Then we define
u + ω(f − F` (u, κ(1) , . . . , κ(6) ))

   
u
κ(1)   κ(1) 
 .  .
   
`
: R8×dim V` → R8×dim V` ,  .  7→  .
 
fsm  .   .
.

 (6) 
κ(6)
 
κ   
f f
It follows from Theorem 16 that
`
fsm ([u, Υ(κh , T ` ), f ]) = [u + ω(f − A`κh u), Υ(κh , T ` ), f ].
(iii) Restricted residual. Similar as in the previous step, we define
u
 
 
u f − F (u, κ(1) , . . . , κ(6) )
κ(1)   ` 
 . 
  
κ(1) 
`
: R8×dim V` → R9×dim V` ,  .  7→ 
 
fresi  .  .. ,
.
 
 (6)   
κ   (6)

 κ 
f
f
and get from Theorem 16 that
`
fresi ([u, Υ(κh , T ` ), f ]) = [u, f − A`κh u, Υ(κh , T ` ), f ].
37
Let now [K1 , . . . , K6 ] ∈ R6×6×3×3 be the convolutional kernel from Lemma 15(ii) and
P`−1 ∈ Rdim V` ×dim V`−1 the prolongation matrix from Definition 2 (where its transpose
acts as the weighted restriction), then we define
   
u

 κ(1) 


 . 
 
 . 
 
 . 
 
 
u
 
  (6)  
 κ  
 r   
 
 (1) 
 f 
κ 
 
`
: R9×dim V` → R8×dim V` × R8×dim V`−1 ,
 
frest  .  7→ 
 .   .

 . 

 (6) 

 0 

κ  κ(1) ∗ K 
1 
..

f  
 .

 
 (6) 
κ ∗ K6 
 
T r
P`−1
Note that Υ(κh , T ` , k) ∗ Kk = Υ(κh , T `−1 , k) for k = 1, . . . , 6.
(iv) Add coarse grid correction to fine grid. As above, let P`−1 ∈ Rdim V` ×dim V`−1 be
the prolongation matrix from Definition 2. We define the function
 
u
 (1) 
κ 
 .  u + P`−1 e
 
 . 
 .   κ(1) 
 (6) 
.
 
κ 
`
: R8×dim V` × R8×dim V`−1 → R8×dim V` , .
 
fprol  7→ 
.
 .
 f   
 κ(6)
   
  
 

 e 
!  f
..
 
.
(v) Return solution. We define

 
u
κ(1) 
 . 
 
fout : R8·dim V` → Rdim V` ,  .  7→ u.
 . 
 (6) 
κ 
f
We assemble a single V-cycle VC`k0 ,k on level ` = 1, . . . , L as follows

VC`k0 ,k := k `
i=1 fsm
`
◦ fprol ◦ (Id, VC`−1
k0 ,k ) ◦ frest ◦ fresi ◦
` ` k `
i=1 fsm ,
k0
VC1k0 ,k := 1
i=1 fsm .
38
Finally, MGm
k0 ,k is given by:

MGm
k0 ,k = fout ◦ i=1 VCk0 ,k
m L
◦ fin .
It is easy to see that the above defined functions are continuous and, thus, Lemma 20
dictates the accuracy with which to approximate each part of the multigrid algorithm (fsm ` ,
fprol , frest , fresi , fin , fout ) on each level ` = 1, . . . , L to yield a final approximation accuracy
` ` ` L L
of δ > 0 for inputs bounded by M > 0.

On each level ` = 1, . . . , L, we derive the approximations fulfilling these requirements for
` and f `
resi using Theorem 18, fprol and frest as in Remark 19, and frest using Lemma 15.
fsm ` ` `
The approximation of fin is derived using Lemma 15 and fout is trivially represented using a
L L
single convolution. Concatenating the individual CNNs yields a CNN Ψ with an architecture
resembling multiple concatenated UNets, such that
Ψ − MGm
k0 ,k ≤ δ.
L∞ ([−M,M ]3×n )
The norm equivalence of k·kL∞ ([−M,M ]3×n ) and k·kH 1 in the finite dimensional setting dir-
ectly implies inequality (i) with a suitable choice of δ. The parameter bound (ii) follows
from the construction of Ψ.
Proof of Corollary 8 Fix the damping parameter ω > 0 and k ∈ N such that µ(k) < 1
and let m ∈ N be chosen as small as possible such that
1 2CH
−1 !
2
1
m ≥ log log .
µ(k) ε
Let k0 ∈ N be given such that the contraction factor of k0 damped Richardson iterations
on the coarsest
n grid is smaller o
than µ(k). Let Ψ be the CNN from Theorem 6 setting
2
M = max supy∈Γ kκy k∞ , kf k∞ such that for all κ, f ∈ [−M, M ]n
1
Ψ(0, κ, f ) − MGm
k0 ,k (0, κ, f ) ≤ ε kf k∗ .
H1 2
Using the contraction property of the multigrid algorithm from Theorem 3, we get
MGm
k0 ,k (0, κy , f ) − vh (y) ≤ µ(k)m kvh (y)kAy
Ay
1 −1
≤ εCH 1 kf k∗ .
2
This yields
kΨ(0, κy , f ) − vh (y)kH 1
≤ Ψ(0, κy , f ) − MGm
k0 ,k (0, κy , f ) + MGm
k0 ,k (0, κy , f ) − vh (y)
H1 H1
1
≤ ε kf k∗ + CH 1 MGm
k0 ,k (0, κy , f ) − vh (y)
2 Ay
≤ ε kf k∗ .
39

Assuming ε ≤ 1, there is a constant C > 0 such that m ≤ C log 1
ε . Together with the
parameter count from Theorem 6, this concludes the proof.
Proof of Corollary 9 This proof is structured as follows. We consider a multilevel

structure of V-Cycles to approximate the fine grid solution and derive error estimates for
this approximation. Similarly to Corollary 8, we show how a CNN of sufficient size is able
to represent this algorithm and translate the error estimates to the network approximation.
To this end, we introduce some constants and fix the number of V-Cycle iterations done
on each grid level: Let c := max{CH 1 , Ccorr }. Fix k ∈ N such that µ(k) < 1 and set
m1 , . . . , mL ∈ N such that
1 2c2 L
−1 ! !
mL ≥ log log − L log(2) ,
µ(k) ε
1
−1
m` ≥ log log(2), for ` = 1, . . . , L − 1.
µ(k)
For any y ∈ Γ, we denote by A`y ∈ Rdim V` ×dim V` be the discretized operator (4) and f` the
right hand side from Problem 2.1 on level ` = 1, . . . , L such that A`y v` (y) = f` . Moreover,
for ease of notation, we adapt the functions introduced in the proof of Theorem 6 and set
(i) for each level ` ∈ 1, . . . , L, κ`y ∈ R6×dim V` as the six channel tensor
h i
κ`y := Υ(κy , T ` , 1), . . . , Υ(κy , T ` , 6) .
g m` : R8×dim V → Rdim V as
(ii) the function MG k0 ,k
m`
m`

MG
g
k0 ,k := fout ◦
`
i=1 VCk0 ,k
`
,
m` m` m`
such that MG
g
k0 ,k ◦ fin =
`
i=1 MGk0 ,k .
We outline the following multilevel procedure for solving Problem 2.1 using multigrid with
y ∈ Γ and κy ∈ Rdim VL given as input:
• κy subsampling. Using κy , compute κL y as outlined in Lemma 15 (i). Subsequently,
compute κy for each level ` = 2, . . . , L − 1 as shown in Lemma 15 (ii).
`
• ` = 1. Solve for v1 (y) using m1 iterations of the multigrid cycle. Let

m1
ṽ1 (y) := MG
g
k0 ,k (0, κy , f ).
1
• ` = 2, . . . , L. While v̂` (y) is the true correction with respect to the Galerkin approx-
imation v`−1 (y), we introduce v̌` (y) as the correction with respect to ṽ`−1 (y). To
this end, set
v̌` (y) := v` (y) − P` ṽ`−1 (y),
f̃` := f` − A`y P` ṽ`−1 (y) = A`y v̌` (y).
Solve A`y v̌` (y) = f̃` using m` iterations of the multigrid V-cycle and set
m`
ṽ` (y) := MG
g
k0 ,k (0, κy , f̃` ) + P` ṽ`−1 (y).
`
40
This procedure yields the following estimates using the contraction from Theorem 3.
• ` = 1. Using the contraction property from Theorem 3 and m1 ≥ − log(2)/ log(µ(k))

yields
1
kṽ1 (y) − v1 (y)kA1y ≤ c kf k∗ .
2
• ` = 2, . . . , L − 1. Using again Theorem 3, the definitions for m` , ṽ` , v̌` and v̂` =
v` − P` v`−1 , it holds
g m` (0, κ` , f̃ ) + P ṽ (y) − v̌ (y) − P ṽ (y)

kṽ` (y) − v` (y)kA`y = MG k0 ,k y ` ` `−1 ` ` `−1
A`y
= g m` (0, κ` , f̃ )
MG − v̌` (y)
k0 ,k y `
A`y
1
≤ kv̌` (y)kA`y
2
1 1
≤ kv̂` (y)kA`y + kP` ṽ`−1 (y) − P` v`−1 (y)kA`y
2 2
1 1
≤ c2−` kf k∗ + kṽ`−1 (y) − v`−1 (y)kA`−1 .
2 2 y
Here, the bound on v̂` (y) follows from Equation (10).
• ` = L. By the same argument as for the case ` = 2, . . . , L − 1 together with the norm
inequality in (9) and the definition of mL , we arrive at the following estimate for the
H 1 -distance on the finest level.
kṽL (y) − vL (y)kH 1

≤ c kṽL (y) − vL (y)kALy
g mL (0, κL , f̃L ) − v̌L (y)
= c MG k0 ,k y
AL
y
1 L
≤ 2 ε kv̌L (y)kALy
2cL
1 L 1 1

≤ 2 ε kv̂L (y)kALy + kPL ṽL−1 (y) − PL vL−1 (y)kALy
cL 2 2
1 L 1 −L 1

≤ 2 ε c2 kf k∗ + kṽL−1 (y) − vL−1 (y)kAL−1
cL 2 2 y
Recursively, this yields

L L
1 L X 1 1
!
kṽL (y) − vL (y)kH 1 2 ε 2−` c kf k∗ = ε kf k∗ .
Y
≤
cL `=1 n=`
2 2
In total, the procedure is comprised of downsampling operations, multigrid V-cycles

and intermediate upsampling operations, all of which can be approximated arbitrarily well
by CNNs: The downsampling of the diffusivity field is realized following Lemma 15, the
V-cycles are realized as shown in Theorem 6 and the prolongation operators for upsampling
41
are realized as in Remark 19. Hence, the procedure poses a concatenation of finitely many
functions each representable by CNNs up to any accuracy. Using Lemma 20 (with M = M
and ε = 12 ε kf k∗ ) implies the existence of a CNN Ψ such that for all y ∈ Γ and all right
hand sides f , we have
kΨ(κy , f ) − vL (y)kH 1 ≤ kΨ(κy , f ) − ṽL (y)kH 1 + kṽL (y) − vL (y)kH 1

≤ ε kf k∗ .
The structure of the presented multilevel algorithm together with the parameter bounds
from Theorem 6, Lemma 15 and Remark 19 implies that Ψ is an ML-Net architecture and
that there exists a constant C > 0 with
L−1
2L

M (Ψ) ≤ C ` + C log − L log(2) L
X
`=1
ε
1

≤ CL L + log + log(2L) − L log(2)
ε
1

≤ CL log + CL2 .
ε
42

2304.00388 Multilevel CNNs For Parametric PDEs

Uploaded by

Copyright:

Available Formats

2304.00388 Multilevel CNNs For Parametric PDEs

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2304.00388 Multilevel CNNs For Parametric PDEs

Uploaded by

Copyright:

Available Formats

Multilevel CNNs for Parametric PDEs

Multilevel CNNs for Parametric PDEs

Cosmas Heiß cosmas.heiss@gmail.com

− divx κ(y, x)∇x u(y, x) = f (x) on D,

• Theory: We conduct an extensive analysis of the expressivity of UNet-like convolu-

1.1 Related Work

In this work, we strive to develop an efficient and accurate NN approximation ṽ : Γ → V ,

Here, the p ∈ N basis functions ak ∈ L2 (D), 1 ≤ k ≤ p are determined by assumed

The discretized version of the parametric Darcy problem (2) reads:

ay,h (uh , vh ) = f (wh ),

to the system of linear equations

Remark 1 We neglect the influence of the approximation of the diffusivity field κ as κh

3 A primer on multigrid methods

Definition 2 (Prolongation & weighted restriction matrices) Let L ∈ N and for

The theoretical analysis of our multilevel NN architecture hinges on the multigrid V-

Theorem 3 Let k ∈ N. There exists ω > 0 and mesh-independent k0 ∈ N, C < 0, such

4 Methodology of multilevel neural networks

Algorithm 3.1 V-cycle function

ML-Net1 [θ 1 ; L] : Rdim VL → Rdim V1 ,

The outputs are given by

z(1) = ML-Net1 [θ 1 ; L](κ),

Again, θ ` are the learnable parameters and θ = (θ ` )L

Figure 1: Illustration of the structure of the proposed ML-Net. The down-sampling

Step 3: Inference. Denoting the optimized parameters by θ ∗ , the approximate solution

(i) Ψ(u0 , κ, f ) − MGm

(i) kΨ(κy , f ) − vh (y)kH 1 (D) ≤ ε kf k∗ for all y ∈ Γ,

4. Cookie problem with variable radii inclusions. As an extension of the cookie

6.1 Error metrics

6.2 Results for the test cases

shown in Tables 4 and 5 in Appendix B. A random selection of solutions and NN predictions

in Tables 4 and 5 is far less pronounced.

Figure 3: Realizations for different benchmark problems (top to bottom: uniform

1 for ML-Net and UNetSeq evaluated on all test cases.

6.3 Training with successively fewer samples on finer levels

method error dataset uniform log-normal

Acknowledgments and Disclosure of Funding

Anima Anandkumar, Kamyar Azizzadenesheli, Kaushik Bhattacharya, Nikola Kovachki,

Albert Cohen and Ronald DeVore. Approximation of high-dimensional parametric pdes.

Ehsan Kharazmi, Zhongqiang Zhang, and George Em Karniadakis. Variational physics-

parametric partial differential equations. In International Conference on Learning Rep-

Gabriel J Lord, Catherine E Powell, and Tony Shardlow. An introduction to computational

Maziar Raissi, Paris Perdikaris, and George Em Karniadakis. Physics-informed neural

Xuefeng Xu and Chen-Song Zhang. Convergence analysis of inexact two-grid methods: A

Appendix A. Exact ML-Net architecture and training details

In our experiments, we construct ML-Net and UNetSeq with L = 7 levels. Convolutional

Appendix B. Additional tables

2 for ML-Net and UNetSeq evaluated on all test cases.

Appendix C. Proofs for the results of Section 5

C.1 Mathematical notation

we denote their pointwise multiplication by x y ∈ RW ×H . Since we consider the FE spaces

cH 1 kukAy ≤ kukH 1 ≤ CH 1 kukAy . (9)

kv̂` (y)kH 1 ≤ Ccorr 2−` kf k∗ . (10)

kvh (y)kAy ≤ CH 1 kf k∗ . (11)

To limit excessive mathematical overhead in our proofs, we restrict ourselves to a certain

C.2 CNNs: terminology and basic properties

Theorem 12 Let D ⊆ Rd and f : D → R be some function. Furthermore, let ε > 0 and

(i) d input channels and one output channel;

(ii) the spatial dimension of all convolutional kernels is 1 × 1;