In this section, we detail efficient implementations of the multigrid ingredients for locally refined meshes used by Algorithm
1. We start with the handling of constraints. Then, we proceed with the matrix-free evaluation of operator
\(\boldsymbol {A}\), which is needed on the active and the multigrid levels, as well as with smoothers and coarse-grid solvers. This section is concluded with an algorithmic framework to efficiently realize all considered kinds of matrix-free transfer operators.
3.2 Matrix-free Operator Evaluation
Instead of assembling the sparse system matrix
\(\boldsymbol {A}\) and performing matrix-vector multiplications of the form
\(\boldsymbol {A} \boldsymbol {x}\), the matrix-free operator evaluation computes the underlying finite-element integrals to represent
\(\mathcal {A}(\boldsymbol {x})\). This is motivated by the fact that many iterative solvers and smoothers do not need the matrix
\(\boldsymbol {A}\) explicitly but only its action on a vector. On a high level, matrix-free operator evaluation can be derived in two steps: (1) loop switching
\(\boldsymbol {v} = \boldsymbol {A} \boldsymbol {x} = (\sum _{e}(\mathcal {G}_e^{T} \circ \mathcal {C}_e^{T}) \boldsymbol {A}_e (\mathcal {C}_e \circ \mathcal {G}_e)) \boldsymbol {x} = \sum _{e} \mathcal {G}_e^{T} \circ \mathcal {C}_e^{T} (\boldsymbol {A}_e (\mathcal {C}_e \circ \mathcal {G}_e \boldsymbol {x}))= \sum _{e} \mathcal {G}_e^T \circ \mathcal {C}_e^{T} (\boldsymbol {A}_e \boldsymbol {x}_e)\), i.e., replacing the sequence “assembly–mat-vec” by a loop over cells with the three steps “gathering–application of the element stiffness matrix–scattering,” and (2) exploitation of the structure of the element stiffness matrix
\(\boldsymbol {A}_e\). In this formula, the operator
\(\mathcal {C}_e \circ \mathcal {G}_e\) represents the assembly of element-wise quantities into the global matrix and vectors, respectively [
92], including the constraints
\(\mathcal {C}_e\) as discussed in Section
3.1. The element stiffness matrix
\(\boldsymbol {A}_e\) is constructed from three main ingredients in a finite element method, the shape functions and derivatives of the tentative solution tabulated at quadrature points, geometric factors and coefficients of the underlying differential operator at quadrature points, and the multiplication by the test functions and their derivatives as well as subsequent summation over quadrature points. The final structure of a matrix-free operator evaluation in the context of continuous finite elements in operator form is as follows:
This structure is depicted in Figure
3. For each cell
e, cell-relevant values are gathered with operator
\(\mathcal {G}_e\), constraints are applied with
\(\mathcal {C}_e\), and values, gradients, or Hessians associated to the vector
\(\boldsymbol {x}\) are computed at the quadrature points with
\(\mathcal {S}_e\). These quantities are processed by a quadrature-point operation
\(\mathcal {Q}_e\); the result is integrated and summed into the result vector
\(\boldsymbol {v}\) by applying
\(\tilde{\mathcal {S}}_e^T\),
\(\mathcal {C}_e^T\), and
\(\mathcal {G}_e^T\). In this publication, we consider symmetric (self-adjoint) PDE operators with
\(\tilde{\mathcal {S}}_e = \mathcal {S}_e\).
In the literature, specialized implementations for GPUs [
4,
57,
64,
68,
69] and CPUs [
4,
62,
63,
80,
82] for operations as expressed in Equation (
3) have been presented. For tensor-product (quadrilateral and hexahedral) elements, a technique known as sum factorization [
78,
86] is often applied, which allows us to replace full interpolations from the local solution values to the quadrature points by a sequence of one-dimensional (1D) steps. In the context of CPUs, it is an option to vectorize across elements [
62,
63], i.e., perform
\((\mathcal {S}^T \circ \mathcal {Q} \circ \mathcal {S})_e\) for multiple cells in different lanes of the same instructions. This necessitates the data to be laid out in a struct-of-arrays fashion. The reshuffling of the data from array-of-structs format to struct-of-arrays format and back can be done, e.g., by
\(\mathcal {G}_e\), while looping through all elements [
62]. Note that our implementation performs
\(\mathcal {C}_e \circ \mathcal {G}_e \boldsymbol {x}\) for a single SIMD batch of cells at a time to keep intermediate results in caches and reduce global memory traffic. For the application of hanging-node constraints, we use the special-purpose algorithm presented in Reference [
69,
81], which is based on updating the DoF map
\(\mathcal {G}_e\) and applying in-place sum factorization for the interpolation during the application of edge and face constraints. Even though this algorithm is highly optimized and comes with only a small overhead for high-order finite elements (
\(\lt 20\%\) per cell with hanging nodes), the additional steps are not for free particularly for linear elements. This might lead to load imbalances in a parallel setting when some processes have a disproportionately high number of cells with hanging nodes, see the quantitative analysis in Reference [
81].
Despite relying on matrix-free algorithms overall, some of the multigrid ingredients (e.g., smoother and coarse-grid solver) need an explicit representation of the linear operator in the form of a matrix or a part of it (e.g., its diagonal). Based on the matrix-free algorithm (
3) applied to the local unit vector
\(e_i\), the local matrix can be computed column by column:
The resulting element matrix can be assembled as usual, including the resolution of the constraints. Computing the diagonal is slightly more complex when attempting to immediately write into a vector for the diagonal without complete matrix assembly. In our code, we choose to compute the
jth entry of the locally relevant diagonal contribution via
i.e., we apply the local constraint matrix
\(\boldsymbol {C}_e\) from the left to the
ith column of the element matrix (computed via Equation (
4)) and apply
\(\boldsymbol {C}_e\) again from the right. This approach needs as many basis vector applications as there are shape functions per cell. The local result can be simply added to the global diagonal via
\(d = \sum _e \mathcal {G}^T_e d_e\). For cells without constrained DoFs, Equation (
5) simplifies to
\(d_e(i)=\boldsymbol {A}_e(i,i)\).
3.5 Transfer Operator
The prolongation operator
\(\mathcal {P}^{(f,c)}\) prolongates the result
\(\boldsymbol {x}\) from a coarse space
c to a fine space
f (between either different geometric grids or different polynomial degrees),
According to the literature [
14,
99], this can be done in three steps:
with
\(\mathcal {C}^{(c)}\) setting the values of constrained DoFs on the coarse mesh, particularly resolving the hanging-node constraints,
\(\tilde{\mathcal {P}}^{(f,c)}\) performing the prolongation on the discontinuous space as if no hanging nodes were present, and the weighting operator
\(\mathcal {W}^{(f)}\) zeroing out the DoFs constrained on the fine mesh.
To derive a matrix-free implementation, one can express Equation (
6) for nested meshes as loops over all (coarse) cells (see also Figure
3):
Here,
\(\mathcal {C}^{(c)}_e \circ \mathcal {G}^{(c)}_e\) gathers the cell-relevant coarse DoFs and applies the constraints just as in the case of the matrix-free operator evaluation (
3). The operator
\(\mathcal {P}^{(f,c)}_e\) embeds the coarse-space solution into the fine space, whereas
\(\mathcal {S}^{(f)}_e\) sums the result back to a global vector. Since multiple cells could add to the same global entry of the vector
\(\boldsymbol {x}^{(f)}\) during the cell loop, the values to be added have to be weighted with the inverse of the valence of the corresponding DoF. This is done by
\(\mathcal {W}^{(f)}_e\), which also ignores constrained DoFs (zero valence) to be consistent with Equation (
6). Figure
4 shows, as an example, the values of
\(\mathcal {W}^{(f)}\) for a simple mesh for a scalar Lagrange element of degree
\(1\le p \le 3\).
We construct the element prolongation matrices as
where
\(\phi ^{(c)}\) are the shape functions on the coarse cell and
\(\phi ^{(f)}\) on the fine cell. Instead of treating each “fine cell” on its own, we group direct children of the coarse cells together and define the fine shape functions on the appropriate subregions. As a consequence, the “finite element” on the fine mesh depends both on the actual finite-element type (like on the polynomial degree and continuity) and on the refinement type, as indicated in Figure
5. For local smoothing, there is only one refinement configuration, since all cells on a finer level are children of cells on the coarser level. In the case of polynomial global coarsening, the mesh stays the same and only the polynomial degree is uniformly changed for all cells. In the case of geometric global coarsening, cells on a finer level are either identical to the cells or direct children of cells on the coarser level, but the element and its polynomial degree stay the same. This necessitates two prolongation cases, an identity matrix for non-refined cells and the coarse-to-fine embedding. We define the set of all coarse–fine cell pairs connected via the same element prolongation matrix as category
\(\mathcal {C}\).
Since \(\mathcal {P}^{(f,c)}_e=\mathcal {P}^{(f,c)}_{\mathcal {C}(e)}\), i.e., all cells of the same category \(\mathcal {C}{(e)}\) have the same element prolongation matrix,
and to be able to apply them for multiple elements in one go in a vectorization-over-elements fashion [
62] as in the case of matrix-free loops (
3), we loop over the cells type by type so that Equation (
7) becomes
We choose the restriction operator as the transpose of the prolongation operator,
We conclude this subsection with discussing the appropriate data structures for transfer operators, beginning with the ones needed for both global geometric and polynomial coarsening. Since global-coarsening algorithms smoothen on the complete computational domain, data structures only need to be able to perform two-level transfers (
8) independently between arbitrary fine (f) and coarse (c) grids.
\(\mathcal {C}^{(c)}_e \circ \mathcal {G}^{(c)}_e\) is identical to
\(\mathcal {C}_e \circ \mathcal {G}_e\) in the matrix-free loop (
3) so that specialized algorithms and data structures [
81] can be applied and reused.
\(\mathcal {S}^{(f)}_e\) needs the indices of DoFs for the given element
e to be able to scatter the values, and
\(\mathcal {W}^{(f)}_e\) stores the weights of DoFs (or of geometric entities—see also Figure
4) for the given element
e.
\(\mathcal {P}^{(f,c)}_{\mathcal {C}(e)}\) (and
\(\mathcal {R}^{(c,f)}_{\mathcal {C}(e)}\)) need to be available for each category. We regard them as general operators and choose the most efficient way of evaluation based on the element types: simple dense-matrix representation vs. some substructure with sum factorization based on smaller 1D prolongation matrices. The category of each cell has to be known for each cell.
Alongside these process-local data structures, one needs access to all constraining DoFs on the coarse level required during
\(\mathcal {C}^{(c)}_e \circ \mathcal {G}^{(c)}_e\) and to the DoFs of all child cells on the fine level during the process-local part of
\(\mathcal {S}^{(f)}_e\), which is concluded by a data exchange. If external vectors do not allow access to the required DoFs, then we copy data to/from internal temporal global vectors with appropriate ghosting [
62]. We choose the coarse-side identification of cells due to its implementation simplicity and structured data access at the price of more ghost transfer.
3 For setting up the communication pattern, we use consensus-based sparse dynamic algorithms [
6,
47].
For the sake of separation of concerns, one might create three classes to implement a global-coarsening transfer operator as we have done in
deal.II. The relation of these classes is shown in Figure
6: The multigrid transfer class (
MGTransferGlobalCoarsening) delegates the actual transfer tasks to the right two-level implementation (
MGTwoLevelTransfer), which performs communications needed as well as evaluates (
8) for each category and cell by using category-specific information from the third class (
MGTransferSchemes).
The discussed data structures are applicable also to local smoothing. The fact that only a single category is available in this case allows us to simplify many code paths. However, one needs data structures that match DoF indices on the active level and on
all multigrid levels that share a part of the mesh with the active level in addition to the two-level data structures. For further details on implementation aspects of transfer operators for local smoothing, see References [
28,
64] and the documentation of
MGTransferMatrixFree class in
deal.II.