1 Introduction

In the standard setting, linear systems

$$\begin{aligned} Ax=b \end{aligned}$$
(1.1)

are described by a regular matrix \(A\in \mathbb {R}^{n\times n}\) and vectors \(x,b\in \mathbb {R}^{n}\). Often we are interested in large-scale problems with n about 100,000 to 100,000,000. In that case, vectors and matrices are represented differently. Vectors of this size can be described explicitly in the full format, i.e., each entry is stored. Matrices of the size \(n\times n\) must be represented by less than \(n^{2}\) units. In the fortunate case of sparse matrices, the sparse format is an easy remedy. For fully populated matrices one can use, e.g., the technique of hierarchical matricesresulting in a storage requirement ofFootnote 1 \(\mathcal {O}(n\log ^{*}n)\) (cf. [14, 17]).

The situation changes if n becomes much larger. A concrete example is the Lyapunov matrix equation. Consider a parabolic problem \(y_{t}=Ly+Bu\) with the elliptic operator L discretised by a matrix \(A\in \mathbb {R}^{n\times n}\) as considered above. The right-hand side contains the control variable u. Linear-quadratic optimal control problems leads to a description of the control involving the solution \(X\in \mathbb {R}^{n\times n}\) of a quadratic matrix Riccati equation (cf. Grasedyck–Hackbusch–Khoromskij [11]). For simplicity we consider its linearisation, the linear Lyapunov equation

$$\begin{aligned} AX+XA=B\qquad (B\in \mathbb {R}^{n\times n}\text { given, }X\in \mathbb {R}^{n\times n}\text { sought}) \nonumber \\ \end{aligned}$$
(1.2)

(cf. Benner–Breiten [3]). Even if A and B are sparse, the solution X is a fully populated matrix. Equation (1.2) is an unusual notation for a linear system with \(n^{2}\) equations involving a matrix \(\mathbf {A}\in \mathbb {R}^{n^{2}\times n^{2}}\):

$$\begin{aligned} \mathbf {Ax}=\mathbf {b},\quad \mathbf {x},\mathbf {b}\in \mathbb {R}^{n^{2} },\quad \mathbf {A}\in \mathbb {R}^{n^{2}\times n^{2}}. \end{aligned}$$
(1.3)

Now the full representation of the vector \(\mathbf {x}\in \mathbb {R}^{n^{2}}\) is as problematic as the full representation of the fully populated matrix \(X\in \mathbb {R}^{n\times n}.\) This shows that for much larger sizes \(N:=n^{2}\) the treatment of a linear system (1.1) must be changed substantially.

Concerning the spatial dimension, we add the following remark. Let L be a differential operator of a 3D problem, while A is the discrete version of L. Then the matrix \(\mathbf {A}\) in (1.3) discretises a six-dimensional pde. The matrix \(\mathbf {A}\) can be defined by means of the tensor product (here also called Kronecker product):

$$\begin{aligned} \mathbf {A}=I\otimes A+A\otimes I\,. \end{aligned}$$

2 Tensors

The tensor space \(\mathbf {V}:=\bigotimes _{j=1}^{d}V_{j}\) exists for all vector spaces \(V_{j}\) (details in Hackbusch [15, 16]). In the following, we discuss the tensor space for particular vector spaces \(V_{j}\).

2.1 Grid functions (\(V_{j}=\mathbb {R}^{n_{j}}\))

Consider a regular grid in d spatial dimensions. More precisely, let \(\omega _{j}=\{x_{1}^{(j)},\ldots ,x_{n_{j}}^{(j)}\}\) be a regular grid consisting of \(n_{j}\) nodal points and define the product grid \(\omega =\omega _{1}\times \omega _{2}\times \cdots \times \omega _{d}.\) The nodal points \(\mathbf {x}\in \omega \) are indexed by a d-tuple \(\left( i_{1} ,\ldots ,i_{d}\right) \) with \(1\le i_{j}\le n_{j}\): \(\mathbf {x}[i_{1} ,\ldots ,i_{d}]= \big ( x_{i_{1}}^{(1)},\ldots ,x_{i_{d}}^{(d)} \big ).\) Grid functions u defined on \(\omega \) are characterised by the nodal values \(u[i_{1},\ldots ,i_{d}],\) i.e., u belongs to \(\mathbb {R}^{N}\) with \(N:=\#\omega =\prod _{j=1}^{d}n_{j}.\) The d-dimensional structure is described by the tensor product \(\mathbb {R}^{n_{1}}\otimes \mathbb {R}^{n_{2}} \otimes \cdots \otimes \mathbb {R}^{n_{d}}\) which is isomorphic to \(\mathbb {R} ^{N}\):

$$\begin{aligned} \mathbb {R}^{N}\simeq \mathbb {R}^{n_{1}}\otimes \mathbb {R}^{n_{2}}\otimes \cdots \otimes \mathbb {R}^{n_{d}}. \end{aligned}$$

The tensor product of vectors \(v^{(j)}\in \mathbb {R}^{n_{j}}\) is defined by

$$\begin{aligned} \left( \bigotimes _{j=1}^{d}v^{(j)}\right) \left[ i_{1},i_{2},\ldots ,i_{d}\right] :=\prod _{j=1}^{d}v^{(j)}[i_{j}]. \end{aligned}$$

Usually, the dimension \(N:=n^{d}\) (for \(n_{j}=n\)) exceeds the available storage by far.

2.2 Functions (\(V_{j}=L^{2}(\Omega _{j})\))

Also functions f defined on a Cartesian product \(\Omega =\Omega _{1} \times \Omega _{2}\times \cdots \times \Omega _{d}\) are tensors. For instance, \(L^{2}(\Omega )\) can be considered as the tensor space

$$\begin{aligned} L^{2}(\Omega )=L^{2}(\Omega _{1})\otimes L^{2}(\Omega _{2})\otimes \cdots \otimes L^{2}(\Omega _{d}). \end{aligned}$$

The tensor product of univariate functions is defined by

$$\begin{aligned} \left( \bigotimes _{j=1}^{d}f_{j}\right) \left( x_{1},x_{2},\ldots ,x_{d}\right) :=\prod _{j=1}^{d}f_{j}(x_{j}). \end{aligned}$$
(2.1)

Differently from the finite-dimensional case, the (topological) tensor space depends on the chosen norm.

2.3 Matrices (\(V_{j}=\mathbb {R}^{n_{j}\times m_{j}}\))

A tensor space can also be generated by matrices, since the set of matrices in \(\mathbb {R}^{n_{j}\times m_{j}}\) is a particular vector space. Let \(A_{j}:V_{j}\rightarrow W_{j}\) (\(1\le j\le d\)) be linear mappings and form the tensor spaces \(\mathbf {V}=V_{1}\otimes V_{2}\otimes \cdots \otimes V_{d}\) and \(\mathbf {W}=W_{1}\otimes W_{2}\otimes \cdots \otimes W_{d}\). The tensor product (Kronecker product)

$$\begin{aligned} \mathbf {A}=A_{1}\otimes A_{2}\otimes \cdots \otimes A_{d} \end{aligned}$$

can be interpreted as the linear map \(\mathbf {A}:\mathbf {V}\rightarrow \mathbf {W}\) defined by

$$\begin{aligned} \mathbf {A}:v^{(1)}\otimes v^{(2)}\otimes \cdots \otimes v^{(d)}\mapsto A_{1}v^{(1)}\otimes A_{2}v^{(2)}\otimes \cdots \otimes A_{d}v^{(d)} \end{aligned}$$
(2.2)

for all \(v^{(j)}\in V_{j}.\) Using matrices \(A_{j}\in \mathbb {R}^{n\times n}\) of moderate size, one can construct huge matrices \(\mathbf {A}\in \mathbb {R} ^{n^{d}\times n^{d}}\).

2.4 r-Term format (Canonical format)

By definition, any algebraic tensor \(\mathbf {v}\in \mathbf {V}=V_{1}\otimes V_{2}\otimes \cdots \otimes V_{d}\) has a representation

$$\begin{aligned} \mathbf {v}=\sum \limits _{\nu =1}^{r}v_{\nu }^{(1)}\otimes v_{\nu }^{(2)} \otimes \cdots \otimes v_{\nu }^{(d)} \end{aligned}$$
(2.3)

with \(v_{\nu }^{(j)}\in V_{j}\) and suitable \(r\in \mathbb {N}\). Fixing some \(r\in \mathbb {N}\), we set

$$\begin{aligned} \mathcal {R}_{r}:=\left\{ \sum \limits _{\nu =1}^{r}v_{\nu }^{(1)}\otimes v_{\nu }^{(2)}\otimes \cdots \otimes v_{\nu }^{(d)}:v_{\nu }^{(j)}\in V_{j}\right\} . \end{aligned}$$

The number r is called the representation rank Footnote 2 of \(\mathbf {v}\in \mathcal {R}_{r}.\) Obviously, the storage size for any \(\mathbf {v}\in \mathcal {R}_{r}\) is bounded by rdn, where \(n=\max _{j}\dim V_{j}\). If r is of moderate size, this format is advantageous. Often, a tensor \(\mathbf {v}\) requires a high rank, but there is some approximation \(\mathbf {v}_{\varepsilon }\in \mathcal {R}_{r}\) with moderate \(r=r(\varepsilon ).\) An example will follow in Sect. 3.1.

Note that (2.3) holds for all vector spaces \(V_{j}.\) In the case of matrix spaces, the r-term representation of a matrix reads as

$$\begin{aligned} \mathbf {A}=\sum \limits _{\nu =1}^{r}\bigotimes \limits _{j=1}^{d}A_{\nu } ^{(j)},\qquad A_{\nu }^{(j)}\in \mathbb {R}^{n_{j}\times m_{j}}. \end{aligned}$$
(2.4)

2.5 Numerical tensor calculus

The numerical treatment of tensors is called ‘numerical tensor calculus’. Besides the representation and storage of tensors, the efficient performance of the tensor operations is required. Since the tensor space \(\mathbf {V}\) is a vector space, the addition is fundamental:

$$\begin{aligned} \mathbf {v}+\mathbf {w},\quad \mathbf {A}+\mathbf {B}. \end{aligned}$$
(2.5a)

Using operands of the form (2.3) or (2.4), the result has the same structure with increased representation rank. Note that this implementation of the addition requires no computational work.

Let \(\left\langle \cdot ,\cdot \right\rangle _{j}\) be the scalar product in \(V_{j}.\) Then the canonical scalar product \(\left\langle \cdot ,\cdot \right\rangle \) in \(\mathbf {V}=\bigotimes _{j=1}^{d}V_{j}\) is defined via

$$\begin{aligned} \left\langle \left( \bigotimes \limits _{j=1}^{d}v^{(j)}\right) ,\left( \bigotimes \limits _{j=1}^{d}w^{(j)}\right) \right\rangle =\prod \limits _{j=1} ^{d}\left\langle v^{(j)},w^{(j)}\right\rangle _{j}. \end{aligned}$$
(2.5b)

For instance, the Euclidean scalar products \(\left\langle \cdot ,\cdot \right\rangle _{j}\) lead to the Euclidean scalar product \(\left\langle \mathbf {v},\mathbf {w}\right\rangle =\sum _{\mathbf {i}}\mathbf {v[i]w[i]}\) in \(\mathbf {V}\). Using the tensor constructions for matrices and vectors, the matrix-vector multiplication is defined by

$$\begin{aligned} \left( \bigotimes \limits _{j=1}^{d}A^{(j)}\right) \left( \bigotimes \limits _{j=1}^{d}v^{(j)}\right) =\bigotimes \limits _{j=1}^{d}A^{(j)} v^{(j)} \end{aligned}$$
(2.5c)

(cf. (2.2)). The formulae (2.5b, 2.5c) show that the tensor operation applied to elementary tensors can easily be computed by operations related to the simpler vector spaces \(V_{j}\). The same statement holds for further operations as the Hadamard product (entrywise product) and the convolution \(\mathbf {u}=\mathbf {v}\star \mathbf {w}\) with \(\mathbf {u} _{\mathbf {i}}=\sum _{\mathbf {0}\le \mathbf {k}\le \mathbf {i}}\mathbf {v} _{\mathbf {i}-\mathbf {k}}\mathbf {w}_{\mathbf {k}}\).

2.6 Truncation

The representation rankFootnote 3 r is the critical quantity in the storage size rdn. Unfortunately, all operations have the property that their result has a (much) larger representation rank than their operands. As an example we consider the matrix-vector multiplication, where the matrix \(\mathbf {A}\) has the representation rank \(r_{A}\), while the vector \(\mathbf {v}\) uses the representation rank \(r_{v}\):

$$\begin{aligned} \mathbf {w}:=\mathbf {Av}= & {} \left( \sum _{\nu =1}^{r_{A}}\bigotimes \limits _{j=1} ^{d}A_{\nu }^{(j)}\right) \left( \sum _{\mu =1}^{r_{v}}\bigotimes \limits _{j=1}^{d}v_{\mu }^{(j)}\right) \\= & {} \sum _{\nu =1}^{r_{A}}\sum _{\mu =1}^{r_{v}}\bigotimes \limits _{j=1}^{d}A_{\nu }^{(j)}v_{\mu }^{(j)}. \end{aligned}$$

Obviously, the product \(\mathbf {w}\) has enlarged the representation rank to \(r:=r_{A}r_{v}\). Repeating the operations, the calculations become more and more costly. A remedy is the rank truncation: the result \(\mathbf {w}\) has to be approximated by \(\mathbf {w}^{\prime }\) with a much smaller rank \(r^{\prime }\). This truncation should be controlled, i.e., we should be able to estimate the truncation error.

Unfortunately, truncation within the format \(\mathcal {R}_{r}\) is a difficult procedure. Moreover, instabilities may occur. The reason is that \(\mathcal {R}_{r}\) is not closed and any approximation of \(\mathbf {v} \in \overline{\mathcal {R}_{r}}\backslash \mathcal {R}_{r}\) by a sequence \(\mathcal {R}_{r}\ni \mathbf {v}_{n}\rightarrow \mathbf {v}\) leads to numerical cancellation problems (cf. [15, Sect. 9.4.3]). Because of these shortcomings one is looking for other tensor representations.

In spite of the negative statement from above, we have to emphasise that the format \(\mathcal {R}_{r}\) is perfect if, by some method, we are able to find a suitable r-term representation. Here ‘suitable’ means stable and with moderate representation rank r. In (3.1) we shall see such an example.

2.7 Other tensor representations

An alternative traditional representation is the subspace representation or Tucker format (cf. Tucker [30]). Instead of a scalar representation rank we have a d-tuple \(\mathbf {r}=\left( r_{1},\ldots ,r_{d}\right) \) of integers. In each vector space \(V_{j}\) we choose \(r_{j}\) vectors (usually, linearly independent ones):

$$\begin{aligned} b_{1}^{(j)},b_{2}^{(j)},\ldots ,b_{r_{j}}^{(j)}\in V_{j} \end{aligned}$$
(2.6)

with the intention that \(r_{j}\ll n_{j}:=\dim (V_{j}).\) Then we try to represent or approximate a tensor by

$$\begin{aligned} \mathbf {v}=\sum \limits _{1\le \ell _{1}\le r_{1}}\ldots \sum \limits _{1\le \ell _{d}\le r_{d}}\mathbf {a}[\ell _{1},\ldots ,\ell _{d}]\,b_{\ell _{1}} ^{(1)}\otimes \cdots \otimes b_{\ell _{d}}^{(d)}\nonumber \\ \end{aligned}$$
(2.7)

This approach was originally introduced for the case of \(d=3\). For large d, this approach is rather costly since the involved core tensor \(\mathbf {a}\in \mathbb {R}^{r_{1}}\otimes \mathbb {R}^{r_{2}}\otimes \cdots \otimes \mathbb {R}^{r_{d}}\) requires the storage size \(\prod _{j=1}^{d}r_{j}.\) On the other hand, this format has very favourable properties. The basis (2.6) can be chosen orthonormal and, using the so-called matricisation, the core tensor \(\mathbf {a}\) can be isomorphically mapped into a matrix. This allows us to use matrix techniques like the singular value decomposition. The arising ‘higher order SVD’ (HOSVD, cf. [6], [15, Sect. 8.3]) is a perfect tool for the truncation as required in Sect. 2.6.

A more recent format is the hierarchical format (cf. Hackbusch–Kühn [19] and [15, Sect. 11]). It provides an easy truncation, while the storage size \(dr^{3}+dnr\) is linear in d (r is the maximum of a rank tuple). As long as r is bounded, a tensor size of \(n^{d}\) for \(n=d=1000\) is harmless, since only the product dn appears. For later purpose, we mention that the typical tensor operations cost amounts to

$$\begin{aligned} \mathcal {O}\left( dr^{4}+dnr^{2}\right) \end{aligned}$$
(2.8)

elementary floating-point operations (r defined as above).

3 Linear systems

3.1 Model problem: Laplace operator

We consider the d-dimensional Laplace operator in a cube,

$$\begin{aligned} -\Delta =-\sum \limits _{j=1}^{d}\frac{\partial ^{2}}{\partial x_{j}^{2}} \quad \text {in}\;\;\Omega =[0,1]^{d}\subset \mathbb {R}^{d}, \end{aligned}$$

with Dirichlet condition \(u=0\) on \(\partial \Omega ,\) and discretise by a finite difference scheme with the grid size \(h=1/\left( n-1\right) \) in a regular product grid. Let \(A_{j}\) be the (positive definite) tridiagonal matrix built by the (one-dimensional) negative second divided differences. Then the system matrix has the form

$$\begin{aligned} \mathbf {A}= & {} A_{1}\otimes I\otimes I\otimes \cdots \otimes I+I\otimes A_{2}\otimes I\otimes \cdots \otimes I\\&+\,I\otimes I\otimes A_{3}\otimes \cdots \otimes I+\cdots , \end{aligned}$$

i.e., it is a d-term representation of the form (2.4). A similar result can be obtained for finite elements using a regular grid, but then the identity I is replaced with the mass matrix. The following considerations hold also for other positive definite matrices \(A_{j}.\)

One may be afraid that the huge system \(\mathbf {Ax}=\mathbf {b}\) has a huge condition number. This is not the case; instead, the condition is determined by the one-dimensional problems:

$$\begin{aligned} \min _{1\le j\le d}{\text {cond}}(A_{j})\le {\text {cond}} (\mathbf {A})\le \max _{1\le j\le d}{\text {cond}}(A_{j}). \end{aligned}$$

Obviously, the spectrum of \(\mathbf {A}\) is contained in \([a,\infty )\) for \(a:=\sum _{j=1}^{d}\lambda _{1}^{(j)}\), where \(\lambda _{1}^{(j)}\) is the smallest eigenvalue of \(A_{j}.\) Using a suitable scaling, we may assume that \(a=1\).

Lemma 3.1

There are (computable) coefficients \(\alpha _{\nu },\beta _{\nu }>0\) such that

$$\begin{aligned} \mathbf {A}^{-1}\approx \mathbf {B}_{r}:=\sum _{\nu =1}^{r}\alpha _{\nu } \bigotimes \limits _{j=1}^{d}\exp (-\beta _{\nu }A_{j})\in \mathcal {R}_{r} \end{aligned}$$

with

$$\begin{aligned} \left\| \mathbf {A^{-1}}-\mathbf {B}_{r}\right\| _{2}\le C_{1}\exp (-C_{2}r^{1/2}). \end{aligned}$$

Even \(\left\| \mathbf {A^{-1}}-\mathbf {B}_{r}\right\| _{2}\le C_{1} \exp (-C_{2}r)\) holds, if \(\sigma (\mathbf {A})\subset [1,R],\) \(R<\infty \) (\(C_{1},C_{2}\) depend on R).

The proof uses the approximation of \(\frac{1}{x}\) in \([a,\infty )\) by the exponential sum \(E_{r}(x)=\sum _{\nu =1}^{r}\alpha _{\nu }\exp (-\beta _{\nu }x).\) Choosing the optimal coefficients \(\alpha _{\nu },\beta _{\nu },\) one proves the following error with respect to the maximum norm:

$$\begin{aligned} \left\| \tfrac{1}{x}-E_{r}(x)\right\| _{\infty ,[1,\infty )} \le \mathcal {O}(\exp (-cr^{1/2})) \end{aligned}$$

(cf. Braess–Hackbusch [4, 5]). For positive definite matrices \(\mathbf {A}\) with \(\sigma (\mathbf {A})\subset [1,\infty )\), \(E_{r}(\mathbf {A})\) approximates \(\mathbf {A}^{-1}\) with the same bound: \(\left\| E_{r}(\mathbf {A})-\mathbf {A}^{-1}\right\| _{2} \le \Vert \frac{1}{x}-E_{r}(x)\Vert _{\infty ,[1,\infty )}\). In the case of \(\mathbf {A}=A_{1}\otimes I\otimes \cdots \otimes I+\cdots +I\otimes \cdots \otimes I\otimes A_{d}\) one has to evaluate \(\exp (-\beta _{\nu }\mathbf {A})\) and obtains

$$\begin{aligned} \mathbf {B}_{r}=E_{r}(\mathbf {A})=\sum _{\nu =1}^{r}\alpha _{\nu }\bigotimes \limits _{j=1}^{d}\exp (-\beta _{\nu }A_{j})\in \mathcal {R}_{r}. \end{aligned}$$
(3.1)

3.2 Solution of linear systems

Consider a linear system

$$\begin{aligned} \mathbf {Ax}=\mathbf {b}, \end{aligned}$$

where \(\mathbf {x},\mathbf {b}\in \mathbf {V}= {\textstyle \bigotimes \nolimits _{j=1}^{d}} V_{j}\) and \(\mathbf {A}\in {\textstyle \bigotimes \nolimits _{j=1}^{d}} \mathcal {L}(V_{j},V_{j})\subset \mathcal {L}(\mathbf {V},\mathbf {V})\) are represented in one of the formats. The general form of a linear iteration is

$$\begin{aligned} \mathbf {x}^{m+1}=\mathbf {x}^{m}-\mathbf {B}\left( \mathbf {Ax}^{m} -\mathbf {b}\right) \end{aligned}$$

with any matrix \(\mathbf {B}\) (cf. [12, (3.2.4)]). If this algorithm is applied. e.g., with the starting value \(\mathbf {x}^{0}=0\), the representation ranks of \(\mathbf {x}^{m}\) would blow up. Therefore, according to Sect. 2.6, a truncation T must be applied. This yields the ‘truncated iteration’

$$\begin{aligned} \mathbf {x}^{m+1}=T\left( \mathbf {x}^{m}-\mathbf {B}\left( T\left( \mathbf {Ax}^{m}-\mathbf {b}\right) \right) \right) . \end{aligned}$$

The cost per step is \(nd\ \)times powers of the involved representation ranks.

Remark 3.2

Let \(\mathbf {A}_{\Delta }\) be the discretisation of \(\Delta \) from above and \(\mathbf {B}_{r}\in \mathcal {R}_{r}\) the approximation of \(\mathbf {A}_{\Delta }^{-1}\) by (3.1). If the matrix \(\mathbf {A}\) corresponds to an elliptic pde of order 2, \(\mathbf {A}\) and \(\mathbf {A}_{\Delta }\) are spectrally equivalent (cf. [12, Lemma 8.4.1]). Therefore \(\mathbf {B:} =\mathbf {B}_{r}\in \mathcal {R}_{r}\) is an appropriate preconditioner.

The approach from above is, e.g., considered by Khoromskij [22].

3.3 cg-Like methods

The exact cg recursion yields a sequence of conjugate (i.e., A-orthogonal) vectors. It is well known that because of the floating point errors, the orthogonality is lost after several iterations. This effect is even worse for the conjugate gradient algorithm combined with tensor rank truncations, since the relative truncation errors are usually larger than the machine precision. There are two remedies:

  1. 1.

    The cg method is restricted to few iterations so that the loss of orthogonality is still weak enough.

  2. 2.

    As soon as necessary, a re-orthogonalisation is performed.

The first approach requires fast convergence, i.e., a very good preconditioning. The second approach has a drawback which will be discussed next. First we mention the following articles about conjugate gradient methods in the context of tensor methods: Tobler [29], Kressner–Tobler [2325] , and Savas–Eldén [28].

Assume that \(\mathbf {u}_{k}\) (\(1\le k\le m\)) are tensors belonging to \(\mathcal {R}_{r}\) for a fixed r. The orthonormalisation changes \(\mathbf {u}_{2}\) into a linear combination \(\tilde{\mathbf {u}}_{2} =\alpha \mathbf {u}_{1}+\beta \mathbf {u}_{2}\). Obviously, the result \(\tilde{\mathbf {u}}_{2}\) belongs to \(\mathcal {R}_{2r}\). In general, the orthonormalisation process produces \(\tilde{\mathbf {u}}_{k}\in \mathcal {R} _{kr}\) with increasing representation rank kr. A reduction of the rank by a truncation is not possible without losing again orthonormality.

To avoid this increase of the rank, Ballani–Grasedyck [2] propose a variant of the GMRES method with tensor truncation and without orthonormalisation. Nevertheless, the iterates are built in such a way that the residuals are decreasing.

4 Multigrid approach

First we transfer the setting of the geometric multigrid method to the tensor case. In Sect. 4.7 we discuss the pros and cons of the method.

4.1 Grid hierarchy

Since the tensors directly correspond to grid functions associated to product grids (cf. Sect. 2.1), the traditional geometric multigrid structure is appropriate.

Let \(V_{j}^{0},V_{j}^{1},\ldots ,V_{j}^{\ell },\ldots \) be a one-dimensional grid hierarchy. In the case of a difference scheme, \(V_{j}^{\ell }=\mathbb {R} ^{n_{\ell }^{(j)}}\) corresponds to a grid with \(n_{\ell }^{(j)}\) points, where \(n_{0}^{(j)}<n_{1}^{(j)}<\cdots \;\). In the case of a finite element discretisation, \(V_{j}^{\ell }\) may be the space of piecewise linear functions defined on a grid of \(n_{\ell }^{(j)}\) points.

The d-dimensional grid hierarchy is given by the tensor spaces

$$\begin{aligned} \mathbf {V}^{0},\mathbf {V}^{1},\ldots ,\mathbf {V}^{\ell },\ldots \quad \text {with}\;\;\mathbf {V}^{\ell }:=\bigotimes \limits _{j=1}^{d}V_{j}^{\ell }\,. \end{aligned}$$

In the case of the difference scheme, the elements of \(\mathbf {V}^{\ell }\) are grid functions on a d-dimensional grid with \(N:=\prod _{j=1}^{d}n_{\ell }^{(j)}\) nodal points (cf. Sect. 2.1), while, in the finite element case, \(\mathbf {V}^{\ell }\) consists of piecewise d-linear functions on the cubical grid of the dimension \(N:=\prod _{j=1}^{d}n_{\ell }^{(j)}\) (cf. Sect. 2.2). Note that the corresponding finite element basis functions are elementary tensors of the form (2.1).

4.2 Prolongation, restriction

Let \(p_{j}:V_{j}^{\ell }\rightarrow V_{j}^{\ell +1}\) be the one-dimensional linear prolongation (cf. [13, Sect. 3.4.2]). The d-linear prolongation \(\mathbf {p}:\mathbf {V}^{\ell }\rightarrow \mathbf {V}^{\ell +1}\) is the elementary (Kronecker) tensor product of the one-dimensional linear prolongations:

$$\begin{aligned} \mathbf {p}=p_{1}\otimes p_{2}\otimes \cdots \otimes p_{d}\,. \end{aligned}$$

Similarly, the restriction satisfies

$$\begin{aligned} \mathbf {r}=r_{1}\otimes r_{2}\otimes \cdots \otimes r_{d}\,. \end{aligned}$$

Note that usually \(\mathbf {r}\) is the adjoint of \(\mathbf {p}\) (cf. [13, Sect. 3.5]).

An important observation is the fact that the application of \(\mathbf {p}\) and \(\mathbf {r}\) does not increase the representation ranks. Let \(\delta \mathbf {x}^{\ell -1}\) be the correction obtained from the coarse grid. Then the multigrid iteration determines the new iterate by \(\mathbf {x}^{\ell } \leftarrow \mathbf {x}^{\ell }-\mathbf {p}\ \delta \mathbf {x}^{\ell -1}.\) To restrict the representation rank, we have to replace this assignment by the truncated version

$$\begin{aligned} \mathbf {x}^{\ell }\leftarrow T(\mathbf {x}^{\ell }-\mathbf {p}\ \delta \mathbf {x}^{\ell -1})\,. \end{aligned}$$

4.3 Coarse-grid matrices

Let \(\mathbf {A}^{L}\) be the fine-grid matrix (L is the level number of the finest grid). The auxiliary matrices \(\mathbf {A}^{\ell }\) (\(\ell <L\)) can be defined by the Galerkin approach (cf. [13, Sect. 3.7]):

$$\begin{aligned} \mathbf {A}^{\ell -1}:=\mathbf {rA}^{\ell }\mathbf {p}\quad \text {for}\;\; \ell =L,L-1,\ldots ,1. \end{aligned}$$

Assume that the fine-grid matrix \(\mathbf {A}^{L}\) has the representation

$$\begin{aligned} \mathbf {A}^{L}=\sum \limits _{\nu =1}^{r_{A}}\bigotimes \limits _{j=1}^{d}A_{\nu }^{L,j}. \end{aligned}$$

The Galerkin approach leads to

$$\begin{aligned} \mathbf {A}^{\ell }=\sum \limits _{\nu =1}^{r_{A}}\bigotimes \limits _{j=1}^{d} A_{\nu }^{\ell ,j} \end{aligned}$$

with the same representation rank \(r_{A}\) involving

$$\begin{aligned} A_{\nu }^{L,j}:=r_{j}\,A_{\nu }^{\ell ,j}\,p_{j}. \end{aligned}$$

The defect \(\mathbf {d}^{(\ell -1)}:=\mathbf {A}^{\ell -1}\mathbf {x}^{\ell -1}-\mathbf {b}^{\ell -1}\) computed in the coarse grid has an increased rank because of the matrix-vector multiplication (see the example in Sect. 2.6). Hence, this step has to be replaced by the truncated version

$$\begin{aligned} \mathbf {d}^{(\ell -1)}:= T\left( \mathbf {A}^{\ell -1}\mathbf {x}^{\ell -1}-\mathbf {b} ^{\ell -1}\right) . \end{aligned}$$

4.4 Smoothing

4.4.1 Richardson iteration

The exact Richardson iteration is

$$\begin{aligned} \mathbf {x}_{\mathsf {new}}^{\ell }:=\mathbf {x}_{\mathsf {old}}^{\ell } -\omega _{\ell }\left( \mathbf {A}^{\ell }\mathbf {x}_{\mathsf {old}}^{\ell }-\mathbf {b}^{\ell }\right) . \end{aligned}$$

As discussed in Sect. 3.2, this approach is not advisable in the tensor case because of the increasing representation ranks. Instead, the truncated Richardson iteration

$$\begin{aligned} \mathbf {x}_{\mathsf {new}}^{\ell }:=T\left( \mathbf {x}_{\mathsf {old}}^{\ell }-\omega _{\ell }\left( \mathbf {A}^{\ell }\mathbf {x}_{\mathsf {old}}^{\ell }-\mathbf {b}^{\ell }\right) \right) \end{aligned}$$

is the method of choice.

A s the semilinear variant of the truncated Richardson iteration (cf. [13, Sect. 3.3.5]).

4.4.2 Jacobi iteration

The damped Jacobi iteration is an often applied smoothing iteration. However, in the tensor case, the performance of this iteration is already too complicated as pointed out below.

The diagonal of the (Kronecker) matrix \(\mathbf {A}^{\ell }\!=\!\sum \nolimits _{\nu =1}^{r_{A}}\bigotimes \nolimits _{j=1}^{d}A_{\nu ,j}^{\ell }\) is given by

$$\begin{aligned} \mathbf {D}^{\ell }:={\text {*}}{\mathrm{diag}}\{\mathbf {A}^{\ell }\}=\sum \limits _{\nu =1}^{r_{A}}\bigotimes \limits _{j=1}^{d}D_{\nu ,j}^{\ell }\quad \text {with }\;\;D_{\nu ,j}^{\ell }:={\text {*}}{\mathrm{diag}}\{A_{\nu ,j}^{\ell }\}. \end{aligned}$$

In the tensor case, diagonal matrices may be called sparse because of the many zero entries, but in fact they are not sparse since the number of nonzero elements is still huge. The number of nonzero entries of \(\mathbf {D}^{\ell } \in \mathbb {R}^{N\times N}\) is \(N=\prod _{j=1}^{d}n_{\ell }^{(j)}\), which is far too large for practical use. Instead one may try to use an approximation

$$\begin{aligned} (\mathbf {D}^{\ell })^{-1}\approx \mathbf {B}_{r}:=E_{r}(\mathbf {D}^{\ell } )=\sum \limits _{\mu =1}^{r}\alpha _{\mu }\exp (-\beta _{\mu }\mathbf {D}^{\ell }) \end{aligned}$$

as in (3.1). However, Kronecker exponentials of the form \(\exp \! \big ( \bigotimes _{j=1}^{d}D_{j} \big ) \) are not easy to evaluate or to apply to a tensor. The only exception arises when at most one \(D_{j}\) is different from the identity I.

4.5 Treatment of the coarsest grid

We recall that \(\dim (\mathbf {V}^{\ell })=\prod _{j=1}^{d}n_{j}^{(\ell )}\) with \(n_{j}^{(\ell )}:=\dim (V_{j}^{\ell })\). Since the dimension of the coarsest grid is \(\prod _{j=1}^{d}n_{j}^{(0)}\), a direct solution of the system at the lowest level requires

  • either \(n_{j}^{(0)}=1,\)

  • or \(n_{j}^{(0)}\) and d so small that also \(\prod _{j=1}^{d}n_{j}^{(0)}\) is sufficiently small.

The choice \(n_{j}^{(0)}=1\) is possible for positive definite problems (cf. [13, Sect. 7.2]). Only if \(n_{j}^{(0)}=1,\) the coarse-grid problem has a size \(\prod _{j=1}^{d}n_{j}^{(0)}=1\) not depending on d. Otherwise, some other method from Sects. 3.23.3 must be applied.

4.6 Convergence

Choosing the multigrid parameters as above, we can apply the standard convergence results (e.g., [13, Theorems 7.2.2 and 7.2.3]), provided that no tensor truncation is applied. The additional truncation effect is similar to the truncated iteration analysed by Hackbusch–Khoromskij–Tyrtyshnikov [18] (see also [14, Sect. 14.3.2] or [17, Sect. 15.3.2]).

Numerical examples for the use of a tensor multigrid iteration can be seen in Ballani–Grasedyck [2, Example 7.5]. These examples with \(n_{\mathsf {fine}}:=n^{(L)}=1023\) and dimensions up to \(d=32\) demonstrate that the convergence behaviour does not depend on d.

In (1.2) we mention the Lyapunov equation. A multigrid solution of the more general Sylvester matrix equation \(AX-XB=C\) is described in Grasedyck–Hackbusch [10]. A nonlinear multigrid approach to the quadratic Riccati equation is given in [9].

4.7 Cost

A disadvantage is the following effect. In traditional 3D multigrid applications, computations at level \(\ell -1\) with the step size \(h_{\ell -1}=2h_{\ell }\) cost only \(\frac{1}{8}\) of the computations at level \(\ell \). This explains why the cost of one V- or W-cycle is dominated by the operations spent on the finest grid.

In tensor applications, the cost of operations is described in (2.8 ): \(\mathcal {O}(dr^{4}+dnr^{2})\). In the multigrid case, n has to be replaced with \(n^{(\ell )}:=\max \{n_{j}^{(\ell )}:1\le j\le d\}.\) For the finest grid, we use the notation \(n_{\mathsf {fine}}:=n^{(L)}.\) Assuming \(n_{j}^{(\ell )}\approx 2n_{j}^{(\ell -1)},\) we obtain the following work per V- or W-cycle:

$$\begin{aligned} \text {V-cycle} \text {:}&\;\;\quad \mathcal {O}(dr^{4}L+dn_{\mathsf {fine}}r^{2}),\\ \text {W-cycle} \text {:}&\;\quad \mathcal {O}(dr^{4}2^{L}+dn_{\mathsf {fine}} r^{2}L)\\&\quad =\mathcal {O}(dr^{4}n_{\mathsf {fine}}+dn_{\mathsf {fine}}r^{2}L). \end{aligned}$$

This result can be interpreted in different ways. On the negative side, the portion of the work corresponding to the auxiliary grids \(\ell <L\) is larger, but on the positive side, even for the W-cycle, the cost is less than \(\mathcal {O}(n_{\mathsf {fine}}^{1+\varepsilon })\) for all \(\varepsilon >0\) with respect to \(n_{\mathsf {fine}}\). Nevertheless, the factor \(r^{4}\) may become more problematic than \(n_{\mathsf {fine}}\).

A possible reduction of the cost may be obtained by choosing different representation ranks r for the finest grid and for the auxiliary problems at the levels \(\ell <L\). For \(\ell =L\) a possibly larger rank \(r_{\mathsf {fine}}\) is needed to approximate the solution of the discretised pde. The size of \(r_{\mathsf {fine}}\) depends on the nature of the solution. However, for \(\ell <L\) only corrections are to be represented and the rank required by the corrections is not related to \(r_{\mathsf {fine}}.\)

4.8 Parabolic problems

Since the dimensions in the sense of number of coordinates are no limitation, space-time simultaneous discretisations with the additional time variable are not disadvantageous. In this respect, the results of Andreev–Tobler [1] are of interest. In the latter paper, a BPX preconditioner is used.

5 Variational approach

Finally we mention a quite different approach to solving \(\mathbf {Ax} =\mathbf {b}\) approximately. If \(\mathbf {A}\) is positive definite, we may minimise the quadratic cost function

$$\begin{aligned} \Phi (\mathbf {x}):=\left\langle \mathbf {Ax},\mathbf {x}\right\rangle -2\left\langle \mathbf {b},\mathbf {x}\right\rangle . \end{aligned}$$

In the general case of a regular \(\mathbf {A}\), define

$$\begin{aligned} \Phi (\mathbf {x}):=\left\| \mathbf {Ax}-\mathbf {b}\right\| ^{2} \quad \text {or}\quad \Phi (\mathbf {x}):=\left\| \mathbf {B}\left( \mathbf {Ax}-\mathbf {b}\right) \right\| ^{2} \end{aligned}$$

with a suitable preconditioner \(\mathbf {B}\) and try to minimise \(\Phi (\mathbf {x})\). The minimisation over all tensors \(\mathbf {x}\) is not feasible because of the huge dimension. Instead one fixes a certain format for the representation of \(\mathbf {x}\) and minimises over all representation parameters of \(\mathbf {x}\). In practice, this works quite well although the theoretical understanding of the convergence properties is still incomplete. Another difficulty is that \(\Phi \) has many local minima and the minimisation is nonconvex.

Concerning the variational approach we refer to the following papers: Espig–Hackbusch–Rohwedder–Schneider [7], Falcó–Nouy [8], Holtz–Rohwedder–Schneider [21], Mohlenkamp [26], Osedelets [27] and others cited in these papers.