research-article

Open access

(De/Re)-Composition of Data-Parallel Computations via Multi-Dimensional Homomorphisms

Author:

Ari RaschAuthors Info & Claims

ACM Transactions on Programming Languages and Systems, Volume 46, Issue 3

Article No.: 10, Pages 1 - 74

https://doi.org/10.1145/3665643

Published: 10 October 2024 Publication History

PDF eReader

Abstract

Data-parallel computations, such as linear algebra routines and stencil computations, constitute one of the most relevant classes in parallel computing, e.g., due to their importance for deep learning. Efficiently de-composing such computations for the memory and core hierarchies of modern architectures and re-composing the computed intermediate results back to the final result—we say (de/re)-composition for short—is key to achieve high performance for these computations on, e.g., GPU and CPU. Current high-level approaches to generating data-parallel code are often restricted to a particular subclass of data-parallel computations and architectures (e.g., only linear algebra routines on only GPU or only stencil computations), and/or the approaches rely on a user-guided optimization process for a well-performing (de/re)-composition of computations, which is complex and error prone for the user.

We formally introduce a systematic (de/re)-composition approach, based on the algebraic formalism of Multi-Dimensional Homomorphisms (MDHs). Our approach is designed as general enough to be applicable to a wide range of data-parallel computations and for various kinds of target parallel architectures. To efficiently target the deep and complex memory and core hierarchies of contemporary architectures, we exploit our introduced (de/re)-composition approach for a correct-by-construction, parametrized cache blocking, and parallelization strategy. We show that our approach is powerful enough to express, in the same formalism, the (de/re)-composition strategies of different classes of state-of-the-art approaches (scheduling-based, polyhedral, etc.), and we demonstrate that the parameters of our strategies enable systematically generating code that can be fully automatically optimized (auto-tuned) for the particular target architecture and characteristics of the input and output data (e.g., their sizes and memory layouts). Particularly, our experiments confirm that via auto-tuning, we achieve higher performance than state-of-the-art approaches, including hand-optimized solutions provided by vendors (such as NVIDIA cuBLAS/cuDNN and Intel oneMKL/oneDNN), on real-world datasets and for a variety of data-parallel computations, including linear algebra routines, stencil and quantum chemistry computations, data mining algorithms, and computations that recently gained high attention due to their relevance for deep learning.

1 Introduction

Data-parallel computations constitute one of the most relevant classes in parallel computing. Important examples of such computations include linear algebra routines [Whaley and Dongarra, 1998], various kinds of stencil computations (e.g., Jacobi method and convolutions) [Hagedorn et al., 2018], quantum chemistry computations [Kim et al., 2019], and data mining algorithms [Rasch et al., 2019b]. The success of many application areas critically depends on achieving high performance for their data-parallel building blocks, on a variety of parallel architectures. For example, highly optimized linear algebra routines implementations combined with the computational power of modern GPUs currently enable deep learning to significantly outperform other existing machine learning approaches (e.g., for speech recognition and image classification).

Data-parallel computations are characterized by applying the same function (a.k.a scalar function) to each point in a multi-dimensional grid of data (a.k.a. array) and combining the obtained intermediate results in the grid’s different dimensions using so-called combine operators.

Figures 1 and 2 illustrate data parallelism using as examples two popular computations: (1) linear algebra routine Matrix-Vector multiplication (MatVec) and (2) stencil computation Jacobi (Jacobi1D). In the case of MatVec, the grid is two-dimensional and consists of pairs, each pointing to one element of the input matrix $M_{i,k}$ and the vector $v_{k}$. To each pair, scalar function $f(M_{i,k},v_{k}):=M_{i,k}*v_{k}$ (multiplication) is applied, and results in the $i$-dimension are combined using combine operator $\circledast_{1}((x_{1},\dotsc,x_{n}), (y_{1},\dotsc,y_{m})): =(x_{1}, \dotsc,x_{n},y_{1},\dotsc,y_{m})$ (concatenation) and in $k$-dimension using operator $\circledast_{2}((x_{1},\dotsc,x_{n}), (y_{1},\dotsc,y_{n})): =(x_{1}+y_{ 1},\dotsc,x_{n}+y_{n})$ (point-wise addition). Similarly, the scalar function of Jacobi1D is $f(v_{i+0},v_{i+1},v_{i+2}):=c*(v_{i+0}+v_{i+1}+v_{i+2})$ which computes the Jacobi-specific function for an arbitrary but fixed constant $c$; Jacobi1D’s combine operator $\circledast_{1}$ is concatenation. We formally define scalar functions and combine operators later in this article.

Fig. 1.

Fig. 2.

Achieving high performance for data-parallel computations is considered important in both academia and industry but has proven to be challenging. In particular, achieving high performance that is portable (i.e., the same program code achieves a consistently high level of performance across different architectures and characteristics of the input/output data, e.g., their size and memory layout) and in a user-productive way is identified as an ongoing, major research challenge. This is because for high performance, an efficient (de/re)-composition of computations (illustrated in Figure 3 and discussed thoroughly in this article) is required to efficiently break down a computation for the deep and complex memory and core hierarchies of state-of-the-art architectures, via efficient cache blocking and parallelization strategies. Moreover, to achieve performance that is portable across architectures, the programmer has to consider that architectures often differ significantly in their characteristics [Sun et al., 2019]—depth of memory and core hierarchies, automatically managed caches (as in CPUs) vs. manually managed caches (as in GPUs), and so on—which poses further challenges on identifying an efficient (de/re)-composition of computations. Productivity is often also hampered: state-of-the-art programming models (such as OpenMP [OpenMP, 2022] for CPU, CUDA [NVIDIA, 2022g] for GPU, and OpenCL [Khronos, 2022b] for multiple kinds of architectures) operate on a low abstraction level; thereby, the models require from the programmer explicitly implementing a well-performing (de/re)-composition, which involves complex and error-prone index computations, explicitly managing memory and threads on multiple layers, and so on.

Fig. 3.

Current high-level approaches to generating data-parallel code usually struggle with addressing in one combined approach all three challenges: performance, portability, and productivity. For example, approaches such as Halide [Ragan-Kelley et al., 2013], Apache TVM [Chen et al., 2018a], Fireiron [Hagedorn et al., 2020a], and LoopStack [Wasti et al., 2022] achieve high performance but incorporate the user into the optimization process—by requiring from the user explicitly expressing optimizations in a so-called scheduling language—which is error prone and needs expert knowledge about low-level code optimizations, thus hindering user’s productivity. In contrast, polyhedral approaches, such as Pluto [Bondhugula et al., 2008b], PPCG [Verdoolaege et al., 2013], and Facebook’s TC [Vasilache et al., 2019], are often fully automatic and thus productive but usually specifically designed toward a particular architecture (e.g., only GPU as TC and PPCG, or only CPU as Pluto) and thus not portable. Functional approaches, e.g., Lift [Steuwer et al., 2015], are productive for functional programmers (e.g., with experience in Haskell [Haskell.org, 2022] programming, which relies on small, functional building blocks for expressing computations), but the approaches often have difficulties in automatically achieving the full performance potential of architectures [Rasch et al., 2019a]. Furthermore, many of the existing approaches are specifically designed toward a particular subclass of data-parallel computations only, e.g., only tensor operations (as LoopStack and TC) or only matrix multiplication (as Fireiron), or they require significant extensions for new subclasses (as Lift for matrix multiplication [Remmelg et al., 2016] and stencil computations [Hagedorn et al., 2018]), which further hinders the productivity of the user.

In this article, we formally introduce a systematic (de/re)-composition approach for data-parallel computations targeting state-of-the-art parallel architectures. We express computations via high-level functional expressions (specifying what to compute), in the form of easy-to-use higher-order functions, based on the algebraic formalism of Multi-Dimensional Homomorphisms (MDHs)¹ [Rasch and Gorlatch, 2016].² Our higher-order functions are capable of expressing various kinds of data-parallel computations (linear algebra, stencils, etc.), in the same formalism and on a high level of abstraction, independently of hardware and optimization details, thereby contributing to user’s productivity.³ As target for our high-level expressions, we introduce functional low-level expressions (specifying how to compute) to formally reason about (de/re)-compositions of data-parallel computations; our low-level expressions are designed such that they can be straightforwardly transformed to executable program code (e.g., in OpenMP, CUDA, and OpenCL). To systematically lower our high-level expressions to low-level expressions, we introduce a formally sound, parameterized lowering process. The parameters of our lowering process enable automatically computing low-level expressions that are optimized (auto-tuned [Balaprakash et al., 2018]) for the particular target architecture and characteristics of the input/output data, thereby achieving fully automatically high, portable performance. For example, we formally introduce parameters for flexibly choosing the target memory regions for de-composed and re-composed computations and also parameters for flexibly setting an optimized data access pattern.

We show that our high-level representation is capable of expressing various kinds of data-parallel computations, including computations that recently gained high attention due to their relevance for deep learning [Barham and Isard, 2019]. For our low-level representation, we show that it can express the cache blocking and parallelization strategies of state-of-the-art parallel implementations—as generated by scheduling approach TVM and polyhedral compilers PPCG and Pluto—in one uniform formalism. Moreover, we present experimental results to confirm that based on our parameterized lowering process in combination with auto-tuning, we are able to achieve higher performance than the state of the art, including hand-optimized implementations provided by vendors (e.g., NVIDIA cuBLAS and Intel oneMKL for linear algebra routines, and NVIDIA cuDNN and Intel oneDNN for deep learning computations).

Summarized, we make the following three major contributions (illustrated in Figure 4):

(1)

We introduce a functional High-Level Representation (HL REP), based on the algebraic formalism of MDHs, that enables uniformly expressing data-parallel computations on a high level of abstraction.

(2)

We introduce a functional Low-Level representation (LL REP) that enables formally expressing and reasoning about (de/re)-compositions of data-parallel computations; our low-level representation is designed such that it can be straightforwardly transformed to executable program code in state-of-practice parallel programming models, including OpenMP, CUDA, and OpenCL.

(3)

We introduce a systematic lowering process to fully automatically lower an expression in our high-level representation to a device- and data-optimized expression in our low-level representation, in a formally sound manner, based on auto-tuning.

Fig. 4.

Our three contributions aim to answer the following questions:

(1)

How can data parallelism be formally defined, and how can data-parallel computations be uniformly expressed via higher-order functions that are agonistic from hardware and optimization details while still capturing all information relevant for generating high-performing, executable program code? (Contribution 1)

(2)

How can optimizations for the memory and core hierarchies of state-of-the-art parallel architectures be formally expressed and generalized such that they apply to arbitrary data-parallel computations? (Contribution 2)

(3)

How can optimizations for data-parallel computations be expressed and structured so that they can be automatically identified (auto-tuned) for a particular target architecture and characteristics of the input and output data? (Contribution 3)

The rest of the article is structured as follows. We introduce our functional HL REP (Contribution 1) in Section 2, and we show how this representation is used for expressing various kinds of popular data-parallel computations. In Section 3, we discuss our functional LL REP (Contribution 2) which is powerful enough to express the optimization decisions of state-of-practice approaches (e.g., scheduling approach TVM and polyhedral compilers PPCG and Pluto) and beyond. Section 4 shows how we systematically lower a computation expressed in our high-level representation to an expression in our low-level representation, in a formally sound and auto-tunable manner (Contribution 3). We present experimental results in Section 5, discuss related work in Section 6, conclude in Section 7, and we present our ideas for future work in Section 8.

We provide a full version of this paper [Rasch, 2024] that contains details for the interested reader that should not be required for understanding the basic concepts introduced in this article. In particular, our full version contains formal details—for all the following definition, examples, and theorems in Sections 2–4—whereas the formalism in this article is simplified for better illustration and easier understanding of our basic ideas and concepts.

2 High-Level Representation for Data-Parallel Computations

We introduce functional building blocks, in the form of higher-order functions, that express data-parallel computations on a high abstraction level. The goal of our high-level abstraction is to express computations agnostic from hardware and optimization details, and thus in a user-productive manner, while still capturing all information relevant for generating high-performance program code. The building blocks of our abstraction are based on the algebraic MDH formalism which is an approach toward formalizing data parallelism (we compare in detail to the existing work on MDHs in Section 6.6).

Figure 5 shows a basic overview of our high-level representation. We express data-parallel computations using exactly three higher-order functions only (a.k.a. patterns or skeletons [Gorlatch and Cole, 2011] in programming terminology): (1) inp_view transforms the domain-specific input data (e.g., a matrix and a vector in the case of matrix-vector multiplication) to a Multi-Dimensional Array (MDA) which is our internal data representation and defined later in this section; (2) md_hom expresses the data-parallel computation; (3) out_view transforms the computed MDA back to the domain-specific data representation.

Fig. 5.

In the following, after informally discussing an introductory example in Section 2.1, we formally define and discuss each higher-order function in detail in Section 2.2 (function md_hom) and Section 2.3 (functions inp_view and out_view). Sections 2.2 and 2.3 introduce and present the internals and formal details of our approach, which are not relevant for the end user of our system—the user only needs to operate on the abstraction level discussed in Section 2.1.

2.1 Introductory Example

Figure 6 shows how our high-level representation is used for expressing the example of matrix-vector multiplication MatVec⁴ (Figure 1). Computation MatVec takes as input a matrix $M\in T^{I\times K}$ and vector $v\in T^{K}$ of arbitrary scalar type⁵ $T$ and sizes $I\times K$ (matrix) and $K$ (vector), for arbitrary but fixed positive natural numbers $I,K\in\mathbb{N}$.⁶ In the figure, based on index function $(i,k)\to(i,k)$ and $(i,k)\to(k)$, high-level function inp_view computes a function that takes $M$ and $v$ as input and maps them to a two-dimensional array of size $I\times K$ (referred to as input MDA in the following and defined formally in the next subsection). The MDA contains at each point $(i,k)$ the pair $(M_{i,k},v_{k})\in T\times T$ comprising element $M_{i,k}$ within matrix $M$ (first component) and element $v_{k}$ within vector $v$ (second component). The input MDA is then mapped via function md_hom to an output MDA of size $I\times 1$, by applying multiplication $*$ to each pair $(M_{i,k},v_{k})$ within the input MDA, and combining the obtained intermediate results within the MDA’s first dimension via ++ (concatenation—also defined formally in the next subsection) and in second dimension via $+$ (point-wise addition). Finally, function out_view computes a function that straightforwardly maps the output MDA, of size $I\times 1$, to MatVec’s result vector $w\in T^{I}$, which has scalar type $T$ and is of size $I$. For the example of MatVec, the output view is trivial, but it can be used in other computations (such as matrix multiplication) to conveniently express more advanced variants of computations (e.g., computing the result matrix of matrix multiplication as transposed, as demonstrated later).⁷

Fig. 6.

2.2 Function md _hom

Higher-order function md_hom is introduced by Rasch and Gorlatch [2016] to express MDH functions—a formal representation of data-parallel computations—in a convenient and structured way. In the following, we recapitulate the definition of MDHs and function md_hom, but in a more general and formally more precise setting than done in the original MDH work.

To define MDH functions, we first need to introduce two central building blocks used in the definition of MDHs: (1) MDAs—the data type on which MDHs operate and which uniformly represent domain-specific input and output data (scalar, vectors, matrices, $\dotsc$), and (2) combine operators which we use to combine elements within a particular dimension of an MDA.

MDAs

Definition 1

(MDA). An MDA $\mathfrak{a}$ that has dimensionality $D\in\mathbb{N}$, size $N\in\mathbb{N}^{D}$, index sets $I_{1},\dotsc,I_{D}\subset\mathbb{N}_{0}$, and scalar type $T\in\texttt{TYPE}$ is a function with the following signature:

\begin{align*}\mathfrak{a}:I_{1} \times\dotsc\times I_{D} \to T\end{align*}

We refer to $I_{1} \times\dotsc\times I_{D} \to T$ as the type of MDA $\mathfrak{a}$.

Notation 1

For better readability, we denote MDAs’ types and accesses to them using a notation close to programming. We often write:

—

$\mathfrak{a}\in T[ I_{1}, \dotsc, I_{D} ]$ instead of $\mathfrak{a}:I_{1} \times\dotsc\times I_{D} \to T$ to denote the type of MDA $\mathfrak{a}$;

—

$\mathfrak{a}\in T[ N_{1}, \dotsc, N_{D} ]$ instead of $\mathfrak{a}:[0,N_{1})_{\mathbb{N}_{0}}\times\cdots\times[0,N_{D})_{\mathbb{N} _{0}}\to T$; ⁸

—

$\mathfrak{a}[ i_{1}, \dotsc, i_{D} ]$ instead of $a(i_{1}, \dotsc, i_{D})$ to access MDA $\mathfrak{a}$ at position $(i_{1}, \dotsc, i_{D})$.

Figure 7 shows six MDAs for illustration. For example, the left part of the figure shows MDA $\mathfrak{a}$ which is of type $\mathfrak{a}:I_{1} \times I_{2} \to T$, for $I_{1}=\{0,1\}$, $I_{2}=\{0,1,2,3\}$, and $T=\mathbb{Z}$ (integer numbers). Note that MDAs named $\mathfrak{a}^{(1,1)},\mathfrak{a}^{(1,2)},\mathfrak{a}^{(2,1)},\mathfrak{a}^{(2,2)},\mathfrak{a}^{(2,3)}$ in Figure 7 can be considered as parts (a.k.a. tiles in programming) of MDA $\mathfrak{a}$: the MDA named $\mathfrak{a}^{(1,1)}$ represents the first row of $\mathfrak{a}$, MDA $\mathfrak{a}^{(2,2)}$ the third column of $\mathfrak{a}$, etc. We formally define and use partitionings of MDAs in Section 3.

Fig. 7.

Combine Operators

A central building block in our definition of MDHs is a combine operator. Intuitively, we use a combine operator to combine all elements within a particular dimension of an MDA. For example, in Figure 1 (matrix-vector multiplication), we combine elements of the two-dimensional MDA via combine operator concatenation in MDA’s first dimension and via operator point-wise addition in the second dimension. Technically, combine operators are functions that take as input two MDAs and yield a single MDA as their output.

We now define combine operators formally, and we illustrate this formal definition afterward using the example operators concatenation and point-wise combination.

Definition 2

(Combine Operator). We refer to any binary function $\circledast$ of type

\begin{align*}\circledast:T[ I_{1}, \dotsc, \underset{\underset{d}{\uparrow}}{P}, \dotsc, I_{D} ]\ \times\ T[ I_{1}, \dotsc, \underset{\underset{d}{\uparrow}}{Q}, \dotsc, I_{D} ]\to T[ I_{1}, \dotsc, \underset{\underset{d}{\uparrow}} {R}, \dotsc, I_{D} ]\end{align*}

as combine operator that has scalar type $T\in\texttt{TYPE}$, dimensionality $D\in\mathbb{N}$, and operating dimension $d\in[1,D]_{\mathbb{N}}$. We denote combine operator’s type concisely as CO.

Example 1

(Concatenation). We define concatenation (in dimension $d$) as function $\mbox{++}_{d}$ of type

\begin{align*}\mbox{++}_{d}:T[ I_{1}, \dotsc, \underset{\underset{d}{\uparrow}}{P}, \dotsc, I_{D} ]\ \times\ T[ I_{1}, \dotsc, \underset{\underset{d}{\uparrow}}{Q}, \dotsc, I_{D} ]\ \to\ T[ I_{1}, \dotsc, \underset{\underset{d}{ \uparrow}}{P \cup\mathrel{\mspace{-18.5mu}}\cdot Q}, \dotsc, I_{D} ]\end{align*}

and that is computed as

\begin{align*} \mbox{++}_{d}(\,\mathfrak{a}_{1},\mathfrak{a}_{2}\ )[\,i_{1}\,,\dotsc,\ \ i_{d}\ \,\dotsc,\,i_{D }\,] :=\ \begin{cases}\ \mathfrak{a}_{1}[\,i_{1}\,,\dotsc,\ \ i_{d}\ \,\dotsc,\,i_{D}\,], & \ \ i_{d}\in P\\ \ \mathfrak{a}_{2}[\,i_{1}\,,\dotsc,\ \ i_{d}\ \,\dotsc,\,i_{D}\,], &\ \ i_{d}\in Q\\ \end{cases} \end{align*}

The function is well defined when $P$ and $Q$ are disjoint. We usually use an infix notation for $\mbox{++}_{d}$, i.e., we write $\mathfrak{a}_{1} \mbox{++}_{d} \mathfrak{a}_{2}$ instead of $\mbox{++}_{d}(\mathfrak{a}_{1},\mathfrak{a}_{2})$, and we refrain from $\mbox{++}_{d}$’s subscript $d$ when it is clear from the context.

Example 2

(Point-Wise Combination). We define point-wise combination (in dimension $d$), according to a binary function $\oplus:T\times T\to T$ (e.g., addition), as function $\overrightarrow{\bullet}_{d}$ of type

\begin{align*}\overrightarrow{\bullet}_d: \underbrace{T\times T\to T }_{\oplus} \ \to \ \underbrace{ T[I_1,\dotsc,\underset{\underset{d}{\uparrow}}{\{0\}},\dotsc,I_D] \times T[I_1,\dotsc,\underset{\underset{d}{\uparrow}}{\{0\}},\dotsc,I_D] \to T[I_1,\dotsc,\underset{\underset{d}{\uparrow}}{\{0\}},\dotsc,I_D]}_{\begin{array}{c}{\scriptsize\texttt{point-wise combination (according to}\ \oplus\texttt{{)}}}\end{array}}\end{align*}

that is computed as

\begin{align*}\overrightarrow{\bullet}_{d}(\oplus) (\mathfrak{a}_{1},\mathfrak{a}_{2}) [i_{1},\dotsc, \underset{\underset{d}{\uparrow}}{0},\dotsc,i_{D}]:=\mathfrak{a}_{1}[i_{1},\dotsc, \underset{\underset{d}{\uparrow}}{0},\dotsc,i_{D}]\oplus\mathfrak{a}_{2}[i_{1},\dotsc, \underset{\underset{d}{\uparrow}}{0},\dotsc,i_{D}].\end{align*}

The input MDAs are assumed to have index set $\{0\}$ in the operating dimension $d$; otherwise, $\overrightarrow{\bullet}(\oplus)$ is undefined. We refrain from $\overrightarrow{\bullet}_{d}(\oplus)$’s subscript $d$ when it is clear from the context. For brevity, we often write $\oplus$ only, instead of $\overrightarrow{\bullet}_{d}(\oplus)$, and we usually use an infix notation for $\oplus$.

MDHs

Now that we have defined MDAs (Definition 1) and combine operators (Definition 2), we can define MDH functions. Intuitively, a function $h$ operating on MDAs is an MDH iff we can apply the function independently to parts of its input MDA and combine the obtained intermediate results to the final result using combine operators; this can be imagined as a typical divide-and-conquer pattern. Compared to classical approaches, e.g., list homomorphisms [Bird, 1989; COLE, 1995; Gorlatch, 1999], a major characteristic of MDH functions is that they allow (de/re)-composing computations in multiple dimensions (e.g., in Figure 1, in both the concatenation dimension as well as in the point-wise addition dimensions), rather than being limited to a particular dimension only (e.g., only the concatenation dimension or only point-wise addition dimension, respectively). We will see later in this article that a multi-dimensional (de/re)-composition approach is essential to efficiently exploit the hardware of modern architectures which require fine-grained cache blocking and parallelization strategies to achieve their full performance potential.

Figure 8 illustrates the MDH property informally on a simple, two-dimensional input MDA. In the left part of the figure, we split the input MDA in dimension 1 (i.e., horizontally) into two parts $\mathfrak{a}_{1}$ and $\mathfrak{a}_{2}$, apply the MDH function $h$ independently to each part, and combine the obtained intermediate results to the final result using the MDH function $h$’s combine operator $\circledast_{1}$. Similarly, in the right part of Figure 8, we split the input MDA in dimension 2 (i.e., vertically) into parts and combine the results via MDH function $h$’s second combine operator $\circledast_{2}$.

Fig. 8.

Figure 9 shows an artificial example in which we apply the MDH property (illustrated in Figure 8) recursively. We refer in Figure 9 to the part above the horizontal dashed lines as de-composition phase and to the part below dashed lines as re-composition phase.

Fig. 9.

Definition 3

(MDH). A function

\begin{align*}h:T^{\texttt{INP}}[\ I_{1}, \dotsc, \ I_{D}\ ]\ \to\ T^{\texttt{OUT}}[\ J_ {1}, \dotsc, \ J_{D}\ ]\end{align*}

is an MDH that has input scalar type $T^{\texttt{INP}}\in\texttt{TYPE}$, output scalar type $T^{\texttt{OUT}}\in\texttt{TYPE}$, and dimensionality $D\in\mathbb{N}$, iff for each $d\in[1,D]_{\mathbb{N}}$, there exists a combine operator $\circledast_{d}$ (Definition 2), such that for any concatenated input MDA $\mathfrak{a}_{1}\mbox{++}_{d}\ \mathfrak{a}_{2}$ in dimension $d$, the homomorphic property is satisfied:

\begin{align*} h(\ \mathfrak{a}_{1}\mbox{++}_{d}\ \mathfrak{a}_{2}\ )\ =\ h(\mathfrak{a}_{1})\,\circledast_{d}\,h(\mathfrak{a}_{ 2}) \end{align*}

We denote the type of MDHs concisely as MDH.

MDHs are defined such that applying them to a concatenated MDA in dimension $d$ can be computed by applying the MDH $h$ independently to the MDA’s parts $\mathfrak{a}_{1}$ and $\mathfrak{a}_{2}$ and combining the intermediate results afterward by using its combine operator $\circledast_{d}$, as also informally discussed above.

Example 3

(Function Mapping). A simple example MDH is function mapping [González-Vélez and Leyton, 2010], computed by higher-order function $\texttt{map}(f)(\mathfrak{a})$, which applies a user-defined scalar function $f:T^{\texttt{INP}}\to T^{\texttt{OUT}}$ to each element within a $D$-dimensional MDA $\mathfrak{a}$. Function $\texttt{map}(f)$ is an MDH whose combine operators are concatenation ++ in all of its $D$ dimensions (Example 1).

Example 4

(Reduction). A further MDH function is reduction [González-Vélez and Leyton, 2010], implemented as higher-order function $\texttt{red}(\oplus)(\mathfrak{a})$, which combines all elements within a $D$-dimensional MDA $\mathfrak{a}$ using a user-defined binary function $\oplus:T\times T\to T$. Reduction’s combine operators are point-wise combination $\overrightarrow{\bullet}(\oplus)$ in all dimensions (Example 2).

We show how Examples 3 and 4 (and particularly also more advanced examples) are expressed in our high-level representation in Section 2.4, based on higher-order functions md_hom, inp_view, and out_view (Figure 5) which we introduce in the following.

Higher-Order Function md_hom

We define higher-order function md_hom which conveniently expresses MDH functions in a uniform and structured manner. For this, we exploit that any MDH function is uniquely determined by its combine operators and its behavior on singleton MDAs, as informally illustrated in the following figure:

Here, $f$ is the function on scalar values that behaves the same as $h$ when restricted to singleton MDAs: $f(\mathfrak{a}[i_{1},\dotsc,i_{D}]):=h(\mathfrak{a})$, for any MDA $\mathfrak{a}\in T[\{i_{1}\},\dotsc,\{i_{D}\}]$ consisting of only one element that is accessed by (arbitrary) indices $i_{1},\dotsc,i_{D}\in\mathbb{N}_{0}$. For singleton MDAs, we usually use $f$ instead of $h$, because $f$ can be defined more conveniently by the user as $h$ (which needs to handle MDAs of arbitrary sizes, and not only singleton MDAs as $f$). Also, since $f$ takes as input a scalar value (rather than a singleton MDA, as $h$), the type of $f$ also becomes simpler, which further contributes to simplicity.

We now formally introduce function md_hom which uniformly expresses any MDH function, by using only the MDH’s behavior $f$ on scalar values and the MDH’s combine operators.

Definition 4

(Higher-Order Function md_hom). The higher-order function md_hom is of type

\begin{align*}\texttt{md_hom}:\quad \underbrace{ \mathrm{SF} }_{f}\ \ \times\ \ \underbrace{(\ \texttt{CO} \times\dotsc\times\texttt{CO}) }_{\circledast_{1} \dotsc, \ \circledast_{D}}\ \ \to_{p}\underbrace {\texttt{MDH}}_{\texttt{md_hom}(\ f, \ (\circledast_{1},\dotsc,\circledast_{D})) }\end{align*}

where $\mathrm{SF}$ denotes the set of scalar functions of type $T^{\texttt{INP}}\to T^{\texttt{OUT}}$. Function md_hom is partial (indicated by $\to_{p}$ instead of $\rightarrow$), which we motivate after this definition. The function takes as input a scalar function $f$ and a tuple of $D$-many combine operators $(\circledast_{1},\dotsc,\circledast_{D})$, and it yields a function $\texttt{md_hom}(\ f, \ (\circledast_{1},\dotsc,\circledast_{D}))$ which is defined as

\begin{align*} \texttt{md_hom}(\ f\,,\ (\circledast_{1},\dotsc,\circledast_{D}) \ )(\,\mathfrak{a}\,)\ :=\ \underset{i_{1}\in I_{1}}{\circledast_{1}}\dotsc\underset{i _{D}\in I_{D}}{\circledast_{D}}\ f(\ \mathfrak{a}[i_{1},\dotsc,i_{D}]\ ). \end{align*}

The combine operators’ underset notation denotes straightforward iteration.⁹ For md_hom, we require by definition the homomorphic property (Definition 3), i.e., for each $d\in[1,D]_{\mathbb{N}}$, it must hold:

\begin{align*} &\texttt{md_hom}(\ f\,,\ (\circledast_{1},\dotsc,\circledast_{D}) \ )(\ \mathfrak{a}_{1}\mbox{++}_{d}\ \mathfrak{a}_{2}\ )\ = \\ &\hskip80px \texttt{md_hom}(\ f\,,\ (\circledast_{1},\dotsc,\circledast_{D})\ )(\,\mathfrak{a}_{1}\,)\ \ \circledast_{d}\ \ \texttt{md_ hom}(\ f\,,\ (\circledast_{1},\dotsc,\circledast_{D})\ )(\,\mathfrak{a}_{2}\,).\end{align*}

Using Definition 4, we express any MDH function uniformly via higher-order function md_hom using only the MDH’s behavior $f$ on scalar values and its combine operators $\circledast_{1},\dotsc,\circledast_{D}$. The other direction also holds: each function expressed via md_hom is an MDH function, because we require the homomorphic property for md_hom.

Note that function md_hom is defined as partial function, because the homomorphic property is not met for all potential combinations of combine operators, e.g., $\circledast_{1}=+$ (point-wise addition) and $\circledast_{2}=*$ (point-wise multiplication). However, in many real-world examples, an MDH’s combine operators are a mix of concatenations and point-wise combinations according to the same binary function. The following lemma proves that any instance of the md_hom higher-order function for such a mix of combine operators is a well-defined MDH function.

Lemma 1.

Let $\oplus:T\to T$ be an arbitrary but fixed associative and commutative binary function on scalar type $T\in\texttt{TYPE}$. Let further $\circledast_{1},\dotsc,\circledast_{D}$ be combine operators of which any is either concatenation (Example 1) or point-wise combination according to binary function $\oplus$ (Example 2). It holds that $\texttt{md_hom}(f, (\circledast_{1},\dotsc,\circledast_{D}))$ is well defined.

Proof.

Proved by Rasch [2024], Section B.5. ∎

MDH functions are defined (Definition 3) such that they uniformly operate on MDAs (Figure 5). We introduce higher-order function inp_view to transform domain-specific inputs (e.g., a matrix and a vector in the case of matrix-vector multiplication) to an MDA, and we use function out_view to transform the output MDA back to the domain-specific data requirements (like storing it as a transposed matrix in the case of matrix multiplication, or splitting it into multiple outputs as we will see later with examples). We introduce both higher-order functions in the following.

2.3 View Functions

In the following, after introducing Buffers (BUF) which represent domain-specific input and output data in our approach (scalars, vectors, matrices, etc.), we define in Sections 2.3.1 and 2.3.2 the concepts of input views and output views—both are central building blocks in our approach. We define input views as arbitrary functions that map a collection of user-defined BUFs to our internal MDA data representation (Figure 5); higher-order function inp_view is then introduced to conveniently compute an important class of input view functions that are relevant for expressing real-world computations. Correspondingly, Section 2.3.2 defines output views as functions that transform an MDA to a collection of BUFs, and higher-order function out_view is introduced to conveniently compute important output views. Finally, we discuss in Section 2.3.3 the relationship between higher-order function inp_view and out_view: we prove that both functions are inversely related to each other, allowing arbitrarily switching between our internal MDA representation and our domain-specific BUF representation (as required for our code generation process discussed later).

Definition 5

(Buffer). A Buffer (BUF) $\mathfrak{b}$ that has dimensionality $D\in\mathbb{N}_{0}$,¹⁰ size $N:=\{N_{1},\dotsc,N_{D}\}\in\mathbb{N}^{D}$, and scalar type $T\in\texttt{TYPE}$ is a function with the following signature:

\begin{align*}\mathfrak{b}:[0,N_{1})_{\mathbb{N}_0} \times\dotsc\times [0,N_{D})_{\mathbb{N}_0} \to T\cup\{\bot\}.\end{align*}

Here, we use $\bot$ to denote the undefined value. We refer to $[0,N_{1})_{\mathbb{N}_{0}} \times\dotsc\times [0,N_{D})_{\mathbb{N}_{0}} \to T\cup\{\bot\}$ as the type of BUF $\mathfrak{b}$, which we also denote as $T^{N_{1}\times\dotsc\times N_{D}}$. Analogously to Notation 1, we write $\mathfrak{b}[ i_{1},\dotsc,i_{D} ]$ instead of $\mathfrak{b}(i_{1},\dotsc,i_{D})$ to avoid a too heavy usage of parentheses.

In contrast to MDAs, a BUF always operates on a contiguous range of natural numbers starting from $0$, and a BUF may contain undefined values. These two differences allow straightforwardly transforming BUFs to data structures provided by low-level programming languages (e.g., C arrays, as used in OpenMP, CUDA, and OpenCL).

Note that in our generated program code (discussed later in Section 3), we implement MDAs on top of BUFs, as straightforward aliases that access BUFs, so that we do not need to transform MDAs to low-level data structures and/or store them otherwise physically in memory.

2.3.1 Input Views.

We define input views as any function that compute an MDA from a collection of (user-defined) BUFs. For example, in the case of MatVec, its input view takes as input two BUFs—a matrix and a vector—and it yields a two-dimensional MDA containing pairs of matrix and vector elements (illustrated in Figure 1). In contrast, the input view of Jacobi1D takes as input a single BUF (representing a vector) only, and it computes an MDA containing triples of BUF elements (Figure 2).

Definition 6

(Input View). An input view from $B$-many BUFs, $B\in\mathbb{N}$, of arbitrary but fixed types $T^{N^{b}_{1}\times\dotsc\times N^{b}_{D_{b}}}_{b}$, $b\in[1,B]_{\mathbb{N}}$, to an MDA of arbitrary but fixed type $T[I_{1},\dotsc,I_{D}]$ is any function $\mathfrak{iv}$ of type:

\begin{align*}{\mathfrak{iv}}:\ \underbrace{\mathop{\times}\limits_{b=1}^{B}T^{N^{b}_{1}\times\dotsc\times N^ {b}_{D_{b}}}_{b}}_{\text{BUFs}}\hskip 3.0pt\to_{p}\hskip 3.0pt\underbrace{T[\ I_{1}, \dotsc, I_{D}\ ]}_{\text{MDA}}\end{align*}

We denote the type of $\mathfrak{iv}$ as IV.

Example 5

(Input View—MatVec). The input view of MatVec on a $1024\times 512$ matrix and $512$-sized vector (sizes chosen arbitrarily) is defined as

\begin{align*}\underbrace{[ M(i,k) ]_{i\in[0,1024)_{\mathbb{N}_{0}},k\in[0,512)_{\mathbb{N}_{0}}}}_{ \text{BUF (Matrix)}},\underbrace{[ v(k) ]_{k\in[0,512)_{\mathbb{N}_{0}}}}_{\text{ BUF (Vector)}}\mapsto\underbrace{[ M(i,k),v(k) ]_{i\in[0,1024)_{\mathbb{N}_{0}},k \in[0,512)_{\mathbb{N}_{0}}}}_{\text{MDA}}\end{align*}

Example 6

(Input View—Jacobi1D). The input view of Jacobi1D on a $512$-sized vector is defined as

\begin{align*}\underbrace{[ v(i) ]_{i\in[0,512)_{\mathbb{N}_{0}}}}_{\text{BUF (Vector)}}\mapsto \underbrace{[ v(i+0),v(i+1),v(i+2) ]_{i\in[0,512-2)_{\mathbb{N}_{0}}}}_{\text{MDA}}.\end{align*}

In the following, we introduce higher-order function inp_view which conveniently computes important input views from user-defined index functions $\mathfrak{idx}_{b,a}:\{0,1,\dotsc\}\to\{0,1,\dotsc\}$, $b\in[1,B]_{\mathbb{N}}$, $a\in[1,A_{b}]_{\mathbb{N}}$, in a uniform, structured manner. Here, $B\in\mathbb{N}$ represents the number of BUFs that the computed input view will take as input, and $A_{b}$ represents the number of accesses to the $b$-th BUF required for computing an individual MDA element.

In the case of MatVec (Figure 1), we use $B:=2$ because MatVec has two input BUFs: a matrix $M$ (the first input of MatVec and thus identified by $b=1$) and a vector $v$ (identified by $b=2$). For the number of accesses, we use for the matrix $A_{1}:=1$, as one element is accessed within matrix $M$ to compute an individual MDA element—matrix element $M[i,k]$ for computing MDA element at position $(i,k)$. For the vector, we use $A_{2}:=1$, as the single element $v[k]$ is accessed within the vector. The index functions of MatVec are $\mathfrak{idx}_{1,1}(i,k):=(i,k)$ (used to access the matrix) and $\mathfrak{idx}_{2,1}(i,k):=(k)$ (used for the vector).

In contrast, for Jacobi1D (Figure 2), we use $B:=1$ because Jacobi1D has vector $v$ as its only input, and we use $A_{1}:=3$ because the vector is accessed three times to compute an individual MDA element at arbitrary position $i$: first access $v[i+0]$, second access $v[i+1]$, and third access $v[i+2]$. The index functions of Jacobi1D are $\mathfrak{idx}_{1,1}(i):=(i+0)$, $\mathfrak{idx}_{1,2}(i):=(i+1)$, and $\mathfrak{idx}_{1,3}(i):=(i+2)$.

Figures 10 and 11 use the examples MatVec and Jacobi1D to informally illustrate how function inp_view uses index functions to compute input views. In the two figures, we use domain-specific identifiers for better clarity: in the case of MatVec, we use for its two input BUFs the identifiers $M$ and $v$ instead of $\mathfrak{b}_{1}$ and $\mathfrak{b}_{2}$, as well as identifiers $i$ and $j$ instead of $i_{1}$ and $i_{2}$ for index variables; for Jacobi1D, we use identifier $v$ instead of $\mathfrak{b}_{1}$ and $i$ instead of $i_{1}$.

Fig. 10.

Fig. 11.

Definition 7

(Higher-Order Function inp_view). Function inp_view is of type

$\texttt{inp_view}:\underbrace{(\underbrace{\mathop{\times}\limits_{b=1}^{ B}}_{\text{Buffer}}\underbrace{\mathop{\times}\limits_{a=1}^{A_{b}}}_{\text{ Access}}\ \underbrace{\texttt{IDX-FCT}}_{\text{Index Function:}\ \mathfrak {idx}_{b,a}})}_{\text{Index Functions:}\ \mathfrak{idx}_{1,1},\dotsc,\mathfrak {idx}_{B,A_{B}}}\ \to\ \underbrace{\texttt{IV}}_{\text{Input View:}\ \mathfrak{iv}}$

and it is defined as

\begin{align*}\underbrace{ (\mathfrak{idx}_{b,a})_{b\in[1,B]_{\mathbb{N}},a\in[1,A_{b}]_{\mathbb{N}}} }_{\text{Index Functions}}\ \mapsto\ \underbrace{(\underbrace{\mathfrak{b}_{1}, \dotsc,\mathfrak{b}_{B}}_{\text{BUFs}}) \overset{\mathfrak{iv}}{\mapsto}\underbrace{ \mathfrak{a}}_{\text{MDA}}}_{\text{Input View}}\end{align*}

for

\begin{align*}\mathfrak{a}[i_{1},\dotsc,i_{D}]:=(\mathfrak{a}_{b,a}[i_{1},\dotsc,i_{D}]) _{b \in[1,B]_{\mathbb{N}},a\in[1,A_{b}]_{\mathbb{N}}}\end{align*}

and

\begin{align*}\mathfrak{a}_{b,a}[i_{1},\dotsc,i_{D}]\ \: =\ \ \mathfrak{b}_{b}[\ \mathfrak{idx}_{b,a}(i_{ 1},\dotsc,i_{D})\ ]\end{align*}

Higher-order function inp_view takes as input a collection of index functions of types IDX-FCT, and it computes an input view of type IV (Definition 6) based on the index functions, as illustrated in Figures 10 and 11.

Note that function inp_view is not capable of computing every kind of input view function (Definition 6). For example, inp_view cannot be used for computing MDAs that are required for expressing computations on sparse data formats [Hall, 2020], because such MDAs need dynamically accessing BUFs. This limitation of inp_view can be relaxed by generalizing our index functions toward taking additional, dynamic input arguments, which we consider as future work (as outlined in Section 8).

Notation 2

(Input Views). For better readability, we use the following notation for the two-dimensional structure of index functions taken as input by function inp_view, inspired by Lattner et al. [2021]:

\begin{align*} \texttt{inp_view}(\ \texttt{ID}_{1}:\,\mathfrak{idx}_{1,1}\,,\dotsc,\,\mathfrak{idx}_{1,A_{1}}\ \ \,\,,\dotsc\,,\ \ \ \texttt{ID}_{B}:\,\mathfrak{idx}_{B,1}\,,\dotsc,\,\mathfrak{idx}_{B,A_{B}}\ ) \end{align*}

Here, $\texttt{ID}_{\texttt{1}},\dotsc,\texttt{ID}_{\texttt{B}}$ denote arbitrary, user-defined identifiers (e.g., $\texttt{ID}_{1}=$ "M" and $\texttt{ID}_{2}=$ "v" for MatVec).

Example 7

Function inp_view is used for MatVec and Jacobi1D (in Notation 2) as follows:

\begin{align*} &\underline{{\texttt{MatVec}:}} && \texttt{inp_view}(\ \underbrace{\texttt{M:}\underbrace{(i,k)\mapsto(i,k)}_{ \texttt{a=1}}}_{\texttt{b=1}}\,,\,\underbrace{\texttt{v:}\underbrace{(i,k) \mapsto(k)}_{\texttt{a=1}}}_{\texttt{b=2}}\ ) \\ &\underline{{\texttt{Jacobi1D}:}} && \texttt{inp_view}(\ \underbrace{\texttt{v:}\underbrace{(i)\mapsto(i+0)}_{a=1}\,,\,\underbrace{(i)\mapsto(i+1)}_{a=2}\,,\,\underbrace{(i)\mapsto(i+2)}_{a=3}}_ {b=1}\ ) \end{align*}

2.3.2 Output Views.

An output view is the counterpart of an input view: in contrast to an input view which maps BUFs to an MDA, an output view maps an MDA to a collection of BUFs. In the following, we define output views, and we introduce higher-order function out_view which computes output views in a structured manner (analogously to function inp_view for input views).

Figures 12 and 13 illustrate output views informally using the examples transposed Matrix Multiplication and Double Reduction.

Fig. 12.

Fig. 13.

In the case of transposed matrix multiplication (Figure 12), the computed output MDA (the computation of matrix multiplication is presented later and not relevant for our following considerations) is stored via an output view as a matrix in a transposed fashion, using index function $(i,j,0)\mapsto(j,i)$. Here, the MDA’s third dimension (accessed via index $0$) represents the so-called reduction dimension of matrix multiplication, and it contains only one element after the computation, as all elements in this dimension are combined via addition.

For double reduction (Figures 13), we combine the elements within the vector twice—once using operator $\oplus$ (e.g., $\oplus=+$ addition) and once using operator $\otimes$ (e.g, $\otimes=*$ multiplication). The final outcome of double reduction is a singleton MDA containing a pair of two elements that represent the combined vector elements (e.g., the elements’ sum and product). We store this MDA via an output view as two individual scalar values, using index functions $(0)\mapsto()$ ¹¹ for both pair elements.

Definition 8

(Output View). An output view from an MDA of arbitrary but fixed type $T[I_{1},\dotsc,I_{D}]$ to $B$-many BUFs, $B\in\mathbb{N}$, of arbitrary but fixed types $T^{N^{b}_{1}\times\dotsc\times N^{b}_{D_{b}}}_{b}$, $b\in[1,B]_{\mathbb{N}}$, is any function $\mathfrak{ov}$ of type:

\begin{align*}{\mathfrak{ov}}:\ \underbrace{T[\ I_{1}, \dotsc, I_{D}\ ]}_{ \text{MDA}}\hskip 3.0pt\to_{p}\hskip 3.0pt\underbrace{\mathop{\times}\limits_{b=1}^{B}T^{N^{b}_ {1}\times\dotsc\times N^{b}_{D_{b}}}_{b}}_{\text{BUFs}}\end{align*}

We denote the type of $\mathfrak{ov}$ as OV.

Example 8

(Output View—MatVec). The output view of MatVec computing a $1024$-sized vector (size is chosen arbitrarily), of integers $\mathbb{Z}$, is defined as

\begin{align*}\underbrace{[ w(i) ]_{i\in[0,1024)_{\mathbb{N}_{0}},k\in\{0\}}}_{ \text{MDA}}\mapsto\underbrace{[ w(i) ]_{i\in[0,1024)_{\mathbb{N}_{0}}}}_{\text{BUF (Vector)}}\end{align*}

Example 9

(Output View—Jacobi1D). The output view of Jacobi1D computing a $(512-2)$-sized vector is defined as

\begin{align*}\underbrace{[ w(i) ]_{i\in[0,512-2)_{\mathbb{N}_{0}},k\in\{0\}}}_{ \text{MDA}}\mapsto\underbrace{[ w(i) ]_{i\in[0,512-2)_{\mathbb{N}_{0}}}}_{\text{BUF (Vector)}}\end{align*}

We define higher-order function out_view formally as follows.

Definition 9

(Higher-Order Function out_view). Function out_view is of type

$\texttt{out_view}:\underbrace{\underbrace{\mathop{\times}\limits_{b=1}^{B }}_{\text{Buffer}}\underbrace{\mathop{\times}\limits_{a=1}^{A_{b}}}_{\text{ Access}}\ \underbrace{\texttt{IDX-FCT}}_{\text{Index Function:}\ \mathfrak {idx}_{b,a}}}_{\text{Index Functions:}\ \mathfrak{idx}_{1,1},\dotsc,\mathfrak{ idx}_{B,A_{B}}} \to \underbrace{\texttt{OV}}_{\text{Output View:}\ \mathfrak{ov}}$

which differs from inp_view’s type only in mapping index functions to OV (Definition 8), rather than IV (Definition 6). Function out_view is defined as

\begin{align*}\underbrace{ (\mathfrak{idx}_{b,a})_{b\in[1,B]_{\mathbb{N}},a\in[1,A_{b}]_{\mathbb{N}}} }_{\text{Index Functions}}\ \mapsto\ \underbrace{\underbrace{\mathfrak{a}}_{\text{ MDA}}\overset{\mathfrak{ov}}{\mapsto}(\underbrace{\mathfrak{b}_{1},\dotsc,\mathfrak{b}_{B}}_ {\text{BUFs}}) }_{\text{Output View}}\end{align*}

for

\begin{align*}\mathfrak{b}_{b}[\ \mathfrak{idx}_{b,a}(i_{1},\dotsc,i_{D})\ ]\ \: =\ \ \mathfrak{a}_{b,a}[ i_{1},\dotsc,i_{D}]\end{align*}

and

\begin{align*}(\mathfrak{a}_{b,a}[i_{1},\dotsc,i_{D}]) _{b\in[1,B]_{\mathbb{N}},a\in[1,A_{b }]_{\mathbb{N}}}:=\mathfrak{a}[i_{1},\dotsc,i_{D}]\end{align*}

i.e., $\mathfrak{a}_{b,a}[i_{1},\dotsc,i_{D}]$ is the element at point $i_{1},\dotsc,i_{D}$ within MDA $\mathfrak{a}$ that belongs to the $a$-th access of the $b$-th BUF. We set $\mathfrak{b}_{b}[ j_{1},\dotsc,j_{D_{b}} ]:=\bot$ (symbol $\bot$ denotes the undefined value) for all BUF indices which are not in the function range of the index functions.

Note that the computed output view $\mathfrak{ov}$ is partial (indicated by $\rightarrow_{p}$ in Definition 8), because for non-injective index functions, it must hold $\mathfrak{idx}_{b,a}(i_{1},\dotsc,i_{D})=\mathfrak{idx}_{b,a^{\prime}}(i^{ \prime}_{1},\dotsc,i^{\prime}_{D})\Rightarrow\mathfrak{a}_{b,a}[i_{1},\dotsc,i _{D}]=\mathfrak{a}_{b,a^{\prime}}[i^{\prime}_{1},\dotsc,i^{\prime}_{D}]$ which may not be satisfied for each potential input MDA of the computed view.

Notation 3

(Output Views). Analogously to Notation 2, we denote out_view for a particular choice of index functions as

\begin{align*} \texttt{out_view}(\ \texttt{ID}_{1}:\,\mathfrak{idx}_{1,1}\,,\dotsc,\,\mathfrak{idx}_{1,A_{1}}\ \ \,\,\dotsc\,,\ \ \ \texttt{ID}_{B}:\,\mathfrak{idx}_{B,1}\,,\dotsc,\,\mathfrak{idx}_{B,A_{B}}\ ) \end{align*}

Example 10

Function out_view is used for MatVec and Jacobi1D (in Notation 3) as follows:

\begin{align*} \underline{\texttt{MatVec:}} \ \ \ \ \texttt{out_view}(\underbrace{\texttt{w:} \underbrace{(i,k)\mapsto(i)}_\texttt{a=1}}_\texttt{b=1}\ ) \ \ \ \ \ \ \ \ \underline{\texttt{Jacobi1D:}} \ \ \ \ \texttt{out_view}( \underbrace{\texttt{w:} \underbrace{(i)\mapsto (i)}_{a=1}}_{b=1}\ ) \end{align*}

2.3.3 Relation between View Functions.

We use view functions to transform data from their domain-specific representation (represented in our formalism as BUFs, Definition 5) to our internal, MDA-based representation (via input views) and back (via output views), as also illustrated in Figure 5. In our implementation presented later, we aim to access data uniformly in the form of MDAs, thereby being independent of domain-specific data representations. However, we aim to store the data physically in the domain-specific format, as such format is usually the more efficient representation. For example, we aim to store the input data of MatVec in the domain-specific matrix and vector format, rather than as an MDA, because the input MDA of MatVec contains many redundancies—each vector element once per row of the input matrix (as illustrated in Figure 10).

The following lemma proves that functions inp_view and out_view are invertible and that they are each others inverses. Consequently, the lemma shows how we can arbitrarily switch between the domain-specific and our MDA-based representation, and consequently also that we can implicitly identify MDAs with the domain-specific data representation. For example, for computing MatVec, we will specify the computations via pattern md_hom which operates on MDAs (see Figure 5), but we use the view functions in our implementation to implicitly forward the MDA accesses to the physically stored BUF representation.

Lemma 2.

Let

\begin{align*}\texttt{inp_view}(\ \texttt{ID}_{1}: \mathfrak{idx}_{1,1}, \dotsc, \mathfrak{idx}_{1,A_{1}}\ \, \dotsc, \ \ \ \texttt{ID}_{B}: \mathfrak{idx}_{B,1}, \dotsc, \mathfrak{idx}_{B,A_{B}}) \end{align*}

and

\begin{align*}\texttt{out_view}(\ \texttt{ID}_{1}: \mathfrak{idx}_{1,1}, \dotsc, \mathfrak{idx}_{1,A_{1}}\ \, \dotsc, \ \ \ \texttt{ID}_{B}: \mathfrak{idx}_{B,1}, \dotsc, \mathfrak{idx}_{B,A_{B}}\ ) \end{align*}

be two arbitrary instances of functions inp_view and out_view (in Notations 2 and 3), both using the same index functions $\mathfrak{idx}_{1,1},\dotsc,\mathfrak{idx}_{B,A_{B}}$.

It holds (index functions omitted via ellipsis for brevity):

\begin{align*}\texttt{inp_view}(\ \dotsc\ ) \ \circ\ \texttt{out_view}(\ \dotsc\ ) \ \ =\ \ \texttt{out_view}(\ \dotsc\ ) \ \circ\ \texttt{inp_view}(\ \dotsc\ ) \ \ =\ \ id\end{align*}

Proof.

Follows immediately from Definitions 7 and 9. ∎

The following figure illustrates the lemma using as example the inverse of MatVec’s input view (shown in Figure 10):

2.4 Examples

Figure 14 shows how our high-level representation is used for expressing different kinds of popular data-parallel computations. For brevity, we state only the index functions, scalar function, and combine operators of the higher-order functions; an expression as in Figure 6 is then obtained by straightforwardly inserting these building blocks into the higher-order functions.

Fig. 14.

Subfigure 1.

We show how our high-level representation is used for expressing linear algebra routines: (1) Dot (Dot Product); (2) MatVec (Matrix-Vector Multiplication); (3) MatMul (Matrix Multiplication); (4) $\texttt{MatMul}^{\texttt{T}}$ (Transposed Matrix Multiplication) which computes matrix multiplication on transposed input and output matrices; (5) bMatMul (batched Matrix Multiplication) where multiple matrix multiplications are computed using matrices of the same sizes.

We can observe from the subfigure that our high-level expressions for the routines naturally evolve from each other. For example, the md_hom instance for MatVec differs from the md_hom instance for Dot by only containing a further concatenation dimension ++ for its $i$ dimension. We consider this close relation between the high-level expressions of MatVec and Dot in our approach as natural and favorable, as MatVec can be considered as computing multiple times Dot—one computation of Dot for each value of MatVec’s $i$ dimension. Similarly, the md_hom instance for MatMul is very similar to the expression of MatVec, by containing the further concatenation dimension $j$ for MatMul’s $j$ dimension. The same applies to bMatMul: its md_hom instance is the expression of MatMul augmented with one further concatenation dimension.

Regarding $\texttt{MatMul}^{\texttt{T}}$, the basic computation part of $\texttt{MatMul}^{\texttt{T}}$ and MatMul are the same, which is exactly reflected in our formalisms: both $\texttt{MatMul}^{\texttt{T}}$ and MatMul are expressed using exactly the same md_hom instances. The differences between $\texttt{MatMul}^{\texttt{T}}$ and MatMul lies only in the data accesses—transposed accesses in the case of $\texttt{MatMul}^{\texttt{T}}$ and non-transposed accesses in the case of MatMul. Data accesses are expressed in our formalism, in a structured way, via view functions (as discussed in Section 2.3): for example, for $\texttt{MatMul}^{\texttt{T}}$, we use for its first input matrix $A$ the index function $(i,j,k)\mapsto(k,i)$ for transposed access, instead of using index function $(i,j,k)\mapsto(i,k)$ as for MatMul’s non-transposed accesses.

Note that all md_hom instances in the subfigure are well defined according to Lemma 1.

Subfigure 2.

We show how convolution-style stencil computations are expressed in our high-level representation: (1) Conv2D expresses a standard convolution that uses a two-dimensional sliding window [Podlozhnyuk, 2007]; (2) MCC expresses a so-called Multi-Channel Convolution [Dumoulin and Visin, 2018]—a generalization of Conv2D that is heavily used in the area of deep learning; (3) MCC_Capsule is a recent generalization of MCC [Hinton et al., 2018] which attracted high attention due to its relevance for advanced deep learning neural networks [Barham and Isard, 2019].

While our md_hom instances for convolutions are quite similar to those of linear algebra routines (they all use multiplication $*$ as scalar function and a mix of concatenations ++ and point-wise additions $+$ as combine operators), the index functions used for the view functions of convolutions are notably different from those used for linear algebra routines: the index functions of convolutions contain arithmetic expressions (e.g., p+r and q+s) and thus access neighboring elements in their input—a typical access pattern in stencil computations that requires special optimizations [Hagedorn et al., 2018]. Moreover, convolution-style computations are often high-dimensional (e.g., $10$ dimensions in the case of MCC_Capsule), whereas linear algebra routines usually rely on less dimensions. Our experiments in Section 5 confirm that respecting the data access patterns and the high dimensionality of convolutions in the optimization process (as in our approach, which we discuss later) often achieves significantly higher performance than using optimizations chosen toward linear algebra routines, as in vendor libraries provided by NVIDIA and Intel for convolutions [Li et al., 2016].

Subfigure 3.

We show how quantum chemistry computation Coupled Cluster (CCSD(T)) [Kim et al., 2019] is expressed in our high-level representation. The computation of CCSD(T) notably differs from those of linear algebra routines and convolution-style stencils, by accessing its high-dimensional input data in sophisticated transposed fashions: for example, the view function of CCSD(T)’s instance one (denoted as I1 in the subfigure) uses indices a and b to access the last two dimensions of its $A$ input tensor (rather than the first two dimensions of the tensor, as would be the case for non-transposed accesses). For brevity, the subfigure presents only two CCSD(T) instances—in our experiments in Section 5, we present experimental results for nine different real-world CCSD(T) instances.

Subfigures 4–6.

The subfigures present computations whose scalar functions and combine operators are different from those used in Subfigures 1–3 (which are in Subfigures 1–3 straightforward multiplications $*$, concatenation ++, and point-wise additions $+$ only). For example, Jacobi stencils (Subfigure 4) use as scalar function the Jacobi-specific computation $\texttt{J}_{\texttt{nD}}$ [Cecilia et al., 2012], and Probabilistic Record Linkage (PRL) [Christen, 2012], which is heavily used in data mining to identify duplicate entries in a database, uses a PRL-specific both scalar function wght and combine operator $\texttt{max}_{\texttt{PRL}}$ (point-wise combination via the PRL-specific binary operator $\texttt{max}_{\texttt{PRL}}$) [Rasch et al., 2019b]. Histograms, in their generalized version [Henriksen et al., 2020] (denoted as GenHisto in Subfigure 6), use an arbitrary, user-defined scalar function $f$ and a user-defined associative and commutative combine operator $\oplus$; the standard histogram variant Histo is then a particular instance of GenHist, for $\oplus=+$ (point-wise addition) and $f=f_{\texttt{Histo}}$, where $f_{\texttt{Histo}}(e,b)=1$ iff $e=b$ and $f_{\texttt{Histo}}(e,b)=0$ otherwise.

Subfigure 7.

We show how typical map and reduce patterns [González-Vélez and Leyton, 2010] are implemented in our high-level representation. Examples map(f) and reduce($\oplus$) (discussed in Examples 3 and 4) are simple and thus straightforwardly expressed in our representation. In contrast, example reduce($\oplus,\otimes$) is more complex and shows how reduce($\oplus$) is extended to combine the input vector simultaneously twice—once combining vector elements via operator $\oplus$ and once using operator $\otimes$. The outcome of reduce($\oplus,\otimes$) are two scalars—one representing the result of combination via $\oplus$ and the other of combination via $\otimes$—which we map via the output view to output elements $\texttt{O}_{\texttt{1}}$ (result of $\oplus$) and $\texttt{O}_{\texttt{2}}$ (result of $\otimes$), correspondingly; this is also illustrated in Figure 13.

Subfigure 8.

We present prefix-sum computations [Blelloch, 1990] which differ from the computations in Subfigures 1–7 in terms of their combine operators: the operator used for expressing computations in Subfigure 8 is different from concatenation (Example 1) and point-wise combinations (Example 2). Computation scan($\oplus$) uses as combine operator $\mbox{++}_{\texttt{prefix-sum}}(\oplus)$ which computes prefix-sum [Gorlatch and Lengauer, 1997] (formally defined by Rasch [2024], Section B.9) according to binary operator $\oplus$, and Maximum Bottom Box Sum (MBBS) [Farzan and Nicolet, 2019] uses a particular instance of prefix-sum for $\oplus=+$ (addition).

3 Low-Level Representation for Data-Parallel Computations

We introduce our low-level representation for expressing data-parallel computations. In contrast to our high-level representation, our low-level representation explicitly expresses the de-composition and re-composition of computations (informally illustrated in Figure 3). Moreover, our low-level representation is designed such that it can be straightforwardly transformed to executable program code, because it explicitly captures and expresses the optimizations for the memory and core hierarchy of the target architecture.

In the following, after briefly discussing an introductory example in Section 3.1, we introduce in Section 3.2 our formal representation of computer systems, which we refer to as Abstract System Model (ASM). Based on this model, we define low-level MDAs, low-level BUFs, and low-level combine operators in Section 3.3, which are basic building blocks of our low-level representation.

Note that all details and concepts discussed in this section are not exposed to the end users of our system and therefore transparent for them: expressions in our low-level representation are generated fully automatically for the user, from expressions in our high-level representation (Figure 4), according to the methodologies presented later in Section 4 and auto-tuning [Rasch et al., 2021].

3.1 Introductory Example

Figure 15 illustrates our low-level representation by showing how MatVec (Matrix-Vector Multiplication) is expressed in our representation. In our example, we use an input matrix $M\in T^{512\times 4096}$ of size $512\times 4096$ (size chosen arbitrarily) that has an arbitrary but fixed scalar type $T\in\texttt{TYPE}$; the input vector $v\in T^{4096}$ is of size $4096$, correspondingly.

Fig. 15.

For better illustration, we consider for this introductory example a straightforward, artificial target architecture that has only two memory layers—Host Memory (HM) and Cache Memory (L1)—and one Core Layer (COR) only; our examples presented and discussed later in this section target real-world architectures (e.g., CUDA-capable NVIDIA GPUs). The particular values of tuning parameters (discussed in detail later in this section), such as the number of threads and the order of combine operators, are chosen by hand for this example and as straightforward for simplicity.

Our low-level representations work in three phases: (1) de-composition (steps 1–7, in the right part of Figure 15), (2) scalar (step 8, bottom part of the figure), (3) re-composition (steps 9–15, left part). Steps are arranged from right to left, inspired by the application order of function composition.

(1) De-Composition Phase.

The de-composition phase (steps 1–$7$ in Figure 15) partitions input MDA ${{}^{\downarrow}}{\mathfrak{a}}$ (in the top right of Figure 15) to the structure ${{}^{\downarrow}}\mathfrak{a}_{f}^{ \lt \dotsc \gt }$ (bottom right) which we refer to as low-level MDA and define formally in the next subsection. The low-level MDA represents a partitioning of MDA ${{}^{\downarrow}}{\mathfrak{a}}$ (a.k.a hierarchical, multi-dimensional tiling in programming), where each particular choice of indices $p^{1}_{1}\in[0,2)_{\mathbb{N}_{0}}$, $p^{1}_{2}\in[0,4)_{\mathbb{N}_{0}}$, $p^{2}_{1}\in[0,8)_{\mathbb{N}_{0}}$, $p^{2}_{2}\in[0,16)_{\mathbb{N}_{0}}$, $p^{3}_{1}\in[0,32)_{\mathbb{N}_{0}}$, $p^{3}_{2}\in[0,64)_{\mathbb{N}_{0}}$ refers to an MDA that represents an individual part of MDA ${{}^{\downarrow}}{\mathfrak{a}}$ (a.k.a. tile in programming—informally illustrated in Figure 7). The partitions are arranged on multiple layers (indicated by the $p$’s superscripts) and in multiple dimensions (indicated by subscripts)—as illustrated in Figure 16—according to the memory/COR of the target architecture and dimensions of the MDH computation: we partition for each of the target architecture’s three layers (HM, L1, COR) and in each of the two dimensions of the MDH (dimensions 1 and 2, as we use example MatVec in Figure 15, which represents a two-dimensional MDH computation). Consequently, our partitioning approach allows efficiently exploiting each particular layer of the target architecture (both memory and core layers), and also optimizing for both dimensions of the target computation (in the case of MatVec, the $i$-dimension and also the $k$-dimension—see Figure 1), allowing fine-grained optimizations.

Fig. 16.

We compute the partitionings of MDAs by applying the concatenation operator (Example 1) inversely (indicated by using $=:$ instead of $:=$ in the top right part of Figure 15). For example, we partition in Figure 15 MDA ${{}^{\downarrow}}{\mathfrak{a}}$ first via the inverse of $\mbox{++}^{\texttt{(HM,x)}}_{1}$ in dimension 1 (indicated by the subscript 1 of $\mbox{++}^{\texttt{(HM,x)}}_{1}$; the superscript (HM,x) is explained later) into two parts, as $p^{1}_{1}$ iterates over interval $[0,2)_{\mathbb{N}_{0}}=\{0,1\}$ which consists of two elements ($0$ and 1)—the interval is chosen arbitrarily for this example. Afterward, each of the obtained parts is further partitioned, in the second dimension, via $\mbox{++}^{\texttt{(HM,y)}}_{2}$ into four parts ($p^{1}_{2}$ iterates over $[0,4)_{\mathbb{N}_{0}}=\{0,1,2,3\}$ which consists of four elements). The ${(2*4)}$-many HM parts are then each further partitioned in both dimensions for the COR layer into $(8*16)$ parts, and each individual COR part is again partitioned for the L1 layer into $(32*64)$ parts, resulting in $(2*8*32)*(4*16*64)=512*4096$ parts in total.

We always use a full partitioning in our low-level expressions,¹² i.e., each particular choice of indices $p^{1}_{1}$, $p^{1}_{2}$, $p^{2}_{1}$, $p^{2}_{2}$, $p^{3}_{1}$, $p^{3}_{2}$ points to an MDA that contains a single element only (in Figure 16, the individual elements are denoted via symbol $\times$, in the bottom part of the figure). By relying on a full partitioning, we can apply scalar function $f$ to the fully partitioned MDAs later in the scalar phase (described in the next paragraph). This is because function $f$ is defined on scalar values (Definition 4) to make defining scalar functions more convenient for the user (as discussed in Section2.2).

The superscript of combine operators, e.g., (COR,x) of operator $\mbox{++}^{\texttt{(COR,x)}}_{1}$, is a so-called operator tag (formal definition given in the next section). Such a tag indicates to our code generator whether its combine operator is assigned to a memory layer (and thus computed sequentially in our generated code) or to a COR (and thus computed in parallel). For example, tag (COR,x) indicates that parts processed by operator $\mbox{++}^{\texttt{(COR,x)}}_{1}$ should be computed by cores COR, and thus in parallel; the dimension tag x indicates that COR layer’s x dimension should be used for computing the operator (we use dimension x for our example architecture as an analogous concept to CUDA’s thread/block dimensions x,y,z for GPU architectures [NVIDIA, 2022g]), as we also discuss in the next section. In contrast, tag (HM,x) refers to a memory layer (HM) and thus, operator $\mbox{++}^{\texttt{(HM,x)}}_{1}$ is computed sequentially. Since the current state-of-practice programming approaches (OpenMP, CUDA, OpenCL, $\dotsc$) have no explicit notion of memory tiles (e.g., by offering the potential variables tileIdx.x/tileIdx.y/tileIdx.z, as analogous concepts to CUDA variables threadIdx.x/threadIdx.y/threadIdx.z), the dimensions tag x in (HM,x) is currently ignored by our code generator, because HM refers to a memory layer.

Note that the number of parts (e.g., 2 parts on layer 1 in dimension 1, and $4$ parts on layer 1 in dimension 2 $\dotsc$), the combine operators’ tags, and our partition order (e.g., first partitioning in MDA’s dimension 1 and afterwards in dimension 2) are chosen arbitrarily for this example. These choices are critical for performance and should be optimized¹³ for a particular target architecture and characteristics of the input and output data (size, memory layouts, etc.), as we discuss in detail later in this section.

(2) Scalar Phase.

In the scalar phase (step 8 in Figure 15), we apply MDH’s scalar function $f$ to the individual MDA elements

\begin{align*}{{}^{\downarrow}}\mathfrak{a}_{f}^{ \lt p^{1}_{1},p^{1}_{2}\ |\ p^{2}_{1},p^{2}_{2}\ |\ p^{3}_{1},p^{3}_{2} \gt }\end{align*}

for each particular choice of indices $p^{1}_{1}$, $p^{1}_{2}$, $p^{2}_{1}$, $p^{2}_{2}$, $p^{3}_{1}$, $p^{3}_{2}$, which results in

\begin{align*}{{}^{\uparrow}}\mathfrak{a}_{f}^{ \lt p^{1}_{1},p^{1}_{2}\ |\ p^{2}_{1},p^{3}_{2}\ |\ p^ {3}_{1},p^{3}_{2} \gt }\end{align*}

In the figure, $\vec{f}$ is the slight adaption of function $f$ that operates on a singleton MDA, rather than a scalar (see Footnote 9).

Annotation $\rightarrow$ $\lt$ (1,2), $\dotsc$$\gt$ indicates the application order of applying scalar function (in this example, first iterating over $p^{1}_{1}$, then over $p^{1}_{2}$, etc.), and we use annotation $\rightarrow$ $\lt$ (HM,x), $\dotsc$$\gt$ to indicate how the scalar computation is assigned to the target architecture (this is described in detail later in this section). Annotations $\rightarrow$ M: HM, v: L1 and $\rightarrow$ w: L1 (in the bottom part of Figure 15) indicate the memory regions to be used for reading and writing the input scalar of function $f$ (also described later in detail).

(3) Re-Composition Phase.

Finally, the re-composition phase (steps $9$–$15$ in Figure 15) combines the computed parts ${{}^{\uparrow}}\mathfrak{a}_{f}^{ \lt p^{1}_{1},p^{1}_{2}\ |\ p^{2}_{1},p^{2}_{2 }\ |\ p^{3}_{1},p^{3}_{2} \gt }$ (bottom left in the figure) to the final result ${{}^{\uparrow}}{\mathfrak{a}}$ (top left) via MDH’s combine operators, which are in the case of matrix-vector multiplication $\circledast_{1}:=\mbox{++}$ (concatenation) and $\circledast_{2}:=+$ (point-wise addition). In this example, we first combine the L1 parts in dimension 2 and then in dimension 1; afterward, we combine the COR parts in both dimensions, and finally the HM parts. Analogously to before, this order of combine operators and their tags are chosen arbitrarily for this example and should be auto-tuned for high performance.

In the de- and re-composition phases, the arrow notation below combine operators allow efficiently exploiting architecture’s memory hierarchy, by indicating the memory region to read from (de-composition phase) or to write to (re-composition phase); the annotations also indicate the memory layouts to use. We exploit these memory and layout information in both (1) our code generation process to assign combine operators’ input and output data to memory regions and to chose memory layouts for the data (row major, column major, etc.); (2) our formalism to specify constraints of programming models, e.g., that in CUDA, results of GPU cores can only be combined in designated memory regions [NVIDIA, 2022f]. For example, annotation $\rightarrow$ M: HM[1,2], v: L1[1] below an operator in the de-composition phase indicates to our code generator that the parts (a.k.a tiles) of matrix $M$ used for this computation step should be read from the HM memory region and that parts of vector $v$ should be copied to and accessed from fast L1 memory. The annotation also indicates that M should be stored using a row-major memory layout (as we use [1,2] and not [2,1]). The memory regions and layouts are chosen arbitrarily for this example and should be chosen as optimized (auto-tuned) for the particular target architecture and characteristics of the input and output data. Formally, the arrow notation of combination operators is a concise notation to hide MDAs and BUFs for intermediate results (discussed by Rasch [2024], Section C.3, for the interested reader).

Excursion: Code Generation¹⁴

Our low-level expressions can be straightforwardly transformed to executable program code in imperative-style programming languages (such as OpenMP, CUDA, and OpenCL). As code generation is not the focus of this work, we outline our code generation process briefly using the example of Figure 15. Details about our code generation are provided by Rasch [2024], Section E and will be presented and illustrated in detail in our future work.

We implement combine operators as sequential or parallel loops. For example, the operator $\mbox{++}^{\texttt{(HM,x)}}_{1}$ is assigned to memory layer HM and thus implemented as a sequential loop (loop range indicated by $[0,2)_{\mathbb{N}_{0}}$), and operator $\mbox{++}^{\texttt{(COR,x)}}_{1}$ is assigned to COR and thus implemented as a parallel loop (e.g., a loop annotated with #pragma omp parallel for in OpenMP [OpenMP, 2022], or variable threadIdx.x in CUDA [NVIDIA, 2022g]). Correspondingly, our three phases (de-composition, scalar, and re-composition) each correspond to an individual loop nest; we generate the nests as fused when the tags of combine operators have the same order in phases, as in Figure 15. Note that our currently targeted programming models (OpenMP, CUDA, and OpenCL) have no explicit notion of tiles, e.g., by offering the potential variable tileIdx.x for managing tiles automatically in the programming model (similarly as variable threadIdx.x automatically manages threads in CUDA). Consequently, when the operator tag refers to a memory layer, the dimension information within tags are currently ignored by our code generator (such as dimension x in tag (HM,x) which refers to memory layer HM).

Operators’ memory regions correspond to straightforward allocations (e.g., in CUDA’s device, shared, or register memory [NVIDIA, 2022g], according to the arrow annotations in our low-level expression). Memory layouts are implemented as aliases, e.g., preprocessor directives such as #define M(i,k) M[k][i] for storing MatVec’s input matrix $M$ as transposed.

We implement MDAs also as aliases (according to Definition 7), e.g., #define inp_mda(i,k) M[i][k],v[k] for MatVec’s input MDA.

Code optimizations that are applied on a lower abstraction level than proposed by our representation in Example 15 are beyond the scope of this work and outlined by Rasch [2024], Section F, e.g., loop fusion and loop unrolling which are applied on the loop-based abstraction level.

We provide an open source MDH compiler for code generation [MDH Project, 2024]. Our compiler takes as input a high-level MDH expression (as in Figure 6), in the form of a Python program, and it fully automatically generates auto-tuned program code from it.

In the following, we introduce in Section 3.2 our formal representation of a computer system (which can be a single device, but also a multi-device or a multi-node system, as we discuss soon), and we illustrate our formal system representation using the example architectures targeted by programming models OpenMP, CUDA, and OpenCL. Afterward, in Section 3.3, we formally define the basic building blocks of our low-level representation—low-level MDAs, low-level BUFs, and low-level combine operators—based on our formal system representation.

3.2 ASM

Definition 10

(Abstract System Model). An $L$ -Layered Abstract System Model (ASM), $L\in\mathbb{N}$, is any pair of two positive natural numbers

\begin{align*}(\texttt{NUM_MEM_LYRs}, \texttt{NUM_COR_LYRs}) \in\mathbb{N}\times\mathbb{N}\end{align*}

for which $\texttt{NUM_MEM_LYRs}+\texttt{NUM_COR_LYRs}\ =\ L$.

Our ASM representation is capable of modeling architectures with arbitrarily deep memory and core hierarchies¹⁵: NUM_MEM_LYRs denotes the target architecture’s number of memory layers and NUM_COR_LYRs the architecture’s number of COR, correspondingly. For example, the artificial architecture we use in Figure 15 is represented as an ASM instance as follows (bar symbols denote set cardinality):

\begin{align*}\texttt{ASM}_{\texttt{artif.}}:=(\ \ |\{\texttt{HM},\texttt{L1}\}|\, \ |\{\texttt{COR}\}|\ \ ) =(2,1)\end{align*}

The instance is a pair consisting of the numbers 2 and 1 which represent the artificial architecture’s two memory layers (HM and L1) and its single COR.

Example 11

We show particular ASM instances that represent the device models of the state-of-practice approaches OpenMP, CUDA, and OpenCL:

\begin{align*}& \texttt{ASM}_\texttt{OpenMP} &:=& &(& \ &|\{\texttt{MM},\texttt{L2},\texttt{L1}\}|& &,& &|\{\texttt{COR}\}|& \ &)& &=& &(3,1)\\& \texttt{ASM}_\texttt{OpenMP+L3} &:=& &(& \ &|\{\texttt{MM},\texttt{L3},\texttt{L2},\texttt{L1}\}|& &,& &|\{\texttt{COR}\}|& \ &)& &=& &(4,1)\\& \texttt{ASM}_\texttt{OpenMP+L3+SIMD} &:=& &(& \ &|\{\texttt{MM},\texttt{L3},\texttt{L2},\texttt{L1}\}|& &,& &|\{\texttt{COR},\texttt{SIMD}\}|& \ &)& &=& &(4,2)\\[5pt] & \texttt{ASM}_\texttt{CUDA} &:=& &(& \ &|\{\texttt{DM},\texttt{SM},\texttt{RM}\}|& &,& &|\{\texttt{SMX},\texttt{CC}\}|& \ &)& &=& &(3,2)\\& \texttt{ASM}_\texttt{CUDA+WRP} &:=& &(& \ &|\{\texttt{DM},\texttt{SM},\texttt{RM}\}|& &,& &|\{\texttt{SMX},\texttt{WRP},\texttt{CC}\}|& \ &)& &=& &(3,3)\\[5pt] & \texttt{ASM}_\texttt{OpenCL} &:=& &(& \ &|\{\texttt{GM},\texttt{LM},\texttt{PM}\}|& &,& &|\{\texttt{CU},\texttt{PE}\}|& \ &)& &=& &(3,2)\end{align*}

OpenMP is often used to target $(3+1)$-layered architectures which rely on three memory regions (main memory MM and caches L2 and L1) and one COR. OpenMP-compatible architectures sometimes also contain the L3 memory region, and they may allow exploiting Single-Instruction-Multiple-Data parallelization (a.k.a. vectorization [Klemm et al., 2012]), which are expressed in our ASM representation as a further memory or COR, respectively.

CUDA’s target architectures are $(3+2)$-layered: they consist of Device Memory (DM), Shared Memory (SM), and Register Memory (RM), and they offer as cores so-called Streaming Multiprocessors (SMX) which themselves consist of Cuda Cores (CC). CUDA also has an implicit notion of so-called Warps (WRP) which are not explicitly represented in the CUDA programming model [NVIDIA, 2022g], but often exploited by programmers—via special intrinsics (e.g., shuffle and tensor core intrinsics [NVIDIA, 2017, 2018])—to achieve highest performance.

OpenCL-compatible architectures are designed analogously to those targeted by CUDA; consequently, both OpenCL- and CUDA-compatible architectures are represented by the same ASM instance in our formalism. Apart from straightforward syntactical differences between OpenCL and CUDA [StreamHPC, 2016], we see as the main differences between the two programming models (from our ASM-based abstraction level) that OpenCL has no notion of warps, and it uses a different terminology—Global/Local/Private Memory (GM/LM/PM) instead of device/shared/register memory, and Compute Unit (CU) and Processing Element (PE), rather than SMX and CC.

In the following, we consider memory regions and cores of ASM-represented architectures as arrangeable in an arbitrary number of dimensions. Programming models for such architectures often have native support for such arrangements. For example, in the CUDA model, memory is accessed via arrays which can be arbitrary-dimensional (a.k.a multi-dimensional C arrays), and cores are programmed in CUDA via threads which are arranged in CUDA’s so-called dimensions x, y, z; further thread dimensions can be explicitly programmed in CUDA, e.g., by embedding them in the last dimension z.

We express constraints of programming models—for example, that in CUDA, SMX can combine their results in DM only [NVIDIA, 2022f]—via so-called tuning parameter constraints, which we discuss later in this section.

Note that we call our abstraction Abstract System Model (rather than Abstract Architecture Model, or the like), because it can also represent systems consisting of multiple devices and/or nodes, and so on. For example, our ASM representation of a multi-GPU system is

\begin{align*}\texttt{ASM}_{\texttt{Multi-GPU}}\: =\ (\ |\{\texttt{HM},\texttt{DM}, \texttt{SM},\texttt{RM}\}|, |\{\texttt{GPU},\texttt{SMX}, \texttt{CC}\}|\ ) \ =\ (4,3)\end{align*}

It extends our ASM-based representation of CUDA devices (Example 11) by HM which represents the memory region of the system containing the GPUs (and in which the intermediate results of different GPUs are combined), and it introduces the further COR GPU representing the system’s GPUs. Analogously, our ASM representation of a multi-node, multi-GPU system is

\begin{align*}\texttt{ASM}_{\texttt{Multi-Node-Multi-GPU}}\: =\ (\ |\{\texttt{NM}, \texttt{HM},\texttt{DM},\texttt{SM},\texttt{RM}\}|, |\{ \texttt{NOD},\texttt{GPU},\texttt{SMX},\texttt{CC}\}|\ ) \ =\ (5,4)\end{align*}

It adds to $\texttt{ASM}_{\texttt{Multi-GPU}}$ the memory layer Node Memory (NM) which represents the memory region of the host node, and it adds COR Node (NOD) which represents the compute nodes. Our approach is currently designed for homogeneous systems, i.e., all devices/nodes/$\dots$ are assumed to be identical. We aim to extend our approach to heterogeneous systems (which may consist of different devices/nodes/$\dotsc$) as future work, inspired by dynamic load balancing approaches [Chen et al., 2010].

3.3 Basic Building Blocks

We introduce the three main basic building blocks of our low-level representation: (1) low-level MDAs which we use to partition MDAs and that represent multi-layered, multi-dimensionally arranged collection of ordinary MDAs (Definition 1)—one ordinary MDA per memory/COR of their target ASM and for each dimension of the MDH computation (as illustrated in Figure 16); (2) low-level BUFs which are a collection of ordinary BUFs (Definition 5) and that are augmented with a memory region and a memory layout; (3) low-level combine operators which represent combine operators (Definition 2) to which the layer and dimension of their target ASM is assigned to be used to compute the operator in our generated code (e.g., a COR to compute the operator in parallel).

Definition 11

(Low-Level MDA). An $L$-layered, $D$-dimensional, $P$-partitioned low-level MDA that has scalar type $T$ and index sets $I$ is any function $\mathfrak{a}_{ll}$ of type:

\begin{align*}&\mathfrak{a}_{ll}^{ \lt \overbrace{\scriptstyle(p^{1}_{1},\dotsc,p^{1}_{D}) \in P^{1}_{1}\times\dotsc\times P^{1}_{D}}^{\text{Partitioning: Layer 1}}\ |\ \dotsc\ |\ \overbrace{\scriptstyle(p^{L}_{1},\dotsc,p^{L}_{D})\in P^ {L}_{1}\times\dotsc\times P^{L}_{D}}^{\text{Partitioning: Layer}\ \textit{L}} \gt }: \\&\qquad\qquad\qquad\qquad\qquad I^{ \lt p^{1}_{1},\dotsc, p^{1}_{D} | \dotsc | p^{L}_{1},\dotsc,p^{L}_{D} \gt }_{1} \times\dotsc\times I^{ \lt p^{1}_{1},\dotsc,p^{1}_{D} | \dotsc | p^{L}_{1},\dotsc,p^{L}_{D} \gt }_{ D} \to T\end{align*}

Next, we introduce low-level BUFs which work similarly as BUFs (Definition 5), but are tagged with a memory region and a memory layout. While these tags have no effect on the operators’ semantics, they indicate later to our code generator in which memory region the BUF should be stored and accessed, and which memory layout to chose for storing the BUF. Moreover, we use these tags to formally define constraints of programming models, e.g., that according to the CUDA specification [NVIDIA, 2022f], SMX cores can combine their results in memory region DM only.

Definition 12

(Low-Level BUF). An $L$-layered, $D$-dimensional, $P$-partitioned low-level BUF that has scalar type $T$ and size $N$ is any function $\mathfrak{b}_{ll}$ of type ($\hookrightarrow\mathrel{\mspace{-13.0mu }}\rightarrow$ denotes bijection):

\begin{align*}\mathfrak{b}_{ll}^{ \lt \overbrace{\scriptstyle\texttt{MEM}\in[1,\texttt{NUM _MEM_LYRs}]_{\mathbb{N}}}^{\text{Memory Region}} | \overbrace{\scriptstyle\sigma: [1,D]_{\mathbb{N}}\hookrightarrow\mathrel{\mspace{-13mu}}\rightarrow[1,D]_{\mathbb{N}}}^{\text{Memory Layout}} \gt \lt \overbrace{\scriptstyle(p^ {1}_{1},\dotsc,p^{1}_{D})\in P^{1}_{1}\times\dotsc\times P^{1}_{D}}^{\text{ Partitioning: Layer 1}}\ |\ \dotsc\ |\ \overbrace{\scriptstyle(p^{L} _{1},\dotsc,p^{L}_{D})\in P^{L}_{1}\times\dotsc\times P^{L}_{D}}^{\text{Partitioning: Layer L}} \gt }: \\{}[0,N_{1}^{ \lt p^{1}_{1},\dotsc,p^{1}_{D}\ |\ \dotsc\ |\ p^{L}_{1},\dotsc,p^{L}_{D} \gt })_{\mathbb{N}_{0}} \times\dotsc\times [ 0,N_{D}^{ \lt p^{1}_{1},\dotsc,p^{1}_{D}\ |\ \dotsc\ |\ p^{L}_{1},\dotsc,p^{L}_{ D} \gt })_{\mathbb{N}_{0}} \to T\end{align*}

We refer to MEM as low-level BUF’s memory region and to $\sigma$ as its memory layout, and we refer to the function

\begin{align*}{\mathfrak{b}^{\texttt{trans}}_{ll}}^{ \lt \overbrace{\scriptstyle\texttt{MEM}\in[1,\texttt{NUM_MEM_LYRs}] _{\mathbb{N}}}^{\text{Memory Region}} | \overbrace{ \scriptstyle\sigma:[1,D]_{\mathbb{N}}\hookrightarrow\mathrel{\mspace{-13mu}}\rightarrow[1,D]_{\mathbb{N}}}^{\text{Memory Layout}} \gt \lt \overbrace{\scriptstyle(p^{1}_{1},\dotsc,p^{1}_{D})\in P^{1}_{1}\times\dotsc \times P^{1}_{D}}^{\text{Partitioning: Layer 1}}\ |\ \dotsc\ |\ \overbrace{\scriptstyle(p^{L}_{1},\dotsc,p^{L}_{D})\in P^{L}_{1}\times\dotsc \times P^{L}_{D}}^{\text{Partitioning: Layer L}} \gt }: \\{}[0,N^{ \lt p^{1}_{1},\dotsc,p^{1}_{D}\ |\ \dotsc\ |\ p^{L}_{1},\dotsc,p^{L}_{D} \gt }_{\sigma(1)})_{\mathbb{N}_{0}} \times\dotsc \times [0,N^{ \lt p^{1}_{1},\dotsc,p^{1}_{D}\ |\ \dotsc\ |\ p^{L}_{1},\dotsc,p^ {L}_{D} \gt }_{\sigma(D)})_{\mathbb{N}_{0}} \to T\end{align*}

that is defined as

\begin{align*}{\mathfrak{b}^{\texttt{trans}}_{ll}}^{ \lt \texttt{MEM} | \sigma \gt \lt p^{1}_{1},\dotsc,p^{ 1}_{D} | \dotsc | p^{L}_{1},\dotsc,p^{L}_{D} \gt }(i_{\sigma(1)},\dotsc,i_{ \sigma(D)})\: =\ \mathfrak{b}_{ll}^{ \lt \texttt{MEM} | \sigma \gt \lt p^{1}_{1},\dotsc,p^{1} _{D} | \dotsc | p^{L}_{1},\dotsc,p^{L}_{D} \gt }(i_{1},\dotsc,i_{D})\end{align*}

as $\mathfrak{b}_{ll}$’s transposed function representation (which we use to store the buffer in our generated code).

Finally, we introduce low-level combine operators. We define such operators to behave the same as ordinary combine operators (Definition 2), but we additionally tag them with a layer of their target ASM. Similarly as for low-level BUFs, the tag has no effect on semantics, but it is used in our code generation process to assign the computation to the hardware (e.g., indicating that the operator is computed by either an SMX, WRP, or CC when targeting CUDA—see Example 11). Also, we use the tags to define model-specific constraints in our formalism (as also discussed for low-level BUFs). We also tag the combine operator with a dimension of the ASM layer, enabling later in our optimization process to express advanced data access patterns (a.k.a. swizzles [Phothilimthana et al., 2019]). For example, when targeting CUDA, flexibly mapping ASM dimensions on CC layer (in CUDA terminology, the dimensions are called threadIdx.x, threadIdx.y, and threadIdx.z) to array dimensions enables the well-performing coalesced global memory accesses [NVIDIA, 2022f] for both transposed and non-transposed data layouts, by only using different dimension tags.

Definition 13

(ASM Level). We refer to pairs $(l_{\texttt{ASM}},d_{\texttt{ASM}})$—consisting of an ASM layer $l_{\texttt{ASM}}\in[1,L]_{\mathbb{N}}$ and an ASM dimension $d_{\texttt{ASM}}\in[1,D]_{\mathbb{N}}$—as ASM Levels (ASM-LVL)¹⁶

\begin{align*}\texttt{ASM-LVL}:=\{\ (l_{\texttt{ASM}},d_{\texttt{ASM}})\ |\ l_{\texttt{ASM}} \in[1,L]_{\mathbb{N}},d_{\texttt{ASM}}\in[1,D]_{\mathbb{N}}\}\end{align*}

Definition 14

(Low-Level Combine Operator). Let be $L\in\mathbb{N}$ (representing an ASM’s number of layers) and $D\in\mathbb{N}$ (representing an MDH’s number of dimensions). A low-level combine operator

\begin{align*}\circledast{}^{ \lt (l_{\texttt{ASM}},d_{\texttt{ASM}})\in\texttt{ASM-LVL}=\{\ (l,d)\ |\ l \in[1,L]_{\mathbb{N}}, \ d\in[1,D]_{\mathbb{N}} \} \gt }\end{align*}

is a function for which $\circledast{}^{ \lt (l_{\texttt{ASM}},d_{\texttt{ASM}}) \gt }$ is an ordinary combine operator (Definition 2), for each $(l_{\texttt{ASM}},d_{\texttt{ASM}})\in\texttt{ASM-LVL}$.

Note that in Figure 15, for better readability, we use domain-specific identifiers for ASM layers: HM:=1 as an alias for the ASM layer that has id 1, L1:=2 for the layer with id 2, and COR:=3 for the layer with id $3$. For dimensions, we use aliases $x:=1$ for ASM dimension 1 and $y:=2$ for ASM dimension 2, correspondingly.

4 Lowering: From High Level to Low Level

We have designed our formalism such that an expression in our high-level representation (as in Figure 6) can be systematically lowered to an expression in our low-level representation (as in Figure 15). For this, we parameterize our high-level representation, step-by-step, in tuning parameters; thereby, we obtain for concrete tuning parameter values a particular expression in our low-level representation—this is formally discussed and demonstrated by Rasch [2024], Section 4, for the interested reader. We chose optimized values of tuning parameters fully automatically via auto-tuning [Rasch et al., 2021]; Section 8 outlines alternative approaches for parameter selection.

Table 1 lists the tuning parameters of our lowering process—different values of tuning parameters lead to semantically equal expressions in our low-level representation (which is proven formally by Rasch [2024], Section 4), but the expressions will be translated to differently optimized code variants.¹⁷

Table 1.

No.	Name	Range	Description
0	$\#\texttt{PRT}$	$\texttt{MDH-LVL}\to\mathbb{N}$	number of parts
D1	$\sigma_{\texttt{$\downarrow$-ord}}$	$\texttt{MDH-LVL}\hookrightarrow\mathrel{\mspace{-13.0mu }}\rightarrow\texttt {MDH-LVL}$	de-composition order
D2	$\leftrightarrow_{\texttt{$\downarrow$-ass}}$	$\texttt{MDH-LVL}\hookrightarrow\mathrel{\mspace{-13.0mu }}\rightarrow\texttt {ASM-LVL}$	ASM assignment (de-composition)
D3	$\texttt{$\downarrow$-mem}^{\texttt{$\lt$ib$\gt$}}$	$\texttt{MDH-LVL}\to\texttt{MR}$	memory regions of input BUFs (ib)
D4	$\sigma_{\texttt{$\downarrow$-mem}}^{\texttt{$\lt$ib$\gt$}}$	$\texttt{MDH-LVL}\to[1,\dotsc,D^{\texttt{IB}}_{\texttt{ib}}]_{\mathcal{S}}$	memory layouts of input BUFs (ib)
S1	$\sigma_{\texttt{$f$-ord}}$	$\texttt{MDH-LVL}\hookrightarrow\mathrel{\mspace{-13.0mu }}\rightarrow\texttt {MDH-LVL}$	scalar function order
S2	$\leftrightarrow_{\texttt{$f$-ass}}$	$\texttt{MDH-LVL}\hookrightarrow\mathrel{\mspace{-13.0mu }}\rightarrow \hookrightarrow\mathrel{\mspace{-13.0mu }}\rightarrow\texttt{ASM-LVL}$	ASM assignment (scalar function)
S3	$f^{\downarrow}\texttt{-mem}^{\texttt{$\lt$ib$\gt$}}$	MR	memory region of input BUF (ib)
S4	$\sigma_{\texttt{$f^{\downarrow}$-mem}}^{\texttt{$\lt$ib$\gt$}}$	$[1,\dotsc,D^{\texttt{IB}}_{\texttt{ib}}]_{\mathcal{S}}$	memory layout of input BUF (ib)
S5	$f^{\uparrow}\texttt{-mem}^{\texttt{$\lt$ob$\gt$}}$	MR	memory region of output BUF (ob)
S6	$\sigma_{\texttt{$f^{\uparrow}$-mem}}^{\texttt{$\lt$ob$\gt$}}$	$[1,\dotsc,D^{\texttt{OB}}_{\texttt{ob}}]_{\mathcal{S}}$	memory layout of output BUF (ob)
R1	$\sigma_{\texttt{$\uparrow$-ord}}$	$\texttt{MDH-LVL}\hookrightarrow\mathrel{\mspace{-13.0mu }}\rightarrow\texttt {MDH-LVL}$	re-composition order
R2	$\leftrightarrow_{\texttt{$\uparrow$-ass}}$	$\texttt{MDH-LVL}\hookrightarrow\mathrel{\mspace{-13.0mu }}\rightarrow\texttt {ASM-LVL}$	ASM assignment (re-composition)
R3	$\texttt{$\uparrow$-mem}^{\texttt{$\lt$ob$\gt$}}$	$\texttt{MDH-LVL}\to\texttt{MR}$	memory regions of output BUFs (ob)
R4	$\sigma_{\texttt{$\uparrow$-mem}}^{\texttt{$\lt$ob$\gt$}}$	$\texttt{MDH-LVL}\to[1,\dotsc,D^{\texttt{OB}}_{\texttt{ob}}]_{\mathcal{S}}$	memory layouts of output BUFs (ob)

Table 1. Tuning Parameters of Our Low-Level Expressions

In the following, we explain the $15$ tuning parameters in Table 1. We give our explanations in a general, formal setting that is independent of a particular computation and programming model. Dotted lines in Table 1 separate parameters for different phases: parameters D1–D4 customize the de-composition phase, parameters S1–S6 the scalar phase, and parameters R1–R4 the re-composition phase, correspondingly; the parameter 0 impacts all three phases (separated by a straight line in the table).

Our tuning parameters in Table 1 have constraints: (1) algorithmic constraints which have to be satisfied by all target programming models, and (2) model constraints which are specific for particular programming models only (CUDA-specific constraints, OpenCL-specific constraints, etc), e.g., that the results of CUDA’s thread blocks can be combined in designated memory regions only [NVIDIA, 2022f]. We discuss algorithmic constraints in the following, together with our tuning parameters; model constraints are discussed by Rasch [2024], Section C.1, for the interested reader.

Note that our parameters do not aim to introduce novel optimization techniques, but to unify, generalize, and combine together well-proven optimizations, based on a formal foundation, toward an efficient, overall optimization process that applies to various combinations of data-parallel computations, architectures, and characteristics of input and output data (e.g., their size and memory layout).

In Table 1, we point to combine operators in Figure 15 using pairs $(l,d)$ to which we refer as MDH Levels. We use the pairs as enumeration for operators in the de-composition and re-composition phases.

Definition 15

(MDH Level). We refer to pairs $(l_{\texttt{MDH}},d_{\texttt{MDH}})$—consisting of a layer $l_{\texttt{MDH}}\in[1,L]_{\mathbb{N}}$ and dimension $d_{\texttt{MDH}}\in[1,D]_{\mathbb{N}}$—as MDH Levels (MDH-LVL):

\begin{align*}\texttt{MDH-LVL}:=\{\ (l_\texttt{MDH},d_\texttt{MDH}) \ | \ l_\texttt{MDH}\in[1,L]_\mathbb{N}, d_\texttt{MDH}\in[1,D]_\mathbb{N}\}\end{align*} ¹⁸

We use the pairs to say, for example, that the MDH computation is partitioned on level $(1,1)$ (i.e., layer $l=1$, dimension $d=1$) into two parts, as in Figure 15.

Parameter 0.

Parameter $\#\texttt{PRT}$ is a function that maps pairs in MDH-LVL to natural numbers; the parameter determines how much data are grouped together into parts in our low-level expression (and consequently also in our generated code later), by setting the particular number of parts (a.k.a. tiles) used in our expression. For example, in Figure 15, we use $\#\texttt{PRT}(1,1):=2$ which causes combine operators $\mbox{++}^{\texttt{(HM,x)}}_{1}$ and $\circledast^{\texttt{(HM,x)}}_{1}$ to iterate over interval $[0,2)_{\mathbb{N}_{0}}$ (and thus partitioning the MDH computation on level $(1,1)$ into two parts), and we use $\#\texttt{PRT}(1,2):=4$ to let operators $\mbox{++}^{\texttt{(HM,y)}}_{2}$ and $\circledast^{\texttt{(HM,x)}}_{2}$ iterate over interval $[0,4)_{\mathbb{N}_{0}}$ (partitioning on level $(1,2)$ into four parts), and so on.

To ensure a full partitioning (so that we obtain singleton MDAs to which scalar function $f$ can be applied in the scalar phase, as discussed above), we require the following algorithmic constraint for the parameter ($N_{d}$ denotes the input size in dimension $d$):

\begin{align*}\prod_{l\in[1,L]_{\mathbb{N}}}\#\texttt{PRT}(l,d) = N_{d}\text{, for all } d\in[1,D]_{\mathbb{N}}.\end{align*}

In our generated code, the number of parts directly translates to the number of tiles which are computed either sequentially (a.k.a. cache blocking [Lam et al., 1991]) or in parallel, depending on the combine operators tags (which are chosen via Parameters D2,S2,R2, as discussed soon). In our example from Figure 15, we process parts belonging to combine operators tagged with HM and L1 sequentially, via for-loops, because HM and L1 correspond to ASM’s memory layers (note that Parameter 0 only chooses the number of tiles; the parameter has no effect on explicitly copying data into fast memory resources, which is the purpose of Parameters D3,R3,S1,S2). The COR parts are computed in parallel in our generated code, because COR corresponds to ASM’s COR, and thus, the number of COR parts determines the number of threads used in our code.

An optimized number of tiles is essential for achieving high performance [Bacon et al., 1994], e.g., due to its impact for locality-aware data accesses (number of sequentially computed tiles) and efficiently exploiting parallelism (number of tiles computed in parallel, which corresponds to the number of threads in our generated code).

Parameters D1,S1,R1.

These three parameters are permutations on MDH-LVL (indicated by symbol $\hookrightarrow\mathrel{\mspace{-13.0mu }}\rightarrow$ in Table 1), determining when data are accessed and combined. The parameters specify the order of combine operators in the de-composition and re-composition phases (parameters D1 and R1), and the order of applying scalar function $f$ to parts (parameter S1). Thereby, the parameters specify when parts are processed during the computation.

In our generated code, combine operators are implemented as sequential/parallel loops such that the parameters enable optimizing loop orders (a.k.a. loop permutation [McKinley et al., 1996]). For combine operators assigned to ASM’s COR (via parameter R2 discussed in the next paragraph) and thus computed in parallel, parameter R1 particularly determines when the computed results of threads are combined: if we used in the re-composition phase of Figure 15 combine operators tagged with (COR,x) and (COR,y) immediately after applying scalar function $f$ (i.e., in steps ⑩ and ⑪, rather than steps ⑫ and ⑬), we would combine the computed intermediate results of threads multiple times, repeatedly after each individual computation step of threads, and using the two operators at the end of the re-composition phase (in steps ⑭ and ⑮) would combine the result of threads only once, at the end of the re-composition phase. Combining the results of threads early in the computation usually has the advantages of reduced memory footprint, because memory needs to be allocated for one thread only, but at the cost of more computations, because the results of threads need to be combined multiple times. In contrast, combing the results of threads late in the computation reduces the amount of computations, but at the cost of higher memory footprint. Our parameters make this tradeoff decision generic in our approach such that the decision can be left to an auto-tuning system, for example.

Note that each phase corresponds to an individual loop nest which we fuse together when parameters D1,S1,R1 (as well as parameters D2,S2,R2) coincide (as also outlined by Rasch [2024], Section F).

Parameters D2,S2,R2.

These parameters (symbol $\hookrightarrow\mathrel{\mspace{-13.0mu }}\rightarrow$ in the table denotes bijection) assign MDH levels to ASM levels, by setting the tags of low-level combine operators (Definition 14). Thereby, the parameters determine by whom data are processed (e.g., threads or for-loops), similar to the concept of bind in scheduling languages [Apache TVM Documentation, 2022a]. Consequently, the parameters determine which parts should be computed sequentially in our generated code and which parts in parallel. For example, in Figure 15, we use $\leftrightarrow_{\downarrow}{}_{\!\texttt{-ass}}(2,1):=(\texttt{COR}, \texttt{x})$ and $\leftrightarrow_{\downarrow}{}_{\!\texttt{-ass}}(2,2):=(\texttt{COR}, \texttt{y})$, thereby assigning the computation of MDA parts on layer 2 in both dimensions to ASM’s COR layer in the de-composition phase, which causes processing the parts in parallel in our generated code. For multi-layered core architectures, the parameters particularly determine the thread layer to be used for the parallel computation (e.g., block or thread in CUDA).

Using these parameters, we are able to flexibly set data access patterns in our generated code. In Figure 15, we assign parts on layer 2 to COR layers, which results in a so-called block access pattern of cores: we start $8*16$ threads, according to the $8*16$ core parts, and each thread processes a part of the input MDA representing a block of $32\times 64$ MDA elements within the input data. If we had assigned in the figure the first computation layer to ASM’s COR layer (in the figure, this layer is assigned to ASM’s HM layer), we would start $2*4$ threads and each thread would process MDA parts of size $(8*32)\times(16*64)$; assigning the last MDH layer to CORs would result in $(2*8*32)\times(4*16*64)$ threads, each processing a singleton MDA (a.k.a. strided access).

The parameters also enable expressing so-called swizzle access patterns [Phothilimthana et al., 2019]. For example, in CUDA, processing consecutive data elements in data dimension 1 by threads that are consecutive in thread dimension 2 (a.k.a threadIdx.y dimension in CUDA) can achieve higher performance due to the hardware design of fast memory resources in NVIDIA GPUs. Such swizzle patterns can be easily expressed and auto-tuned in our approach; for example, by interchanging in Figure 15 tags (COR,x) and (COR,y). For memory layers (such as HM and L1), the dimension tags x and y currently have no effect on our generated code, as the programming models we target at the moment (OpenMP, CUDA, and OpenCL) have no explicit notion of tiles. However, this might change in the future when targeting new kinds of programming models, e.g., for upcoming architectures.

Parameters D3,R3 and S3,S5.

Parameters D3 and R3 set for each BUF the memory region to be used, thereby determining where data are read from or written to, respectively. In the table, we use $\texttt{ib}\in\mathbb{N}$ to refer to a particular input BUF (e.g., ib=1 to refer to the input matrix of matrix-vector multiplication, and ib=2 to refer to the input vector), and $\texttt{ob}\in\mathbb{N}$ refers to an output BUF, correspondingly. Parameter D3 specifies the memory region to read from, and parameter R3 the region to write to. The set $\texttt{MR}:=[1,\texttt{NUM_MEM_LYRs}]_{\mathbb{N}}$ denotes the ASM’s memory regions.

Similarly to parameters D3 and R3, parameters S3 and S5 set the memory regions for the input and output of scalar function $f$.

Exploiting fast memory resources of architectures is a fundamental optimization [Bondhugula, 2020; Hristea et al., 1997, Mei et al., 2014; Salvador Rohwedder et al., 2023], particularly due to the performance gap between processors’ cores and their memory systems [Oliveira et al., 2021; Wilkes, 2001].

Parameters D4,R4 and S4,S6.

These parameters set the memory layouts of BUFs, thereby determining how data are accessed in memory; for brevity in Table 1, we denote the set of all BUF permutations $[1,D]_{\mathbb{N}}\hookrightarrow\mathrel{\mspace{-13.0mu }}\rightarrow[1,D]_{ \mathbb{N}}$ (Definition 12) as $[1,\dotsc,D]_{\mathcal{S}}$ (symbol $\mathcal{S}$ is taken from the notation of symmetric groups [Sagan, 2001]). In the case of our matrix-vector multiplication example in Figure 15, we use a standard memory layout for all matrices, which we express via the parameters by setting them to the identity function, e.g., $\sigma_{\downarrow}{}_{\texttt{-mem}}^{\texttt{$\lt$M$\gt$}}(1,1):=id$ (Parameter D4) for the matrix read by operator $\mbox{++}^{\texttt{(HM,x)}}_{1}$.

An optimized memory layout is important to access data in a locality-aware and thus efficient manner.

5 Experimental Results

We experimentally evaluate our approach by comparing it to popular representatives of four important classes:

(1)

Scheduling Approach: TVM [Chen et al., 2018a] which generates GPU and CPU code from programs expressed in TVM’s own high-level program representation;

(2)

Polyhedral Compilers: PPCG [Verdoolaege et al., 2013] for GPUs¹⁹ and Pluto [Bondhugula et al., 2008b] for CPUs, which automatically generate executable program code in CUDA (PPCG) or OpenMP (Pluto) from straightforward, unoptimized C programs;

(3)

Functional Approach: Lift [Steuwer et al., 2015] which generates OpenCL code from a Lift-specific, functional program representation;

(4)

Domain-Specific Libraries: NVIDIA cuBLAS [NVIDIA, 2022b] and NVIDIA cuDNN [NVIDIA, 2022e], as well as Intel oneMKL [Intel, 2022c] and Intel oneDNN [Intel, 2022b], which offer the user easy-to-use, domain-specific building blocks for programming. The libraries internally rely on pre-implemented assembly code that is optimized by experts for their target application domains: linear algebra (cuBLAS and oneMKL) or convolutions (cuDNN and oneDNN), respectively. To make comparison against the libraries challenging for us, we compare to all routines provided by the libraries. For example, the cuBLAS library offers three, semantically equal but differently optimized routines for computing MatMul: cublasSgemm (the default MatMul implementation in cuBLAS), cublasGemmEx which is part of the cuBLASEx extension of cuBLAS [NVIDIA, 2022c], and the most recent cublasLtMatmul which is part of the cuBLASLt extension [NVIDIA, 2022d]; each of these three routines may perform differently on different problem sizes (NVIDIA usually recommends to naively test which routine performs best for the particular target problem). To make comparison further challenging for us, we exhaustively test for each routine all of its so-called cublasGemmAlgo_t variants and report the routine’s runtime for the best performing variant. In the case of oneMKL, we compare also to its JIT engine [Intel, 2019] which is specifically designed and optimized for small problem sizes. We also compare to library EKR [Hentschel et al., 2008] which computes data mining example PRL (Figure 14) on CPUs—the library is implemented in the Java programming language and parallelized via Java Threads, and the library is used in practice by the Epidemiological Cancer Registry in North Rhine-Westphalia (Germany) which is the currently largest cancer registry in Europe.

We compare to the approaches experimentally in terms of

(1)

Performance: via a runtime comparison of our generated code against code that is generated according to the related approaches;

(2)

Portability: based on the Pennycook Metric [Pennycook et al., 2019] which mathematically defines portability²⁰ as

\begin{align*}\Phi(a,p,H)=\left\{\begin{array}{@{}ll}\frac{|H|}{\sum_{i\in H}\frac{1}{e_{i}(a,p)}}\qquad &\text{if}\ \textit{i}\ \text{is supported,}\ \forall \textit{i}\ \in \textit{H}\\0\qquad &\text{otherwise}\end{array}\right.\end{align*}

In words: “for a given set of platforms $H$, the performance portability (PP) $\Phi$ of an application $a$ solving problem $p$ is defined as $\Phi(a,p,H)$, where $e_{i}(a,p)$ is the performance efficiency (i.e., a ratio of observed performance relative to some proven, achievable level of performance) of application $a$ solving problem $p$ on platform $i$; value $\Phi(a,p,H)$ is $0$, if any platform in $H$ is unsupported by $a$ running $p$” [Pennycook et al., 2019]. Consequently, Pennycook defines portability as a real value in the interval $[0,1]_{\mathbb{R}}$ such that a value close to 1 indicates high portability and a value close to $0$ indicates low portability. Here, platforms $H$ represents a set of devices (CPUs, GPUs, $\dotsc$), an application $a$ is in our context a framework (such as TVM, a polyhedral compiler, or our approach), problems $p$ are our case studies, and $e_{i}(a,p)$ is computed as the runtime $a^{\text{best}}_{p,i}$ of the application that achieves the best observed runtime for problem $p$ on platform $i$, divided by the runtime of application $a$ for problem $p$ running on platform $i$.

(3)

Productivity: by intuitively arguing that our approach achieves the same/lower/higher productivity as the related approaches, using the representative example computation Matrix-Vector Multiplication (MatVec) (Figure 6). Classical code metrics, such as Lines of Code, Constructive Cost Model [Boehm et al., 1995], McCabe’s Cyclomatic Complexity [McCabe, 1976], and Halstead Development Effort [Halstead, 1977], are not meaningful for comparing the short and concise programs in high-level languages as proposed by the related work as well as our approach.

In the following, after discussing our application case studies, experimental setup, auto-tuning system, and code generator, we compare our approach to each of the four abovementioned classes of approaches (1)–(4) in Sections 5.1–5.4.

Application Case Studies

We use for experiments in this section popular example computations from Figure 14 that belong to different classes of computations:

—

Linear Algebra Routines: Matrix Multiplication (MatMul) and Matrix-Vector Multiplication (MatVec);

—

Stencil Computations: Jacobi Computation (Jacobi3D) and Gaussian Convolution (Conv2D) which differ from linear algebra routines by accessing neighboring elements in their input data;

—

Quantum Chemistry: Coupled Cluster (CCSD(T)) computations which differ from linear algebra routines and stencil computations by accessing their high-dimensional input data in complex, transposed fashions;

—

Data Mining: PRL which differs from the previous computations by relying on a PRL-specific combine operator and scalar function (instead of straightforward additions or multiplications as the previous computations);

—

Deep Learning: the most time-intensive computations within the popular neural networks ResNet-50 [He et al., 2015], VGG-16 [Simonyan and Zisserman, 2014], and MobileNet [Howard et al., 2017], according to their TensorFlow implementations [TensorFlow, 2022a,b,c]. Deep learning computations rely on advanced variants of linear algebra routines and stencil computations, e.g., MCC and MCC_Capsule for computing convolution-like stencils, instead of the classical Conv2D variant of convolution (Figure 14)—the deep learning variants are considered as significantly more challenging to optimize than their classical variants [Barham and Isard, 2019].

We use for experiments this subset of computations from Figure 14 to make experimenting challenging for us: the computations differ in major characteristics (as discussed in Section 2.4), e.g., accessing neighboring elements in their input data (as stencil computations) or not (as linear algebra routines), thus usually requiring fundamentally different kinds of optimizations. Consequently, we consider it challenging for our approach to achieve high performance for our studies, because our approach relies on a generalized optimization process (discussed in Section 4) that uniformly applies to any kind of data-parallel computation and also parallel architecture. In contrast, the optimization processes of the related approaches are often specially designed and tied to a particular application class and often also architecture. For example, NVIDIA cuBLAS and Intel oneMKL are highly optimized specifically for linear algebra routines on either GPU or CPU, respectively, and TVM is specifically designed and optimized for deep learning computations.

To make experimenting further challenging for us, we consider data sizes and characteristics either taken from real-world computations (e.g., from the TCCG benchmark suite [Springer and Bientinesi, 2016] for quantum chemistry computations) or sizes that are preferable for our competitors, e.g., powers of two for which many competitors are highly optimized, e.g., vendor libraries. For the deep learning case studies, we use data characteristics (sizes, strides, padding strategy, image/filter formats, etc.) taken from the particular implementations of the neural networks when computing the popular ImageNet [Krizhevsky et al., 2012] dataset (the particular characteristics are listed by Rasch [2024], Section D.1, for the interested reader). For all experiments, we use single precision floating point numbers (a.k.a. float or fp32), as such precision is the default in TensorFlow and many other frameworks.

Experimental Setup

We run our experiments on a cluster containing two different kinds of GPUs and CPUs:

—

NVIDIA Ampere GPU A100-PCIE-40GB

—

NVIDIA Volta GPU V100-SXM2-16GB

—

Intel Xeon Broadwell CPU E5-2683 v4 @ 2.10GHz

—

Intel Xeon Skylake CPU Gold-6140 @ 2.30GHz

We represent the two CUDA GPUs in our formalism using model $\texttt{ASM}_{\texttt{CUDA+WRP}}$ (Example 11). We rely on model $\texttt{ASM}_{\texttt{CUDA+WRP}}$, rather than the CUDA’s standard model $\texttt{ASM}_{\texttt{CUDA}}$ (also in Example 11), to exploit CUDA’s (implicit) warp level for a fair comparison to the related approaches: warp-level optimizations are exploited by the related approaches (such as TVM), e.g., for shuffle operations [NVIDIA, 2018] which combine the results of threads within a warp with high performance. To fairly compare our approach to TVM and PPCG, we avoid exploiting warps’ tensor core intrinsics [NVIDIA, 2017], in all experiments, which compute the multiplication of small matrices with high performance [Feng et al., 2023], because these intrinsics are not used in the TVM- and PPCG-generated CUDA code. For the two CPUs, we rely on model $\texttt{ASM}_{\texttt{OpenCL}}$ (Example 11) for generating OpenCL code. The same as our approach, TVM also generates OpenCL code for CPUs; Pluto relies on the OpenMP approach to target CPUs.

For all experiments, we use the currently newest versions of frameworks, libraries, and compilers, as follows. We compile our generated GPU code using library CUDA NVRTC [NVIDIA, 2022h] from CUDA Toolkit 11.4, and we use Intel’s OpenCL runtime version 18.1.0.0920 for compiling CPU code. For both compilers, we do not set any flags so that they run in their default modes. For the related approaches, we use the following versions of frameworks, libraries, and compilers:

—

TVM [Apache, 2022] version 0.8.0 which also uses our system’s CUDA Toolkit version 11.4 for GPU computations and Intel’s runtime version 18.1.0.0920 for computations on CPU;

—

PPCG [Michael Kruse, 2022] version 0.08.04 using flag --target=cuda for generating CUDA code, rather than OpenCL, as CUDA is usually better performing than OpenCL on NVIDIA GPUs, and we use flag --sizes followed by auto-tuned tile sizes—we rely on the Auto-Tuning Framework (ATF) [Rasch et al., 2021] for choosing optimized tile size values (as we discuss in the next subsection);

—

Pluto [Uday Bondhugula, 2022] commit 12e075a using flag --parallel for generating OpenMP-parallelized C code (rather than sequential C), as well as flag --tile to use ATF-tuned tile sizes for Pluto; the Pluto-generated OpenMP code is compiled via Intel’s icx compiler version 2022.0.0 using the Pluto-recommended optimization flags -O3 -qopenmp;

—

NVIDIA cuBLAS [NVIDIA, 2022b] from CUDA Toolkit 11.4, using the NVIDIA-recommended compiler flags -fast -O3 -DNDEBUG;

—

NVIDIA cuDNN [NVIDIA, 2022e] from CUDA Toolkit 11.4, using the NVIDIA-recommended compiler flags -fast -O3 -DNDEBUG;

—

Intel oneMKL [Intel, 2022c] compiled with Intel’s icpx compiler version 2022.0.0, using flags -DMKL_ILP64 -qmkl=parallel -L$$\{$MKLROOT$\}$/lib/intel64 -liomp5 -lpthread -lm -ldl, as recommended for oneMKL by Intel’s Link Line Advisor tool [Intel, 2022a], as well as standard flags -O3 -NDEBUG;

—

Intel oneDNN [Intel, 2022b] also compiled with Intel’s icpx compiler version 2022.0.0, using flags -I$$\{$DNNLROOT$\}$/include -L$$\{$DNNLROOT$\}$/lib -ldnnl, according to oneDNN’s documentation, as well as standard flags -03 -NDEBUG;

—

EKR [Hentschel et al., 2008] executed via Java SE 1.8.0 Update 281.

We profile runtimes of CUDA and OpenCL programs using the corresponding, event-based profiling APIs provided by CUDA and OpenCL. For Pluto which generates OpenMP-annotated C code, we measure runtimes via system call clock_gettime [GNU/Linux, 2022]. In the case of C++ libraries Intel oneMKL and Intel oneDNN, we use the C++ chrono library [C++ reference, 2022] for profiling. Libraries NVIDIA cuBLAS and NVIDIA cuDNN are also based on the CUDA programming model; thus, we profile them also via CUDA events. To measure the runtimes of the EKR Java library, we use Java function System.currentTimeMillis().

All measurements of CUDA and OpenCL programs contain the pure program runtime only (a.k.a. kernel runtime). The runtime of host code²¹ is not included in the reported runtimes, as performance of host code is not relevant for this work and the same for all approaches.

In all experiments, we collect measurements until the 99% confidence interval was within 5% of our reported means, according to the guidelines for scientific benchmarking of parallel computing systems by Hoefler and Belli [2015].

Auto-Tuning

The auto-tuning process of our approach relies on the generic ATF [Rasch et al., 2021]. The ATF framework has proven to be efficient for exploring large search spaces of constrained tuning parameters (as our space introduced in Section 4). We use ATF, out of the box, exactly as described by Rasch et al. [2021]: (1) we straightforwardly represent in ATF our search space (Table 1) via tuning parameters which express the parameters in the table and their constraints; (2) we use ATF’s pre-implemented cost functions for CUDA and OpenCL to measure the cost of our generated OpenCL and CUDA codes (in this article, we consider as cost program’s runtime, rather than its energy consumption or similar); (3) we start the tuning process using ATF’s default search technique (AUC bandit [Ansel et al., 2014]). ATF then fully automatically determines a well-performing tuning parameter configuration for the particular combination of a case study, architecture, and input/output characteristics (size, memory layout, etc.).

For scheduling approach TVM, we use its Ansor [Zheng et al., 2020a] optimization engine which is specifically designed and optimized toward generating optimized TVM schedules. Polyhedral compilers PPCG and Pluto do not provide own auto-tuning systems; thus, we use for them also ATF for auto-tuning, the same as for our approach. For both compilers, we additionally also report their runtimes when relying on their internal heuristics, rather than on auto-tuning, to fairly compare to them.

To achieve the best possible performance results for TVM, PPCG, and Pluto, we auto-tune each of these frameworks individually, for each particular combination of case study, architecture, and input/output characteristics, the same as for our approach. For example, we start for TVM one tuning run when auto-tuning case study MatMul for the NVIDIA Ampere GPU on one input size, and another, new tuning run for a new input size.

Hand-optimized libraries NVIDIA cuBLAS/cuDNN and Intel oneMKL/oneDNN rely on heuristics provided by experts, rather than auto-tuning. By relying on heuristics, the libraries avoid the time-intensive process of auto-tuning. However, auto-tuning is well amortized in many application areas (e.g., deep learning), because the auto-tuned implementations are re-used in many program runs. Moreover, auto-tuning avoids the complex and costly process of hand optimization by experts, and it often achieves higher performance than hand-optimized code (as we confirm later in our experiments), because well-performing optimizations are often not intuitive.

For a fair comparison, we use for each tuning run uniformly the same tuning time of $12$ h. Even though for many computations well-performing tuning results could often be found in less than $12$ h, for our approach as well as for other frameworks, we use such generous tuning time for all frameworks to avoid auto-tuning issues in our reported results—analyzing, improving, and accelerating the auto-tuning process is beyond the scope of this work and intended for our future work (as also outlined in Section 8). In particular, TVM’s Ansor optimizer was often able to find well performing optimizations in $6$h of tuning time or less. This is because Ansor explores a small search space that is designed and optimized for deep learning computations—Ansor’s space is a proper subset of our space, as our space aims to capture general optimizations that apply to arbitrary data-parallel computations. However, the focus on deep learning causes Ansor to have difficulties with optimizing computations not taken from the deep learning area, as we confirm in our experiments.

To improve the auto-tuning efficiency for our implementations, we rely on a straightforward cost model that shrinks our search space in Table 1 before starting our ATF-based auto-tuning process: (1) we always use the same values for Parameters D1, S1, R1 as well as for Parameters D2, S2, R2, thereby generating the same loop structure for all three phases (de-composition, scalar, and re-composition) such that the structures can be generated as a fused loop nest; (2) we restrict Parameters D2, S2, R2 to two values—one value that let threads process outer parts (a.k.a. blocked access or outer parallelism, respectively) and one to let threads process inner parts (strided access or inner parallelism); all other permutations are currently ignored for simplicity or because they have no effect on the generated code (e.g., permutations of Parameters D2, S2, R2 that only differ in dimension tags belonging to memory layers, as discussed in the previous section); (3) we restrict Parameters D3, S3, S5, R3 such that each parameter is invariant under different values of $d$ of its input pairs $(l,d)\in\texttt{MDH-LVL}$, i.e., we always copy full tiles in memory regions (and not a full tile of one input buffer and a half tile of another input buffer, which sometimes might achieve higher performance when memory is a limited resource).

Our cost model is straightforward and might filter out configurations from our search space that achieve potentially higher performance than we report for our approach in Sections 5.1–5.4. We aim to substantially improve our naive cost model in future work, based on operational semantics for our low-level representation, to improve the auto-tuning quality and to reduce (or even avoid) tuning time.

Code Generation

We provide an open source MDH compiler [MDH Project, 2024] for generating executable program code from expressions in our high-level representation (as illustrated in Figure 4). Our compiler takes as input the high-level representation of the target computation (Figure 14), in the form of a Python program, and it fully automatically generates auto-tuned program code, based on the concepts and methodologies introduced and discussed in this article and the ATF[Rasch et al., 2021].

In our future work, we aim to integrate our code generation approach into the MLIR compiler framework [Lattner et al., 2021], building on work-in-progress results [Google SIG MLIR Open Design Meeting, 2020], thereby making our work better accessible to the community. We consider approaches such as AnyDSL [Leißa et al., 2018] and BuildIt [Brahmakshatriya and Amarasinghe, 2021] as further, interesting frameworks in which our compiler could be implemented.

5.1 Scheduling Approaches

Performance.

Figures 17–22 report the performance of the TVM-generated code, which is in CUDA for GPUs and in OpenCL for CPUs. We observe that we usually achieve with our approach the high performance of TVM and often perform even better. For example, in Figure 21, we achieve a speedup ${\gt}2\times$ over TVM on NVIDIA Ampere GPU for matrix multiplications as used in the inference phase of the ResNet-50 neural network—an actually favorable example for TVM which is designed and optimized toward deep learning computations executed on modern NVIDIA GPUs. Our performance advantage over TVM is because we parallelize and optimize more efficiently reduction-like computations—in the case of MatMul (Figure 14), its $3$rd-dimension (a.k.a. $k$-dimension). The difficulties of TVM with reduction computations become particularly obvious when computing dot products (Dot) on GPUs (Figure 17): the Dot’s main computation part is a reduction computation (via point-wise addition, see Figure 14), thus requiring reduction-focused optimization, in particular when targeting the highly parallel architecture of GPUs: in the case of Dot (Figure 17), our generated CUDA code exploits parallelization over CUDA blocks, whereas the Ansor-generated TVM code exploits parallelization over threads within in a single block only, because TVM currently cannot use blocks for parallelizing reduction computations [Apache TVM Community, 2022a]. Furthermore, while TVM’s Ansor rigidly parallelizes outer dimensions [Zheng et al., 2020a], our ATF-based tuning process has auto-tuned our tuning parameters D2, S2, R2 in Table 1 to exploit parallelism for inner dimensions, which achieves higher performance for this particular MatMul example used in ResNet-50. Also, for MatMul-like computations, Ansor always caches parts of the input in GPU’s shared memory, and it computes these cached parts always in register memory. In contrast, our caching strategy is auto-tunable (via parameters D3, S3 S5, R3 in Table 1), and ATF has determined to not cache the input matrices into fast memory resources for the MatMul example in ResNet-50. Surprisingly, Ansor does not exploit fast memory resources for Jacobi stencils (Figure 18), as required to achieve high performance for them: our generated and auto-tuned CUDA kernel for Jacobi uses register memory for both inputs (image buffer and filter) when targeting NVIDIA Ampere GPU (small input size), thereby achieving a speedup over TVM+Ansor of $1.93\times$ for Jacobi. Most likely, Ansor fails to foresee the potential of exploiting fast memory resources for Jacobi stencils, because the Jacobi’s index functions used for memory accesses (Figure 14) are injective. For the MatMul example of ResNet-50’s training phase (Figure 21), we achieve a speedup over TVM on NVIDIA Ampere GPU of $1.26\times$, because auto-tuning determined to store parts of input matrix $A$ as transposed into fast memory (via parameter D4 in Table 1). Storing parts of the input/output data as transposed is not considered by Ansor as optimization, perhaps because such optimization must be expressed in TVM’s high-level language, rather than in its scheduling language[Apache TVM Community, 2022c]. For MatVec on NVIDIA Ampere GPU (Figure 17), we achieve a speedup over TVM of $1.22\times$ for the small input size, by exploiting a so-called swizzle pattern [Phothilimthana et al., 2019]: our ATF tuner has determined to assign threads that are consecutive in CUDA’s x-dimension to the second MDA dimension (via parameters D2, S2, R2 in Table 1), thereby accessing the input matrix in a GPU-efficient manner (a.k.a coalesced global memory accesses [NVIDIA, 2022f]). In contrast, for MatVec computations, Ansor assigns threads with consecutive x-ids always to the first data dimension, in a non-tunable manner, causing lower performance.

Fig. 17.

Fig. 18.

Fig. 19.

Fig. 20.

Fig. 21.

Fig. 22.

Our positive speedups over TVM on CPU are for the same reasons as discussed above for GPU. For example, we achieve a speedup of ${\gt}3\times$ over TVM on Intel Skylake CPU for MCC (Figure 22) as used in the training phase of the MobileNet neural network, because we exploit fast memory resources more efficiently than TVM: our auto-tuning process has determined to use register memory for the MCC’s second input (the filter buffer F, see Fig. 14) and using no fast memory for the first input (image buffer I), whereas Ansor uses shared memory rigidly for both inputs of MCC. Moreover, our auto-tuning process has determined to parallelize the inner dimensions of MCC, while Ansor always parallelizes outer dimensions. We achieve the best speedup over TVM for MCC on an input size taken from TVM’s own tutorials [Apache TVM Documentation, 2022b] (Figure 18), rather than from neural networks (as in Figures 21 and 22). This is because TVM’s MCC size includes large reduction computations, which are not efficiently optimized by TVM (as discussed above).

The TVM compiler achieves higher performance than our approach for some examples in Figures 17–22. However, in most cases, this is for a technical reason only: TVM uses the NVCC compiler for compiling CUDA code, whereas our proof-of-concept code generator currently relies on NVIDIA’s NVRTC library which surprisingly generates less-efficient CUDA assembly than NVCC. In three cases, the higher performance of TVM over our approach is because our ATF was not able to find a better performing tuning configuration than TVM’s Ansor optimization engine during our $12$h tuning time; the three cases are: (1) MCC from VGG-16’s inference phase on NVIDIA Ampere GPU (Figure 21), (2) MCC (capsule variant) from VGG-16’s training phase on NVIDIA Ampere GPU (Figure 21), and (3) MCC (capsule variant) from ResNet-50’s training phase on Intel Skylake CPU (Figure 22). However, when we manually set the Ansor-found tuning configurations also for our approach, instead of using the ATF-found configurations, we achieve for these three cases exactly the same high performance as TVM+Ansor, i.e., the well-performing configurations are contained in our search space (Table 1). Most likely, Ansor was able to find this well-performing configuration within the $12$ h tuning time, because it explores a significantly smaller search space that is particularly designed for deep learning computations. To avoid such tuning issues in our approach, we aim to substantially improve our auto-tuning process in future work: we plan to introduce an analytical cost model that assists (or even replaces) our auto-tuner, as we also outline in Section 8.

Note that the TVM compiler crashes for our data mining example PRL, because TVM has difficulties with computations relying on user-defined combine operators [Apache TVM Community, 2022d].

Portability.

Figure 23 reports the portability of the TVM compiler. Our portability measurements are based on the Pennycook metric where a value close to 1 indicates high portability and a value close to $0$ indicates low portability, correspondingly. We observe that except for the example of transposed matrix multiplication $\texttt{GEMM}^{\texttt{T}}$, we always achieve higher portability than TVM. The higher portability of TVM for $\texttt{GEMM}^{\texttt{T}}$ is because TVM achieves for this example higher performance than our approach on NVIDIA Volta GPU. However, the higher performance of TVM is only due to the fact that TVM uses NVIDIA’s NVCC compiler for compiling CUDA code, while we currently rely on NVIDIA’s NVRTC library which surprisingly generates less-efficient CUDA assembly, as discussed above.

Fig. 23.

Productivity.

Listing 1 shows how matrix-vector multiplication (MatVec) is implemented in TVM’s high-level program representation which is embedded into the Python programming language. In line 1, the input size $(I,K)\in\mathbb{N}\times\mathbb{N}$ of matrix $M\in T^{I\times K}$ (line 2) and vector $v\in T^{K}$ (line 3) are declared, in the form of function parameters; the matrix and vector are named M and v and both are assumed to contain elements of scalar type $T=\texttt{float32}$ (floating point numbers). Line 5 defines a so-called reduction axis in TVM, in which all values are combined in line 8 via te.sum (addition). The basic computation part of MatVec—multiplying matrix element M[i,k] with vector element v[k]—is also specified in line 8.

Listing 1. TVM Program Expressing Matrix-Vector Multiplication (MatVec)

1 def MatVec(I, K):

2 M = te.placeholder((I, K), name=’M’, dtype=’float32’)

3 v = te.placeholder((K,), name=’v’, dtype=’float32’)

5 k = te.reduce_axis((0, K), name=’k’)

6 w = te.compute(

7 (I,),

8 lambda i: te.sum(M[i, k] * v[k], axis=k)

9 )

10 return [M, v, w]

While we consider the MatVec implementations of TVM (Listing 1) and our approach (Figure 6) basically on the same level of abstraction, we consider our approach as more expressive in general. This is because our approach supports multiple reduction dimensions that may rely on different combine operators, e.g., as required for expressing the MBBS example in Figure 14. In contrast, TVM is struggling with different combine operators—adding support for multiple, different reduction dimensions is considered in the TVM community as a non-trivial extension of TVM [Apache TVM Community, 2020, 2022b]. Also, we consider our approach as slightly less error-prone: we automatically compute the expected sizes of matrix $M$ (as $I\times K$) and vector $v$ (as $K$), based on the user-defined input size $(I,K)$ in line 1 and index functions $(i,k)\mapsto(i,k)$ for the matrix and $(i,k)\mapsto(k)$ for the vector in line 8 (the formula for computing the sizes is described by Rasch [2024], Definition 8, for the interested reader). In contrast, TVM redundantly requests these matrix and vector sizes from the user: once in lines 2 and 3 of Listing 1, and again in lines 5 and 7. TVM uses these sizes for generating the function specification of its generated MatVec code, which lets TVM generate incorrect low-level code—without issuing an error message—when the user sets non-matching sizes in lines 2/3 and lines 5/7.