We define multi-GPU libraries as libraries that allow input data to reside on host memory, GPU memory, or a combination of both and internally manage data distribution and computation on multiple devices. Most existing multi-GPU BLAS libraries target Level-3 BLAS routines [
3,
4,
5,
6,
7,
8,
9,
10], relieving the programmer from the complex optimizations required on multi-GPU systems, with the optimization of level-1 and level-2 BLAS still left to the programmer due to their usually smaller impact on total application performance. In this section, we present the performance bottlenecks of multi-GPU Level-3 BLAS, as well as the algorithms and optimizations used by the current state-of-the-art libraries to alleviate them. We additionally discuss the limitations of current optimization approaches concerning the initial data distribution and the interconnect heterogeneity. Finally, we discuss the absence of consideration for the resource utilization in existing multi-GPU BLAS libraries, and the relevant efficiency and heterogeneity challenges.
The key architectural features that influence the performance, and thus, library design, in multi-GPU setups are the distinct GPU memories and the increasingly complex underlying interconnect. Consequently, unlike single GPU setups, where algorithmic optimizations mainly target the internals of the BLAS kernels, multi-GPU setups include multiple devices acting as parallel workers, introducing the notions of data decomposition, reuse, communication, scheduling, and load balancing. For simplicity, we categorize the optimization space for multi-GPU BLAS into a) data domain decomposition, i.e., splitting the initial problem into sub-problems/tasks (henceforth sub-kernels) and their distribution, b) communication overlap, avoidance, and routing and c) load-balancing between GPUs.
2.1 Level-3 BLAS Decomposition and Distribution
Level-3 BLAS routines operate on matrices, and typically the problem is decomposed into data-parallel tasks, following some matrix decomposition scheme. On multi-GPU setups, this is necessary to exploit the multiple devices as parallel workers, similarly to multi-core or multi-node execution [
11]. Decomposition and distribution schemes are important, as they determine the required amount of communication between workers. Early multi-GPU BLAS libraries, such as the CUDA-based
cuBLASXt [
7], and its LAPACK-compatible wrapper
NVBLAS [
8], use a simple round-robin scheme to distribute 2D tile-based sub-kernels to devices, resulting in unnecessary communication. State-of-the-art libraries such as BLASX [
9] and XKBLAS [
10] borrow the 2D block-cyclic decomposition and distribution from distributed computing, which achieves a good homogeneous distribution of communication on a virtual 2D-grid of workers.
In this work, we also employ the state-of-practice 2D block-cyclic decomposition/distribution. Figure
2 shows an example of the 2D block-cyclic distribution used for Level-3 BLAS matrix-matrix multiplication (GEMM). The available system devices are organized in a virtual 2D grid, which should be as square as possible - for example, a 2 × 2 grid for four devices, a 4 × 2 or 2 × 4 grid for eight devices, a 3 × 3 grid for nine devices, and so on.
2.2 Communication Optimization
The 2D block-cyclic decomposition in GEMM depicted in Figure
2 results in a favorable communication pattern for the read-only tiles of matrices A and B. Every GPU requires the same number of “row” tiles and “column” tiles from each of the input matrices. This is the basis for enabling
communication optimization, as communication is the main bottleneck in multi-GPU BLAS performance on modern systems [
9,
10,
12]. While different libraries have used different approaches for improving communication performance, the optimization targets can be roughly classified into communication
overlap,
avoidance, and
routing.
Overlap : Overlap refers both to computation-communication overlap, as well as communication-communication overlap, when this happens between different devices. Computation-communication overlap is a common technique in GPU offload, both in single and multi-GPU setups and it has been extensively explored in the past [
12,
13,
14,
15]. CUDA versions also frequently increase the overlap potential by enabling additional GPU copy-engines [
16]. In this work, we also try to maximize overlap to improve the performance of Level-3 BLAS.
GPU Data caching : In a multi-GPU setup, the different GPUs have distinct memories. Because of the problem distribution, data that is necessary for computations need to be transferred from one device to another often. Although the effect of those transfers is mitigated by technologies like RMA, a common approach to reduce redundant communication is data caching/buffering on the GPU memory, as it also enables data reuse between subsequent subkernels. For example, in Figure
2, if GPU 0 does not cache data tiles, it needs to fetch
\(C_{00}\),
\(A_{00}\),
\(A_{01}\) and
\(C_{01}\) twice, resulting in 12 tile fetches instead of 8 (50% increase in communication volume), which can worsen depending on the problem size and the tiling size
T.
BLAS libraries initially designed for CPUs [
3,
4,
5,
6] use simple buffers in GPU memories to support some data caching, and cuBLASXt [
7] follows the same design logic. Unfortunately, simple buffering is only sufficient for internal GPU caching, which fares well on a single GPU. Its performance degrades rapidly as the number of GPUs increases, since it neglects the very fast connections between discrete GPU memories modern interconnects offer [
9]. In this work, we employ a GPU data caching scheme to reduce unnecessary communication and maximize data reuse.
Communication routing : Routing in a multi-GPU setup refers to selecting the fastest routes for transfers between GPUs. In the context of Level-3 BLAS, the multiple GPUs require different data tiles, which need to be transferred, and peer-to-peer transfers between GPUs (henceforth
d2d) usually have considerably higher bandwidth than CPU-GPU transfers (henceforth
h2d/d2h). To enable routing optimizations for multi-GPU BLAS, two components are necessary: 1) a cache consistency-like logic for the GPU buffers, to ensure that data tiles are always up to date and their Read/Write dependencies are respected, and 2) a way to distinguish the interconnect bandwidth levels in order to select the ‘closest’ fetch location when a
read-only (RONLY) tile is available in multiple buffers. A solution like this is implemented in BLASX [
9], which provides a hardware abstraction of the underlying interconnect in a hierarchical representation, designed for communication optimization in multi-GPU systems. The hierarchical representation abstracts the system as a cache hierarchy, with the CPU RAM being the main memory and the distinct GPU memories being levels of caches. This representation favors data reuse (last-level cache ‘hits’) and the faster GPU-GPU transfers (lower-level cache ‘hits’) to CPU-GPU transfers (cache ‘misses’), while a MESI-like protocol enables sharing RONLY blocks between GPUs. Still, BLASX performance is hindered by continuously writing back to the CPU and re-fetching output (WR) blocks. Moreover, recent multi-GPU clusters from NVIDIA are connected with NVLink lanes, which may offer more than one distinguishable bandwidth ‘levels’, compared to the simple distinction between ‘h2d’ and ‘d2d’ bandwidths. XKBLAS [
10] overcomes the write-back bottleneck by performing lazy write-backs of output data. It additionally considers a heuristic ‘ranking’ of distinguishable bandwidth levels through information from the NVIDIA driver interface, and favoring transfers over device connections higher bandwidth. In our work, we also perform communication routing as a communication optimization, through a different hardware abstraction and caching scheme.
The data placement hazard : While all the aforementioned optimizations target the full-offload scenario, where all data are initially located on the CPU, BLAS multi-GPU libraries support input data on any GPU. On modern systems, CPU-GPU (h2d/d2h) transfers are inherently much slower than peer-to-peer, GPU-GPU (d2d) transfers, therefore having part of the input data available on GPUs should lead to a less communication-bound problem.
However, as shown in Figure
1, the performance of BLASX and XKBLAS drops significantly in all but the full-offload scenarios and only PARALiA provides the expected increased performance. We demystify the cause for this counter-intuitive behavior in Figure
3, which shows the communication pattern, number of transfers and average achieved bandwidth for BLASX, XKBLAS and our proposed runtime, PARALiA. BLASX suffers from excessive costly write-backs to the CPU due to its caching policy, although these h2d/d2h could be completely avoided in the scenario where all data are initially on the GPUs [
10]. Additionally, BLASX is bandwidth-agnostic, with transfers passing through a variety of bandwidth levels, since its hierarchical abstraction is only capable of recognizing the difference between
h2d/d2h and
d2d, and is not sufficient for modern interconnects. XKBLAS, on the other hand, avoids intermediate write-backs to the CPU due to its lazy write-back design and provides a much more balanced communication map in the full-offload scenario, with a clear preference for the higher bandwidths. This desirable behavior does not extend to the scenario where data initially populate GPUs 0, 1 and 2. On the contrary, the communication map becomes more dense around these locations and creates a communication bottleneck, with a lot of extremely low-bandwidth transfers. The cause of this meltdown lies in the core of current multi-GPU Level-3 BLAS optimization - the decomposition and distribution method. All usual distribution methods (round-robin, block-cyclic, 2D block-cyclic, etc.) target a homogeneous, balanced scenario, where data - and consequently the communication volume - is distributed between workers as equally as possible. In the full-offload scenario, all data must first be fetched from the host (
\(id=8\) in the heatmaps), which in our testbed - and most modern HPC clusters - has the same bandwidth for all h2d/d2h connections and therefore favors a balanced distribution. On the other hand, in the second scenario, some GPUs are
closer (higher d2d bandwidth) and some are
further (lower d2d bandwidth) from the data, which results in transfers passing through a variety of connections, some of which are very slow. In our work, PARALiA, we take data placement into consideration and optimize communication, under the assumption of heterogeneity in the underlying interconnect, through an effective abstraction of the hardware, modeling, and autotuning. As shown in Figure
3, our work achieves higher average bandwidth compared to other techniques regardless of the initial data placement.
2.3 Load Balancing for Heterogeneity
State-of-the-art multi-GPU libraries attempt to mitigate the aforementioned problem of imbalance between transfers, and also adapt to potential heterogeneous computational capabilities, by using task load-balancing through work-stealing [
9,
10]. However, although work-stealing can improve load balancing, it disrupts the potential temporal locality offered by a good initial task decomposition and distribution, as is the 2D block-cyclic decomposition. Moreover, although work stealing may address the load imbalance coming from small performance differences between GPUs, it is not sufficient for heterogeneous workload distribution, as we will show in Section
4.4. Finally, the work stealing mechanisms integrated in existing approaches are performance-agnostic; they balance work between devices, without taking into account the computational capabilities of the devices or the communication properties of the problem. This design results in inefficient resource allocation and usage for BLAS execution, and is incompatible with future heterogeneous systems. This monolithic design does not lead to efficient execution since it can’t adapt resource allocation to BLAS problem characteristics and is incompatible with future heterogeneous systems.
In this work, we argue that a homogeneous distribution coupled with work-stealing is not able to effectively handle the built-in heterogeneity of modern HPC systems for the case of BLAS, and propose a model-based approach that adjusts the decomposition, communication, and task allocation to the characteristics of 1) each different system through offline micro-benchmarks and 2) each different BLAS problem and data placement during runtime. Our approach steps away from the typical homogeneous distribution coupled with work stealing, as we believe that this approach cannot simultaneously account for the interconnect layout, problem size, and data placement, and therefore cannot effectively handle the built-in heterogeneity of modern multi-GPU HPC systems for the case of BLAS.