1 Introduction
The execution time of modern scientific and engineering applications is dominated by the manipulation of large sparse matrices, i.e., matrices whose number of non-zero entries
NNZ is much smaller than their dimensions
\(M \times N\) . Matrices derived from computational physics problems are often sparse and regular because they derive from a relatively small number of differential terms. Algebraic graph problems produce matrices which are still sparse but usually less regular [
16]. Similarly, the weight and activation matrices used by
Convolutional Neural Networks (
CNN) may hold millions of terms, but most of them are explicitly set to zero (e.g.,
rectified) without loss of information [
30]. It was recognized early in the 1960s [
69] that working on these sparse datasets using dense array representations was inefficient. The seminal work of OSKI [
83] in 2003 formalized the sparse representation alternatives and identified many optimizations. This opened the path for specialized software libraries for sparse algebra [
84]. However, the single-core performance of software implementations can be deceiving, often only reaching 10% of peak performance [
30].
The main matrix multiplication kernels that are of interest are
matrix-vector (
MV) and
matrix-matrix (
MM), which include SpMV and SpMSpV (
\(A \times b = c\) ), SpGEMV (
\(A \times b + d = c\) ), SpMM and SpMSpM (
\(A \times B = C\) ), and SpGEMM (
\(A \times B + D = C\) ), with
Sp for
sparse and
GE for
general. SpMV, as well as its variant SpGEMV, are by far the dominant operations of modern algebraic kernels (linear solvers, eigensolvers) which have been explicitly based on them [
25,
34,
39,
44,
49,
66,
87]. Since these kernels are ubiquitous, their efficient implementation is primordial. In contrast, the SpMSpM multiplication is less frequent in standard algebraic kernels, but it is extensively used in algebraic graph manipulation [
5,
14,
17,
23,
33,
46,
67], as well as in multigrid solvers [
4,
13,
53,
68]. The growing importance of large graphs has motivated new research on optimizing SpMSpM applications.
The two matrix operations, SpMV and SpMSpM, pose orthogonal computational challenges: (1) access to sparse input data, (2)
reduction operations on the intermediate result, and (3) formatting and writing back the result to memory. The different algorithms, which are detailed in Section
2, are built on different articulations of these three aspects. The sparse data storage scheme used in memory dictates the first aspect and is briefly explored in Section
2.1. The third aspect, i.e., result storage, is sometimes overlooked when the output is dense and not too large. For example, when running on a conventional CPU which is memory-bound, it becomes crucial to streamline the write accesses in order to optimize cache usage when writing a large dense vector output. The second aspect, i.e., the
reduction operations, deserves different solutions according to the execution platform. On a conventional CPU, the bottleneck is memory throughput. Thus, most software implementations of SpMV deal with
a priori reorganization of the input matrices, by row ordering, blocking, and loop optimization, for improving the data locality of operations. The perspective for SpMV is different when dealing with GPU platforms. The challenge shifts to finding a balanced assignment of the numerous threads and warps, and to the use of local memory resources [
57]. SpMSpM is far more complex: the intermediate
reductions introduce new
NNZ values at random locations, which trigger costly memory allocations and list management operations [
31]. As a consequence, this phase dominates the execution time and becomes the main memory bottleneck on a regular CPU. The issue is similar on GPUs because of the limited size of the scratchpad memory for processing large matrices, which requires complex memory management and load balancing [
22,
62,
88].
The main motivation to use custom hardware accelerators is to make better use of the intermediate memory hierarchy, ensuring the input data is available near the cores and also off-loading the accumulation and
reduction operations from the processors. The focus of this article is to review these architectures, to compare them, and to analyze their performance. Our main focus is on accelerators that eventually target custom or ASIC implementations, although we do touch on the contributions from accelerators that were prototyped on FPGAs. The reader is referred to [
58] for a survey of acceleration techniques for CPUs, GPUs, and FPGAs. Section
2 reviews the storage schemes and the three basic algorithms with a focus on the data access patterns and their impact on the memory hierarchy. In Section
3, we study how the different algorithms stress the memory systems and propose an analytical model based on these insights. In Section
4, we present the operation of several state-of-the-art hardware accelerators, in the light of these hardware challenges. In Section
5, we systematically compare these accelerators. Throughout this study, we try to be consistent, and we hope that our nomenclature will be useful for architects and designers who intend to implement dedicated hardware for problems with sparse datasets.
3 Hardware Challenges for Sparse Matrices and Vectors
In the previous section, we presented a brief background on matrix multiplication without considering the underlying hardware. Although the various sparse formats reduce the memory footprint, they have the disadvantage that indirect accesses are required, which creates data-dependencies and reduces the effectiveness of caches. As we have seen, with all three algorithms, at least some of the accesses are non-sequential. The lack of spatial locality for these accesses reduces the benefit of having caches. In other words, a full cache line is loaded, but only a few entries are read or modified. Moreover, the matrices of interest are extremely large and even the non-zero elements, and associated metadata, can not generally fit in the cache. Depending on the algorithm, successive accesses to the same element may be too far apart to benefit from temporal locality in the cache.
In an IP algorithm, the A-metadata needs to be read to indirectly access the B-matrix (or b-vector), which may itself be stored in a sparse format. In this case, the algorithm needs to check if the non-zero A-indices are present in B (or b). If an index appears in neither list, there is no need to perform the multiplication (ineffectual operation). More generally, this problem applies to fibers and is called the intersection search. As a minimum, this requires reading all the metadata for A and B (or b), and identifying those present in both.
In an
OP algorithm, the inputs are accessed sequentially, but the updates of the output require indirect accesses. The pattern of the accesses, while the output is generated, is non-sequential and determined by the structure of the input. Furthermore, if these
random3 updates provoke cache misses where only one word within a cache-line is modified, the bandwidth to external memory is poorly utilized. The
OP algorithm generates several
POs that must be accumulated to obtain the final output. We can identify two strategies:
immediate update and
deferred update. In the former, the final output is immediately updated as each
partial sum is generated. If the output is stored using a
sparse format, either the index being updated does not yet exist, and the new
partial sum must be inserted. Or, if it exists, a lookup must be done to locate it, the update performed and the result written back. The problem associated with combining the
POs is called
reduction.
To avoid the complexities of the indirect accesses with
immediate updates, the
partial sums can be stored and the accumulation can be deferred. One such technique is called
output batching [
48]. With this technique, tuples containing the index and the
partial sum are streamed sequentially to memory. Then, during a second pass, they are read back and the accumulations are performed, an approach which results in sequential memory accesses during the first pass, with the downside that the volume of data transferred to memory is increased. In Table
1, we summarize the previously discussed advantages and hardware challenges for each algorithm, including
Gust which presents a tradeoff between
IP and
OP.
The issues presented with IP and OP can be generalized to two concepts: gather and scatter. The gather problem exists when the inputs are indirectly accessed, and need to be read from memory using metadata to present the processor with the correct terms for computing the partial sums. Typically, with IP, the gather problem is more difficult. The scatter problem exists when the order in which the output matrix/vector is generated is not sequential and is typically associated with the OP algorithm. In the case of tiling, a hardware accelerator must handle both gather and scatter, creating a need for hardware support for both intersection searches and reductions.
3.1 SpMSpM Algorithm Analysis
To better understand how the algorithms impact the number and types of memory accesses, we developed, and presented in Figure
4, a model based on memory accesses counts. We identify five memory access patterns: (1) those where a matrix is read only once, thus there is no opportunity for reuse, (2) where the accesses are sequential, but a line is read multiple times, (3) where lines are read consecutively, but the accesses within a line are
random, (4) where the lines are accessed
randomly, but within a line the accesses are sequential, and (5) where the matrix elements are accessed multiple times, and with a
random access pattern.
For each of the three algorithms, we counted the number of read and write accesses, based on the density of the input matrices. We plotted the number of memory accesses in logarithmic scale for SpMSpM, based on a square matrix with \(M = 10,\!000\) . Indeed, the number of non-zero elements in C depends on the structure of the input matrices and in this model, we have assumed the non-zero entries in the input matrices are uniformly, randomly distributed. This corresponds to a worst case for sparse matrices (no spatial and temporal locality). Banded input matrices would, for example, have fewer non-zero elements in C. In each graph, we have separated the accesses into A (bottom), B (middle), and C (top). The color code shows the access pattern, based on the five categories described above.
In the top-row of Figure
4 (graphs ➀, ➁, and ➂), we start by considering a naïve accelerator which has no local storage, thus all data comes from main memory. From these figures, we clearly see that
IP requires more accesses, as the
A-matrix must be read N times (graph ➀). Whereas for
OP and
Gust, the
A-matrix is read only once (thus the grey color). For
OP and
Gust the total number of accesses is the same; however, with
OP (graph ➁), the accesses to
C are
random (red). With
Gust (graph ➂), the accesses to
B are sequential within the rows (orange) and the
C-rows are generated sequentially (yellow), thus
Gust avoids having any fully
random accesses (red).
Since main memory accesses are the principal bottleneck, accelerators integrate local storage (such as buffers or caches). Thus, we consider the case of an accelerator with sufficient local memory to store one full row, per matrix. We assume this local storage is perfect (no misses) and re-calculate the number of remaining main memory accesses, as shown in the middle-row of Figure
4 (graphs ➃, ➄, and ➅). The matrices are assumed to be initially stored in main memory, thus they must be accessed at least once. As can be seen, for
IP (graph ➃), the number of
A-accesses is reduced, because rows are reused, thus the total number of
A-accesses becomes equal to that in
OP and
Gust. For
OP (graph ➄), the single row buffer greatly reduces the number of
B-accesses, however, it does nothing for
C (despite the appearance, due to the logarithmic scale). For
Gust (graph ➅), the single row buffer is sufficient to cache the
reduction operations in
C, and thus, it now has fewer overall accesses than
OP. It is also interesting to note that
OP and
Gust scale better than
IP with lower densities, as seen by the shallower slope.
Finally, we have considered the extreme case where an accelerator has a buffer of sufficient size to store two matrices, which is shown in the bottom-row of Figure
4 (graphs ➆, ➇, and ➈). Although this case is not practical for large matrices, through the proper use of
tiling to locally reduce the dimensions of a sub-matrix, this case can be achieved at the level of a
tile. In this case, the total number of external accesses is equal for the three algorithms, as each matrix is accessed only once from main memory. For
IP (graph ➆), it is interesting to note that such a large buffer is needed to locally address the
intersection search in
B. Similarly for
OP (graph ➇), a large storage is required to address the
reductions in
C. For
Gust, in fact, it is not necessary to buffer the entire
B-matrix. Indeed, buffering a few rows (corresponding to
\(NNZ_{r}(A)\) ) is sufficient to achieve the number of accesses shown in graph ➈. This corresponds to an intermediate design point (multiple row buffers for
B) not explicitly shown in the figure (between graphs ➅ and ➈).
When viewed overall, Figure
4 highlights the benefits of
Gust. We clearly see that
Gust does not require any fully
random accesses (red). With a single row buffer, it is possible to address the
reductions in
C. And, with a limited number of row buffers, the number of external memory accesses for
B can be reduced. Based on this analysis, it is clear that
Gust is the best algorithm for hardware implementation of SpMSpM.
4 Hardware Accelerators for Sparse Matrices and Vectors
In this section, we describe the recent (2018–2023) hardware accelerators for sparse MM and MV, targeting ASIC-type implementation. We try to be consistent on notations and architecture schemes to simplify the comparison. Before going into the detailed overviews of the accelerators, in Table
2 we present a criteria-based classification of the designs.
A first criterion to classify the accelerators is between
standalone accelerators that address the full kernel without the assist of a processor (e.g., OuterSPACE [
64]—Section
4.1) and ones that address only a part of the kernel and need a processor to manage the computation (e.g., PHI [
60]—Section
4.9).
Standalone accelerators are directly connected to the main memory and internally manage their own custom memories. The other accelerators actually intervene at different levels of the memory hierarchy, giving a second classification criterion. Certain accelerators intervene near the processors (e.g., ASA [
89]—Section
4.8), near the cache (e.g., PHI and SortCache [
77]—Sections
4.9 and
4.7), or near the main memory (e.g., SPiDRE [
10] and SpaceA [
86]—Section
4.9).
Each accelerator is optimized for a specific algorithm (
IP,
OP, or
Gust), our third criterion. For example, some accelerators (e.g., PHI and SortCache) address the
reduction problem and are adapted to
OP algorithms but
reductions are also required with
Gust and highly-
tiled algorithms. ExTensor [
40] (Section
4.2) is a general purpose accelerator that is demonstrated for a highly-
tiled OP but also makes major contributions for
IP.
From a memory perspective, the two main problems are
gather and
scatter, our fourth criterion. Certain accelerators offer solutions for one, the other, or both. For example, some accelerators (e.g., SpArch [
93]—Section
4.3) are specialized for an
OP algorithm but optimize both
gather and
scatter. Finally, accelerators preferably focus on sparse MM, MV or both—our final criterion.
After this first comparison, in Sections
4.1 to
4.6, we present the
standalone accelerators, followed by the
processor assist accelerators in Sections
4.7 to
4.8. In Section
4.9, we briefly touch on other accelerators.
4.1 OuterSPACE
OuterSPACE [
64] is an
OP accelerator with a focus on SpMSpM kernels, although it can also handle SpMV. It is a highly parallelized architecture which employs
Single-Program Multiple-Data (
SPMD)-style
processing elements (
PEs) that fetch data from main memory through a two-level highly-banked cache hierarchy as shown in Figure
5. The input
A- and
B-matrices are stored in
(UOP, CP) column- and
row-major formats, respectively, to match the read access patterns of the
OP algorithm. The
partial and final outputs are stored in
(UOP, CP) format.
The computation is temporally divided into the
multiply and
merge phases (Figure
6). During the first phase, inputs are divided into
tiles of several rows (
A) and columns (
B) and distributed through the
processing tiles (
PTs) by a
control unit. Within each
PT, each
PE processes multiplications of one element of the
\(k^{th}\) A-column with all elements of the
\(k^{th}\) B-row and an L0 cache in the
PT facilitates the reuse of the
B-rows between
PEs. This phase generates a
PO per
PT, corresponding to the
tile of rows (
A) and columns (
B) that it was assigned.
Then, during the merge phase, the POs are combined to generate the final C-matrix. Each PT locally sorts the POs, and then the PTs work together to merge them. This algorithm was chosen to maximize data locality and parallelism, although it is less efficient. This merge phase only uses half of the PEs so the others are disabled (clock-gated) and the caches are reconfigured in a scratchpad mode to save energy.
Note, if NNZ elements in the POs exceed the L0 cache capacity, it overflows to the last level cache, which results in reduced performance. OuterSPACE performs best when the \(NNZ_c(A)\) and \(NNZ_r(B)\) are both uniform which ensures a uniform distribution of work across the large number of PTs and PEs.
4.2 ExTensor
The authors created ExTensor [
40] as a general purpose sparse tensor algebra accelerator (including SpMSpM). It is an
OP accelerator but due to its support for three levels of fixed
tiling of the inputs, it optimizes the
intersection search between non-zero elements or
tiles.4 By doing the
intersections ahead of time, at different levels of the memory hierarchy (which correspond to the
tile levels), ExTensor is able to efficiently eliminate
ineffectual operations, thus addressing the main issue in
IP or complex
tiling algorithms. Indeed, the input matrices would typically be stored in a
(UOP, UOP, UOP, CP) format, corresponding to three levels of
tiling. ExTensor is designed with a flexible architecture composed of different bricks (see Figure
7). To simply aid in the understanding, the interactions between the bricks are described using function calls:
—
The Scanner is a storage unit that delivers a stream of coordinates in increasing order, fetching them from memory. The coordinates are delivered through the Iterate() operation call.
—
The Intersect is a hardware block that co-iterates over several Scanners in parallel and compares the coordinate streams. This unit performs the intersection search required by IP type algorithms, delivering a single stream with the matching coordinates, which are the coordinates of non-zero partial sums. To optimize the intersection step, it is possible to skip a range of coordinates in a stream depending on the current coordinate of another stream. Thus, the Intersect sends skip messages with SkipTo() operations to the other Scanners.
—
The
Coordinator is a hardware unit that includes
Scanners and
Intersect units, and manages the input and output data depending on the current processed
tile, as shown in Figure
7. Data and
metadata are, respectively, stored in storage units and
Scanners, while a
Sequencer manages the
Iterate() calls so that the
Intersect brick delivers the stream of effective coordinates. In addition, in the
Tile Coord brick, the offsets of the
tile are added back, so the output of the
Coordinator is in absolute coordinates.
These bricks can be placed at different positions in a memory hierarchy, depending on the tiling and selected hardware. Thus, the Coordinator is able to intersect at coarse-granularity with output data being metadata of the next tile, skipping an entire tile when appropriate, with low computation cost, and also at fine-granularity when output data are the non-zero elements that need to be multiplied.
ExTensor is designed with three levels of tiling that each have either input or output sequentiality. First, a Sequencer plus Scanners block ➀ is placed close to the main memory to determine which tiles to send to the next storage, the Last Level Buffer (LLB). Then, for each tile, a Coordinator ➁ in the LLB intersects the next level tiles, which are sent to several PEs that will process in parallel. Finally, in each PE, a Coordinator ➂ intersects the non-zero elements of the computed tile and sends them to the arithmetic unit of the PE for multiplication.
In addition to these bricks which are the main contributions, ExTensor still needs to perform the merge phase, including the reduction of the POs which are stored in CP \(^{2}\) format. The Partial Output Buffer (POB), is managed by tiles, and the content of the tile is stored on-chip using a Content Addressable Memory (CAM) for quick access, enabling reductions to be performed. On overflow, the tiles are flushed to main memory. To benefit from ExTensor’s support for tiling, the input matrices need preprocessing, making it difficult to directly compare performance with other accelerators.
4.3 SpArch
SpArch [
93] uses an
OP algorithm for SpGEMM in order to benefit from the simplified access pattern for the inputs. To address the management of the large volume of
POs, SpArch pipelines the
multiply and
merge phases and uses condensing methods to reduce the number of
POs and scheduling methods to reduce memory traffic. Unlike other
OP accelerators [
64], with SpArch, both
A and
B are stored in
(UOP, CP) row-major format. As a consequence of this
CSR format, the
A-rows can easily be condensed to the left (see Figure
8). Thus, when a condensed
A-column is read, it will require reading all the
B-rows corresponding to the different
A-columns that have been combined. The
MatA Column Fetcher reads the combined
A-columns and the
MatB Row Prefetcher reads the necessary
B-rows (see Figure
9), thus masking the memory latency. The
MatB Row Prefetcher, stores multiple
B-rows in a cache, and it uses information about the upcoming
A-columns (provided by the
Distance List Builder) to ensure the needed
B-rows are kept in the cache.
A key contribution of SpArch is the merge unit that is able to merge 64
POs in parallel, through a binary tree type architecture. The core of the merger is a brick which compares and merges 4 × 4
\((index, value)\) pairs in one cycle. These are combined to build a hierarchical merger array whose output is then written to main memory, and read back for further merging by the
Partial Matrix Fetcher. The authors note that the order in which the
reductions are performed is important, specifically sparser
POs should be merged first and they propose a Huffman tree scheduler to determine the best order to minimize memory accesses. Overall, SpArch is able to compute an
OP SpGEMM
5 kernel with reduced main memory traffic for the
POs mainly by addressing the
scatter problem with a highly parallelized merger.
4.4 MatRaptor
MatRaptor [
79] is an accelerator designed to compute SpMSpM kernels. It is based on
Gust algorithm,
6 which makes it possible to process multiple
A-rows (2 to 8 for MatRaptor) in parallel. In MatRaptor (see Figure
10) a
SpAL (
Sparse Matrix A Loader) fetches non-zero elements of a
A-row and sends them to a
SpBL (
Sparse Matrix B Loader). With the
metadata of this element, the
SpBL fetches non-zero elements of the corresponding
B-rows and sends pairs of
A- and
B-elements to a
PE. This
PE computes the
partial sums and sends the results into queues for the
merge process. Instead of separating
multiply and
merge phases, several queues are used to store
partial sums and partially merged
partial sums, allowing the two phases to be pipelined. The
PE merges all the
partial sums resulting from an entire
A-row, before sending the final output (the corresponding
C-row) to main memory.
In MatRaptor, multiple
A-rows are processed independently and in parallel in different
PEs and each
PE is assigned to a channel in an HBM memory [
45]. It is important that each
PE can read its
A-rows without interfering with the other
PEs. With the classic
(UOP, CP) row-major format, the rows are consecutive in memory and are not aligned to channel boundaries. To solve this problem, the authors propose a new format called
C \(^2\) SR (
(FL, UOP, CP) channel-major) that is based on a
(UOP, CP) row-major. For example, with two
PEs, all the elements of the even rows are stored in one channel, and the elements of the odd rows are in another channel, thus isolating the data. With this approach, the row pointers (
UOP) are no longer monotonic, which means that the length of a row can not be deduced by subtracting adjacent row pointers. This requires the addition of a new table (which we call
fiber length FL), which stores the length of the non-zero elements in each row (see Figure
11). This modified storage format enables multiple
PEs to make effective use of the HBM.
MatRaptor is one of the first accelerators to use Gust algorithm; however, it has some limitations regarding effective reuse of B-rows, both between parallel SpBLs and temporally, as A is traversed. A key contribution is the proposed C \(^2\) SR input format, which makes it possible to separate rows into memory channels, to improve performance.
4.5 InnerSP
InnerSP [
7] is a SpMSpM accelerator based on the
Gust algorithm.
7 The authors of InnerSP justify their choice by showing quantitatively that with the
OP the
POs represent a large volume of data, and that for many matrices the
reductions can not be done fully on-chip and thus must be buffered in main memory. Normally, with
Gust algorithm,
B must be read repeatedly, but InnerSP proposes a special cache for
B which exploits the spatial locality.
The architecture of InnerSP is shown in Figure
12. The
A reader reads the
A-rows. Two separate caches are used for
B, one that stores the row pointers and the other which stores column indices and values. The
B reader reads the required
B-rows, hopefully getting the data from the caches. Parallel multipliers perform the multiplications and then a hash-unit generates a hash based on the indices in
C. A
reduction unit based on hash tables contains parallel comparators which search for matching hash values and performs the
reductions. When all the operations for a
A-row are completed, the
C writer writes the
C-row back to main memory.
The architecture is efficient, as long as the Hash Reduction Unit does not overflow. When overflows occur, the extra entries are temporarily stored in main memory, but there is a significant performance hit. To minimize the chance of overflows, InnerSP uses a pre-scanner to estimate the number of non-zero entries in the upcoming C-rows. The estimate primarily considers the NNZ in the A-rows. If the estimated number of non-zero entries in C exceeds the capacity of the Hash Reduction Unit, then the row is split and processed in two passes. Conversely, if the number of non-zero entries in C is small, two A-rows can be processed simultaneously.
To improve the performance of the
B-cache, the authors exploit a technique from [
8]. The non-zero elements in the
A-rows provide direct information about which
B-rows are required. The column index table of
A is already pre-fetched by the
A reader, and when a row must be evicted from the
B-cache, it chooses the ones which will not be required based on the upcoming entries in the column index table of
A. This technique (also used by SpArch [
93]) maximizes the reuse of
B-rows.
InnerSP is an accelerator based on Gust algorithm which achieves high reuse of the B-rows, through a look-ahead in the A-metadata.
4.6 Gamma
Gamma [
90] is a dedicated accelerator for SpMSpM kernels, which uses
Gust algorithm. The two input matrices are both stored in
(UOP, CP) row-major format and the output is generated in the same format, providing consistency.
The Gamma architecture, presented in Figure
13, is composed of a fetcher for the
A-rows that contains a
Distributor to distribute them to 32
PEs. Each
PE processes an entire
A-row and computes the corresponding
C-row. The
PE fetches the required
B-rows and merges and sorts them in a 64-wide
merge block, before multiplying with the
A-elements. As a result, the output
C-row is generated in order. The
reduction step is, thus, highly simplified, because any duplicate indices are adjacent. Normally, the
C-row can be written back to main memory. However, if the
A-row contains more than 64 non-zero elements, then it must be processed with multiple passes. After each pass, the partial
C-row is stored temporarily. Finally, the temporary
C-rows are read-back and merged by the
PEs. Of course, the
B-rows must be accessed repeatedly by the different
PEs, and Gamma proposes
FiberCache which is a highly banked cache which fetches the
B-rows in advance, based on the column index table of
A. It keeps track of which rows of
B are most needed by the
PEs, and when an eviction is necessary, the victim, is the row which is least needed.
Gamma encounters some difficulties when the A-matrix has rows that are too dense, requiring multiple iterations through the PEs, which creates a high-load on the cache. The authors propose a software preprocessing technique which splits dense A-rows and rearranges them to make best use of the FiberCache.
4.7 SortCache
The authors of SortCache [
77] note that modern processors are able to load an entire cache line in parallel but in many sparse applications less than 50% of the data within a line is used, even with hand-tuned code. They propose a light SpGEMM accelerator which offloads the
reduction step, based on the
Vectorized Binary Search Tree (
VBST), a data-structure from their earlier work [
76,
78]. The VBST is a standard binary search tree where the size of node is a multiple of the cache block size and contains a sorted vector of
K \((key, value)\) pairs. Associated with each node is a
pivot; the pivots, with left and right child pointers, form a binary tree, as illustraded in Figure
14.
Using the VBST, SortCache focuses on accelerating the scatter updates to the output matrix, which are performed directly in the cache. The authors augment the processor with a single new instruction, RED Key, Value. After K updates have been issued, the hardware launches a reduction in the SortCache, where the K \((key, value)\) pairs are inserted into the VBST. This is done by traversing the tree from the root to the leaves. When duplicate keys are found, the reduction operation is performed in the cache. When new keys are found, they are inserted, which can result in a temporary, new node with up to 2K \((key, value)\) pairs. This temporary node is partitioned, with one half having exactly K entries. The insertion procedure may require visiting each level of the tree at least once ( \(\mathcal {O}(log(n)\) ) and it does not ensure the uniqueness of the keys, so at the end, an additional full traversal of the tree is required to remove duplicates.
The nodes closest to the root of the tree are accessed most frequently and are thus mapped to the caches closest to the processor. The authors propose to re-purpose cache tracking hardware (e.g., TAG values, LRU counters) to store the additional metadata (pivots, pointers). One of the difficulties with SortCache is that, in some cases, a large fraction of the VBST nodes are underutilized.
4.8 ASA
In the scope of graph analytic problems, the authors of ASA [
89] observed that
Gust algorithm achieves high performance for SpGEMM kernels but note that it still requires acceleration for the
reduction step.
8 The software state-of-the-art for
reduction uses hash tables [
15]. Some hardware accelerators, such as HTA [
92], enhance performance for hash operations, but those solutions are not optimized for SpGEMM kernels. ASA is a low area, near-core extension for a general purpose processor that specifically accelerates the
reduction step for
Gust algorithm.
ASA accelerates
reduction operations received from software via custom instructions. As presented in Figure
15, the
partial sums from these instructions are stored in a waiting buffer as
\((index, value)\) pairs with a key that is a hash of the index, computed in software to maintain flexibility. The cache line, accessed with the key, is filled with the indices and the
partial sums. If a corresponding
partial sum is already stored, ASA accumulates it and writes back the result. If not, the
partial sum is simply stored into the cache.
To minimize die area, the cache is not dimensioned for an entire
C-row
9 but ASA handles cache overflows in hardware. When a
partial sum needs to be stored in a fully filled cache line, a victim is selected and evicted, based on a LRU policy. The evicted entry is written into a pre-allocated FIFO queue in the memory hierarchy. ASA uses an address generator to manage evicted data and updates head and tail registers for the list. Software then traverses the list to merge duplicate keys to build the final output
C-row. To avoid overflows, the authors advocate
tiling methods to efficiently distribute work through one or several ASA cores.
ASA is an accelerator which optimizes the scatter aspect of SpGEMM kernel with low area overhead, while leaving high software flexibility. Unlike other Gust accelerators, there is no hardware gather support for reusing or caching the B-rows.
4.9 Other Accelerators
In the above sections, we have described a selection of the main sparse accelerators that have appeared in the literature, covering different matrix multiplication algorithms and architectures. This list is not exhaustive, and in the following paragraphs, we present several other accelerators which we will cover in less detail, due to space constraints.
4.9.1 SPiDRE.
The authors of SPiDRE (Sparse Data Rearrange Engine) [
10] propose a solution for the poor utilization of the memory hierarchy in sparse applications. They propose a near-memory data rearrangement engine. This engine can preprocess data, near memory, to create new data-structures that are dense and benefit from the memory hierarchy. For example, with a SpMV kernel, the input vector could be rearranged through a user defined function to a dense format where all non-zero elements are in the right order of computation.
4.9.2 PHI.
The authors of PHI [
60] note that many large graph algorithms favor pull (each node reads inputs from its neighbors) compared to push implementations (each node propagates results to its neighbors), as push implementations require read-modify-writes which require accessing complete cache lines, even though a small unit of data is modified. The authors include
OP SpMV (a push algorithm) in their evaluation.
PHI is a modified cache optimized to efficiently process commutative scatter updates. When an update is received but the line is not in the cache, the line is not fetched but rather marked as being in update mode, where it simply stores the update. If subsequent updates hit the line, they are stored, or combined via the commutative operator. When a cache line is evicted, if it contains multiple updates, PHI assumes that the coalescing has been effective, and it processes the operations via a read-modify-write from main memory. If a cache contains few updates when evicted, PHI batches these updates as \((address, value)\) tuples, allocating a new cache line to assemble the batch. Groups of updates are organized by bins corresponding to a small memory region. The batches are streamed to the main memory. Later, the reading of the batches is also efficient, as they are contiguous.
In multi-core systems, private caches act as local buffers, thus coalescing the updates reduces coherency protocol traffic. This combination of in-cache coalescing and update buffering enables PHI to take advantage of locality when possible (like Coup [
91]), but with the addition of
update batching, PHI improves memory utilization, even when the data being updated is too large to fit in the cache.
4.9.3 GSCSp.
GSCSp [
55] is an FPGA-based accelerator based on
Gust algorithm. In other
Gust accelerators, a
PE processes a full
A-row but the authors note that when there is variability in the number of elements in the
A-rows,
PEs may be blocked, waiting for another
PE to complete. Thus the authors propose to distribute the elements in an
A-row to different
PEs (
element-wise parallelism). Initially, each
PE buffers its output
C-row, and when a
PE advances to the next row, it pushes its partial
C-row to a final merger. The authors also propose to use separate caches for the
metadata for the
B-rows and the values, and report that combined with the
element-wise parallelism, the approach is effective for reducing the memory traffic for accessing
B-rows.
4.9.4 Flexagon.
Flexagon [
59] is a SpMSpM accelerator for DNN applications. The authors claim that the optimal algorithm (
IP,
OP, and
Gust) depends on the matrix structure, potentially varying between DNN models and layers within a DNN, although in the majority of the cases, their data shows that
Gust is the most efficient. Flexagon is configurable: it can implement all three algorithms. To achieve that, the authors designed a
Merger-Reduction Network (
MRN) that is able to perform multiplications (for
partial sum generation), accumulations (for
reductions operations) and comparisons (for
intersection searches and
merge phases) via a binary tree structure. Flexagon addresses the memory access issues of each algorithm via three custom memories: a FIFO for the
A-matrix that is always read sequentially, a cache for the
B-matrix that is subject to high temporal and spatial reuse in each of the three algorithms, and a PSRAM to store
POs for
OP and
Gust algorithms.
4.9.5 Other Accelerators.
In GraphR [
75], the authors propose to use ReRAM technology and a cross-bar to perform graph operations directly in hardware, but this approach is not currently scalable to large problems. The ITS accelerator [
70] focuses on an
OP implementation of SpMV and they apply data-compression to the
metadata. The authors of SPU [
20] propose a more general hardware solution to input data dependencies and focus on graph applications. The focus of Alrescha [
3] is an innovative and adaptive preprocessing of the matrix in hardware to simplify the accesses to the
b-vector. The authors of CoSparse [
29] focus on a software layer that selects the hardware acceleration that is best adapted to the given problem.
The SSSR accelerator [
73] for RISC-V processors goes further by addressing
intersection and
union while streaming data to and from memory and it can be used for multiple kernels; however, the RISC-V performance can not easily be compared with the other accelerators presented in this article. In passing we note three other accelerators EIE [
38], SCNN [
65], and CANDLES [
36] which have hardware support for sparsity but which are highly focused on CNNs, besides SDMA [
32] that focuses on GNNs.
The SpaceA [
86] accelerator exploits near-memory computing inside a 3D DRAM memory to implement SpMV using an
IP algorithm. With preprocessing, the rows of the matrix are assigned to banks within specific DRAM dies, while optimizing spatial locality. A separate die is used for storing the input and output vectors. Associated with each DRAM bank which stores the matrix, there is a
PE which reads the non-zero values of the matrix, fetches the required vector entries and performs the computations. Bringing the compute near the DRAM and exploiting emerging 3D technology, is an orthogonal approach to reduce memory access overheads.
4.9.6 Other FPGA-based Accelerators.
Many sparse accelerators have been prototyped on FPGAs [
58]. Indeed, AMD/Xilinx provides a turnkey library for SpMV [
19,
56]. In [
24], the authors propose a higher-performance SpMV using a modified
CSR format where the data is packed for efficient, streaming access in HBM.
10 They also propose a highly-banked on-chip memory for the vector updates, and a hazard resolution strategy that allows efficient pipelining.
5 Comparison of Existing Accelerators
In this section, we analyze and further compare the accelerators that were described in Section
4. We start with Table
3 where we summarize the accelerators, chronologically, showing the research group where they were developed. We see that MIT is very active in the field, as they were involved in developing four different accelerators (e.g., ExTensor, PHI, SpArch, and Gamma).
5.1 Benchmarks, Evaluation, and Performance
Up to this point, we have discussed the architecture of the various accelerators, without reporting the performance they achieve. Most of the matrices used for benchmarking come from the SuiteSparse Matrix Collection [
50] that contains a large number of matrices from diverse, real applications. Domain specific matrix libraries are also used [
6,
27,
52,
74], as well as some synthetic matrix generators [
2,
11]. The densities are in general less than 1% and can be as low as
\(10^{-5}\) %. ExTensor, SortCache, and Flexagon also benchmark with denser matrices. In all cases, large matrices with several million
NNZ, exceeding the size of a normal cache, have been evaluated.
In Table
4, we note any required preprocessing, assuming, arbitrarily, that matrices are initially stored in
CSC format.
11 We see that a few accelerators require a conversion to their specific format (e.g., ExTensor and MatRaptor). PHI, SortCache, and ASA are not concerned by preprocessing since they only address
scatter. In any case, fair benchmarking must take into account the cost of specialized preprocessing.
Not all accelerators have reached the same level of design maturity, and in Table
4, we have shown how each design was analyzed or simulated. All designs report that they have been simulated at a cycle-accurate level. Most of the accelerators use custom simulators or tools like
gem5 [
26],
ZSim [
71], and STONNE [
61], but unfortunately, there is no common approach, making it difficult to fairly compare the reported performance. RTL descriptions are often only generated for key parts of the design. To estimate power and area, authors used tools such as CACTI [
9] and McPAT [
54].
12 In the last column of the table, we show the size of the local memory storage used in simulation by each accelerator. Except for SPiDRE and SpaceA which work directly in main memory, the local memories are in the range of 0.3 MiB to 38 MiB, corresponding to 27 k to 2.8 M entries. Thus, in the model presented in Section
3.1, most of the accelerators are situated in the middle-row of Figure
4, with a local memory corresponding to a few matrix rows.
The accelerators’ performances are reported as a speedup versus a baseline, which may differ according to their hardware and software platforms: typically the Intel MKL library [
43] and the Armadillo library [
72] for CPUs, or cuSPARSE [
63] and CUSP [
21] for GPUs. Since different baselines with different numbers of threads and cores are used, these speedups need to be interpreted with care. The speedups are in general higher for the
standalone accelerators as they provide optimized hardware for the entire kernel, while
processor assist accelerators have less impressive speedups but aim to be general purpose and have lower area overhead.
The only accelerators that have a comparable baseline are the
standalone ones addressing SpMSpM. They all compare themselves to OuterSPACE with the same benchmark matrices, so we can sort them by their speedup: Gamma (
\(6.6\times\) ),
13 InnerSP (
\(4.57\times\) ), SpArch (
\(4\times\) ), MatRaptor (
\(1.8\times\) ), and OuterSPACE (
\(1\times\) ). Three of the five
standalone accelerators use the
Gust algorithm (Gamma, InnerSP, and MatRaptor). SpArch, an
OP accelerator employing a complex architecture, outperforms MatRaptor, but does not achieve the same performance as InnerSP or Gamma, which suggests that
Gust is most adapted to high-performance hardware accelerators.
These measurements are consistent with our model presented in Section
3.1 (Figure
4). OuterSPACE, the baseline, corresponds to graph ➄. MatRaptor, the first
Gust accelerator, was able to achieve higher performance by reducing the number of
random accesses, and creating better regularity (red to yellow), as seen in graph ➅.
SpArch, is an
OP accelerator with a much larger local memory, which is used to address the red region in graph ➄. Their
A-matrix condensing method, allows them to reduce the number of
POs, and approach the performance achievable in graph ➇, although,
in fine, it is impossible to store all the
POs in local memory. InnerSP and Gamma correspond to optimizations of the
Gust approach in graph ➅, to approach the performance in graph ➈. This is achieved by custom caches for the
B-matrix to address the orange region. With the
Gust algorithm, it is only necessary to store a limited number of
B-rows, which is achievable with the large local memories in these accelerators. The model presented in Section
3.1 is helpful in understanding the local memory requirements for SpMSpM accelerators.
5.2 Analysis for Designers
There is no single “best overall” accelerator. A designer must consider the required performance, the available area, and the class of problems to be addressed. In Figure
16, we present a flow diagram which can guide a designer towards an architecture based on their requirements. Starting from the top-right of the figure, the designer must first decide between a
specialized accelerator for a specific kernel or a
general purpose accelerator. The former is preferable if performance must be maximized, requiring a
standalone accelerator addressing both
gather and
scatter (e.g., MatRaptor, SpArch, Gamma, and InnerSP).
A general purpose accelerator can either take the form of a highly configurable standalone block (e.g., OuterSPACE and ExTensor) or a processor assist that boosts performance for a specific task. With a processor assist (rightmost branch in the figure), the processor still performs important parts of the kernel, typically the multiplications, but off-loads the scatter or gather tasks to the accelerator, as these are the tasks which are the bottleneck with a traditional cache hierarchy. For example, for an IP algorithm, the difficult part of the gather task is performing the intersection (e.g., the ExTensor bricks and SPiDRE). Conversely, with OP or Gust, the challenge with the scatter is to efficiently perform reductions (e.g., PHI, SortCache, and ASA). Finally, the designer must decide whether the storage is done by reusing existing memory resources (e.g., PHI, SPiDRE, and SortCache) or through custom caches/memories.
Going back to the case of a standalone accelerator, the algorithms differ according to the supported kernels. For sparse MV, the accelerators studied here have opted for an OP algorithm. For sparse MM, the designers of three of the five highest performing accelerators have chosen Gust algorithm, as this algorithm provides a good tradeoff, simplifying the gather by reusing input B-rows and the row-by-row generation of the output, simplifying the scatter aspect. The fact that the outputs are generated row-by-row, facilitates parallel implementations. Gust works well with (UOP, CP), providing consistent input and output formats. Beyond the basic algorithm, tiling approaches, as proposed by ExTensor, can be explored.
Orthogonal to previous choices, other design considerations are important. First, the starting point needs to be defined: some accelerators require software to perform tiling (e.g., ExTensor) or to rearrange the rows to improve performance (e.g., Gamma). The cost of such preprocessing can be high and is usually only justified if the same matrix is reused multiple times.
6 Conclusion
The size of the datasets considered in computational physics, machine learning, and graph analytics is continuing to grow and over the last six years, we note the emergence of many hardware accelerators designed to improve upon the performance of CPUs and GPUs. In this article, we have presented the most important designs and established a comparison, while highlighting the underlying hardware challenges associated with sparse multiplication kernels.
As we have seen, with this class of problem, the challenge is to efficiently use the memory system despite the dereferencing required to access the metadata associated with the sparse data formats. Depending on the selected MM algorithm, the major challenge is either with the gather and intersection or with the scatter updates. We have identified that accelerators based on Gustavson’s algorithm achieve a tradeoff, where, on the gather side, a limited number of B-rows need to be accessed, and on the output side, the result is generated row-by-row, simplifying the scatter updates. The best Gustavson accelerators have implemented further optimizations such as smart cache eviction strategies where the victim is selected by looking ahead at the A-indices. Several of the smaller processor assist accelerators reuse existing cache memories and simply modify the behavior of the control logic (directories, tags, eviction policies) to optimize the operation for sparse matrices. On the other hand, the highest performance accelerators propose dedicated, massively parallel reduction/mergers, to maximize parallelism.
Given the number and diversity of existing designs, it is clear that further hardware optimizations and improvements are possible. As stated earlier, there may never be a single “best” sparse accelerator but rather designs that are best suited to specific classes of problems. We believe that the structure of the matrices is critical in terms of the sizing of caches and other resources and that in the coming years, new accelerators will be increasingly tailored based on the structure of the matrices for specific classes of problems.