Sparse Linear Algebra on AMD and NVIDIA GPUs – The Race Is On

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12151))

Included in the following conference series:

International Conference on High Performance Computing

5876 Accesses
13 Citations
3 Altmetric

Abstract

Efficiently processing sparse matrices is a central and performance-critical part of many scientific simulation codes. Recognizing the adoption of manycore accelerators in HPC, we evaluate in this paper the performance of the currently best sparse matrix-vector product (SpMV) implementations on high-end GPUs from AMD and NVIDIA. Specifically, we optimize SpMV kernels for the CSR, COO, ELL, and HYB format taking the hardware characteristics of the latest GPU technologies into account. We compare for 2,800 test matrices the performance of our kernels against AMD’s hipSPARSE library and NVIDIA’s cuSPARSE library, and ultimately assess how the GPU technologies from AMD and NVIDIA compare in terms of SpMV performance.

You have full access to this open access chapter, Download conference paper PDF

Accelerating Numerical Dense Linear Algebra Calculations with GPUs

Optimizing Memory-Bound SYMV Kernel on GPU Hardware Accelerators

Porting Sparse Linear Algebra to Intel GPUs

Keywords

1 Introduction

The sparse matrix vector product (SpMV) is a heavily-used and performance-critical operation in many scientific and industrial applications such as fluid flow simulations, electrochemical analysis, or Google’s PageRank algorithm [11]. Operations including sparse matrices are typically memory bound on virtually all modern processor technology. With an increasing number of high performance computing (HPC) systems featuring GPU accelerators, there are significant resources spent on finding the best way to store a sparse matrix and optimize the SpMV kernel for different problems.

In this paper, we present and compare four SpMV strategies (COO, CSR, ELL, and HYB) and their realization on AMD and NVIDIA GPUs. We furthermore assess the performance of each format for 2,800 test matrices on high-end GPUs from AMD and NVIDIA. We also derive performance profiles to investigate how well the distinct kernels generalize. All considered SpMV kernels are integrated into the Ginkgo open-source library^{Footnote 1}, a modern C++ library designed for the iterative solution of sparse linear systems, and we demonstrate that these kernels often outperform their counterparts available in the AMD hipSPARSE and the NVIDIA cuSPARSE vendor libraries.

Given the long list of efforts covering the design and evaluation of SpMV kernels on manycore processors, see [2, 7] for a recent and comprehensive overview of SpMV research, we highlight that this work contains the following novel contributions:

We develop new SpMV kernels for COO, CSR, ELL and HYB that are optimized for AMD and NVIDIA GPUs and outperform existing implementations. In particular, we propose algorithmic improvements and tuning parameters to enable performance portability.
We evaluate the performance of the new kernels against SpMV kernels available in AMD’s hipSPARSE library and NVIDIA’s cuSPARSE library.
Using the 2,800 test matrices from the Suite Sparse Matrix Collection, we derive performance profiles to assess how well the distinct kernels generalize.
We compare the SpMV performance limits of high-end GPUs from AMD and NVIDIA.
Up to our knowledge, Ginkgo is the first open-source sparse linear algebra library based on C++ that features multiple SpMV kernels suitable for irregular matrices with back ends for both, AMD’s and NVIDIA’s GPUs.
We ensure full result reproducibility by making all kernels publicly available as part of the Ginkgo library, and archiving the performance results in a public repository^{Footnote 2}.

Before providing more details about the sparse matrix formats and the processing strategy of the related SpMV routines in Sect. 3, we recall some basics about sparse matrix formats in Sect. 2. In Sect. 3.4, we combine several basic matrix storage formats into the so-called “hybrid” format (HYB) that splits the matrix into parts to exploit the performance niches of various basic formats. In a comprehensive evaluation in Sect. 4, we first compare the performance of Ginkgo’s SpMV functionality with the SpMV kernels available in NVIDIA’s cuSPARSE library and AMD’s hipSPARSE library, then derive performance profiles to characterize all kernels with respect to specialization and generalization, and finally compare the SpMV performance of AMD’s RadeonVII GPU with NVIDIA’s V100 GPU. We conclude in Sect. 5 with a summary of the observations.

2 Review of Sparse Matrix Formats

For matrices where most elements are zero, which is typical for, e.g., finite element discretizations or network representations, storing all values explicitly is expensive in terms of memory and computational effort. In response, sparse matrix formats reduce the memory footprint and the computational effort by focusing on the nonzero matrix values [3]. In some cases, additionally storing some zero elements can improve memory access and data-parallel processing [4]. While there exists a long and still expanding list of sparse matrix formats (some of them tailored towards specific problems), we illustrate some of the most common basic formats (DENSE, COO, CSR, ELL) in Fig. 1.

The optimization of the SpMV kernel for manycore GPUs remains a topic of major interest [5, 9, 12]. Many of the most recent algorithm developments increase the efficiency by using prefix-sum computations [13] and intra-warp communication [10] on modern manycore hardware.

3 Sparse Matrix Vector Kernel Designs

We realize all SpMV kernels in the vendors’ native languages: CUDA for NVIDIA GPUs and HIP for AMD GPUs. Given the different hardware characteristics, see Table 1, we optimize kernel parameters like group size for the distinct architectures. More relevant, for the CSR, ELL, and HYB kernels, we modify the SpMV execution strategy for the AMD architecture from the strategy that was previously realized for NVIDIA architectures [2].

3.1 Balancing COO SpMV Kernel

Flegar et al. [6] introduced a load-balancing COO SpMV based on the idea of parallelizing across the nonzeros of a sparse matrix. This way, all threads have the same workload, and coalesced access to the column indexes and the values of the sparse matrix is enabled. At the same time, parallelizing across nonzeros requires the use of atomicAdd operations to avoid race conditions, see Algorithm 1.

Flegar et al. [6] also introduced an “oversubscribing” parameter $\omega $ that controls the number of threads allocated to each physical core. When increasing the oversubscribing, we have more active threads to hide the latency of data access and atomicAdds [1]. At the same time, it increases the number of atomicAdds invocations and the overhead of context switching. Using an experimental assessment on all of the 2,800 matrices from the Suite Sparse Matrix Collection, Flegar et al. [6] identifies oversubscribing parameters $\omega _{\text {NVIDIA}}$ that draw a good balance between these aspects. Similarly to Flegar et al. [6], we use experiments to identify good choices $\omega _{\text {AMD}}$ for AMD architectures by considering oversubscribing parameters $\omega = 2^k (0 \le k \le 7)$. In the Ginkgo library and our experiments, we use the setting

$$ \omega _{\text {NVIDIA}} = {\left\{ \begin{array}{ll} \begin{array}{lll} 8 &{} (n_z<2\cdot 10^5), \\ 32 &{} (2\cdot 10^5 \le n_z<2\cdot 10^6), \\ 128 &{} (2\cdot 10^6 \le n_z) \end{array} \end{array}\right. } \omega _{\text {AMD}} = {\left\{ \begin{array}{ll} \begin{array}{lll} 2 &{} (n_z<10^5), \\ 8 &{} (10^5 \le n_z<10^7). \\ 32 &{} (10^7 \le n_z) \end{array} \end{array}\right. } $$

3.2 CSR SpMV Kernel

The most basic CSR SpMV kernel (basic CSR) assigns only one thread to each row, which results in notoriously low occupancy of GPU. In Algorithm 2, we assign a “subwarp” (multiple threads) to each row, and use warp reduction mechanisms to accumulate the partial results before writing to the output vector. This classical CSR assigning multiple threads to each row is inspired by the performance improvement of the ELL SpMV in [2]. We adjust the number of threads assigned to each row to the maximum number of nonzeros in a row. We select

$$\text {subwarp size} = 2^k (0 \le k \le 5\ \text {(NVIDIA) or } 6 \text { (AMD)})$$

as the closest number smaller or equal to the maximum number of nonzeros in a row, i.e.

$$\text {subwarp size} = \max \left\{ 2^t \le max\_row\_nnz | t \in \mathbb {Z}, 0 \le t \le \log _2(\text {device warpsize})\right\} $$

In Fig. 5 in Sect. 4, we visualize the performance improvements obtained from assigning multiple threads to each row and observe that the basic CSR SpMV is not always slower. In particular for very unbalanced matrices, assigning the same parallel resources to each row turns out to be inefficient. In response, we design a load-balancing CSRI which follows the strategy of the COO SpMV described in Sect. 3.1 to balance the workload across the compute resources. For an automatic strategy selection in Algorithm 3, we define two variables $nnz\_limit$ and $row\_len\_limit$ to control the kernel selection on NVIDIA and AMD GPUs. $nnz\_limit$ reflects the limit of total nonzero count, and $row\_len\_limit$ reflects the limit of the maximum number of stored elements in a row. For AMD GPUs, $nnz\_limit$ is $10^8$ and $row\_len\_limit$ is 768. For NVIDIA GPUs, $nnz\_limit$ is $10^6$ and $row\_len\_limit$ is 1024.

3.3 ELL SpMV Kernel

In [2], the authors demonstrated that the ELL SpMV kernel can be accelerated by assigning multiple threads to each row, and using an “early stopping” strategy to terminate thread blocks early if they reach the padding part of the ELL format. Porting this strategy to AMD architectures, we discovered that the non-coalesced global memory access possible when assigning multiple threads to the rows of the ELL matrix stored in column-major format can result in low performance. The reason behind this is that the strategy in [2] uses threads of the same group to handle one row, which results in adjacent threads always reading matrix elements that are m (matrix size or stride) memory locations apart. To overcome this problem, we rearrange the memory access by assigning the threads of the same group to handle one column like the classical ELL kernel, but assigning several groups to each row to increase GPU usage. Because the threads handling the elements of a row may be of the same thread block but are no longer part of the same warp, we can not use warp reduction for the partial sums but need to invoke atomicAdds on shared memory. Figure 2 visualizes the different memory access strategies.

In our experiments, we set the “group_size” to multiple of 32 for both AMD and NVIDIA architectures. The group_size is the number of contiguous element read by thread block, and the num_group is the number of thread in the thread block accessing the same row. We use block size = 512 in ELL kernel. To make the “group_size” is the multiple of 32, we set the max of $num\_group = block\_size/min\_group\_size\, =\,512/32 = 16$. We visualize in Fig. 8 the improvement of the new ELL SpMV kernel over the kernel previously employed [2].

In Algorithm 4, we present the ELL SpMV kernel implemented in Ginkgo for SIMD architectures like GPUs. The number of groups assigned to a row is computed via Algorithm 5. Generally, the number of the group is increased with the number of nonzero elements accumulated in a single row. However, if $num\_group = 16$, multiple thread block may be assigned to the same row, see line 8 in Algorithm 5. This strategy aims at increasing the occupancy of the GPU multiprocessors when targeting short-and-wide matrices that accumulate many elements in few rows. After the group is determined, the start index for a specific thread is computed in lines 2 in Algorithm 4 with the step size which is same as the total number of threads accessing the same row. The threads process the data with the loop in lines 6–12. This kernel still uses the early stopping in lines 8–10 introduced in [2]. After completion of the matrix vector multiplication step, the partial sums accumulated in thread-local variables are reduced (line 13) and added to the output vector in global memory, see line 15. Even though this operation requires an atomic operation as multiple groups (part of distinct thread blocks) may operate on the same row, the chance of atomic collisions is small due to the previous reduction in line 13.

3.4 Hybrid Matrix Formats and Optimal Matrix Splitting

Ginkgo ’s hybrid (“HYB ”) format splits the matrix into two parts and stores the regular part in the ELL format and the irregular part in the COO format. Flegar et al. [6] demonstrated that Ginkgo’s COO SpMV achieves good performance for irregular matrices on NVIDIA architectures, and the results in Sect. 4 confirm that Ginkgo’s COO SpMV performs well also on AMD architectures. How the HYB format partitions a matrix into the ELL and the COO part impacts the memory requirements and performance. Anzt et al. [2] derived strategies basing the partitioning on the nonzeros-per-row distribution of the matrix. We modify this strategy by adding a condition based on the ratio between the maximum nonzeros-per-row and the number of rows. For R being the set of the nonzeros-per-row values, we define the function $Q_R$ and $F_R$:

$$\begin{aligned} Q_R(x) := \min \left\{ t \in \mathbb {N} \ \vert \ x < F_R(t) \right\} ,\qquad F_R(t) := \frac{\left| \{ r \in R \ \vert \ r \le t \} \right| }{\left| R \right| }. \end{aligned}$$

We recall that Anzt et al. [2] introduced hybrid$\{n\}$ which takes the nonzeros of the row at the $n\%$-quantile in the ascending ordering of the nonzero-per-row values, $Q_R(n\%)$. A variant denoted with “hybridminstorage” selects

$$n\% = \left\lfloor \frac{\#rows\times sizeof(index)}{sizeof(value)+2\times sizeof(index)} + 1 \right\rfloor $$

according to the (bit-)size of the value and index arrays, i.e. hybridminstorage is hybrid25 when storing the values in 64-bit doubles and the indexes in 32-bit integers [2]. In this paper, we enhance the hybrid$\{n\}$ partitioning from Anzt et al. [2] by enforcing the limitation that the maximum nonzero-per-row of the ELL part can at most be $\#rows * 0.0001$. We consider the resulting strategy “hybridlimit$\{n\}$” and select hybridlimit33 (label “HYB ”) as our default strategy according to the performance evaluation in Fig. 11 in Sect. 4.

4 Experimental Performance Assessment

4.1 Experiment Setup

In this paper, we consider NVIDIA’s V100 (SXM2 16 GB) GPU with support for compute capability 7.0 [14] and AMD’s RadeonVII with compute capability gfx906. See Table 1 for some hardware specifications [16]. We note that the AMD RadeonVII is not a server-line GPU, but provides the same memory bandwidth as the AMD HPC GPU MI50, and thus should be comparable for memory bound operations such as the SpMV kernels. We use the major programming ecosystems for the distinct architectures - CUDA for NVIDIA GPUs and HIP for AMD GPUs. CUDA GPU kernels were compiled using CUDA version 9.2, and HIP GPU kernels were compiled using HIP version 2.8.19361.

Table 1. Specifications of the V100 SXM2 16 GB and the RadeonVII [16].

Full size table

The performance evaluation covers more than 2,800 test matrices of the Suite Sparse Matrix Collection [15]. Some matrices contain dense rows, which makes the conversion to the ELL format virtually impossible. We ignore those matrices in the performance evaluation of the ELL SpMV kernel.

All experiments are performed in IEEE double precision arithmetic, and the GFLOP/s rates are computed under the assumption that the number of flops is always $2n_z$, where $n_z$ is the number of nonzeros of the test matrix (ignoring padding).

4.2 COO SpMV Performance Analysis

We first evaluate the performance of the load-balancing COO SpMV kernel. In Fig. 3a, we compare against cuSPARSE’s COO kernel (cusparseDhybmv with CUSPARSE_HYB_PARTITION_USER and threshold of 0), in Fig. 3b, we compare against hipSPARSE’s COO kernel (hipsparseDhybmv with HIPSPARSE_HYB_PARTITION_USER and threshold of 0). Each dot reflects one test matrix from the Suite Sparse collection. The x-axis is the nonzero count of the matrix, and the y-axis is the performance in GFLOP/s. In Fig. 4, we present the speedup of Ginkgo’s SpMV over cuSPARSE’s COO implementation and hipSPARSE’s COO implementation, respectively. Red dots reflect test matrices where Ginkgo outperforms the vendor library, green dots reflect cases where the vendor library is faster. Despite the fact that the irregularity of a matrix heavily impacts the SpMV kernels’ efficiency, we can observe that Ginkgo’s COO SpMV achieves much higher performance than both NVIDIA’s and AMD’s COO kernels in most cases. Overall, Ginkgo achieves an average speedup of about 2.5x over cuSPARSE’s COO SpMV and an average speedup of about 1.5x over hipSPARSE COO SpMV.

4.3 CSR SpMV Performance Analysis

In the CSR SpMV performance analysis, we first demonstrate the improvement of assigning multiple threads to each row (classical CSR) over the implementation assigning only one thread to each row (basic CSR) see Fig. 5 for the CUDA and AMD backend, respectively. For a few matrices with many nonzeros, the basic CSR is 5x–10x faster than the classical CSR. To overcome this problem, we use Algorithm 3 in Ginkgo that chooses the load-balancing CSRI algorithm for problems with large nonzero counts.

Next, we compare the performance of the Ginkgo CSR SpMV (that automatically interfaces to either the load-balancing CSRI kernel or the classical CSR, see Sect. 3.2) with the vendors’ CSR SpMV. Anzt et al. [2] identified the cusp_csr kernel (cusparseDcsrmv) as the overall performance winner among the different NVIDIA CSR implementations. For the AMD CSR SpMV kernel, we use the CSR kernel (hipsparseDcsrmv) provided in hipSPARSE. For completeness, we mention that the rocSPARSE library (outside the HIP ecosystem) contains a CSR kernel that renders better SpMV performance for irregular matrices on AMD GPUs. We refrain from considering it as we want to stay within the HIP ecosystem, which is anticipated to serve as primary dissemination tool for AMD’s sparse linear algebra technology.

In Fig. 6, we compare the Ginkgo CSR SpMV with the cusparseDcsrmv CSR kernel available in NVIDIA’s cuSPARSE library and the hipsparseDcsrmv CSR kernel available in AMD’s hipSPARSE library, respectively. In the relative performance analysis, Fig. 7, we use the ratio $\frac{max(row\_n_z)}{num\_rows}$ for the x-axis as this is the parameter used in Ginkgo’s CSR SpMV to decide which CSR algorithm is selected. Ginkgo CSR achieves significant speedups for large x-values (up to 900x speedup on V100 and 700x speedup on RadeonVII). At the same time, there are a few cases where the Ginkgo CSR SpMV is slower than the library implementations (up to 20x slowdown on V100 and 5x slowdown on RadeonVII).

4.4 ELL SpMV Performance Analysis

First, we investigate the performance improvement we obtain by changing the memory access strategy for the ELL SpMV kernel, see Sect. 3. Interestingly, moving to the new ELL SpMV algorithm does not render noteworthy performance improvements on NVIDIA’s V100 GPU, as can be seen in Fig. 8a. At the same time, the performance improvements are significant for AMD’s RadeonVII, as shown in Fig. 8b. In the new ELL SpMV algorithm, we improve the global memory access at the cost of atomicAdd operations on shared memory (which are more expensive than warp reductions). In consequence, the current ELL SpMV is not always faster than the previous ELL SpMV.

In Fig. 9, we compare Ginkgo’s ELL SpMV kernel against cuSPARSE cusparseDhybmv with CUSPARSE_HYB_PARTITION_MAX ELL kernel and hipSPARSE hipsparseDhybmv with HIPSPARSE_HYB_PARTITION_MAX ELL kernel, respectively. hipSPARSE ELL employs a limitation not to process matrices that have more than $\frac{\#nnz - 1}{\#rows} +1$ elements in a row. Thus, we have much fewer data points for the hipSPARSE ELL SpMV (the blue points in Fig. 9b). In Fig. 10, Ginkgo’s ELL is faster than their counterparts available in the vendors libraries if the ratio $\frac{max(row\_n_z)}{num\_rows} > 10^{-2}$. For the other cases, Ginkgo and the vendor libraries are comparable in their ELL SpMV performance.

4.5 HYB SpMV Performance Analysis

Before comparing against the vendor implementations, we investigate the performance of our HYB SpMV kernel for different partitioning strategies denoted by hybrid$\{n\}$, hybridlimit$\{n\}$, and hybridminstorage (which is same as hybrid25) as introduced in Sect. 3.4. We use a performance profile [8] on all Suite Sparse matrices to compare the strategies with respect to specialization and generalization. Using a performance profile allows to identify the test problem share (y-axis) for a maximum acceptable slowdown compared to the fastest algorithm (x-axis). In Fig. 11, we visualize the performance profiles for the V100 and RadeonVII architectures. Although the hybrid strategy (which corresponds to hybridlimit33) does not win in terms of specialization (maximum slowdown of 1), we favor this strategy since it provides the best generality: when considering a maximum acceptable slowdown factor of less than 1.75, this format wins in terms of problem share.

In Fig. 12, we see that Ginkgo’s HYB SpMV achieves similar peak performances like cuSPARSE’s cusparseDhybmv HYB SpMV and hipSPARSE’s hipsparseDhybmv HYB SpMV, but Ginkgo has much higher performance averages than cuSPARSE or hipSPARSE. Figure 13a and Fig. 13b visualize the HYB SpMV performance relative to the vendor libraries, and we identify significant speedups for most problems and moderate slowdowns for a few cases.

4.6 All SpMV Performance Profile Analysis

In Fig. 14, we use the performance profile to assess the specialization and generalization of all matrix formats we consider. In Fig. 14, Ginkgo’s CSR is the fastest for about 30% of the test cases, and Ginkgo’s HYB is the winner in terms of generality (if the acceptable slowdown factor is larger than 1.0625). Very similarly, in Fig. 15, Ginkgo’s CSR is the fastest kernel for roughly 30% of the test cases, and Ginkgo’s HYB is the generalization-winner if the acceptable slowdown factor is larger than 1.375. We note that the hipSPARSE ELL stays at a low problem ratio as it employs a limitation to not process matrices that have more than $\frac{\#nnz - 1}{\#rows} +1$ elements in a row.

We already noticed in the analysis comparing Ginkgo’s different SpMV kernels to the vendor libraries that AMD’s hipSPARSE library generally features much better-engineered kernels than NVIDIA’s cuSPARSE library. In consequence, also the performance profiles of AMD’s SpMV kernels are much closer to Ginkgo’s SpMV kernel profiles than NVIDIA’s SpMV kernel profiles.

4.7 RadeonVII vs V100 SpMV Performance Analysis

We finally compare the SpMV performance limits of RadeonVII and V100 in Fig. 16. We consider both Ginkgo’s back ends for the two architectures, and the SpMV kernels available in the vendor libraries (labeled “Sparselib”).

In most cases, the V100 is faster than RadeonVII, but the speedup factors are moderate, with an average around 2x. RadeonVII shows better performance for matrices that contain many nonzeros. The higher memory bandwidth of the RadeonVII might be a reason for these performance advantages, but as there are typically many factors (such as context switch, warp size, the number of multiprocessors, etc.) affecting the performance of SpMV kernels, identifying the origin of the performance results is difficult.

While NVIDIA’s V100 outperforms AMD’s RadeonVII in most tests, we acknowledge that the price for a V100 (16 GB SXM2) is currently more than an order of magnitude higher than for a RadeonVII^{Footnote 3}

5 Summary and Outlook

In this paper, we have presented a comprehensive evaluation of SpMV kernels for AMD and NVIDIA GPUs, including routines for the CSR, COO, ELL, and HYB format. We have optimized all kernels for the latest GPU architectures from both vendors, including new algorithmic developments and parameter tuning. All kernels are part of the Ginkgo open source library, and typically outperform their counterparts available in the vendor libraries NVIDIA cuSPARSE and AMD hipSPARSE. We accompany te kernel release with a performance database and a web tool that allows investigating the performance characteristics interactively. We also conducted an extensive SpMV performance comparison on both AMD RadeonVII and NVIDIA V100 hardware. We show that despite NVIDIA’s V100 providing better performance for many cases, AMD’s RadeonVII with the hipSPARSE library is able to compete against NVIDIA’s V100 in particular for matrices with a high number of non zero elements. In addition, we note that due to the price discrepancy between the two hardware (AMD’s RadeonVII is roughly 6.6% of the price of an NVIDIA’s V100), the AMD hardware provides a much better performance-per-dollar ratio. This may indicate that after a long period of NVIDIA dominating the HPC GPU market, AMD steps up to recover a serious competitor position.

Notes

1.
https://ginkgo-project.github.io.
2.
https://github.com/ginkgo-project/ginkgo-data/tree/2020_isc.
3.
In December 2019, the list price for NVIDIA’s V100 (16 GB SXM2) is US$ 10,664.-, the list price for AMD’s RadeonVII is US$ 699.-.

References

Anzt, H., Chow, E., Dongarra, J.: On block-asynchronous execution on GPUs. Technical report 291, LAPACK Working Note (2016)
Google Scholar
Anzt, H., et al.: Load-balancing sparse matrix vector product kernels on GPUs. ACM Trans. Parallel Comput. 7(1), 1–26 (2020). https://doi.org/10.1145/3380930
Article Google Scholar
Barrett, R., et al.: Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, 2nd edn. SIAM, Philadelphia (1994)
Book Google Scholar
Bell, N., Garland, M.: Implementing sparse matrix-vector multiplication on throughput-oriented processors. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC 2009, pp. 1–11. ACM, New York (2009). https://doi.org/10.1145/1654059.1654078
Dalton, S., Baxter, S., Merrill, D., Olson, L., Garland, M.: Optimizing sparse matrix operations on GPUs using merge path. In: 2015 IEEE International Parallel and Distributed Processing Symposium, pp. 407–416, May 2015
Google Scholar
Flegar, G., Anzt, H.: Overcoming load imbalance for irregular sparse matrices. In: Proceedings of the 7th Workshop on Irregular Applications: Architectures and Algorithms, IA3 2017, pp. 1–8. ACM, New York (2017). https://doi.org/10.1145/3149704.3149767
Grossman, M., Thiele, C., Araya-Polo, M., Frank, F., Alpak, F.O., Sarkar, V.: A survey of sparse matrix-vector multiplication performance on large matrices. CoRR (2016). http://arxiv.org/abs/1608.00636
Higham, D., Higham, N.: Matlab Guide. Society for Industrial and Applied Mathematics (2005). https://doi.org/10.1137/1.9780898717891
Hong, C., Sukumaran-Rajam, A., Nisa, I., Singh, K., Sadayappan, P.: Adaptive sparse tiling for sparse matrix multiplication. In: Proceedings of the 24th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2019, Washington, DC, USA, 16–20 February 2019, pp. 300–314 (2019). https://doi.org/10.1145/3293883.3295712
Hong, S., Kim, S.K., Oguntebi, T., Olukotun, K.: Accelerating CUDA graph algorithms at maximum warp. In: Proceedings of the 16th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 2011, San Antonio, TX, USA, 12–16 February 2011, pp. 267–276 (2011). https://doi.org/10.1145/1941553.1941590
Langville, A.N., Meyer, C.D.: Google’s PageRank and Beyond: The Science of Search Engine Rankings. Princeton University Press, Princeton (2012)
Google Scholar
Merrill, D., Garland, M.: Merge-based parallel sparse matrix-vector multiplication. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2016, pp. 1–12. IEEE Press, Piscataway (2016). http://dl.acm.org/citation.cfm?id=3014904.3014982
Merrill, D., Garland, M., Grimshaw, A.S.: High-performance and scalable GPU graph traversal. ACM Trans. Parallel Comput. (TOPC) 1(2), 1–30 (2015). https://doi.org/10.1145/2717511
Article Google Scholar
NVIDIA Corp.: Whitepaper: NVIDIA Tesla V100 GPU Architecture (2017)
Google Scholar
SuiteSparse: Matrix Collection (2018). https://sparse.tamu.edu. Accessed Apr 2018
TechPowerUp: GPU database (2019). https://www.techpowerup.com/gpu-specs/

Download references

Acknowledgment

This work was supported by the “Impuls und Vernetzungsfond” of the Helmholtz Association under grant VH-NG-1241, and the US Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration.

Author information

Authors and Affiliations

Karlsruhe Institute of Technology, 76131, Karlsruhe, Germany
Yuhsiang M. Tsai, Terry Cojean & Hartwig Anzt
University of Tennessee, Knoxville, TN, 37996, USA
Hartwig Anzt

Authors

Yuhsiang M. Tsai
View author publications
You can also search for this author in PubMed Google Scholar
Terry Cojean
View author publications
You can also search for this author in PubMed Google Scholar
Hartwig Anzt
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hartwig Anzt .

Editor information

Editors and Affiliations

School of Computing, University of Utah, Salt Lake City, UT, USA
Ponnuswamy Sadayappan
Cray, a Hewlett Packard Enterprise Company, Seattle, WA, USA
Bradford L. Chamberlain
Helmholtz-Zentrum Dresden-Rossendorf, Dresden, Germany
Guido Juckeland
Extreme Computing Research Center, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
Hatem Ltaief

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tsai, Y.M., Cojean, T., Anzt, H. (2020). Sparse Linear Algebra on AMD and NVIDIA GPUs – The Race Is On. In: Sadayappan, P., Chamberlain, B.L., Juckeland, G., Ltaief, H. (eds) High Performance Computing. ISC High Performance 2020. Lecture Notes in Computer Science(), vol 12151. Springer, Cham. https://doi.org/10.1007/978-3-030-50743-5_16

Download citation

DOI: https://doi.org/10.1007/978-3-030-50743-5_16
Published: 15 June 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-50742-8
Online ISBN: 978-3-030-50743-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics