Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3581784.3607051acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

DASP: Specific Dense Matrix Multiply-Accumulate Units Accelerated General Sparse Matrix-Vector Multiplication

Published: 11 November 2023 Publication History

Abstract

Sparse matrix-vector multiplication (SpMV) plays a key role in computational science and engineering, graph processing, and machine learning applications. Much work on SpMV was devoted to resolving problems such as random access to the vector x and unbalanced load. However, we have experimentally found that the computation of inner products still occupies much overhead in the SpMV operation, which has been largely ignored in existing work.
In this paper, we propose DASP, a new algorithm using specific dense MMA units for accelerating the compute part of general SpMV. We analyze the row-wise distribution of nonzeros and group the rows into three categories containing long, medium, and short rows, respectively. We then organize them into small blocks of proper sizes to meet the requirement of MMA computation. For the three categories, DASP offers different strategies to complete SpMV by efficiently utilizing the MMA units.
The experimental results on two newest NVIDIA GPUs A100 and H800 show that our DASP in FP64 precision outperforms five latest SpMV methods CSR5, TileSpMV, LSRB-CSR, cuSPARSE BSR format and cuSPARSE CSR format by a factor of on average 1.46x, 2.09x, 3.29x, 2.08x and 1.52x (up to 12.64x, 17.48x, 90.59x, 283.92x and 6.94x) on A100, respectively. As for SpMV in FP16 precision, our DASP outperforms cuSPARSE by a factor of on average 1.70x and 1.75x (up to 26.47x and 65.94x) on A100 and H800, respectively.

Supplemental Material

MP4 File - SC23 paper presentation recording for "DASP: Specific Dense Matrix Multiply-Accumulate Units Accelerated General Sparse Matrix-Vector Multiplication"
SC23 paper presentation recording for "DASP: Specific Dense Matrix Multiply-Accumulate Units Accelerated General Sparse Matrix-Vector Multiplication", by Yuechen Lu and Weifeng Liu

References

[1]
C. Alappat, A. Basermann, A. R. Bishop, H. Fehske, G. Hager, O. Schenk, J. Thies, and G. Wellein. A recursive algebraic coloring technique for hardware-efficient symmetric sparse matrix-vector multiplication. ACM Transactions on Parallel Computing, 7(3), 2020.
[2]
J. I. Aliaga, H. Anzt, T. Grützmacher, E. S. Quintana-Ortí, and A. E. Tomás. Compression and load balancing for efficient sparse matrix-vector product on multicore processors and graphics processing units. Concurrency and Computation: Practice and Experience, 34(14), 2022.
[3]
H. Anzt, T. Cojean, G. Flegar, F. Göbel, T. Grützmacher, P. Nayak, T. Ribizel, Y. M. Tsai, and E. S. Quintana-Ortí. Ginkgo: A modern linear operator algebra framework for high performance computing. ACM Transactions on Mathematical Software, 48(1), 2022.
[4]
H. Anzt, T. Cojean, C. Yen-Chen, J. Dongarra, G. Flegar, P. Nayak, S. Tomov, Y. M. Tsai, and W. Wang. Load-balancing sparse matrix vector product kernels on gpus. ACM Transactions on Parallel Computing, 7(1), 2020.
[5]
H. Anzt, S. Tomov, and J. Dongarra. On the performance and energy efficiency of sparse linear algebra on gpus. The International Journal of High Performance Computing Applications, 31(5), 2017.
[6]
A. Ashari, N. Sedaghati, J. Eisenlohr, S. Parthasarath, and P. Sadayappan. Fast sparse matrix-vector multiplication on gpus for graph applications. In SC '14, 2014.
[7]
A. Ashari, N. Sedaghati, J. Eisenlohr, and P. Sadayappan. A model-driven blocking strategy for load balanced sparse matrix-vector multiplication on gpus. Journal of Parallel and Distributed Computing, 76, 2015.
[8]
N. Bell and M. Garland. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In SC '09, 2009.
[9]
A. Benatia, W. Ji, Y. Wang, and F. Shi. Sparse matrix format selection with multiclass svm for spmv on gpu. In ICPP '16, 2016.
[10]
H. Bian, J. Huang, L. Liu, D. Huang, and X. Wang. Albus: A method for efficiently processing spmv using simd and load balancing. Future Generation Computer Systems, 116, 2021.
[11]
P. Blanchard, N. J. Higham, F. Lopez, T. Mary, and S. Pranesh. Mixed precision block fused multiply-add: Error analysis and application to gpu tensor cores. SIAM Journal on Scientific Computing, 42(3), 2020.
[12]
A. Buluç, J. T. Fineman, M. Frigo, J. R. Gilbert, and C. E. Leiserson. Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks. In SPAA '09, 2009.
[13]
A. Buluç, S. Williams, L. Oliker, and J. Demmel. Reduced-bandwidth multithreaded algorithms for sparse matrix-vector multiplication. In IPDPS '11, 2011.
[14]
A. Buttari, V. Eijkhout, J. Langou, and S. Filippone. Performance optimization and modeling of blocked sparse kernels. The International Journal of High Performance Computing Applications, 21(4), 2007.
[15]
Z. Chen, Z. Qu, L. Liu, Y. Ding, and Y. Xie. Efficient tensor core-based gpu kernels for structured sparsity under reduced precision. In SC '21, 2021.
[16]
J. W. Choi, A. Singh, and R. W. Vuduc. Model-driven autotuning of sparse matrix-vector multiply on gpus. In PPoPP '10, 2010.
[17]
J. Choquette, W. Gandhi, O. Giroux, N. Stam, and R. Krashinsky. Nvidia a100 tensor core gpu: Performance and innovation. IEEE Micro, 41(2), 2021.
[18]
R. Chowdhury, F. Silvestri, and F. Vella. A computational model for tensor core units. In SPAA '20, 2020.
[19]
R. Chowdhury, F. Silvestri, and F. Vella. Algorithm design for tensor units. In Euro-Par '21, 2021.
[20]
Y.-H. Chung, C.-J. Shih, and S.-H. Hung. Accelerating simulated quantum annealing with gpu and tensor cores. In ISC '22, 2022.
[21]
M. Daga and J. L. Greathouse. Structural agnostic spmv: Adapting csr-adaptive for irregular matrices. In HiPC '15, 2015.
[22]
A. Dakkak, C. Li, J. Xiong, I. Gelado, and W.-m. Hwu. Accelerating reduction and scan using tensor core units. In ICS '19, 2019.
[23]
S. Dalton, L. Olson, and N. Bell. Optimizing sparse matrix-matrix multiplication for the gpu. ACM Transactions on Mathematical Software, 41(4), 2015.
[24]
T. A. Davis and Y. Hu. The university of florida sparse matrix collection. ACM Transactions on Mathematical Software, 38(1), 2011.
[25]
J. Demmel, J. Dongarra, V. Eijkhout, E. Fuentes, A. Petitet, R. Vuduc, R. C. Whaley, and K. Yelick. Self-adapting linear algebra algorithms and software. Proceedings of the IEEE, 93(2), 2005.
[26]
J. Domke, E. Vatai, A. Drozd, P. ChenT, Y. Oyama, L. Zhang, S. Salaria, D. Mukunoki, A. Podobas, M. WahibT, and S. Matsuoka. Matrix engines for high performance computing: A paragon of performance or grasping at straws? In IPDPS '21, 2021.
[27]
Z. Du, J. Li, Y. Wang, X. Li, G. Tan, and N. Sun. Alphasparse: Generating high performance spmv codes directly from sparse matrices. In SC '22, 2022.
[28]
S. Durrani, M. S. Chughtai, M. Hidayetoglu, R. Tahir, A. Dakkak, L. Rauchwerger, F. Zaffar, and W.-m. Hwu. Accelerating fourier and number theoretic transforms using tensor cores and warp shuffles. In PACT '21, 2021.
[29]
A. Elafrou, G. Goumas, and N. Koziris. Performance analysis and optimization of sparse matrix-vector multiplication on modern multi- and many-core processors. In ICPP '17, 2017.
[30]
A. Elafrou, G. Goumas, and N. Koziris. Conflict-free symmetric sparse matrix-vector multiplication on multicore architectures. In SC '19, 2019.
[31]
A. Elafrou, V. Karakasis, T. Gkountouvas, K. Kourtis, G. Goumas, and N. Koziris. Sparsex: A library for high-performance sparse matrix-vector multiplication on multicore platforms. ACM Transactions on Mathematical Software, 44(3), 2018.
[32]
B. Feng, Y. Wang, T. Geng, A. Li, and Y. Ding. Apnn-tc: Accelerating arbitrary precision neural networks on ampere gpu tensor cores. In SC '21, 2021.
[33]
S. Filippone, V. Cardellini, D. Barbieri, and A. Fanfarillo. Sparse matrix-vector multiplication on gpgpus. ACM Transactions on Mathematical Software, 43(4), 2017.
[34]
J. Finkelstein, J. S. Smith, S. M. Mniszewski, K. Barros, C. F. A. Negre, E. H. Rubensson, and A. M. N. Niklasson. Quantum-based molecular dynamics simulations using tensor cores. Journal of Chemical Theory and Computation, 17(10), 2021.
[35]
J. S. Firoz, A. Li, J. Li, and K. Barker. On the feasibility of using reduced-precision tensor core operations for graph analytics. In HPEC '20, 2020.
[36]
J. Gao, W. Ji, Z. Tan, Y. Wang, and F. Shi. Taichi: A hybrid compression format for binary sparse matrix-vector multiplication on gpu. IEEE Transactions on Parallel and Distributed Systems, 33(12), 2022.
[37]
C. Gómez, F. Mantovani, E. Focht, and M. Casas. Efficiently running spmv on long vector architectures. In PPoPP '21, 2021.
[38]
G. Goumas, K. Kourtis, N. Anastopoulos, V. Karakasis, and N. Koziris. Performance evaluation of the sparse matrix-vector multiplication on modern architectures. The Journal of Supercomputing, 50(1), 2009.
[39]
J. L. Greathouse and M. Daga. Efficient sparse matrix-vector multiplication on gpus using the csr storage format. In SC '14, 2014.
[40]
A. Haidar, S. Tomov, J. Dongarra, and N. J. Higham. Harnessing gpu tensor cores for fast fp16 arithmetic to speed up mixed-precision iterative refinement solvers. In SC '18, 2018.
[41]
K. Ho, H. Zhao, A. Jog, and S. Mohanty. Improving gpu throughput through parallel execution using tensor cores and cuda cores. In ISVLSI '22, 2022.
[42]
N.-M. Ho and W.-F. Wong. Tensorox: Accelerating gpu applications via neural approximation on unused tensor cores. IEEE Transactions on Parallel and Distributed Systems, 33(2), 2022.
[43]
G. Huang, H. Li, M. Qin, F. Sun, Y. Ding, and Y. Xie. Shfl-bw: Accelerating deep neural network inference with tensor-core aware weight pruning. In DAC '22, 2022.
[44]
E.-J. Im and K. Yelick. Optimizing sparse matrix computations for register reuse in sparsity. In ICCS '01, 2001.
[45]
E.-J. Im, K. Yelick, and R. Vuduc. Sparsity: Optimization framework for sparse matrix kernels. The International Journal of High Performance Computing Applications, 18(1), 2004.
[46]
H. Ji, H. Song, S. Lu, Z. Jin, G. Tan, and W. Liu. Tilespmspv: A tiled algorithm for sparse matrix-sparse vector multiplication on gpus. In ICPP '22, 2022.
[47]
Z. Ji and C.-L. Wang. Efficient exact k-nearest neighbor graph construction for billion-scale datasets using gpus with tensor cores. In ICS '22, 2022.
[48]
E. Karimi, N. B. Agostini, S. Dong, and D. Kaeli. Vcsr: An efficient gpu memory-aware sparse format. IEEE Transactions on Parallel and Distributed Systems, 33(12), 2022.
[49]
H. Kim, S. Ahn, Y. Oh, B. Kim, W. W. Ro, and W. J. Song. Duplo: Lifting redundant memory accesses of deep neural networks for gpu tensor cores. In MICRO '20, 2020.
[50]
K. Kourtis, V. Karakasis, G. Goumas, and N. Koziris. Csx: An extended compression format for spmv on shared memory systems. In PPoPP '11, 2011.
[51]
M. Kreutzer, G. Hager, G. Wellein, H. Fehske, and A. R. Bishop. A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide simd units. SIAM Journal on Scientific Computing, 36(5), 2014.
[52]
D. Langr and P. Tvrdík. Evaluation criteria for sparse matrix storage formats. IEEE Transactions on Parallel and Distributed Systems, 27(2), 2016.
[53]
S. Lee, S. Hwang, M. J. Kim, J. Choi, and J. H. Ahn. Future scaling of memory hierarchy for tensor cores and eliminating redundant shared memory traffic using inter-warp multicasting. IEEE Transactions on Computers, 71(12), 2022.
[54]
A. Li, T. Geng, T. Wang, M. Herbordt, S. L. Song, and K. Barker. Bstc: A novel binarized-soft-tensor-core design for accelerating bit-based approximated neural nets. In SC '19, 2019.
[55]
A. Li and S. Su. Accelerating binarized neural networks via bit-tensor-cores in turing gpus. IEEE Transactions on Parallel and Distributed Systems, 32(7), 2021.
[56]
B. Li, S. Cheng, and J. Lin. tcfft: A fast half-precision fft library for nvidia tensor cores. In CLUSTER '21, 2021.
[57]
G. Li, J. Xue, L. Liu, X. Wang, X. Ma, X. Dong, J. Li, and X. Feng. Unleashing the low-precision computation potential of tensor cores on gpus. In CGO '21, 2021.
[58]
J. Li, G. Tan, M. Chen, and N. Sun. Smat: An input adaptive auto-tuner for sparse matrix-vector multiplication. In PLDI '13, 2013.
[59]
K. Li, W. Yang, and K. Li. Performance analysis and optimization for spmv on gpu using probabilistic modeling. IEEE Transactions on Parallel and Distributed Systems, 26(1), 2014.
[60]
S. Li, K. Osawa, and T. Hoefler. Efficient quantized sparse matrix operations on tensor cores. In SC '22, 2022.
[61]
W. Li, H. Cheng, Z. Lu, y. Lu, and W. Liu. Haspmv: Heterogeneity-aware sparse matrix-vector multiplication on modern asymmetric multicore processors. In CLUSTER '23, 2023.
[62]
C. Liu, B. Xie, X. Liu, W. Xue, H. Yang, and X. Liu. Towards efficient spmv on sunway manycore architectures. In ICS '18, 2018.
[63]
L. Liu, M. Liu, C. Wang, and J. Wang. Lsrb-csr: A low overhead storage format for spmv on the gpu systems. In ICPADS '15, 2015.
[64]
W. Liu and B. Vinter. Csr5: An efficient storage format for cross-platform sparse matrix-vector multiplication. In ICS '15, 2015.
[65]
W. Liu and B. Vinter. Speculative segmented sum for sparse matrix-vector multiplication on heterogeneous processors. Parallel Computing, 49(C), 2015.
[66]
X. Liu, Y. Liu, H. Yang, J. Liao, M. Li, Z. Luan, and D. Qian. Toward accelerated stencil computation by adapting tensor core unit on gpu. In ICS '22, 2022.
[67]
X. Liu, M. Smelyanskiy, E. Chow, and P. Dubey. Efficient sparse matrix-vector multiplication on x86-based many-core processors. In ICS '13, 2013.
[68]
Z. Lu and W. Liu. Tilesptrsv: a tiled algorithm for parallel sparse triangular solve on gpus. CCF Transactions on High Performance Computing, 5, 2023.
[69]
S. Markidis, S. W. D. Chien, E. Laure, I. B. Peng, and J. S. Vetter. Nvidia tensor core programmability, performance & precision. In IPDPSW '18, 2018.
[70]
M. Martineau, P. Atkinson, and S. McIntosh-Smith. Benchmarking the nvidia v100 gpu and tensor cores. In Euro-Par '18, 2019.
[71]
M. Martone. Efficient multithreaded untransposed, transposed or symmetric sparse matrix-vector multiplication with the recursive sparse blocks format. Parallel Computing, 40(7), 2014.
[72]
J. D. McCalpin. Memory bandwidth and machine balance in current high performance computers. IEEE Computer Society Technical Committee on Computer Architecture Newsletter, 2(19--25), 1995.
[73]
D. Merrill and M. Garland. Merge-based parallel sparse matrix-vector multiplication. In SC '16, 2016.
[74]
H. Mi, X. Yu, X. Yu, S. Wu, and W. Liu. Balancing computation and communication in distributed sparse matrix-vector multiplication. In CCGrid '23, 2023.
[75]
D. Mukunoki, K. Ozaki, T. Ogita, and T. Imamura. Dgemm using tensor cores, and its accurate and reproducible versions. In ISC '20, 2020.
[76]
Y. Niu, Z. Lu, M. Dong, Z. Jin, W. Liu, and G. Tan. Tilespmv: A tiled algorithm for sparse matrix-vector multiplication on gpus. In IPDPS '21, 2021.
[77]
Y. Niu, Z. Lu, H. Ji, S. Song, Z. Jin, and W. Liu. Tilespgemm: A tiled algorithm for parallel sparse general matrix-matrix multiplication on gpus. In PPoPP '22, 2022.
[78]
R. Nobre, A. Ilic, S. Santander-Jiménez, and L. Sousa. Exploring the binary precision capabilities of tensor cores for epistasis detection. In IPDPS '20, 2020.
[79]
H. Ootomo and R. Yokota. Recovering single precision accuracy from tensor cores while surpassing the fp32 theoretical peak performance. The International Journal of High Performance Computing Applications, 36(4), 2022.
[80]
L. Pisha and . Ligowski. Accelerating non-power-of-2 size fourier transforms with gpu tensor cores. In IPDPS '21, 2021.
[81]
F. A. Quezada, C. A. Navarro, N. Hitschfeld, and B. Bustos. Squeeze: Efficient compact fractals for tensor core gpus. Future Generation Computer Systems, 135, 2022.
[82]
N. Sedaghati, T. Mu, L. Pouchet, S. Parthasarathy, and P. Sadayappan. Automatic selection of sparse matrix representation on gpus. In ICS '15, 2015.
[83]
Z. Song, J. Wang, T. Li, L. Jiang, J. Ke, X. Liang, and N. Jing. Gpnpu: Enabling efficient hardware-based direct convolution with multi-precision support in gpu tensor cores. In DAC '20, 2020.
[84]
M. Steinberger, R. Zayer, and H. Seidel. Globally homogeneous, locally adaptive sparse matrix-vector multiplication on the gpu. In ICS '17, 2017.
[85]
W. Sun, A. Li, T. Geng, S. Stuijk, and H. Corporaal. Dissecting tensor cores via microbenchmarks: Latency, throughput and numeric behaviors. IEEE Transactions on Parallel and Distributed Systems, 34(1), 2023.
[86]
W. Sun, S. Sioutas, S. Stuijk, A. Nelson, and H. Corporaal. Efficient tensor cores support in tvm for low-latency deep learning. In DATE '21, 2021.
[87]
Y. Sun, L. Zheng, Q. Wang, X. Ye, Y. Huang, P. Yao, X. Liao, and H. Jin. Accelerating sparse deep neural network inference using gpu tensor cores. In HPEC '22, 2022.
[88]
G. Tan, J. Liu, and J. Li. Design and implementation of adaptive spmv library for multicore and many-core architecture. ACM Transactions on Mathematical Software, 44(4), 2018.
[89]
J. Tu, M. A. Clark, C. Jung, and R. D. Mawhinney. Solving dwf dirac equation using multi-splitting preconditioned conjugate gradient with tensor cores on nvidia gpus. In PASC '21, 2021.
[90]
N. Tukanov, R. Srinivasaraghavan, J. E. Moreira, and T. M. Low. Modeling matrix engines for portability and performance. In IPDPS '22, 2022.
[91]
R. Vuduc, J. W. Demmel, and K. A. Yelick. Oski: A library of automatically tuned sparse matrix kernels. Journal of Physics: Conference Series, 16(1), 2005.
[92]
R. Vuduc, J. W. Demmel, K. A. Yelick, S. Kamil, R. Nishtala, and B. Lee. Performance optimizations and bounds for sparse matrix-vector multiply. In SC '02, 2002.
[93]
R. W. Vuduc and H.-J. Moon. Fast sparse matrix-vector multiplication by exploiting variable block structure. In HPCC '05, 2005.
[94]
Y. Wang, B. Feng, and Y. Ding. Qgtc: Accelerating quantized graph neural networks via gpu tensor core. In PPoPP '22, 2022.
[95]
S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. Parallel Computing, 35(3), 2009.
[96]
B. Xie, J. Zhan, X. Liu, W. Gao, Z. Jia, X. He, and L. Zhang. Cvr: Efficient vectorization of spmv on x86 processors. In CGO '18, 2018.
[97]
D. Yan, W. Wang, and X. Chu. Demystifying tensor cores to optimize half-precision matrix multiply. In IPDPS '20, 2020.
[98]
S. Yan, C. Li, Y. Zhang, and H. Zhou. yaspmv: Yet another spmv framework on gpus. In PPoPP '14, 2014.
[99]
W. Yang, K. Li, Z. Mo, and K. Li. Performance optimization using partitioned spmv on gpus and multicore cpus. IEEE Transactions on Computers, 64(9), 2014.
[100]
X. Yang, S. Parthasarathy, and P. Sadayappan. Fast sparse matrix-vector multiplication on gpus: Implications for graph mining. Proceedings of the VLDB Endowment, 4(4), 2011.
[101]
S. Yesil, A. Heidarshenas, A. Morrison, and J. Torrellas. Wise: Predicting the performance of sparse matrix vector multiplication with machine learning. In PPoPP '23, 2023.
[102]
X. You, C. Liu, H. Yang, P. Wang, Z. Luan, and D. Qian. Vectorizing spmv by exploiting dynamic regular patterns. In ICPP '22, 2022.
[103]
A. N. Yzelman and R. H. Bisseling. Cache-oblivious sparse matrix-vector multiplication by using sparse matrix partitioning methods. SIAM Journal on Scientific Computing, 31(4), 2009.
[104]
A. N. Yzelman and R. H. Bisseling. Two-dimensional cache-oblivious sparse matrix-vector multiplication. Parallel Computing, 37(12), 2011.
[105]
A. N. Yzelman and D. Roose. High-level strategies for parallel shared-memory sparse matrix-vector multiplication. IEEE Transactions on Parallel and Distributed Systems, 25(1), 2014.
[106]
O. Zachariadis, N. Satpute, J. Gómez-Luna, and J. Olivares. Accelerating sparse matrix-matrix multiplication with gpu tensor cores. Computers & Electrical Engineering, 88, 2020.
[107]
Y. Zhang, S. Li, F. Yuan, D. Dong, X. Yang, T. Li, and Z. Wang. Memory-aware optimization for sequences of sparse matrix-vector multiplications. In IPDPS '23, 2023.
[108]
Y. Zhao, J. Li, C. Liao, and X. Shen. Bridging the gap between deep learning and sparse matrix format selection. In PPoPP '18, 2018.
[109]
Y. Zhao, W. Zhou, X. Shen, and G. Yiu. Overhead-conscious format selection for spmv-based applications. In IPDPS '18, 2018.

Cited By

View all
  • (2024)Bitmap-Based Sparse Matrix-Vector Multiplication with Tensor CoresProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673055(1135-1144)Online publication date: 12-Aug-2024
  • (2024)CAMLB-SpMV: An Efficient Cache-Aware Memory Load-Balancing SpMV on CPUProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673042(640-649)Online publication date: 12-Aug-2024
  • (2023)HASpMV: Heterogeneity-Aware Sparse Matrix-Vector Multiplication on Modern Asymmetric Multicore Processors2023 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER52292.2023.00025(209-220)Online publication date: 31-Oct-2023

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '23: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
November 2023
1428 pages
ISBN:9798400701092
DOI:10.1145/3581784
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 November 2023

Permissions

Request permissions for this article.

Check for updates

Badges

Author Tags

  1. GPU
  2. tensor core
  3. matrix multiply-accumulate
  4. sparse matrix-vector multiplication

Qualifiers

  • Research-article

Conference

SC '23
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)899
  • Downloads (Last 6 weeks)70
Reflects downloads up to 21 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Bitmap-Based Sparse Matrix-Vector Multiplication with Tensor CoresProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673055(1135-1144)Online publication date: 12-Aug-2024
  • (2024)CAMLB-SpMV: An Efficient Cache-Aware Memory Load-Balancing SpMV on CPUProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673042(640-649)Online publication date: 12-Aug-2024
  • (2023)HASpMV: Heterogeneity-Aware Sparse Matrix-Vector Multiplication on Modern Asymmetric Multicore Processors2023 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER52292.2023.00025(209-220)Online publication date: 31-Oct-2023

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media