research-article

Open access

Heuristic adaptability to input dynamics for SpMM on GPUs

Authors:

Yu WangAuthors Info & Claims

DAC '22: Proceedings of the 59th ACM/IEEE Design Automation Conference

Pages 595 - 600

https://doi.org/10.1145/3489517.3530508

Published: 23 August 2022 Publication History

Abstract

Sparse Matrix-Matrix Multiplication (SpMM) has served as fundamental components in various domains. Many previous studies exploit GPUs for SpMM acceleration because GPUs provide high bandwidth and parallelism. We point out that a static design does not always improve the performance of SpMM on different input data (e.g., >85% performance loss with a single algorithm). In this paper, we consider the challenge of input dynamics from a novel auto-tuning perspective, while following issues remain to be solved: (1) Orthogonal design principles considering sparsity. Orthogonal design principles for such a sparse problem should be extracted to form different algorithms, and further used for performance tuning. (2) Nontrivial implementations in the algorithm space. Combining orthogonal design principles to create new algorithms needs to tackle with new challenges like thread race handling. (3) Heuristic adaptability to input dynamics. The heuristic adaptability is required to dynamically optimize code for input dynamics.

To tackle these challenges, we first propose a novel three-loop model to extract orthogonal design principles for SpMM on GPUs. The model not only covers previous SpMM designs, but also comes up with new designs absent from previous studies. We propose techniques like conditional reduction to implement algorithms missing in previous studies. We further propose DA-SpMM, a Data-Aware heuristic GPU kernel for SpMM. DA-SpMM adaptively optimizes code considering input dynamics. Extensive experimental results show that, DA-SpMM achieves 1.26X~1.37X speedup compared with the best NVIDIA cuSPARSE algorithm on average, and brings up to 5.59X end-to-end speedup to Graph Neural Networks.

References

[1]

Basic linear algebra for sparse matrices on nvidia gpus. =https://developer.nvidia.com/cusparse.

[2]

Scott P. Kolodziej et al. The suitesparse matrix collection website interface. Journal of Open Source Software, 4(35):1244, 2019.

[3]

Wencong Xiao et al. Tux²: Distributed graph computation for machine learning. In NSDI, pages 669--682, 2017.

[4]

Song Han et al. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv:1510.00149, 2015.

[5]

Song Han et al. Eie: Efficient inference engine on compressed deep neural network. In ISCA, pages 243--254, 2016.

[6]

Victor Sanh et al. Movement pruning: Adaptive sparsity by fine-tuning. arXiv:2005.07683, 2020.

[7]

Guyue Huang et al. Ge-spmm: general-purpose sparse matrix-matrix multiplication on gpus for graph neural networks. In SC, pages 1--12, 2020.

[8]

Bahar Asgari, Ramyad Hadidi, Jiashen Cao, Da Eun Shim, Sung Kyu Lim, and Hyesoon Kim. Fafnir: Accelerating sparse gathering by using efficient near-memory intelligent reduction. HPCA, pages 908--920, 2021.

[9]

Ziheng Wang, Jeremy Wohlwend, and Tao Lei. Structured pruning of large language models. arXiv:1910.04732, 2019.

[10]

Changwan Hong et al. Adaptive sparse tiling for sparse matrix multiplication. In PPoPP, pages 300--314, 2019.

[11]

Peng Jiang et al. A novel data transformation and execution strategy for accelerating sparse matrix multiplication on gpus. In PPoPP, pages 376--388, 2020.

[12]

Trevor Gale et al. Sparse gpu kernels for deep learning. In SC, pages 1--14, 2020.

[13]

Ziheng Wang. Sparsert: Accelerating unstructured sparsity on gpus for deep learning inference. In PACT, pages 31--42, 2020.

Digital Library

[14]

William L Hamilton et al. Inductive representation learning on large graphs. In NeurIPS, pages 1025--1035, 2017.

[15]

Carl Yang et al. Design principles for sparse matrix multiplication on the gpu. In Euro-Par, pages 672--687. Springer, 2018.

[16]

Nathan Bell and Michael Garland. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In SC, pages 1--11, 2009.

Digital Library

[17]

Tianqi Chen et al. Tvm: An automated end-to-end optimizing compiler for deep learning. In OSDI, pages 578--594, 2018.

[18]

Fredrik Kjolstad et al. Taco: A tool to generate tensor algebra kernels. In ASE, pages 943--948, 2017.

[19]

Joseph L Greathouse and Mayank Daga. Efficient sparse matrix-vector multiplication on gpus using the csr storage format. In SC, pages 769--780, 2014.

Digital Library

[20]

Duane Merrill and Michael Garland. Merge-based parallel sparse matrix-vector multiplication. In SC, pages 678--689, 2016.

[21]

Jee W Choi et al. Model-driven auto tuning of sparse matrix-vector multiply on gpus. ACM Sigplan Notices, 45(5):115--126, 2010.

Digital Library

[22]

Israt Nisa et al. Effective machine learning based format selection and performance modeling for SpMV on GPUs. IPDPSW, pages 1056--1065, 2018.

[23]

Qingxiao Sun et al. Sptfs: Sparse tensor format selection for mttkrp via deep learning. In SC, 2020.

[24]

Lightgbm. https://lightgbm.readthedocs.io/en/latest/.

[25]

Carl Yang et al. Graphblast: A high-performance linear algebra-based graph framework on the gpu. arXiv:1908.01407, 2019.

[26]

Deepayan Chakrabarti et al. R-mat: A recursive model for graph mining. In ICDM, pages 442--446, 2004.

[27]

Minjie Wang et al. Deep graph library: A graph-centric, highly-performant package for graph neural networks. arXiv:1909.01315, 2019.

[28]

Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv:1609.02907, 2016.

[29]

William L Hamilton et al. Reddit dataset in graphsage, website, 2017. http://snap.stanford.edu/graphsage/.

Cited By

Fan RWang WChu XTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)DTC-SpMM: Bridging the Gap in Accelerating General Sparse Matrix Multiplication with Tensor CoresProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651378(253-267)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620666.3651378
Hu ZSun JLi ZSun G(2024)PckGNN: Optimizing Aggregation Operators with Packing Strategies in Graph Neural Networks2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS57955.2024.00010(2-13)Online publication date: 27-May-2024
https://doi.org/10.1109/IPDPS57955.2024.00010
Castro RAndrade DFraguela B(2024)STuning-DL: Model-Driven Autotuning of Sparse GPU Kernels for Deep LearningIEEE Access10.1109/ACCESS.2024.340232612(70581-70599)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3402326
Show More Cited By

Recommendations

Adaptive sparse tiling for sparse matrix multiplication
PPoPP '19: Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming

Tiling is a key technique for data locality optimization and is widely used in high-performance implementations of dense matrix-matrix multiplication for multicore/manycore CPUs and GPUs. However, the irregular and matrix-dependent data access pattern ...
Efficient sparse-matrix multi-vector product on GPUs
HPDC '18: Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing

Sparse Matrix-Vector (SpMV) and Sparse Matrix-Multivector (SpMM) products are key kernels for computational science and data science. While GPUs offer significantly higher peak performance and memory bandwidth than multicore CPUs, achieving high ...
A framework for general sparse matrix-matrix multiplication on GPUs and heterogeneous processors

General sparse matrix-matrix multiplication (SpGEMM) is a fundamental building block for numerous applications such as algebraic multigrid method (AMG), breadth first search and shortest path problem. Compared to other sparse BLAS routines, an efficient ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

DAC '22: Proceedings of the 59th ACM/IEEE Design Automation Conference

July 2022

1462 pages

ISBN:9781450391429

DOI:10.1145/3489517

General Chair:
Rob Oshana
NXP

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGDA: ACM Special Interest Group on Design Automation
IEEE CEDA

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 August 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China
Tsinghua EE Xilinx AI Research Fund
Beijing National Research Center for Information Science and Technology (BNRist)
Beijing Innovation Center for Future Chips

Conference

DAC '22

Sponsor:

SIGDA

DAC '22: 59th ACM/IEEE Design Automation Conference

July 10 - 14, 2022

California, San Francisco

Acceptance Rates

Overall Acceptance Rate 1,770 of 5,499 submissions, 32%

Upcoming Conference

DAC '25

Sponsor:
sigda

62nd ACM/IEEE Design Automation Conference

June 22 - 26, 2025

San Francisco , CA , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
684
Total Downloads

Downloads (Last 12 months)299
Downloads (Last 6 weeks)29

Reflects downloads up to 02 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Fan RWang WChu XTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)DTC-SpMM: Bridging the Gap in Accelerating General Sparse Matrix Multiplication with Tensor CoresProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651378(253-267)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620666.3651378
Hu ZSun JLi ZSun G(2024)PckGNN: Optimizing Aggregation Operators with Packing Strategies in Graph Neural Networks2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS57955.2024.00010(2-13)Online publication date: 27-May-2024
https://doi.org/10.1109/IPDPS57955.2024.00010
Castro RAndrade DFraguela B(2024)STuning-DL: Model-Driven Autotuning of Sparse GPU Kernels for Deep LearningIEEE Access10.1109/ACCESS.2024.340232612(70581-70599)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3402326
Ye ZLai RShao JChen TCeze LAamodt TJerger NSwift M(2023)SparseTIR: Composable Abstractions for Sparse Compilation in Deep LearningProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3582016.3582047(660-678)Online publication date: 25-Mar-2023
https://dl.acm.org/doi/10.1145/3582016.3582047
Chen SZheng DDing CHuan CJi YLiu HMohror KArnold DBadia R(2023)TANGO: re-thinking quantization for graph neural network training on GPUsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607037(1-14)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607037
Liu JDai GXia HGuo LShi XXu JYang HWang Y(2023)TSTC: Two-Level Sparsity Tensor Core Enabling both Algorithm Flexibility and Hardware Efficiency2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD)10.1109/ICCAD57390.2023.10323775(1-9)Online publication date: 28-Oct-2023
https://doi.org/10.1109/ICCAD57390.2023.10323775
Dun MZhang XCao HZhang YHuang JYe X(2023)Adaptive Sparse Deep Neural Network Inference on Resource-Constrained Cost-Efficient GPUs2023 IEEE High Performance Extreme Computing Conference (HPEC)10.1109/HPEC58863.2023.10363545(1-7)Online publication date: 25-Sep-2023
https://doi.org/10.1109/HPEC58863.2023.10363545
Qin LReddy ASong Z(2023)Online Adaptive Mahalanobis Distance Estimation2023 IEEE International Conference on Big Data (BigData)10.1109/BigData59044.2023.10386869(56-65)Online publication date: 15-Dec-2023
https://doi.org/10.1109/BigData59044.2023.10386869
Zhang GZhao YTao YYu ZDai GHuang SWen YPetoumenos PWang Y(2023)Sgap: towards efficient sparse tensor algebra compilation for GPUCCF Transactions on High Performance Computing10.1007/s42514-023-00140-45:2(210-227)Online publication date: 8-May-2023
https://doi.org/10.1007/s42514-023-00140-4

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents