Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3489517.3530508acmconferencesArticle/Chapter ViewAbstractPublication PagesdacConference Proceedingsconference-collections
research-article
Open access

Heuristic adaptability to input dynamics for SpMM on GPUs

Published: 23 August 2022 Publication History

Abstract

Sparse Matrix-Matrix Multiplication (SpMM) has served as fundamental components in various domains. Many previous studies exploit GPUs for SpMM acceleration because GPUs provide high bandwidth and parallelism. We point out that a static design does not always improve the performance of SpMM on different input data (e.g., >85% performance loss with a single algorithm). In this paper, we consider the challenge of input dynamics from a novel auto-tuning perspective, while following issues remain to be solved: (1) Orthogonal design principles considering sparsity. Orthogonal design principles for such a sparse problem should be extracted to form different algorithms, and further used for performance tuning. (2) Nontrivial implementations in the algorithm space. Combining orthogonal design principles to create new algorithms needs to tackle with new challenges like thread race handling. (3) Heuristic adaptability to input dynamics. The heuristic adaptability is required to dynamically optimize code for input dynamics.
To tackle these challenges, we first propose a novel three-loop model to extract orthogonal design principles for SpMM on GPUs. The model not only covers previous SpMM designs, but also comes up with new designs absent from previous studies. We propose techniques like conditional reduction to implement algorithms missing in previous studies. We further propose DA-SpMM, a Data-Aware heuristic GPU kernel for SpMM. DA-SpMM adaptively optimizes code considering input dynamics. Extensive experimental results show that, DA-SpMM achieves 1.26X~1.37X speedup compared with the best NVIDIA cuSPARSE algorithm on average, and brings up to 5.59X end-to-end speedup to Graph Neural Networks.

References

[1]
Basic linear algebra for sparse matrices on nvidia gpus. =https://developer.nvidia.com/cusparse.
[2]
Scott P. Kolodziej et al. The suitesparse matrix collection website interface. Journal of Open Source Software, 4(35):1244, 2019.
[3]
Wencong Xiao et al. Tux2: Distributed graph computation for machine learning. In NSDI, pages 669--682, 2017.
[4]
Song Han et al. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv:1510.00149, 2015.
[5]
Song Han et al. Eie: Efficient inference engine on compressed deep neural network. In ISCA, pages 243--254, 2016.
[6]
Victor Sanh et al. Movement pruning: Adaptive sparsity by fine-tuning. arXiv:2005.07683, 2020.
[7]
Guyue Huang et al. Ge-spmm: general-purpose sparse matrix-matrix multiplication on gpus for graph neural networks. In SC, pages 1--12, 2020.
[8]
Bahar Asgari, Ramyad Hadidi, Jiashen Cao, Da Eun Shim, Sung Kyu Lim, and Hyesoon Kim. Fafnir: Accelerating sparse gathering by using efficient near-memory intelligent reduction. HPCA, pages 908--920, 2021.
[9]
Ziheng Wang, Jeremy Wohlwend, and Tao Lei. Structured pruning of large language models. arXiv:1910.04732, 2019.
[10]
Changwan Hong et al. Adaptive sparse tiling for sparse matrix multiplication. In PPoPP, pages 300--314, 2019.
[11]
Peng Jiang et al. A novel data transformation and execution strategy for accelerating sparse matrix multiplication on gpus. In PPoPP, pages 376--388, 2020.
[12]
Trevor Gale et al. Sparse gpu kernels for deep learning. In SC, pages 1--14, 2020.
[13]
Ziheng Wang. Sparsert: Accelerating unstructured sparsity on gpus for deep learning inference. In PACT, pages 31--42, 2020.
[14]
William L Hamilton et al. Inductive representation learning on large graphs. In NeurIPS, pages 1025--1035, 2017.
[15]
Carl Yang et al. Design principles for sparse matrix multiplication on the gpu. In Euro-Par, pages 672--687. Springer, 2018.
[16]
Nathan Bell and Michael Garland. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In SC, pages 1--11, 2009.
[17]
Tianqi Chen et al. Tvm: An automated end-to-end optimizing compiler for deep learning. In OSDI, pages 578--594, 2018.
[18]
Fredrik Kjolstad et al. Taco: A tool to generate tensor algebra kernels. In ASE, pages 943--948, 2017.
[19]
Joseph L Greathouse and Mayank Daga. Efficient sparse matrix-vector multiplication on gpus using the csr storage format. In SC, pages 769--780, 2014.
[20]
Duane Merrill and Michael Garland. Merge-based parallel sparse matrix-vector multiplication. In SC, pages 678--689, 2016.
[21]
Jee W Choi et al. Model-driven auto tuning of sparse matrix-vector multiply on gpus. ACM Sigplan Notices, 45(5):115--126, 2010.
[22]
Israt Nisa et al. Effective machine learning based format selection and performance modeling for SpMV on GPUs. IPDPSW, pages 1056--1065, 2018.
[23]
Qingxiao Sun et al. Sptfs: Sparse tensor format selection for mttkrp via deep learning. In SC, 2020.
[24]
Lightgbm. https://lightgbm.readthedocs.io/en/latest/.
[25]
Carl Yang et al. Graphblast: A high-performance linear algebra-based graph framework on the gpu. arXiv:1908.01407, 2019.
[26]
Deepayan Chakrabarti et al. R-mat: A recursive model for graph mining. In ICDM, pages 442--446, 2004.
[27]
Minjie Wang et al. Deep graph library: A graph-centric, highly-performant package for graph neural networks. arXiv:1909.01315, 2019.
[28]
Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv:1609.02907, 2016.
[29]
William L Hamilton et al. Reddit dataset in graphsage, website, 2017. http://snap.stanford.edu/graphsage/.

Cited By

View all
  • (2024)DTC-SpMM: Bridging the Gap in Accelerating General Sparse Matrix Multiplication with Tensor CoresProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651378(253-267)Online publication date: 27-Apr-2024
  • (2024)PckGNN: Optimizing Aggregation Operators with Packing Strategies in Graph Neural Networks2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS57955.2024.00010(2-13)Online publication date: 27-May-2024
  • (2024)STuning-DL: Model-Driven Autotuning of Sparse GPU Kernels for Deep LearningIEEE Access10.1109/ACCESS.2024.340232612(70581-70599)Online publication date: 2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
DAC '22: Proceedings of the 59th ACM/IEEE Design Automation Conference
July 2022
1462 pages
ISBN:9781450391429
DOI:10.1145/3489517
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 August 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GPU
  2. graph neural networks
  3. sparse matrix-matrix multiplication

Qualifiers

  • Research-article

Funding Sources

Conference

DAC '22
Sponsor:
DAC '22: 59th ACM/IEEE Design Automation Conference
July 10 - 14, 2022
California, San Francisco

Acceptance Rates

Overall Acceptance Rate 1,770 of 5,499 submissions, 32%

Upcoming Conference

DAC '25
62nd ACM/IEEE Design Automation Conference
June 22 - 26, 2025
San Francisco , CA , USA

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)299
  • Downloads (Last 6 weeks)29
Reflects downloads up to 02 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)DTC-SpMM: Bridging the Gap in Accelerating General Sparse Matrix Multiplication with Tensor CoresProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651378(253-267)Online publication date: 27-Apr-2024
  • (2024)PckGNN: Optimizing Aggregation Operators with Packing Strategies in Graph Neural Networks2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS57955.2024.00010(2-13)Online publication date: 27-May-2024
  • (2024)STuning-DL: Model-Driven Autotuning of Sparse GPU Kernels for Deep LearningIEEE Access10.1109/ACCESS.2024.340232612(70581-70599)Online publication date: 2024
  • (2023)SparseTIR: Composable Abstractions for Sparse Compilation in Deep LearningProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3582016.3582047(660-678)Online publication date: 25-Mar-2023
  • (2023)TANGO: re-thinking quantization for graph neural network training on GPUsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607037(1-14)Online publication date: 12-Nov-2023
  • (2023)TSTC: Two-Level Sparsity Tensor Core Enabling both Algorithm Flexibility and Hardware Efficiency2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD)10.1109/ICCAD57390.2023.10323775(1-9)Online publication date: 28-Oct-2023
  • (2023)Adaptive Sparse Deep Neural Network Inference on Resource-Constrained Cost-Efficient GPUs2023 IEEE High Performance Extreme Computing Conference (HPEC)10.1109/HPEC58863.2023.10363545(1-7)Online publication date: 25-Sep-2023
  • (2023)Online Adaptive Mahalanobis Distance Estimation2023 IEEE International Conference on Big Data (BigData)10.1109/BigData59044.2023.10386869(56-65)Online publication date: 15-Dec-2023
  • (2023)Sgap: towards efficient sparse tensor algebra compilation for GPUCCF Transactions on High Performance Computing10.1007/s42514-023-00140-45:2(210-227)Online publication date: 8-May-2023

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media