research-article

A novel data transformation and execution strategy for accelerating sparse matrix multiplication on GPUs

Authors:

Gagan AgrawalAuthors Info & Claims

PPoPP '20: Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Pages 376 - 388

https://doi.org/10.1145/3332466.3374546

Published: 19 February 2020 Publication History

Abstract

SpMM (multiplication of a sparse matrix and a dense matrix) and SDDMM (sampled dense-dense matrix multiplication) are at the core of many scientific, machine learning, and data mining applications. Because of the irregular memory accesses, the two kernels have poor data locality, and data movement overhead is a bottleneck for their performance. To overcome this issue, previous works have proposed using tiling and data reorganization to enhance data reuse. Despite their success in improving the performance for many sparse matrices, we find that the efficacy of existing techniques largely depends on how the non-zeros are distributed in a sparse matrix. In this work, we propose a novel row-reordering technique to improve data locality for SpMM and SDDMM on GPUs. The goal of such row reordering is to place similar rows close to each other, allowing them to be processed together, and thus providing better temporal locality for the values of the dense matrix. We focus on performing the row-reordering efficiently, by using a hierarchical clustering procedure optimized by locality-sensitive hashing. We also investigate when row-reordering is useful, and what factors the performance gains from our method are correlated to. Experimental evaluation using 1084 sparse matrices from SuiteSparse collection and Network Repository shows that our technique achieves up to 2.91x speedup for SpMM and up to 3.19x speedup for SDDMM against the state-of-the-art alternatives on an Nvidia P100 GPU.

References

[1]

2019. The API reference guide for cuSPARSE, the CUDA sparse matrix library, https://docs.nvidia.com/cuda/cusparse/index.html Version 10.1.168.

[2]

2019. IntelÂő Math Kernel Library, https://software.intel.com/en-us/mkl

[3]

Hasan Metin Aktulga, Aydin Buluç, Samuel Williams, and Chao Yang. 2014. Optimizing Sparse Matrix-Multiple Vectors Multiplication for Nuclear Configuration Interaction Calculations. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium (IPDPS '14). IEEE Computer Society, Washington, DC, USA, 1213--1222.

Digital Library

[4]

Hartwig Anzt, Stanimire Tomov, and Jack Dongarra. 2015. Accelerating the LOBPCG Method on GPUs Using a Blocked Sparse Matrix Vector Product. In Proceedings of the Symposium on High Performance Computing (HPC '15). Society for Computer Simulation International, San Diego, CA, USA, 75--82. http://dl.acm.org/citation.cfm?id=2872599.2872609

Digital Library

[5]

Arash Ashari, Naser Sedaghati, John Eisenlohr, Srinivasan Parthasarathy, and P. Sadayappan. 2014. Fast Sparse Matrix-vector Multiplication on GPUs for Graph Applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '14). IEEE Press, Piscataway, NJ, USA, 781--792.

Digital Library

[6]

N. Bell and M. Garland. 2009. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. 1--11.

Digital Library

[7]

A. Benatia, W. Ji, Y. Wang, and F. Shi. 2016. Sparse Matrix Format Selection with Multiclass SVM for SpMV on GPU. In 2016 45th International Conference on Parallel Processing (ICPP). 496--505.

[8]

J. Canny. 2002. Collaborative filtering with privacy. In Proceedings 2002 IEEE Symposium on Security and Privacy. 45--57.

[9]

John F. Canny and Huasha Zhao. 2013. BIDMach: Large-scale Learning with Zero Memory Allocation.

[10]

Linchuan Chen, Peng Jiang, and Gagan Agrawal. 2016. Exploiting Recent SIMD Architectural Advances for Irregular Applications. In Proceedings of the 2016 International Symposium on Code Generation and Optimization (CGO '16). ACM, New York, NY, USA, 47--58.

Digital Library

[11]

Jee W. Choi, Amik Singh, and Richard W. Vuduc. 2010. Model-driven Autotuning of Sparse Matrix-vector Multiply on GPUs. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '10). ACM, New York, NY, USA, 115--126.

Digital Library

[12]

Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. 2009. Introduction to Algorithms, Third Edition (3rd ed.). The MIT Press.

Digital Library

[13]

Mayank Daga and Joseph L. Greathouse. 2015. Structural Agnostic SpMV: Adapting CSR-Adaptive for Irregular Matrices. In Proceedings of the 2015 IEEE 22Nd International Conference on High Performance Computing (HiPC) (HIPC '15). IEEE Computer Society, Washington, DC, USA, 64--74.

Digital Library

[14]

Steven Dalton, Nathan Bell, Luke Olson, and Michael Garland. 2014. Cusp: Generic Parallel Algorithms for Sparse Matrix and Graph Computations. http://cusplibrary.github.io/ Version 0.5.0.

[15]

Timothy A. Davis and Yifan Hu. 2011. The University of Florida Sparse Matrix Collection. ACM Trans. Math. Softw. 38, 1, Article 1 (Dec. 2011), 25 pages

Digital Library

[16]

Changwan Hong, Aravind Sukumaran-Rajam, Bortik Bandyopadhyay, Jinsung Kim, Süreyya Emre Kurt, Israt Nisa, Shivani Sabhlok, Ümit V. Çatalyürek, Srinivasan Parthasarathy, and P. Sadayappan. 2018. Efficient Sparse-matrix Multi-vector Product on GPUs. In Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing (HPDC '18). ACM, New York, NY, USA, 66--79.

Digital Library

[17]

Changwan Hong, Aravind Sukumaran-Rajam, Israt Nisa, Kunal Singh, and P. Sadayappan. 2019. Adaptive Sparse Tiling for Sparse Matrix Multiplication. In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming (PPoPP '19). ACM, New York, NY, USA, 300--314.

Digital Library

[18]

Dietmar Jannach, Markus Zanker, Alexander Felfernig, and Gerhard Friedrich. 2010. Recommender Systems: An Introduction (1st ed.). Cambridge University Press, New York, NY, USA.

Digital Library

[19]

Peng Jiang and Gagan Agrawal. 2018. Conflict-free Vectorization of Associative Irregular Applications with Recent SIMD Architectural Advances. In Proceedings of the 2018 International Symposium on Code Generation and Optimization (CGO 2018). ACM, New York, NY, USA, 175--187.

Digital Library

[20]

Peng Jiang, Linchuan Chen, and Gagan Agrawal. 2016. Reusing Data Reorganization for Efficient SIMD Parallelization of Adaptive Irregular Applications. In Proceedings of the 2016 International Conference on Supercomputing (ICS '16). ACM, New York, NY, USA, Article 16, 10 pages.

Digital Library

[21]

George Karypis and Vipin Kumar. 1998. A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs. SIAM J. Sci. Comput. 20, 1 (Dec. 1998), 359--392.

Digital Library

[22]

Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with Graph Convolutional Networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24--26, 2017, Conference Track Proceedings. https://openreview.net/forum?id=SJU4ayYgl

[23]

Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, and Saman Amarasinghe. 2017. The Tensor Algebra Compiler. Proc. ACM Program. Lang. 1, OOPSLA, Article 77 (Oct. 2017), 29 pages.

Digital Library

[24]

Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix Factorization Techniques for Recommender Systems. Computer 42, 8 (Aug. 2009), 30--37.

Digital Library

[25]

Kornilios Kourtis, Vasileios Karakasis, Georgios Goumas, and Nectarios Koziris. 2011. CSX: An Extended Compression Format for Spmv on Shared Memory Systems. In Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming (PPoPP '11). ACM, New York, NY, USA, 247--256.

Digital Library

[26]

Christopher D. Krieger and Michelle Mills Strout. 2013. A Fast Parallel Graph Partitioner for Shared-Memory Inspector/Executor Strategies. In Languages and Compilers for Parallel Computing, Hironori Kasahara and Keiji Kimura (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 190--204.

[27]

K. Lakhotia, S. Singapura, R. Kannan, and V. Prasanna. 2017. ReCALL: Reordered Cache Aware Locality Based Graph Processing. In 2017 IEEE 24th International Conference on High Performance Computing (HiPC). 273--282.

[28]

Jure Leskovec, Anand Rajaraman, and Jeffrey David Ullman. 2014. Mining of Massive Datasets (2nd ed.). Cambridge University Press, New York, NY, USA.

[29]

John Mellor-Crummey, David Whalley, and Ken Kennedy. 2001. Improving Memory Hierarchy Performance for Irregular Applications Using Data and Computation Reorderings. Int. J. Parallel Program. 29, 3 (June 2001), 217--247.

Digital Library

[30]

Duane Merrill and Michael Garland. 2016. Merge-based Sparse Matrix-vector Multiplication (SpMV) Using the CSR Storage Format. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '16). ACM, New York, NY, USA, Article 43, 2 pages.

Digital Library

[31]

I. Nisa, A. Sukumaran-Rajam, S. E. Kurt, C. Hong, and P. Sadayappan. 2018. Sampled Dense Matrix Multiplication for High-Performance Machine Learning. In 2018 IEEE 25th International Conference on High Performance Computing (HiPC). 32--41.

[32]

Leonid Oliker, Xiaoye S. Li, Parry Husbands, and Rupak Biswas. 2002. Effects of Ordering Strategies and Programming Paradigms on Sparse Matrix Computations. SIAM Rev. 44 (2002), 373--393.

Digital Library

[33]

Gloria Ortega, Francisco VÃązquez, Inmaculada Garcia, and Ester M. GarzÃşn. 2013. FastSpMM: An Efficient Library for Sparse Matrix Matrix Product on GPUs. Comput. J. 57, 7 (05 2013), 968--979. arXiv:http://oup.prod.sis.lan/comjnl/article-pdf/57/7/968/1007133/bxt038.pdf

[34]

Ryan A. Rossi and Nesreen K. Ahmed. 2015. The Network Data Repository with Interactive Graph Analytics and Visualization. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence. http://networkrepository.com

Digital Library

[35]

Ahmet Erdem Sariyuce, Erik Saule, Kamer Kaya, and Umit V. Catalyurek. 2015. Regularizing Graph Centrality Computations. J. Parallel Distrib. Comput. 76, C (Feb. 2015), 106--119.

Digital Library

[36]

Markus Steinberger, Rhaleb Zayer, and Hans-Peter Seidel. 2017. Globally Homogeneous, Locally Adaptive Sparse Matrix-vector Multiplication on the GPU. In Proceedings of the International Conference on Supercomputing (ICS '17). ACM, New York, NY, USA, Article 13, 11 pages.

Digital Library

[37]

Michalis K. Titsias. 2008. The Infinite Gamma-Poisson Feature Model. In Advances in Neural Information Processing Systems 20, J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis (Eds.). Curran Associates, Inc., 1513--1520. http://papers.nips.cc/paper/3309-the-infinite-gamma-poisson-feature-model.pdf

[38]

Richard Vuduc, James W Demmel, and Katherine A Yelick. 2005. OSKI: A library of automatically tuned sparse matrix kernels. Journal of Physics: Conference Series 16 (jan 2005), 521--530.

[39]

Hao Wei, Jeffrey Xu Yu, CanLu, and Xuemin Lin. 2016. Speedup Graph Processing by Graph Ordering. In Proceedings of the 2016 International Conference on Management of Data (SIGMOD '16). ACM, New York, NY, USA, 1813--1828.

Digital Library

[40]

Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S. Yu. 2019. A Comprehensive Survey on Graph Neural Networks. CoRR abs/1901.00596 (2019). arXiv:1901.00596 http://arxiv.org/abs/1901.00596

[41]

Carl Yang, Aydin BuluÃğ, and John Owens. 2018. Design Principles for Sparse Matrix Multiplication on the GPU (Euro-par).

[42]

A. Yzelman and R. Bisseling. 2009. Cache-Oblivious Sparse Matrix Vector Multiplication by Using Sparse Matrix Partitioning Methods. SIAM Journal on Scientific Computing 31, 4 (2009), 3128--3154.

Digital Library

[43]

Eddy Z. Zhang, Yunlian Jiang, Ziyu Guo, Kai Tian, and Xipeng Shen. 2011. On-the-fly Elimination of Dynamic Irregularities for GPU Computing. In Proceedings of the Sixteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XVI). ACM, New York, NY, USA, 369--380.

Digital Library

Cited By

Han YKim IKim JMoon G(2024)Tensor Core-Adapted Sparse Matrix Multiplication for Accelerating Sparse Deep Neural NetworksElectronics10.3390/electronics1320398113:20(3981)Online publication date: 10-Oct-2024
https://doi.org/10.3390/electronics13203981
Khetawat HMueller F(2024)Workload Scheduling on Heterogeneous DevicesISC High Performance 2024 Research Paper Proceedings (39th International Conference)10.23919/ISC.2024.10528933(1-11)Online publication date: May-2024
https://doi.org/10.23919/ISC.2024.10528933
Li MXiao JZhang KLin ZShui CMeng KWang ZPang YTan G(2024)A Coordinated Strategy for GNN Combining Computational Graph and Operator OptimizationsProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3661896(460-472)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3650200.3661896
Show More Cited By

Index Terms

A novel data transformation and execution strategy for accelerating sparse matrix multiplication on GPUs
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Single instruction, multiple data
2. Software and its engineering
  1. Software organization and properties
    1. Software system structures
      1. Software system models
        Massively parallel systems

Recommendations

On Implementing Sparse Matrix Multi-vector Multiplication on GPUs
HPCC '14: Proceedings of the 2014 IEEE Intl Conf on High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS)

Sparse matrix-vector and multi-vector multiplications (SpMV and SpMM) are performance bottlenecks operations in numerous HPC applications. A variety of SpMV GPU kernels using different matrix storage formats have been developed to accelerate these ...
An efficient two-dimensional blocking strategy for sparse matrix-vector multiplication on GPUs
ICS '14: Proceedings of the 28th ACM international conference on Supercomputing

Sparse matrix-vector multiplication (SpMV) is one of the key operations in linear algebra. Overcoming thread divergence, load imbalance and non-coalesced and indirect memory access due to sparsity and irregularity are challenges to optimizing SpMV on ...
Accelerating approximate matrix multiplication for near-sparse matrices on GPUs
Abstract
Although the matrix multiplication plays a vital role in computational linear algebra, there are few efficient solutions for matrix multiplication of the near-sparse matrices. The Sparse Approximate Matrix Multiply (SpAMM) is one of the algorithms ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

PPoPP '20: Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

February 2020

454 pages

ISBN:9781450368186

DOI:10.1145/3332466

General Chair:
Rajiv Gupta
UC Riverside
,
Program Chair:
Xipeng Shen
NCSU

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 February 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

PPoPP '20

Sponsor:

PPoPP '20: 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

February 22 - 26, 2020

California, San Diego

Acceptance Rates

PPoPP '20 Paper Acceptance Rate 28 of 121 submissions, 23%;

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

37
Total Citations
View Citations
1,733
Total Downloads

Downloads (Last 12 months)277
Downloads (Last 6 weeks)43

Reflects downloads up to 19 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Han YKim IKim JMoon G(2024)Tensor Core-Adapted Sparse Matrix Multiplication for Accelerating Sparse Deep Neural NetworksElectronics10.3390/electronics1320398113:20(3981)Online publication date: 10-Oct-2024
https://doi.org/10.3390/electronics13203981
Khetawat HMueller F(2024)Workload Scheduling on Heterogeneous DevicesISC High Performance 2024 Research Paper Proceedings (39th International Conference)10.23919/ISC.2024.10528933(1-11)Online publication date: May-2024
https://doi.org/10.23919/ISC.2024.10528933
Li MXiao JZhang KLin ZShui CMeng KWang ZPang YTan G(2024)A Coordinated Strategy for GNN Combining Computational Graph and Operator OptimizationsProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3661896(460-472)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3650200.3661896
Pang MFei XQu PZhang YLi ZLee IChabbi MSteuwer M(2024)A Row Decomposition-based Approach for Sparse Matrix Multiplication on GPUsProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638470(377-389)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1145/3627535.3638470
Fan RWang WChu XTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)DTC-SpMM: Bridging the Gap in Accelerating General Sparse Matrix Multiplication with Tensor CoresProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651378(253-267)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620666.3651378
Bank Tavakoli ERiera MQuraishi MRen F(2024)FSpGEMM: A Framework for Accelerating Sparse General Matrix–Matrix Multiplication Using Gustavson’s Algorithm on FPGAsIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2024.335549932:4(633-644)Online publication date: Apr-2024
https://doi.org/10.1109/TVLSI.2024.3355499
Huang HChow E(2024)Exploring the Design Space of Distributed Parallel Sparse Matrix-Multiple Vector MultiplicationIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.3452478(1-12)Online publication date: 2024
https://doi.org/10.1109/TPDS.2024.3452478
Lee EHan YMoon G(2024)Exploiting Tensor Cores in Sparse Matrix-Multivector Multiplication via Block-Sparsity-Aware Clustering2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00199(1181-1183)Online publication date: 27-May-2024
https://doi.org/10.1109/IPDPSW63119.2024.00199
Gao YQuan GHomsi SWen WWang L(2024)Secure and efficient general matrix multiplication on cloud using homomorphic encryptionThe Journal of Supercomputing10.1007/s11227-024-06428-880:18(26394-26434)Online publication date: 26-Aug-2024
https://doi.org/10.1007/s11227-024-06428-8
Lee EHan YMoon G(2024)Accelerated Block-Sparsity-Aware Matrix Reordering for Leveraging Tensor Cores in Sparse Matrix-Multivector MultiplicationEuro-Par 2024: Parallel Processing10.1007/978-3-031-69583-4_1(3-16)Online publication date: 26-Aug-2024
https://doi.org/10.1007/978-3-031-69583-4_1
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents