Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3332466.3374546acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article

A novel data transformation and execution strategy for accelerating sparse matrix multiplication on GPUs

Published: 19 February 2020 Publication History

Abstract

SpMM (multiplication of a sparse matrix and a dense matrix) and SDDMM (sampled dense-dense matrix multiplication) are at the core of many scientific, machine learning, and data mining applications. Because of the irregular memory accesses, the two kernels have poor data locality, and data movement overhead is a bottleneck for their performance. To overcome this issue, previous works have proposed using tiling and data reorganization to enhance data reuse. Despite their success in improving the performance for many sparse matrices, we find that the efficacy of existing techniques largely depends on how the non-zeros are distributed in a sparse matrix. In this work, we propose a novel row-reordering technique to improve data locality for SpMM and SDDMM on GPUs. The goal of such row reordering is to place similar rows close to each other, allowing them to be processed together, and thus providing better temporal locality for the values of the dense matrix. We focus on performing the row-reordering efficiently, by using a hierarchical clustering procedure optimized by locality-sensitive hashing. We also investigate when row-reordering is useful, and what factors the performance gains from our method are correlated to. Experimental evaluation using 1084 sparse matrices from SuiteSparse collection and Network Repository shows that our technique achieves up to 2.91x speedup for SpMM and up to 3.19x speedup for SDDMM against the state-of-the-art alternatives on an Nvidia P100 GPU.

References

[1]
2019. The API reference guide for cuSPARSE, the CUDA sparse matrix library, https://docs.nvidia.com/cuda/cusparse/index.html Version 10.1.168.
[2]
2019. IntelÂő Math Kernel Library, https://software.intel.com/en-us/mkl
[3]
Hasan Metin Aktulga, Aydin Buluç, Samuel Williams, and Chao Yang. 2014. Optimizing Sparse Matrix-Multiple Vectors Multiplication for Nuclear Configuration Interaction Calculations. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium (IPDPS '14). IEEE Computer Society, Washington, DC, USA, 1213--1222.
[4]
Hartwig Anzt, Stanimire Tomov, and Jack Dongarra. 2015. Accelerating the LOBPCG Method on GPUs Using a Blocked Sparse Matrix Vector Product. In Proceedings of the Symposium on High Performance Computing (HPC '15). Society for Computer Simulation International, San Diego, CA, USA, 75--82. http://dl.acm.org/citation.cfm?id=2872599.2872609
[5]
Arash Ashari, Naser Sedaghati, John Eisenlohr, Srinivasan Parthasarathy, and P. Sadayappan. 2014. Fast Sparse Matrix-vector Multiplication on GPUs for Graph Applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '14). IEEE Press, Piscataway, NJ, USA, 781--792.
[6]
N. Bell and M. Garland. 2009. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. 1--11.
[7]
A. Benatia, W. Ji, Y. Wang, and F. Shi. 2016. Sparse Matrix Format Selection with Multiclass SVM for SpMV on GPU. In 2016 45th International Conference on Parallel Processing (ICPP). 496--505.
[8]
J. Canny. 2002. Collaborative filtering with privacy. In Proceedings 2002 IEEE Symposium on Security and Privacy. 45--57.
[9]
John F. Canny and Huasha Zhao. 2013. BIDMach: Large-scale Learning with Zero Memory Allocation.
[10]
Linchuan Chen, Peng Jiang, and Gagan Agrawal. 2016. Exploiting Recent SIMD Architectural Advances for Irregular Applications. In Proceedings of the 2016 International Symposium on Code Generation and Optimization (CGO '16). ACM, New York, NY, USA, 47--58.
[11]
Jee W. Choi, Amik Singh, and Richard W. Vuduc. 2010. Model-driven Autotuning of Sparse Matrix-vector Multiply on GPUs. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '10). ACM, New York, NY, USA, 115--126.
[12]
Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. 2009. Introduction to Algorithms, Third Edition (3rd ed.). The MIT Press.
[13]
Mayank Daga and Joseph L. Greathouse. 2015. Structural Agnostic SpMV: Adapting CSR-Adaptive for Irregular Matrices. In Proceedings of the 2015 IEEE 22Nd International Conference on High Performance Computing (HiPC) (HIPC '15). IEEE Computer Society, Washington, DC, USA, 64--74.
[14]
Steven Dalton, Nathan Bell, Luke Olson, and Michael Garland. 2014. Cusp: Generic Parallel Algorithms for Sparse Matrix and Graph Computations. http://cusplibrary.github.io/ Version 0.5.0.
[15]
Timothy A. Davis and Yifan Hu. 2011. The University of Florida Sparse Matrix Collection. ACM Trans. Math. Softw. 38, 1, Article 1 (Dec. 2011), 25 pages
[16]
Changwan Hong, Aravind Sukumaran-Rajam, Bortik Bandyopadhyay, Jinsung Kim, Süreyya Emre Kurt, Israt Nisa, Shivani Sabhlok, Ümit V. Çatalyürek, Srinivasan Parthasarathy, and P. Sadayappan. 2018. Efficient Sparse-matrix Multi-vector Product on GPUs. In Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing (HPDC '18). ACM, New York, NY, USA, 66--79.
[17]
Changwan Hong, Aravind Sukumaran-Rajam, Israt Nisa, Kunal Singh, and P. Sadayappan. 2019. Adaptive Sparse Tiling for Sparse Matrix Multiplication. In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming (PPoPP '19). ACM, New York, NY, USA, 300--314.
[18]
Dietmar Jannach, Markus Zanker, Alexander Felfernig, and Gerhard Friedrich. 2010. Recommender Systems: An Introduction (1st ed.). Cambridge University Press, New York, NY, USA.
[19]
Peng Jiang and Gagan Agrawal. 2018. Conflict-free Vectorization of Associative Irregular Applications with Recent SIMD Architectural Advances. In Proceedings of the 2018 International Symposium on Code Generation and Optimization (CGO 2018). ACM, New York, NY, USA, 175--187.
[20]
Peng Jiang, Linchuan Chen, and Gagan Agrawal. 2016. Reusing Data Reorganization for Efficient SIMD Parallelization of Adaptive Irregular Applications. In Proceedings of the 2016 International Conference on Supercomputing (ICS '16). ACM, New York, NY, USA, Article 16, 10 pages.
[21]
George Karypis and Vipin Kumar. 1998. A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs. SIAM J. Sci. Comput. 20, 1 (Dec. 1998), 359--392.
[22]
Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with Graph Convolutional Networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24--26, 2017, Conference Track Proceedings. https://openreview.net/forum?id=SJU4ayYgl
[23]
Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, and Saman Amarasinghe. 2017. The Tensor Algebra Compiler. Proc. ACM Program. Lang. 1, OOPSLA, Article 77 (Oct. 2017), 29 pages.
[24]
Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix Factorization Techniques for Recommender Systems. Computer 42, 8 (Aug. 2009), 30--37.
[25]
Kornilios Kourtis, Vasileios Karakasis, Georgios Goumas, and Nectarios Koziris. 2011. CSX: An Extended Compression Format for Spmv on Shared Memory Systems. In Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming (PPoPP '11). ACM, New York, NY, USA, 247--256.
[26]
Christopher D. Krieger and Michelle Mills Strout. 2013. A Fast Parallel Graph Partitioner for Shared-Memory Inspector/Executor Strategies. In Languages and Compilers for Parallel Computing, Hironori Kasahara and Keiji Kimura (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 190--204.
[27]
K. Lakhotia, S. Singapura, R. Kannan, and V. Prasanna. 2017. ReCALL: Reordered Cache Aware Locality Based Graph Processing. In 2017 IEEE 24th International Conference on High Performance Computing (HiPC). 273--282.
[28]
Jure Leskovec, Anand Rajaraman, and Jeffrey David Ullman. 2014. Mining of Massive Datasets (2nd ed.). Cambridge University Press, New York, NY, USA.
[29]
John Mellor-Crummey, David Whalley, and Ken Kennedy. 2001. Improving Memory Hierarchy Performance for Irregular Applications Using Data and Computation Reorderings. Int. J. Parallel Program. 29, 3 (June 2001), 217--247.
[30]
Duane Merrill and Michael Garland. 2016. Merge-based Sparse Matrix-vector Multiplication (SpMV) Using the CSR Storage Format. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '16). ACM, New York, NY, USA, Article 43, 2 pages.
[31]
I. Nisa, A. Sukumaran-Rajam, S. E. Kurt, C. Hong, and P. Sadayappan. 2018. Sampled Dense Matrix Multiplication for High-Performance Machine Learning. In 2018 IEEE 25th International Conference on High Performance Computing (HiPC). 32--41.
[32]
Leonid Oliker, Xiaoye S. Li, Parry Husbands, and Rupak Biswas. 2002. Effects of Ordering Strategies and Programming Paradigms on Sparse Matrix Computations. SIAM Rev. 44 (2002), 373--393.
[33]
Gloria Ortega, Francisco VÃązquez, Inmaculada Garcia, and Ester M. GarzÃşn. 2013. FastSpMM: An Efficient Library for Sparse Matrix Matrix Product on GPUs. Comput. J. 57, 7 (05 2013), 968--979. arXiv:http://oup.prod.sis.lan/comjnl/article-pdf/57/7/968/1007133/bxt038.pdf
[34]
Ryan A. Rossi and Nesreen K. Ahmed. 2015. The Network Data Repository with Interactive Graph Analytics and Visualization. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence. http://networkrepository.com
[35]
Ahmet Erdem Sariyuce, Erik Saule, Kamer Kaya, and Umit V. Catalyurek. 2015. Regularizing Graph Centrality Computations. J. Parallel Distrib. Comput. 76, C (Feb. 2015), 106--119.
[36]
Markus Steinberger, Rhaleb Zayer, and Hans-Peter Seidel. 2017. Globally Homogeneous, Locally Adaptive Sparse Matrix-vector Multiplication on the GPU. In Proceedings of the International Conference on Supercomputing (ICS '17). ACM, New York, NY, USA, Article 13, 11 pages.
[37]
Michalis K. Titsias. 2008. The Infinite Gamma-Poisson Feature Model. In Advances in Neural Information Processing Systems 20, J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis (Eds.). Curran Associates, Inc., 1513--1520. http://papers.nips.cc/paper/3309-the-infinite-gamma-poisson-feature-model.pdf
[38]
Richard Vuduc, James W Demmel, and Katherine A Yelick. 2005. OSKI: A library of automatically tuned sparse matrix kernels. Journal of Physics: Conference Series 16 (jan 2005), 521--530.
[39]
Hao Wei, Jeffrey Xu Yu, CanLu, and Xuemin Lin. 2016. Speedup Graph Processing by Graph Ordering. In Proceedings of the 2016 International Conference on Management of Data (SIGMOD '16). ACM, New York, NY, USA, 1813--1828.
[40]
Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S. Yu. 2019. A Comprehensive Survey on Graph Neural Networks. CoRR abs/1901.00596 (2019). arXiv:1901.00596 http://arxiv.org/abs/1901.00596
[41]
Carl Yang, Aydin BuluÃğ, and John Owens. 2018. Design Principles for Sparse Matrix Multiplication on the GPU (Euro-par).
[42]
A. Yzelman and R. Bisseling. 2009. Cache-Oblivious Sparse Matrix Vector Multiplication by Using Sparse Matrix Partitioning Methods. SIAM Journal on Scientific Computing 31, 4 (2009), 3128--3154.
[43]
Eddy Z. Zhang, Yunlian Jiang, Ziyu Guo, Kai Tian, and Xipeng Shen. 2011. On-the-fly Elimination of Dynamic Irregularities for GPU Computing. In Proceedings of the Sixteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XVI). ACM, New York, NY, USA, 369--380.

Cited By

View all
  • (2024)Tensor Core-Adapted Sparse Matrix Multiplication for Accelerating Sparse Deep Neural NetworksElectronics10.3390/electronics1320398113:20(3981)Online publication date: 10-Oct-2024
  • (2024)Workload Scheduling on Heterogeneous DevicesISC High Performance 2024 Research Paper Proceedings (39th International Conference)10.23919/ISC.2024.10528933(1-11)Online publication date: May-2024
  • (2024)A Coordinated Strategy for GNN Combining Computational Graph and Operator OptimizationsProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3661896(460-472)Online publication date: 30-May-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
PPoPP '20: Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
February 2020
454 pages
ISBN:9781450368186
DOI:10.1145/3332466
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 February 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GPUs
  2. sparse matrix multiplication

Qualifiers

  • Research-article

Conference

PPoPP '20

Acceptance Rates

PPoPP '20 Paper Acceptance Rate 28 of 121 submissions, 23%;
Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)277
  • Downloads (Last 6 weeks)43
Reflects downloads up to 19 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Tensor Core-Adapted Sparse Matrix Multiplication for Accelerating Sparse Deep Neural NetworksElectronics10.3390/electronics1320398113:20(3981)Online publication date: 10-Oct-2024
  • (2024)Workload Scheduling on Heterogeneous DevicesISC High Performance 2024 Research Paper Proceedings (39th International Conference)10.23919/ISC.2024.10528933(1-11)Online publication date: May-2024
  • (2024)A Coordinated Strategy for GNN Combining Computational Graph and Operator OptimizationsProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3661896(460-472)Online publication date: 30-May-2024
  • (2024)A Row Decomposition-based Approach for Sparse Matrix Multiplication on GPUsProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638470(377-389)Online publication date: 2-Mar-2024
  • (2024)DTC-SpMM: Bridging the Gap in Accelerating General Sparse Matrix Multiplication with Tensor CoresProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651378(253-267)Online publication date: 27-Apr-2024
  • (2024)FSpGEMM: A Framework for Accelerating Sparse General Matrix–Matrix Multiplication Using Gustavson’s Algorithm on FPGAsIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2024.335549932:4(633-644)Online publication date: Apr-2024
  • (2024)Exploring the Design Space of Distributed Parallel Sparse Matrix-Multiple Vector MultiplicationIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.3452478(1-12)Online publication date: 2024
  • (2024)Exploiting Tensor Cores in Sparse Matrix-Multivector Multiplication via Block-Sparsity-Aware Clustering2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00199(1181-1183)Online publication date: 27-May-2024
  • (2024)Secure and efficient general matrix multiplication on cloud using homomorphic encryptionThe Journal of Supercomputing10.1007/s11227-024-06428-880:18(26394-26434)Online publication date: 26-Aug-2024
  • (2024)Accelerated Block-Sparsity-Aware Matrix Reordering for Leveraging Tensor Cores in Sparse Matrix-Multivector MultiplicationEuro-Par 2024: Parallel Processing10.1007/978-3-031-69583-4_1(3-16)Online publication date: 26-Aug-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media