research-article

Merge-based parallel sparse matrix-vector multiplication

Authors:

Michael GarlandAuthors Info & Claims

SC '16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Article No.: 58, Pages 1 - 12

Published: 13 November 2016 Publication History

Abstract

We present a strictly balanced method for the parallel computation of sparse matrix-vector products (SpMV). Our algorithm operates directly upon the Compressed Sparse Row (CSR) sparse matrix format without preprocessing, inspection, reformatting, or supplemental encoding. Regardless of nonzero structure, our equitable 2D merge-based decomposition tightly bounds the workload assigned to each processing element. Furthermore, our technique is suitable for recursively partitioning CSR datasets themselves into multi-scale, distributed, NUMA, and GPU environments that are constrained by fixed-size local memories.

We evaluate our method on both CPU and GPU microarchitectures across a very large corpus of diverse sparse matrix datasets. We show that traditional CsrMV methods are inconsistent performers, often subject to order-of-magnitude performance variation across similarly-sized datasets. In comparison, our method provides predictable performance that is substantially uncorrelated to the distribution of nonzeros among rows and broadly improves upon that of current CsrMV methods.

References

[1]

J. Kepner, D. A. Bader, A. Buluç, J. R. Gilbert, T. G. Mattson, and H. Meyerhenke, "Graphs, Matrices, and the GraphBLAS: Seven Good Reasons," CoRR, vol. abs/1504.01039, 2015.

[2]

J. Kepner and J. Gilbert, Graph Algorithms in the Language of Linear Algebra. Society for Industrial and Applied Mathematics, 2011.

Digital Library

[3]

M. Mohri, "Semiring Frameworks and Algorithms for Shortest-distance Problems," J. Autom. Lang. Comb., vol. 7, no. 3, pp. 321--350, Jan. 2002.

[4]

R. W. Vuduc, "Automatic Performance Tuning of Sparse Matrix Kernels," Ph.D. Dissertation, University of California, Berkeley, 2003.

Digital Library

[5]

S. Williams, N. Bell, J. W. Choi, M. Garland, L. Oliker, and R. Vuduc, "Sparse Matrix-Vector Multiplication on Multicore and Accelerators," in Scientific Computing with Multicore and Accelerators, J. Kurzak, D. A. Bader, and J. Dongarra, Eds. Taylor & Francis, 2011, pp. 83--109.

[6]

S. Filippone, V. Cardellini, D. Barbieri, and A. Fanfarillo, "Sparse Matrix-Vector Multiplication on GPGPUs," ACM Trans. Math. Softw., To Appear.

[7]

N. Bell and M. Garland, "Implementing sparse matrix-vector multiplication on throughput-oriented processors," in Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, New York, NY, USA, 2009, pp. 18:1--18:11.

Digital Library

[8]

J. L. Greathouse and M. Daga, "Efficient Sparse Matrix-vector Multiplication on GPUs Using the csr Storage Format," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, Piscataway, NJ, USA, 2014, pp. 769--780.

Digital Library

[9]

A. Ashari, N. Sedaghati, J. Eisenlohr, S. Parthasarathy, and P. Sadayappan, "Fast Sparse Matrix-vector Multiplication on GPUs for Graph Applications," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, Piscataway, NJ, USA, 2014, pp. 781--792.

Digital Library

[10]

Intel Math Kernel Library (MKL) v11.3. Intel Corporation, 2015.

[11]

NVIDIA cuSPARSE v7.5. NVIDIA Corporation, 2013.

[12]

S. Odeh, O. Green, Z. Mwassi, O. Shmueli, and Y. Birk, "Merge path - Parallel Merging Made Simple," in Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum, Washington, DC, USA, 2012, pp. 1611--1618.

Digital Library

[13]

S. Baxter and D. Merrill, "Efficient Merge, Search, and Set Operations on GPUs," Mar-2013. {Online}. Available: http://on-demand.gputechconf.com/gtc/2013/presentations/S3414-Efficient-Merge-Search-Set-Operations.pdf.

[14]

N. Deo, A. Jain, and M. Medidi, "An optimal parallel algorithm for merging using multiselection," Inf. Process. Lett., vol. 50, no. 2, pp. 81--87, Apr. 1994.

Digital Library

[15]

A. Buluç, J. T. Fineman, M. Frigo, J. R. Gilbert, and C. E. Leiserson, "Parallel Sparse Matrix-Vector and Matrix-Transpose-Vector Multiplication Using Compressed Sparse Blocks," in Proc. SPAA, Calgary, Canada, 2009.

Digital Library

[16]

A. Jain, "pOSKI: An Extensible Autotuning Framework to Perform Optimized SpMVs on Multicore Architectures," Master's Thesis, University of California at Berkeley, 2008.

[17]

S. Yan, C. Li, Y. Zhang, and H. Zhou, "yaSpMV: Yet Another SpMV Framework on GPUs," in Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, New York, NY, USA, 2014, pp. 107--118.

Digital Library

[18]

G. E. Blelloch, M. A. Heroux, and M. Zagha, "Segmented Operations for Sparse Matrix Computation on Vector Multiprocessors," School of Computer Science, Carnegie Mellon University, CMU-CS-93--173, Aug. 1993.

Digital Library

[19]

D. Merrill, "Allocation-oriented Algorithm Design with Application to GPU Computing," Ph.D. Dissertation, University of Virginia, 2011.

Digital Library

[20]

S. Sengupta, M. Harris, Y. Zhang, and J. D. Owens, "Scan primitives for GPU computing," in Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware, Aire-la-Ville, Switzerland, Switzerland, 2007, pp. 97--106.

Digital Library

[21]

N. Bell and M. Garland, "Efficient Sparse Matrix-Vector Multiplication on CUDA," NVIDIA Corporation, NVIDIA Technical Report NVR-2008--004, Dec. 2008.

[22]

A. Monakov, A. Lokhmotov, and A. Avetisyan, "Automatically Tuning Sparse Matrix-Vector Multiplication for GPU Architectures," in High Performance Embedded Architectures and Compilers, vol. 5952, Y. Patt, P. Foglia, E. Duesterwald, P. Faraboschi, and X. Martorell, Eds. Springer Berlin Heidelberg, 2010, pp. 111--125.

Digital Library

[23]

W. Tang, W. Tan, R. Goh, S. Turner, and W. Wong, "A Family of Bit-Representation-Optimized Formats for Fast Sparse Matrix-Vector Multiplication on the GPU," Parallel and Distributed Systems, IEEE Transactions on, vol. pp, no. 99, pp. 1--1, 2014.

[24]

W. T. Tang, W. J. Tan, R. Ray, Y. W. Wong, W. Chen, S. Kuo, R. S. M. Goh, S. J. Turner, and W.-F. Wong, "Accelerating Sparse Matrix-vector Multiplication on GPUs Using Bit-representation-optimized Schemes," in Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, New York, NY, USA, 2013, pp. 26:1--26:12.

Digital Library

[25]

B.-Y. Su and K. Keutzer, "clSpMV: A Cross-Platform OpenCL SpMV Framework on GPUs," in Proceedings of the 26th ACM International Conference on Supercomputing, New York, NY, USA, 2012, pp. 353--364.

Digital Library

[26]

S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel, "Optimization of Sparse Matrix-vector Multiplication on Emerging Multicore Platforms," in Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, New York, NY, USA, 2007, pp. 38:1--38:12.

Digital Library

[27]

S. Dalton, S. Baxter, D. Merrill, L. Olson, and M. Garland, "Optimizing Sparse Matrix Operations on GPUs Using Merge Path," in Parallel and Distributed Processing Symposium (IPDPS), 2015 IEEE International, 2015, pp. 407--416.

Digital Library

[28]

S. Baxter, Modern GPU. NVIDIA Research, 2013.

[29]

A. Buluc and J. R. Gilbert, "On the representation and multiplication of hypersparse matrices," in Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on, 2008, pp. 1--11.

[30]

D. Chakrabarti, Y. Zhan, and C. Faloutsos, "R-MAT: A Recursive Model for Graph Mining," in SIAM International Conference on Data Mining, 2004.

[31]

T. Davis and Y. Hu, "University of Florida Sparse Matrix Collection." {Online}. Available: http://www.cise.ufl.edu/research/sparse/matrices/. {Accessed: 11-Jul-2011}.

[32]

NVIDIA, "CUDA." {Online}. Available: http://www.nvidia.com/object/cuda_home_new.html. {Accessed: 25-Aug-2011}.

[33]

D. Merrill, CUB v1.5.3: CUDA Unbound, a library of warp-wide, block-wide, and device-wide GPU parallel primitives. NVIDIA Research, 2015.

[34]

J. D. McCalpin, "Memory Bandwidth and Machine Balance in Current High Performance Computers," IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter, pp. 19--25, Dec. 1995.

[35]

X. Liu, M. Smelyanskiy, E. Chow, and P. Dubey, "Efficient Sparse Matrix-vector Multiplication on x86-based Many-core Processors," in Proceedings of the 27th International ACM Conference on International Conference on Supercomputing, New York, NY, USA, 2013, pp. 273--282.

Digital Library

Cited By

Hwang SBaek DPark JHuh J(2024)Cerberus: Triple Mode Acceleration of Sparse Matrix and Vector MultiplicationACM Transactions on Architecture and Code Optimization10.1145/365302021:2(1-24)Online publication date: 21-May-2024
https://dl.acm.org/doi/10.1145/3653020
Jiang JHuang JBian H(2023)GTLB:A Load-Balanced SpMV Computation Method on GPUProceedings of the 2023 7th International Conference on High Performance Compilation, Computing and Communications10.1145/3606043.3606057(101-107)Online publication date: 17-Jun-2023
https://dl.acm.org/doi/10.1145/3606043.3606057
Zhang LWahib MChen PMeng JWang XEndo TMatsuoka SGallivan KNikolopoulos DBeivide RGallopoulos E(2023)PERKS: a Locality-Optimized Execution Model for Iterative Memory-bound GPU ApplicationsProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593705(167-179)Online publication date: 21-Jun-2023
https://dl.acm.org/doi/10.1145/3577193.3593705
Show More Cited By

Recommendations

Merge-based sparse matrix-vector multiplication (SpMV) using the CSR storage format
PPoPP '16: Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

We present a perfectly balanced, "merge-based" parallel method for computing sparse matrix-vector products (SpMV). Our algorithm operates directly upon the Compressed Sparse Row (CSR) sparse matrix format, a predominant in-memory representation for ...
Merge-based sparse matrix-vector multiplication (SpMV) using the CSR storage format
PPoPP '16

We present a perfectly balanced, "merge-based" parallel method for computing sparse matrix-vector products (SpMV). Our algorithm operates directly upon the Compressed Sparse Row (CSR) sparse matrix format, a predominant in-memory representation for ...
Vectorized Parallel Sparse Matrix-Vector Multiplication in PETSc Using AVX-512
ICPP '18: Proceedings of the 47th International Conference on Parallel Processing

Emerging many-core CPU architectures with high degrees of single-instruction, multiple data (SIMD) parallelism promise to enable increasingly ambitious simulations based on partial differential equations (PDEs) via extreme-scale computing. However, such ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

November 2016

1034 pages

ISBN:9781467388153

Conference Chair:
John West
University of Texas at Austin

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE-CS: Computer Society

In-Cooperation

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing

Publisher

IEEE Press

Publication History

Published: 13 November 2016

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SC16

Sponsor:

SIGARCH
IEEE-CS

SC16: The International Conference for High Performance Computing, Networking, Storage and Analysis

November 13 - 18, 2016

Utah, Salt Lake City

Acceptance Rates

SC '16 Paper Acceptance Rate 81 of 442 submissions, 18%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

17
Total Citations
View Citations
824
Total Downloads

Downloads (Last 12 months)17
Downloads (Last 6 weeks)1

Reflects downloads up to 15 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Hwang SBaek DPark JHuh J(2024)Cerberus: Triple Mode Acceleration of Sparse Matrix and Vector MultiplicationACM Transactions on Architecture and Code Optimization10.1145/365302021:2(1-24)Online publication date: 21-May-2024
https://dl.acm.org/doi/10.1145/3653020
Jiang JHuang JBian H(2023)GTLB:A Load-Balanced SpMV Computation Method on GPUProceedings of the 2023 7th International Conference on High Performance Compilation, Computing and Communications10.1145/3606043.3606057(101-107)Online publication date: 17-Jun-2023
https://dl.acm.org/doi/10.1145/3606043.3606057
Zhang LWahib MChen PMeng JWang XEndo TMatsuoka SGallivan KNikolopoulos DBeivide RGallopoulos E(2023)PERKS: a Locality-Optimized Execution Model for Iterative Memory-bound GPU ApplicationsProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593705(167-179)Online publication date: 21-Jun-2023
https://dl.acm.org/doi/10.1145/3577193.3593705
Anzt HCojean TYen-Chen CDongarra JFlegar GNayak PTomov STsai YWang W(2020)Load-balancing Sparse Matrix Vector Product Kernels on GPUsACM Transactions on Parallel Computing10.1145/33809307:1(1-26)Online publication date: 29-Mar-2020
https://dl.acm.org/doi/10.1145/3380930
Boukaram WTurkiyyah GKeyes D(2019)Hierarchical Matrix Operations on GPUsACM Transactions on Mathematical Software10.1145/323285045:1(1-28)Online publication date: 24-Feb-2019
https://dl.acm.org/doi/10.1145/3232850
Cheshmi KKamil SStrout MDehnavi M(2018)ParSyProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.5555/3291656.3291739(1-15)Online publication date: 11-Nov-2018
https://dl.acm.org/doi/10.5555/3291656.3291739
Li JSun JVuduc R(2018)HiCOOProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.5555/3291656.3291682(1-15)Online publication date: 11-Nov-2018
https://dl.acm.org/doi/10.5555/3291656.3291682
Hu YLiu HHuang H(2018)TriCoreProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.5555/3291656.3291675(1-12)Online publication date: 11-Nov-2018
https://dl.acm.org/doi/10.5555/3291656.3291675
Benatia AJi WWang YShi F(2018)BestSFACM Transactions on Architecture and Code Optimization10.1145/322622815:3(1-27)Online publication date: 4-Sep-2018
https://dl.acm.org/doi/10.1145/3226228
Yang CBuluç AOwens J(2018)Implementing Push-Pull Efficiently in GraphBLASProceedings of the 47th International Conference on Parallel Processing10.1145/3225058.3225122(1-11)Online publication date: 13-Aug-2018
https://dl.acm.org/doi/10.1145/3225058.3225122
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten