research-article

Speculative segmented sum for sparse matrix-vector multiplication on heterogeneous processors

Authors:

Weifeng Liu,

Brian VinterAuthors Info & Claims

Volume 49, Issue C

Pages 179 - 193

https://doi.org/10.1016/j.parco.2015.04.004

Published: 01 November 2015 Publication History

Highlights

•

A speculative segmented sum strategy for the CSR-based SpMV.

•

Utilizing both GPU cores and CPU cores in a heterogeneous processor.

•

No format conversion or tuning overhead for input sparse matrices in the CSR format.

•

High speedup over the CSR-vector algorithm running irregular matrices.

•

No performance penalty for most regular matrices.

Abstract

Sparse matrix-vector multiplication (SpMV) is a central building block for scientific software and graph applications. Recently, heterogeneous processors composed of different types of cores attracted much attention because of their flexible core configuration and high energy efficiency. In this paper, we propose a compressed sparse row (CSR) format based SpMV algorithm utilizing both types of cores in a CPU–GPU heterogeneous processor. We first speculatively execute segmented sum operations on the GPU part of a heterogeneous processor and generate a possibly incorrect result. Then the CPU part of the same chip is triggered to re-arrange the predicted partial sums for a correct resulting vector. On three heterogeneous processors from Intel, AMD and nVidia, using 20 sparse matrices as a benchmark suite, the experimental results show that our method obtains significant performance improvement over the best existing CSR-based SpMV algorithms.

References

[1]

N. Bell, M. Garland, Implementing sparse matrix-vector multiplication on throughput-oriented processors, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC ’09, 2009, pp. 18:1–18:11.

Google Scholar

[2]

B.-Y. Su, K. Keutzer, clSpMV: A cross-platform open CL SpMV framework on GPUs, Proceedings of the 26th ACM International Conference on Supercomputing, ICS ’12, 2012, pp. 353–364.

Google Scholar

[3]

R. Li, Y. Saad, GPU-accelerated preconditioned iterative linear solvers, J Supercomput 63 (2) (2013) 443–466.

Google Scholar

[4]

X. Liu, M. Smelyanskiy, E. Chow, P. Dubey, Efficient sparse matrix-vector multiplication on x86-based many-core processors, Proceedings of the 27th International ACM Conference on International Conference on Supercomputing, ICS ’13, 2013, pp. 273–282.

Google Scholar

[5]

S. Yan, C. Li, Y. Zhang, H. Zhou, yaSpMV: yet another SpMV framework on GPUs, Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’14, 2014, pp. 107–118.

Google Scholar

[6]

A. Ashari, N. Sedaghati, J. Eisenlohr, P. Sadayappan, An efficient two-dimensional blocking strategy for sparse matrix-vector multiplication on GPUs, Proceedings of the 28th ACM International Conference on Supercomputing, ICS ’14, 2014, pp. 273–282.

Google Scholar

[7]

S. Balay, S. Abhyankar, M.F. Adams, J. Brown, P. Brune, K. Buschelman, V. Eijkhout, W.D. Gropp, D. Kaushik, M.G. Knepley, L.C. McInnes, K. Rupp, B.F. Smith, H. Zhang, A.N. Laboratory, PETSc Users Manual, Tech. Rep. ANL-95/11 - Revision 3.5, 2014.

Google Scholar

[8]

V. Minden, B. Smith, M. Knepley, D.A. Yuen, L. Wang, X. Chi, L. Johnsson, W. Ge, Preliminary implementation of PETSc using GPUs, in: Y. Shi (Ed.), GPU Solutions to Multi-scale Problems in Science and Engineering, Lecture Notes in Earth System Sciences, Springer Berlin Heidelberg, 2013, pp. 131–140.

Google Scholar

[9]

P. Kumbhar, Performance of PETSc GPU Implementation with Sparse Matrix Storage Schemes, Master’s thesis, The University of Edinburgh, 2011.

Google Scholar

[10]

D. Langr, P. Tvrdík, Evaluation Criteria for Sparse Matrix Storage Formats, IEEE Transactions on Parallel and Distributed Systems in press.

Google Scholar

[11]

W. Liu, B. Vinter, An efficient GPU general sparse matrix-matrix multiplication for irregular data, Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, IPDPS ’14, 2014, pp. 370–381.

Google Scholar

[12]

W. Liu, B. Vinter, CSR5: An efficient storage format for cross-platform sparse matrix-vector multiplication, Proceedings of the 29th ACM International Conference on Supercomputing, ICS ’15, 2015.

Google Scholar

[13]

S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, J. Demmel, Optimization of sparse matrix-vector multiplication on emerging multicore platforms, Parallel Comput 35 (3) (2009) 178–194.

Google Scholar

[14]

J.L. Greathouse, M. Daga, Efficient sparse matrix-vector multiplication on GPUs using the CSR storage format, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’14, 2014, pp. 769–780.

Google Scholar

[15]

A. Ashari, N. Sedaghati, J. Eisenlohr, S. Parthasarathy, P. Sadayappan, Fast sparse matrix-vector multiplication on GPUs for graph applications, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’14, 2014, pp. 781–792.

Google Scholar

[16]

G.E. Blelloch, M.A. Heroux, M. Zagha, Segmented Operations for Sparse Matrix Computation on Vector Multiprocessors, Tech. Rep. CMU-CS-93-173, Carnegie Mellon University, 1993.

Google Scholar

[17]

M. Harris, J.D. Owens, S. Sengupta, CUDPP Documentation, second ed., nVidia, 2014.

Google Scholar

[18]

S. Baxter, Modern GPU, nVidia, 2013.

Google Scholar

[19]

S. Sengupta, M. Harris, Y. Zhang, J.D. Owens, Scan primitives for GPU computing, Proceedings of the 22Nd ACM SIGGRAPH/EUROGRAPHICS Symposium on Graphics Hardware, GH ’07, 2007, pp. 97–106.

Google Scholar

[20]

M. Garland, Sparse matrix computations on many core GPU’s, Proceedings of the 45th Annual Design Automation Conference, DAC ’08, 2008, pp. 2–6.

Google Scholar

[21]

R. Kumar, D. Tullsen, N. Jouppi, P. Ranganathan, Heterogeneous chip multiprocessors, Computer 38 (11) (2005) 32–38.

Google Scholar

[22]

S. Keckler, W. Dally, B. Khailany, M. Garland, D. Glasco, GPUs Future Parallel Comput, Micro, IEEE 31 (5) (2011) 7–17.

Google Scholar

[23]

A. Branover, D. Foley, M. Steinman, AMD Fusion APU: Llano, IEEE Micro 32 (2) (2012) 28–37.

Google Scholar

[24]

AMD, White Paper: Compute Cores (Jan 2014).

Google Scholar

[25]

S. Damaraju, V. George, S. Jahagirdar, T. Khondker, R. Milstrey, S. Sarkar, S. Siers, I. Stolero, A. Subbiah, A 22nm IA multi-CPU and GPU system-on-chip, Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2012 IEEE International, 2012, pp. 56–57.

Google Scholar

[26]

nVidia, NVIDIA Tegra K1 A New Era in Mobile Computing, 1st ed., 2014.

Google Scholar

[27]

Qualcomm, Qualcomm Snapdragon 800 Product Brief (Aug 2013).

Google Scholar

[28]

E. Chung, P. Milder, J. Hoe, K. Mai, Single-chip heterogeneous computing: does the future include custom logic, FPGAs, and GPGPUs?, Microarchitecture (MICRO), 2010 43rd Annual IEEE/ACM International Symposium on, 2010, pp. 225–236.

Google Scholar

[29]

HSA Foundation, HSA Programmer’s Reference Manual: HSAIL Virtual ISA and Programming Model, Compiler Writer’s Guide, and Object Format (BRIG), 0th ed. (May 2013).

Google Scholar

[30]

A. Munshi, The Open CL Specification, 2nd ed., Khronos OpenCL Working Group, 2014.

Google Scholar

[31]

D. Negrut, R. Serban, A. Li, A. Seidl, Unified Memory in CUDA 6: A Brief Overview and Related Data Access/Transfer Issues, Tech. Rep. TR-2014-09, University of Wisconsin–Madison, 2014.

Google Scholar

[32]

Y.S. Deng, B.D. Wang, S. Mu, Taming irregular EDA applications on GPUs, Proceedings of the 2009 International Conference on Computer-Aided Design, ICCAD ’09, 2009, pp. 539–546.

Google Scholar

[33]

C. Gregg, K. Hazelwood, Where is the data? Why you cannot debate CPU vs. GPU performance without the answer, Performance Analysis of Systems and Software (ISPASS), 2011 IEEE International Symposium on, 2011, pp. 134–144.

Google Scholar

[34]

Y. Dotsenko, N.K. Govindaraju, P.-P. Sloan, C. Boyd, J. Manferdelli, Fast scan algorithms on graphics processors, Proceedings of the 22Nd Annual International Conference on Supercomputing, ICS ’08, 2008, pp. 205–213.

Google Scholar

[35]

T.A. Davis, Y. Hu, The University of Florida sparse matrix collection, ACM Trans. Math. Softw. 38 (1) (2011) 1:1–1:25.

Digital Library

Google Scholar

[36]

J.-H. Byun, R. Lin, J.W. Demmel, K.A. Yelick, pOSKI: Parallel Optimized Sparse Kernel Interface Library User’s Guide, 1st ed., University of California, Berkeley, 2012.

Google Scholar

[37]

R. Vuduc, J.W. Demmel, K.A. Yelick, OSKI: A library of automatically tuned sparse matrix kernels, J Phys: Conference Series 16 (1) (2005) 521–530.

Google Scholar

[38]

D. Lukarski, N. Trost, PARALUTION–User Manual, Tech. Rep. 1.0.0, PARALUTION Labs UG (haftungsbeschrÃ¤nkt) Co. KG, 2015.

Google Scholar

[39]

M.M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam, A. Rountev, P. Sadayappan, A compiler framework for optimization of Affine Loop Nests for GPGPUs, Proceedings of the 22nd Annual International Conference on Supercomputing, ICS ’08, 2008, pp. 225–234.

Google Scholar

[40]

J. Fang, A. Varbanescu, H. Sips, A comprehensive performance comparison of CUDA and OpenCL, Parallel Processing (ICPP), 2011 International Conference on, 2011, pp. 216–225.

Google Scholar

[41]

R. Vuduc, H.-J. Moon, Fast sparse matrix-vector multiplication by exploiting variable block structure, High Performance Computing and Communications, Vol. 3726 of Lecture Notes in Computer Science, Springer Berlin Heidelberg, 2005, pp. 807–816.

Google Scholar

[42]

R.W. Vuduc, Automatic Performance Tuning of Sparse Matrix Kernels, Ph.d. thesis, University of California, Berkeley, 2003.

Digital Library

Google Scholar

[43]

A. Buluç, J.T. Fineman, M. Frigo, J.R. Gilbert, C.E. Leiserson, Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks, Proceedings of the Twenty-First Annual Symposium on Parallelism in Algorithms and Architectures, SPAA ’09, 2009, pp. 233–244.

Google Scholar

[44]

A. Buluç, S. Williams, L. Oliker, J. Demmel, Reduced-bandwidth multithreaded algorithms for sparse matrix-vector multiplication, Parallel Distributed Processing Symposium (IPDPS), 2011 IEEE International, 2011, pp. 721–733.

Google Scholar

[45]

J.W. Choi, A. Singh, R.W. Vuduc, Model-driven autotuning of sparse matrix-vector multiply on GPUs, Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’10, 2010, pp. 115–126.

Google Scholar

[46]

M. Martone, Efficient multithreaded untransposed, transposed or symmetric sparse matrix-vector multiplication with the recursive sparse blocks format, Parallel Comput 40 (7) (2014) 251–270.

Google Scholar

[47]

R. Nishtala, R. Vuduc, J. Demmel, K. Yelick, When cache blocking of sparse matrix vector multiply works and why, applicable algebra in engineering, Commun Comput 18 (3) (2007) 297–311.

Google Scholar

[48]

J.C. Pichel, F.F. Rivera, M. Fernández, A. Rodríguez, Optimization of sparse matrix-vector multiplication using reordering techniques on GPUs, Microprocess Microsyst 36 (2) (2012) 65–77.

Google Scholar

[49]

M.M. Baskaran, R. Bordawekar, Optimizing Sparse Matrix-Vector Multiplication on GPUs, Tech. Rep. RC24704, IBM, 2008.

Google Scholar

[50]

I. Reguly, M. Giles, Efficient sparse matrix-vector multiplication on cache-based GPUs, Innovative Parallel Computing (InPar), 2012, pp. 1–12.

Google Scholar

[51]

N. Zhang, A novel parallel scan for multicore processors and its application in sparse matrix-vector multiplication, Parallel Distributed Syst, IEEE Trans. 23 (3) (2012) 397–404.

Google Scholar

[52]

J. Lee, M. Samadi, Y. Park, S. Mahlke, Transparent CPU-GPU collaboration for data-parallel kernels on heterogeneous systems, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques, PACT ’13, 2013, pp. 245–256.

Google Scholar

[53]

R. Kaleem, R. Barik, T. Shpeisman, B.T. Lewis, C. Hu, K. Pingali, Adaptive heterogeneous scheduling for integrated GPUs, Proceedings of the 23rd International Conference on Parallel Architectures and Compilation, PACT ’14, 2014, pp. 151–162.

Google Scholar

[54]

J. Shen, J. Fang, A.L. Varbanescu, H. Sips, An application-centric evaluation of openCL on Multi-Core CPUs, Parallel Computing 39 (12) (2013) 834–850.

Google Scholar

[55]

J. Shen, A.L. Varbanescu, P. Zou, Y. Lu, H. Sips, Improving performance by matching imbalanced workloads with heterogeneous platforms, Proceedings of the 28th ACM International Conference on Supercomputing, ICS ’14, 2014, pp. 241–250.

Google Scholar

[56]

Y. Liu, B. Schmidt, LightSpMV: faster CSR-based sparse matrix-vector multiplication on CUDA-enabled GPUs, Proceedings of the 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP), IEEE'15, 2015.

Google Scholar

Cited By

View all

Hu ZSun JLi ZSun G(2024)AG-SpTRSV: An Automatic Framework to Optimize Sparse Triangular Solve on GPUsACM Transactions on Architecture and Code Optimization10.1145/367491121:4(1-25)Online publication date: 25-Jun-2024
https://dl.acm.org/doi/10.1145/3674911
Hwang SBaek DPark JHuh J(2024)Cerberus: Triple Mode Acceleration of Sparse Matrix and Vector MultiplicationACM Transactions on Architecture and Code Optimization10.1145/365302021:2(1-24)Online publication date: 21-May-2024
https://dl.acm.org/doi/10.1145/3653020
Shi ZZou YSong XLi SLiu FXue Q(2024)DyLaClass: Dynamic Labeling Based Classification for Optimal Sparse Matrix Format Selection in Accelerating SpMVIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.348805335:12(2624-2639)Online publication date: 1-Dec-2024
https://dl.acm.org/doi/10.1109/TPDS.2024.3488053
Show More Cited By

Index Terms

Speculative segmented sum for sparse matrix-vector multiplication on heterogeneous processors

Index terms have been assigned to the content through auto-classification.

Recommendations

CUDA-enabled Sparse Matrix-Vector Multiplication on GPUs using atomic operations

We propose the Sliced Coordinate Format (SCOO) for Sparse Matrix-Vector Multiplication on GPUs.An associated CUDA implementation which takes advantage of atomic operations is presented.We propose partitioning methods to transform a given sparse matrix ...
Adaptive Sparse Matrix-Vector Multiplication on CPU-GPU Heterogeneous Architecture
HPCCT '19: Proceedings of the 2019 3rd High Performance Computing and Cluster Technologies Conference

SpMV is the core algorithm in solving the sparse linear equations, which is widely used in many research and engineering application field. GPU is the most common coprocessor in high-performance computing domain, and has already been proven to ...
LightSpMV: Faster CUDA-Compatible Sparse Matrix-Vector Multiplication Using Compressed Sparse Rows

Compressed sparse row (CSR) is one of the most frequently used sparse matrix storage formats. However, the efficiency of existing CUDA-compatible CSR-based sparse matrix vector multiplication (SpMV) implementations is relatively low. We address this ...

Reviews

Reviewer: Henk Sips

Sparse matrix-vector multiplication (SpMV) is a key operation in many scientific and graph applications. It is also challenging because of data compression and load balancing issues. Data compression is needed to efficiently store the matrix elements, and load balancing is needed to avoid idle processing capacity. Because the sparsity structure of applications has a lot of variation, static solutions generally do not work very well. Also, programming languages do not offer much support for expressing sparse structures and compiler optimizations are generally very difficult due to the indirect array accesses. A completely new challenge is to use graphics processing units (GPUs) for SpMV operations. Now, GPUs are at their very best when all operations take the same amount of time, a static workload partition is applied, and little synchronization is needed. This all seems very contradictory to the requirements of SpMV operations. One solution is to adapt the data compression scheme such that workload partitioning is easier. When many iterations are needed, this might be an applicable solution. However, if only a few iterations per cycle are done, this conversion might not be worth the effort. The data compression technique that is widely applied is the compressed sparse row (CSR) format. The CSR format does not naturally reveal load balancing information, therefore adapted formats have been used, but not without flaws; for example, they are not able to detect empty rows. The authors follow a different approach and make use of the intrinsic differences between central processing units (CPUs) and GPUs; that is, they propose a heterogeneous solution. Core to their approach is to first speculatively execute the SpMV operation on the GPUs and then rearrange the resulting vectors on the CPUs, because CPUs are better suited for doing the final irregular accesses and global synchronization than GPUs. The algorithm is given in the appendix of the paper and is very clearly illustrated by a simple sparse SpMV example. The paper contains an extensive experiment section, using 20 sparse matrix structures from the University of Florida Sparse Matrix Collection and three CPU-GPU platforms from Intel, NVDIA, and AMD, respectively. Speedups of on average 2.6 and no obvious losses for dense systems are reported. What remains is the setting of four tuning parameters. The W parameter (workload per thread) especially can be critical, because if it is chosen too high, the performance may degrade. The authors do not provide users with a tuning recipe for tuning their application so this does not happen. Users just have to tune their application themselves. Last but not least, the heterogeneous processing is in alternating mode, meaning the GPUs and CPUs are not used at the same time. The authors hint in their conclusion that this can be further investigated. For doing so, one needs a workload model to partition the work over the CPUs and GPUs (for example, see reference 55 in the paper). Given the irregular nature of the memory accesses, this is a true challenge. Overall, I liked the paper very much. It is a very valuable contribution to the field of sparse matrix computation. The experiments are extensive and convincing, and I am looking forward to a total heterogeneous solution. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

Parallel Computing Volume 49, Issue C

Nov 2015

193 pages

ISSN:0167-8191

Issue’s Table of Contents

Elsevier B.V.

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 01 November 2015

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

39
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 15 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Hu ZSun JLi ZSun G(2024)AG-SpTRSV: An Automatic Framework to Optimize Sparse Triangular Solve on GPUsACM Transactions on Architecture and Code Optimization10.1145/367491121:4(1-25)Online publication date: 25-Jun-2024
https://dl.acm.org/doi/10.1145/3674911
Hwang SBaek DPark JHuh J(2024)Cerberus: Triple Mode Acceleration of Sparse Matrix and Vector MultiplicationACM Transactions on Architecture and Code Optimization10.1145/365302021:2(1-24)Online publication date: 21-May-2024
https://dl.acm.org/doi/10.1145/3653020
Shi ZZou YSong XLi SLiu FXue Q(2024)DyLaClass: Dynamic Labeling Based Classification for Optimal Sparse Matrix Format Selection in Accelerating SpMVIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.348805335:12(2624-2639)Online publication date: 1-Dec-2024
https://dl.acm.org/doi/10.1109/TPDS.2024.3488053
Qiu HXu CFang JZhang JDeng LDing YWang QChen SChe YLiu J(2024)A Conflict-aware Divide-and-Conquer Algorithm for Symmetric Sparse Matrix-Vector MultiplicationProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00054(1-15)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SC41406.2024.00054
Li JLi LWang QXue WLiang JShi J(2024)Parallel optimization and application of unstructured sparse triangular solver on new generation of Sunway architectureParallel Computing10.1016/j.parco.2024.103080120:COnline publication date: 1-Jun-2024
https://dl.acm.org/doi/10.1016/j.parco.2024.103080
Xiao GYin CZhou TLi XChen YLi K(2023)A Survey of Accelerating Parallel Sparse Linear AlgebraACM Computing Surveys10.1145/360460656:1(1-38)Online publication date: 28-Aug-2023
https://dl.acm.org/doi/10.1145/3604606
Bi DLi SZhang YYang XDong D(2023)Efficiently Running SpMV on Multi-core DSPs for Banded MatrixAlgorithms and Architectures for Parallel Processing10.1007/978-981-97-0808-6_12(201-220)Online publication date: 20-Oct-2023
https://dl.acm.org/doi/10.1007/978-981-97-0808-6_12
Xie ZTan GLiu WSun N(2022)A Pattern-Based SpGEMM Library for Multi-Core and Many-Core ArchitecturesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.309032833:1(159-175)Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1109/TPDS.2021.3090328
Benner PDufrechou EEzzatti PQuintana-Ortí E(2021)Selecting optimal SpMV realizations for GPUs via machine learningInternational Journal of High Performance Computing Applications10.1177/109434202199073835:3(254-267)Online publication date: 30-Dec-2021
https://dl.acm.org/doi/10.1177/1094342021990738
Zhang FSu JLiu WHe BWu RDu XWang R(2021)YuenyeungSpTRSV: A Thread-Level and Warp-Level Fusion Synchronization-Free Sparse Triangular SolveIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.306663532:9(2321-2337)Online publication date: 1-Sep-2021
https://dl.acm.org/doi/10.1109/TPDS.2021.3066635
Show More Cited By

Highlights

Abstract

References

Cited By

Index Terms

Recommendations

CUDA-enabled Sparse Matrix-Vector Multiplication on GPUs using atomic operations

Adaptive Sparse Matrix-Vector Multiplication on CPU-GPU Heterogeneous Architecture

LightSpMV: Faster CUDA-Compatible Sparse Matrix-Vector Multiplication Using Compressed Sparse Rows

Reviews

Access critical reviews of Computing literature here

Comments

Information

Published In

Publisher

Publication History

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

Share

Share this Publication link

Share on social media

Affiliations