Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Speculative segmented sum for sparse matrix-vector multiplication on heterogeneous processors

Published: 01 November 2015 Publication History

Highlights

A speculative segmented sum strategy for the CSR-based SpMV.
Utilizing both GPU cores and CPU cores in a heterogeneous processor.
No format conversion or tuning overhead for input sparse matrices in the CSR format.
High speedup over the CSR-vector algorithm running irregular matrices.
No performance penalty for most regular matrices.

Abstract

Sparse matrix-vector multiplication (SpMV) is a central building block for scientific software and graph applications. Recently, heterogeneous processors composed of different types of cores attracted much attention because of their flexible core configuration and high energy efficiency. In this paper, we propose a compressed sparse row (CSR) format based SpMV algorithm utilizing both types of cores in a CPU–GPU heterogeneous processor. We first speculatively execute segmented sum operations on the GPU part of a heterogeneous processor and generate a possibly incorrect result. Then the CPU part of the same chip is triggered to re-arrange the predicted partial sums for a correct resulting vector. On three heterogeneous processors from Intel, AMD and nVidia, using 20 sparse matrices as a benchmark suite, the experimental results show that our method obtains significant performance improvement over the best existing CSR-based SpMV algorithms.

References

[1]
N. Bell, M. Garland, Implementing sparse matrix-vector multiplication on throughput-oriented processors, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC ’09, 2009, pp. 18:1–18:11.
[2]
B.-Y. Su, K. Keutzer, clSpMV: A cross-platform open CL SpMV framework on GPUs, Proceedings of the 26th ACM International Conference on Supercomputing, ICS ’12, 2012, pp. 353–364.
[3]
R. Li, Y. Saad, GPU-accelerated preconditioned iterative linear solvers, J Supercomput 63 (2) (2013) 443–466.
[4]
X. Liu, M. Smelyanskiy, E. Chow, P. Dubey, Efficient sparse matrix-vector multiplication on x86-based many-core processors, Proceedings of the 27th International ACM Conference on International Conference on Supercomputing, ICS ’13, 2013, pp. 273–282.
[5]
S. Yan, C. Li, Y. Zhang, H. Zhou, yaSpMV: yet another SpMV framework on GPUs, Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’14, 2014, pp. 107–118.
[6]
A. Ashari, N. Sedaghati, J. Eisenlohr, P. Sadayappan, An efficient two-dimensional blocking strategy for sparse matrix-vector multiplication on GPUs, Proceedings of the 28th ACM International Conference on Supercomputing, ICS ’14, 2014, pp. 273–282.
[7]
S. Balay, S. Abhyankar, M.F. Adams, J. Brown, P. Brune, K. Buschelman, V. Eijkhout, W.D. Gropp, D. Kaushik, M.G. Knepley, L.C. McInnes, K. Rupp, B.F. Smith, H. Zhang, A.N. Laboratory, PETSc Users Manual, Tech. Rep. ANL-95/11 - Revision 3.5, 2014.
[8]
V. Minden, B. Smith, M. Knepley, D.A. Yuen, L. Wang, X. Chi, L. Johnsson, W. Ge, Preliminary implementation of PETSc using GPUs, in: Y. Shi (Ed.), GPU Solutions to Multi-scale Problems in Science and Engineering, Lecture Notes in Earth System Sciences, Springer Berlin Heidelberg, 2013, pp. 131–140.
[9]
P. Kumbhar, Performance of PETSc GPU Implementation with Sparse Matrix Storage Schemes, Master’s thesis, The University of Edinburgh, 2011.
[10]
D. Langr, P. Tvrdík, Evaluation Criteria for Sparse Matrix Storage Formats, IEEE Transactions on Parallel and Distributed Systems in press.
[11]
W. Liu, B. Vinter, An efficient GPU general sparse matrix-matrix multiplication for irregular data, Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, IPDPS ’14, 2014, pp. 370–381.
[12]
W. Liu, B. Vinter, CSR5: An efficient storage format for cross-platform sparse matrix-vector multiplication, Proceedings of the 29th ACM International Conference on Supercomputing, ICS ’15, 2015.
[13]
S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, J. Demmel, Optimization of sparse matrix-vector multiplication on emerging multicore platforms, Parallel Comput 35 (3) (2009) 178–194.
[14]
J.L. Greathouse, M. Daga, Efficient sparse matrix-vector multiplication on GPUs using the CSR storage format, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’14, 2014, pp. 769–780.
[15]
A. Ashari, N. Sedaghati, J. Eisenlohr, S. Parthasarathy, P. Sadayappan, Fast sparse matrix-vector multiplication on GPUs for graph applications, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’14, 2014, pp. 781–792.
[16]
G.E. Blelloch, M.A. Heroux, M. Zagha, Segmented Operations for Sparse Matrix Computation on Vector Multiprocessors, Tech. Rep. CMU-CS-93-173, Carnegie Mellon University, 1993.
[17]
M. Harris, J.D. Owens, S. Sengupta, CUDPP Documentation, second ed., nVidia, 2014.
[18]
S. Baxter, Modern GPU, nVidia, 2013.
[19]
S. Sengupta, M. Harris, Y. Zhang, J.D. Owens, Scan primitives for GPU computing, Proceedings of the 22Nd ACM SIGGRAPH/EUROGRAPHICS Symposium on Graphics Hardware, GH ’07, 2007, pp. 97–106.
[20]
M. Garland, Sparse matrix computations on many core GPU’s, Proceedings of the 45th Annual Design Automation Conference, DAC ’08, 2008, pp. 2–6.
[21]
R. Kumar, D. Tullsen, N. Jouppi, P. Ranganathan, Heterogeneous chip multiprocessors, Computer 38 (11) (2005) 32–38.
[22]
S. Keckler, W. Dally, B. Khailany, M. Garland, D. Glasco, GPUs Future Parallel Comput, Micro, IEEE 31 (5) (2011) 7–17.
[23]
A. Branover, D. Foley, M. Steinman, AMD Fusion APU: Llano, IEEE Micro 32 (2) (2012) 28–37.
[24]
AMD, White Paper: Compute Cores (Jan 2014).
[25]
S. Damaraju, V. George, S. Jahagirdar, T. Khondker, R. Milstrey, S. Sarkar, S. Siers, I. Stolero, A. Subbiah, A 22nm IA multi-CPU and GPU system-on-chip, Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2012 IEEE International, 2012, pp. 56–57.
[26]
nVidia, NVIDIA Tegra K1 A New Era in Mobile Computing, 1st ed., 2014.
[27]
Qualcomm, Qualcomm Snapdragon 800 Product Brief (Aug 2013).
[28]
E. Chung, P. Milder, J. Hoe, K. Mai, Single-chip heterogeneous computing: does the future include custom logic, FPGAs, and GPGPUs?, Microarchitecture (MICRO), 2010 43rd Annual IEEE/ACM International Symposium on, 2010, pp. 225–236.
[29]
HSA Foundation, HSA Programmer’s Reference Manual: HSAIL Virtual ISA and Programming Model, Compiler Writer’s Guide, and Object Format (BRIG), 0th ed. (May 2013).
[30]
A. Munshi, The Open CL Specification, 2nd ed., Khronos OpenCL Working Group, 2014.
[31]
D. Negrut, R. Serban, A. Li, A. Seidl, Unified Memory in CUDA 6: A Brief Overview and Related Data Access/Transfer Issues, Tech. Rep. TR-2014-09, University of Wisconsin–Madison, 2014.
[32]
Y.S. Deng, B.D. Wang, S. Mu, Taming irregular EDA applications on GPUs, Proceedings of the 2009 International Conference on Computer-Aided Design, ICCAD ’09, 2009, pp. 539–546.
[33]
C. Gregg, K. Hazelwood, Where is the data? Why you cannot debate CPU vs. GPU performance without the answer, Performance Analysis of Systems and Software (ISPASS), 2011 IEEE International Symposium on, 2011, pp. 134–144.
[34]
Y. Dotsenko, N.K. Govindaraju, P.-P. Sloan, C. Boyd, J. Manferdelli, Fast scan algorithms on graphics processors, Proceedings of the 22Nd Annual International Conference on Supercomputing, ICS ’08, 2008, pp. 205–213.
[35]
T.A. Davis, Y. Hu, The University of Florida sparse matrix collection, ACM Trans. Math. Softw. 38 (1) (2011) 1:1–1:25.
[36]
J.-H. Byun, R. Lin, J.W. Demmel, K.A. Yelick, pOSKI: Parallel Optimized Sparse Kernel Interface Library User’s Guide, 1st ed., University of California, Berkeley, 2012.
[37]
R. Vuduc, J.W. Demmel, K.A. Yelick, OSKI: A library of automatically tuned sparse matrix kernels, J Phys: Conference Series 16 (1) (2005) 521–530.
[38]
D. Lukarski, N. Trost, PARALUTION–User Manual, Tech. Rep. 1.0.0, PARALUTION Labs UG (haftungsbeschränkt) Co. KG, 2015.
[39]
M.M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam, A. Rountev, P. Sadayappan, A compiler framework for optimization of Affine Loop Nests for GPGPUs, Proceedings of the 22nd Annual International Conference on Supercomputing, ICS ’08, 2008, pp. 225–234.
[40]
J. Fang, A. Varbanescu, H. Sips, A comprehensive performance comparison of CUDA and OpenCL, Parallel Processing (ICPP), 2011 International Conference on, 2011, pp. 216–225.
[41]
R. Vuduc, H.-J. Moon, Fast sparse matrix-vector multiplication by exploiting variable block structure, High Performance Computing and Communications, Vol. 3726 of Lecture Notes in Computer Science, Springer Berlin Heidelberg, 2005, pp. 807–816.
[42]
R.W. Vuduc, Automatic Performance Tuning of Sparse Matrix Kernels, Ph.d. thesis, University of California, Berkeley, 2003.
[43]
A. Buluç, J.T. Fineman, M. Frigo, J.R. Gilbert, C.E. Leiserson, Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks, Proceedings of the Twenty-First Annual Symposium on Parallelism in Algorithms and Architectures, SPAA ’09, 2009, pp. 233–244.
[44]
A. Buluç, S. Williams, L. Oliker, J. Demmel, Reduced-bandwidth multithreaded algorithms for sparse matrix-vector multiplication, Parallel Distributed Processing Symposium (IPDPS), 2011 IEEE International, 2011, pp. 721–733.
[45]
J.W. Choi, A. Singh, R.W. Vuduc, Model-driven autotuning of sparse matrix-vector multiply on GPUs, Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’10, 2010, pp. 115–126.
[46]
M. Martone, Efficient multithreaded untransposed, transposed or symmetric sparse matrix-vector multiplication with the recursive sparse blocks format, Parallel Comput 40 (7) (2014) 251–270.
[47]
R. Nishtala, R. Vuduc, J. Demmel, K. Yelick, When cache blocking of sparse matrix vector multiply works and why, applicable algebra in engineering, Commun Comput 18 (3) (2007) 297–311.
[48]
J.C. Pichel, F.F. Rivera, M. Fernández, A. Rodríguez, Optimization of sparse matrix-vector multiplication using reordering techniques on GPUs, Microprocess Microsyst 36 (2) (2012) 65–77.
[49]
M.M. Baskaran, R. Bordawekar, Optimizing Sparse Matrix-Vector Multiplication on GPUs, Tech. Rep. RC24704, IBM, 2008.
[50]
I. Reguly, M. Giles, Efficient sparse matrix-vector multiplication on cache-based GPUs, Innovative Parallel Computing (InPar), 2012, pp. 1–12.
[51]
N. Zhang, A novel parallel scan for multicore processors and its application in sparse matrix-vector multiplication, Parallel Distributed Syst, IEEE Trans. 23 (3) (2012) 397–404.
[52]
J. Lee, M. Samadi, Y. Park, S. Mahlke, Transparent CPU-GPU collaboration for data-parallel kernels on heterogeneous systems, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques, PACT ’13, 2013, pp. 245–256.
[53]
R. Kaleem, R. Barik, T. Shpeisman, B.T. Lewis, C. Hu, K. Pingali, Adaptive heterogeneous scheduling for integrated GPUs, Proceedings of the 23rd International Conference on Parallel Architectures and Compilation, PACT ’14, 2014, pp. 151–162.
[54]
J. Shen, J. Fang, A.L. Varbanescu, H. Sips, An application-centric evaluation of openCL on Multi-Core CPUs, Parallel Computing 39 (12) (2013) 834–850.
[55]
J. Shen, A.L. Varbanescu, P. Zou, Y. Lu, H. Sips, Improving performance by matching imbalanced workloads with heterogeneous platforms, Proceedings of the 28th ACM International Conference on Supercomputing, ICS ’14, 2014, pp. 241–250.
[56]
Y. Liu, B. Schmidt, LightSpMV: faster CSR-based sparse matrix-vector multiplication on CUDA-enabled GPUs, Proceedings of the 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP), IEEE'15, 2015.

Cited By

View all
  • (2024)AG-SpTRSV: An Automatic Framework to Optimize Sparse Triangular Solve on GPUsACM Transactions on Architecture and Code Optimization10.1145/367491121:4(1-25)Online publication date: 25-Jun-2024
  • (2024)Cerberus: Triple Mode Acceleration of Sparse Matrix and Vector MultiplicationACM Transactions on Architecture and Code Optimization10.1145/365302021:2(1-24)Online publication date: 21-May-2024
  • (2024)DyLaClass: Dynamic Labeling Based Classification for Optimal Sparse Matrix Format Selection in Accelerating SpMVIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.348805335:12(2624-2639)Online publication date: 1-Dec-2024
  • Show More Cited By

Index Terms

  1. Speculative segmented sum for sparse matrix-vector multiplication on heterogeneous processors
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Reviews

          Henk Sips

          Sparse matrix-vector multiplication (SpMV) is a key operation in many scientific and graph applications. It is also challenging because of data compression and load balancing issues. Data compression is needed to efficiently store the matrix elements, and load balancing is needed to avoid idle processing capacity. Because the sparsity structure of applications has a lot of variation, static solutions generally do not work very well. Also, programming languages do not offer much support for expressing sparse structures and compiler optimizations are generally very difficult due to the indirect array accesses. A completely new challenge is to use graphics processing units (GPUs) for SpMV operations. Now, GPUs are at their very best when all operations take the same amount of time, a static workload partition is applied, and little synchronization is needed. This all seems very contradictory to the requirements of SpMV operations. One solution is to adapt the data compression scheme such that workload partitioning is easier. When many iterations are needed, this might be an applicable solution. However, if only a few iterations per cycle are done, this conversion might not be worth the effort. The data compression technique that is widely applied is the compressed sparse row (CSR) format. The CSR format does not naturally reveal load balancing information, therefore adapted formats have been used, but not without flaws; for example, they are not able to detect empty rows. The authors follow a different approach and make use of the intrinsic differences between central processing units (CPUs) and GPUs; that is, they propose a heterogeneous solution. Core to their approach is to first speculatively execute the SpMV operation on the GPUs and then rearrange the resulting vectors on the CPUs, because CPUs are better suited for doing the final irregular accesses and global synchronization than GPUs. The algorithm is given in the appendix of the paper and is very clearly illustrated by a simple sparse SpMV example. The paper contains an extensive experiment section, using 20 sparse matrix structures from the University of Florida Sparse Matrix Collection and three CPU-GPU platforms from Intel, NVDIA, and AMD, respectively. Speedups of on average 2.6 and no obvious losses for dense systems are reported. What remains is the setting of four tuning parameters. The W parameter (workload per thread) especially can be critical, because if it is chosen too high, the performance may degrade. The authors do not provide users with a tuning recipe for tuning their application so this does not happen. Users just have to tune their application themselves. Last but not least, the heterogeneous processing is in alternating mode, meaning the GPUs and CPUs are not used at the same time. The authors hint in their conclusion that this can be further investigated. For doing so, one needs a workload model to partition the work over the CPUs and GPUs (for example, see reference 55 in the paper). Given the irregular nature of the memory accesses, this is a true challenge. Overall, I liked the paper very much. It is a very valuable contribution to the field of sparse matrix computation. The experiments are extensive and convincing, and I am looking forward to a total heterogeneous solution. Online Computing Reviews Service

          Access critical reviews of Computing literature here

          Become a reviewer for Computing Reviews.

          Comments

          Please enable JavaScript to view thecomments powered by Disqus.

          Information & Contributors

          Information

          Published In

          cover image Parallel Computing
          Parallel Computing  Volume 49, Issue C
          Nov 2015
          193 pages

          Publisher

          Elsevier Science Publishers B. V.

          Netherlands

          Publication History

          Published: 01 November 2015

          Author Tags

          1. Sparse matrices
          2. Sparse matrix-vector multiplication
          3. Compressed sparse row
          4. Speculative execution
          5. Segmented sum
          6. Heterogeneous processors

          Qualifiers

          • Research-article

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)0
          • Downloads (Last 6 weeks)0
          Reflects downloads up to 15 Feb 2025

          Other Metrics

          Citations

          Cited By

          View all
          • (2024)AG-SpTRSV: An Automatic Framework to Optimize Sparse Triangular Solve on GPUsACM Transactions on Architecture and Code Optimization10.1145/367491121:4(1-25)Online publication date: 25-Jun-2024
          • (2024)Cerberus: Triple Mode Acceleration of Sparse Matrix and Vector MultiplicationACM Transactions on Architecture and Code Optimization10.1145/365302021:2(1-24)Online publication date: 21-May-2024
          • (2024)DyLaClass: Dynamic Labeling Based Classification for Optimal Sparse Matrix Format Selection in Accelerating SpMVIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.348805335:12(2624-2639)Online publication date: 1-Dec-2024
          • (2024)A Conflict-aware Divide-and-Conquer Algorithm for Symmetric Sparse Matrix-Vector MultiplicationProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00054(1-15)Online publication date: 17-Nov-2024
          • (2024)Parallel optimization and application of unstructured sparse triangular solver on new generation of Sunway architectureParallel Computing10.1016/j.parco.2024.103080120:COnline publication date: 1-Jun-2024
          • (2023)A Survey of Accelerating Parallel Sparse Linear AlgebraACM Computing Surveys10.1145/360460656:1(1-38)Online publication date: 28-Aug-2023
          • (2023)Efficiently Running SpMV on Multi-core DSPs for Banded MatrixAlgorithms and Architectures for Parallel Processing10.1007/978-981-97-0808-6_12(201-220)Online publication date: 20-Oct-2023
          • (2022)A Pattern-Based SpGEMM Library for Multi-Core and Many-Core ArchitecturesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.309032833:1(159-175)Online publication date: 1-Jan-2022
          • (2021)Selecting optimal SpMV realizations for GPUs via machine learningInternational Journal of High Performance Computing Applications10.1177/109434202199073835:3(254-267)Online publication date: 30-Dec-2021
          • (2021)YuenyeungSpTRSV: A Thread-Level and Warp-Level Fusion Synchronization-Free Sparse Triangular SolveIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.306663532:9(2321-2337)Online publication date: 1-Sep-2021
          • Show More Cited By

          View Options

          View options

          Figures

          Tables

          Media

          Share

          Share

          Share this Publication link

          Share on social media