research-article

Public Access

Energy efficiency and performance frontiers for sparse computations on GPU supercomputers

Authors:

Stanimire Tomov,

Jack DongarraAuthors Info & Claims

PMAM '15: Proceedings of the Sixth International Workshop on Programming Models and Applications for Multicores and Manycores

Pages 1 - 10

https://doi.org/10.1145/2712386.2712387

Published: 07 February 2015 Publication History

Abstract

In this paper we unveil some energy efficiency and performance frontiers for sparse computations on GPU-based supercomputers. To do this, we consider state-of-the-art implementations of the sparse matrix-vector (SpMV) product in libraries like cuSPARSE, MKL, and MAGMA, and their use in the LOBPCG eigen-solver. LOBPCG is chosen as a benchmark for this study as it combines an interesting mix of sparse and dense linear algebra operations with potential for hardware-aware optimizations. Most notably, LOBPCG includes a blocking technique that is a common performance optimization for many applications. In particular, multiple memory-bound SpMV operations are blocked into a SpM-matrix product (SpMM), that achieves significantly higher performance than a sequence of SpMVs. We provide details about the GPU kernels we use for the SpMV, SpMM, and the LOBPCG implementation design, and study performance and energy consumption compared to CPU solutions. While a typical sparse computation like the SpMV reaches only a fraction of the peak of current GPUs, we show that the SpMM achieves up to a 6x performance improvement over the GPU's SpMV, and the GPU-accelerated LOBPCG based on this kernel is 3 to 5x faster than multicore CPUs with the same power draw, e.g., a K40 GPU vs. two Sandy Bridge CPUs (16 cores). In practice though, we show that currently available CPU implementations are much slower due to missed optimization opportunities. These performance results translate to similar improvements in energy consumption, and are indicative of today's frontiers in energy efficiency and performance for sparse computations on supercomputers.

References

[1]

J. Dongarra et al, "The international ExaScale software project roadmap," IJHPCA, vol. 25, no. 1, 2011.

Digital Library

[2]

The green 500 list, http://www.green500.org/.

[3]

The top 500 list, http://www.top.org/.

[4]

J. Dongarra and M. A. Heroux, "Toward a New Metric for Ranking High Performance Computing Systems," SANDIA REPORT SAND2013-4744, June 2013.

[5]

J. Aliaga, H. Anzt, M. Castillo, J. Fernández, G. Léon, J. Pérez, and E. Quintana-Ortí, "Unveiling the performance-energy tradeoff in iterative linear system solvers for multithreaded processors," Concurrency and Computation: Practice and Experience, 2014.

[6]

"Intel® Math Kernel Library. Sparse BLAS and Sparse Solver Performance Charts: DCSRGEMV and DCSRMM," October 2014. {Online}. Available: https://software.intel.com/en-us/intel-mkl

[7]

M. F. Hoemmen, "Communication-avoiding krylov subspace methods," Ph.D. dissertation, EECS Department, UC, Berkeley, Apr 2010.

Digital Library

[8]

I. Yamazaki, H. Anzt, S. Tomov, M. Hoemmen, and J. Dongarra, "Improving the performance of CA-GMRES on multicores with multiple GPUs," in IPDPS'14. Washington, DC, USA: 2014, pp. 382--391.

Digital Library

[9]

A. V. Knyazev, "Toward the optimal preconditioned eigensolver: Locally optimal block preconditioned conjugate gradient method," SIAM J. Sci. Comput, vol. 23, pp. 517--541, 2001.

Digital Library

[10]

S. Tomov, J. Langou, J. Dongarra, A. Canning, and L.-W. Wang, "Conjugate-gradient eigenvalue solvers in computing electronic properties of nanostructure architectures." IJCSE, vol. 2, no. 3/4, pp. 205--212, 2006.

Digital Library

[11]

C. Vömel, S. Tomov, O. A. Marques, A. Canning, L.-W. Wang, and J. J. Dongarra, "State-of-the-art eigensolvers for electronic structure calculations of large scale nano-systems." J. Comput. Physics, vol. 227, no. 15, pp. 7113--7124, 2008.

Digital Library

[12]

S. Yamada, T. Imamura, and M. Machida, "16.447 tflops and 159-billion-dimensional exact-diagonalization for trapped fermion-hubbard model on the earth simulator," in Proc. of SC'05, ser. SC '05. Washington, DC, USA: IEEE Computer Society, 2005, p. 44.

Digital Library

[13]

S. Yamada, T. Imamura, T. Kano, and M. Machida, "High-performance computing for exact numerical approaches to quantum many-body problems on the earth simulator," in Proc. of the ACM/IEEE SC'06. New York, NY, USA: ACM, 2006.

Digital Library

[14]

A. Knyazev. https://code.google.com/p/blopex/.

[15]

A. V. Knyazev, M. E. Argentati, I. Lashuk, and E. E. Ovtchinnikov, "Block locally optimal preconditioned eigenvalue xolvers (blopex) in hypre and petsc." SIAM J. Scientific Computing, vol. 29, no. 5, pp. 2224--2239, 2007.

Digital Library

[16]

I. C. Lab, "Software distribution of MAGMA version 1.5," http://icl.cs.utk.edu/magma/, 2014.

[17]

D. Donofrio, L. Oliker, J. Shalf, M. F. Wehner, C. Rowen, J. Krueger, S. Kamil, and M. Mohiyuddin, "Energy-efficient computing for extreme-scale science," Computer, vol. 42, no. 11, pp. 62--71, 2009.

Digital Library

[18]

V. Jiménez, R. Gioiosa, E. Kursun, F. Cazorla, C.-Y. Cher, A. Buyuktosunoglu, P. Bose, and M. Valero, "Trends and techniques for energy efficient architectures," in VLSI System on Chip Conference (VLSI-SoC), 2010 18th IEEE/IFIP, Sept 2010, pp. 276--279.

[19]

G. Kestor, R. Gioiosa, D. Kerbyson, and A. Hoisie, "Quantifying the energy cost of data movement in scientific applications," in Workload Characterization (IISWC), 2013 IEEE International Symposium on, Sept 2013, pp. 56--65.

[20]

J. Charles, W. Sawyer, M. F. Dolz, and S. Catalán, "Evaluating the performance and energy efficiency of the cosmo-art model system," Computer Science - Research and Development, pp. 1--10, 2014.

[21]

C. e. a. Knote, "Towards an online-coupled chemistry-climate model: evaluation of trace gases and aerosols in cosmo-art," Geoscientific Model Development, vol. 4, no. 4, pp. 1077--1102, 2011.

[22]

E. Padoin, L. Pilla, F. Boito, R. Kassick, P. Velho, and P. Navaux, "Evaluating application performance and energy consumption on hybrid CPU+GPU architecture," Cluster Computing, vol. 16, no. 3, pp. 511--525, 2013.

Digital Library

[23]

J. Krueger, D. Donofrio, J. Shalf, M. Mohiyuddin, S. Williams, L. Oliker, and F.-J. Pfreund, "Hardware/software co-design for energy-efficient seismic modeling," in Proc. of SC'11. New York, NY, USA: ACM, 2011, pp. 73:1--73:12.

Digital Library

[24]

M. Wittmann, G. Hager, T. Zeiser, and G. Wellein, "An analysis of energy-optimized lattice-boltzmann cfd simulations from the chip to the highly parallel level," CoRR, vol. abs/1304.7664, 2013.

[25]

NV, CUSPARSE LIBRARY, July 2013.

[26]

M. Naumov, "Preconditioned block-iterative methods on gpus," PAMM, vol. 12, no. 1, pp. 11--14, 2012.

[27]

G. H. Golub and C. F. Van Loan, Matrix Computations (3rd Ed.). Baltimore, MD, USA: Johns Hopkins University Press, 1996.

Digital Library

[28]

E. Anderson, Z. Bai, C. Bischof, L. S. Blackford, J. Demmel, J. J. Dongarra, J. Du Croz, S. Hammarling, A. Greenbaum, A. McKenney, and D. Sorensen, LAPACK Users' Guide (Third Ed.). Philadelphia, PA, USA: Society for Industrial and Applied Mathematics, 1999.

Digital Library

[29]

J. Demmel, L. Grigori, M. Hoemmen, and J. Langou, "Communication-avoiding parallel and sequential QR factorizations," CoRR, vol. abs/0806.2159, 2008.

[30]

M. Anderson, G. Ballard, J. Demmel, and K. Keutzer, "Communication-avoiding QR decomposition for GPUs," EECS Department, UC, Berkeley, Tech. Rep. UCB/EECS-2010-131, Oct 2010.

[31]

E. Agullo, C. Augonnet, J. Dongarra, M. Faverge, H. Ltaief, S. Thibault, and S. Tomov, "QR factorization on a multicore node enhanced with multiple gpu accelerators." in IPDPS. IEEE, 2011, pp. 932--943.

Digital Library

[32]

E. Jones, T. Oliphant, P. Peterson et al., "SciPy: Open source scientific tools for Python," 2001--. {Online: http://www.scipy.org/}.

[33]

A. Castro et al, "Octopus: a tool for the application of time-dependent density functional theory," phys. stat. sol. (b), vol. 243, no. 11, pp. 2465--2488, 2006.

[34]

C. G. Baker, U. L. Hetmaniuk, R. B. Lehoucq, and H. K. Thornquist, "Anasazi software for the numerical solution of large-scale eigenvalue problems," ACM TOMS, vol. 36, no. 3, pp. 13:1--13:23, Jul. 2009.

Digital Library

[35]

M. Heroux et al, "An Overview of Trilinos," Sandia National Laboratories, Tech. Rep. SAND2003--2927, 2003.

[36]

X. G. et al., "First-principles computation of material properties: the ABINIT software project," Computational Materials Science, vol. 25, no. 3, pp. 478--492, 2002.

[37]

NVIDIA CUDA Compute Unified Device Architecture Programming Guide, 6th ed., NVIDIA Corporation, April 2014.

[38]

A. V. Knyazev, "Preconditioned eigensolvers - an oxymoron?" ETNA, vol. 7, pp. 104--123, 1998.

[39]

U. Hetmaniuk and R. Lehoucq, "Basis selection in LOBPCG," Journal of Computational Physics, vol. 218, no. 1, pp. 324--332, 2006.

Digital Library

[40]

P. Arbenz and R. Geus, "Multilevel preconditioned iterative eigensolvers for Maxwell eigenvalue problems," Applied Numerical Mathematics, vol. 54, no. 2, pp. 107--121, 2005, 6th IMACS International Symposium on Iterative Methods in Scientific Computing.

Digital Library

[41]

P. Benner and T. Mach, "Locally optimal block preconditioned conjugate gradient method for hierarchical matrices," PAMM, vol. 11, no. 1, pp. 741--742, 2011.

[42]

T. V. Kolev and P. S. Vassilevski, "Parallel eigensolver for H(curl) problems using H1-auxiliary space AMG preconditioning," LLNL, Livermore, CA, Tech. Rep. UCRL-TR-226197, 2006.

[43]

A. Knyazev and K. Neymeyr, Efficient Solution of Symmetric Eigenvalue Problems Using Multigrid Preconditioners in the Locally Optimal Block Conjugate Gradient Method, ser. UCD/CCM report. University of Colorado at Denver, 2001.

Digital Library

[44]

H. Anzt, S. Tomov, and J. Dongarra, "Implementing a Sparse Matrix Vector Product for the SELL-C/SELL-C-σ formats on NVIDIA GPUs," University of Tennessee, Tech. Rep. ut-eecs-14-727, March 2014.

[45]

R. Barrett, M. Berry, T. F. Chan, J. Demmel, J. Donato, J. Dongarra, V. Eijkhout, R. Pozo, C. Romine, and H. V. der Vorst, Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, 2nd Edition. Philadelphia, PA: SIAM, 1994.

[46]

N. Bell and M. Garland, "Efficient sparse matrix-vector multiplication on CUDA," Dec. 2008.

[47]

A. Monakov, A. Lokhmotov, and A. Avetisyan, "Automatically tuning sparse matrix-vector multiplication for gpu architectures," in Proc. of HiPEAC'10. Berlin, Heidelberg: Springer-Verlag, 2010, pp. 111--125.

Digital Library

[48]

M. Kreutzer, G. Hager, G. Wellein, H. Fehske, and A. R. Bishop, "A unified sparse matrix data format for modern processors with wide simd units," CoRR, vol. abs/1307.6209, 2013.

[49]

N. Corp., NVIDIA CUDA TOOLKIT V6.0, July 2013.

[50]

"Intel® Math Kernel Library for Linux* OS," Document Number: 314774--005US, October 2007, Intel Corporation.

[51]

(2014) Piz Daint Computing Resources. Swiss National Computing Centre.

[52]

G. Fourestey, B. Cumming, L. Gilly, and T. C. Schulthess. (2014, August) First Experiences With Validating and Using the Cray Power Management Database Tool.

[53]

R. Nath, S. Tomov, and J. Dongarra, "An improved magma gemm for fermi graphics processing units," Int. J. High Perform. Comput. Appl., vol. 24, no. 4, pp. 511--515, Nov. 2010.

Digital Library

[54]

I. Yamazaki, S. Tomov, T. Dong, and J. Dongarra, "Mixed-precision orthogonalization scheme and adaptive step size for ca-gmres on gpus," VECPAR 2014 (Accepted), jan 2014.

Cited By

Bian HHuang JLiu LHuang DWang X(2021)ALBUS: A method for efficiently processing SpMV using SIMD and Load balancingFuture Generation Computer Systems10.1016/j.future.2020.10.036116(371-392)Online publication date: Mar-2021
https://doi.org/10.1016/j.future.2020.10.036
Nie JZhang CZou DXia FLu LWang XZhao F(2019)Adaptive Sparse Matrix-Vector Multiplication on CPU-GPU Heterogeneous ArchitectureProceedings of the 2019 3rd High Performance Computing and Cluster Technologies Conference10.1145/3341069.3341072(6-10)Online publication date: 22-Jun-2019
https://dl.acm.org/doi/10.1145/3341069.3341072
Benatia AJi WWang YShi F(2018)BestSFACM Transactions on Architecture and Code Optimization10.1145/322622815:3(1-27)Online publication date: 4-Sep-2018
https://dl.acm.org/doi/10.1145/3226228
Show More Cited By

Index Terms

Energy efficiency and performance frontiers for sparse computations on GPU supercomputers
1. Mathematics of computing
  1. Mathematical software

Recommendations

On the performance and energy efficiency of sparse linear algebra on GPUs

In this paper we unveil some performance and energy efficiency frontiers for sparse computations on GPU-based supercomputers. We compare the resource efficiency of different sparse matrix-vector products SpMV taken from libraries such as cuSPARSE and ...
Analyzing GPU-controlled communication with dynamic parallelism in terms of performance and energy

Intra-GPU synchronization is a problem for GPU controlled communication.Options, based on dynamic parallelism provide on-device synchronization.GPU controlled communication have a lower performance than CPU assisted approaches.Relieving the CPU from the ...
Energy Efficiency Analysis of GPUs
IPDPSW '12: Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum

In the last few years, Graphics Processing Units (GPUs) have become a great tool for massively parallel computing. GPUs are specifically designed for throughput and face several design challenges, specially what is known as the Power and Memory Walls. ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

PMAM '15: Proceedings of the Sixth International Workshop on Programming Models and Applications for Multicores and Manycores

February 2015

186 pages

ISBN:9781450334044

DOI:10.1145/2712386

Editors:
Pavan Balaji
Argonne National Laboratory
,
Minyi Guo
Shanghai Jiao Tong University, China
,
Zhiyi Huang
University of Otago New Zealand

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 February 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

PPoPP '15

Sponsor:

SIGPLAN

PPoPP '15: 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

February 7 - 8, 2015

California, San Francisco

Acceptance Rates

PMAM '15 Paper Acceptance Rate 19 of 34 submissions, 56%;

Overall Acceptance Rate 53 of 97 submissions, 55%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
580
Total Downloads

Downloads (Last 12 months)109
Downloads (Last 6 weeks)28

Reflects downloads up to 18 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Bian HHuang JLiu LHuang DWang X(2021)ALBUS: A method for efficiently processing SpMV using SIMD and Load balancingFuture Generation Computer Systems10.1016/j.future.2020.10.036116(371-392)Online publication date: Mar-2021
https://doi.org/10.1016/j.future.2020.10.036
Nie JZhang CZou DXia FLu LWang XZhao F(2019)Adaptive Sparse Matrix-Vector Multiplication on CPU-GPU Heterogeneous ArchitectureProceedings of the 2019 3rd High Performance Computing and Cluster Technologies Conference10.1145/3341069.3341072(6-10)Online publication date: 22-Jun-2019
https://dl.acm.org/doi/10.1145/3341069.3341072
Benatia AJi WWang YShi F(2018)BestSFACM Transactions on Architecture and Code Optimization10.1145/322622815:3(1-27)Online publication date: 4-Sep-2018
https://dl.acm.org/doi/10.1145/3226228
Aktulga HAfibuzzaman MWilliams SBuluc AShao MYang CNg EMaris PVary J(2017)A High Performance Block Eigensolver for Nuclear Configuration Interaction CalculationsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2016.263069928:6(1550-1563)Online publication date: 1-Jun-2017
https://dl.acm.org/doi/10.1109/TPDS.2016.2630699
Anzt HBaboulin MDongarra JFournier YHulsemann FKhabou AWang Y(2017)Accelerating the Conjugate Gradient Algorithm with GPUs in CFD SimulationsHigh Performance Computing for Computational Science – VECPAR 201610.1007/978-3-319-61982-8_5(35-43)Online publication date: 14-Jul-2017
https://doi.org/10.1007/978-3-319-61982-8_5
Anzt HTomov SDongarra J(2016)On the performance and energy efficiency of sparse linear algebra on GPUsThe International Journal of High Performance Computing Applications10.1177/109434201667208131:5(375-390)Online publication date: 5-Oct-2016
https://doi.org/10.1177/1094342016672081
Benatia AJi WWang YShi F(2016)Energy evaluation of Sparse Matrix-Vector Multiplication on GPU2016 Seventh International Green and Sustainable Computing Conference (IGSC)10.1109/IGCC.2016.7892595(1-6)Online publication date: 2016
https://doi.org/10.1109/IGCC.2016.7892595
Zhou TJiang J(2016)Performance modeling of hyper-scale custom machine for the principal steps in block Wiedemann algorithmThe Journal of Supercomputing10.1007/s11227-016-1767-y72:11(4181-4203)Online publication date: 1-Nov-2016
https://dl.acm.org/doi/10.1007/s11227-016-1767-y
Anzt HTomov SLuszczek PSawyer WDongarra J(2015)Acceleration of GPU-based Krylov solvers via data transfer reductionThe International Journal of High Performance Computing Applications10.1177/109434201558013929:3(366-383)Online publication date: 8-Apr-2015
https://doi.org/10.1177/1094342015580139

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents