Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2712386.2712387acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article
Public Access

Energy efficiency and performance frontiers for sparse computations on GPU supercomputers

Published: 07 February 2015 Publication History

Abstract

In this paper we unveil some energy efficiency and performance frontiers for sparse computations on GPU-based supercomputers. To do this, we consider state-of-the-art implementations of the sparse matrix-vector (SpMV) product in libraries like cuSPARSE, MKL, and MAGMA, and their use in the LOBPCG eigen-solver. LOBPCG is chosen as a benchmark for this study as it combines an interesting mix of sparse and dense linear algebra operations with potential for hardware-aware optimizations. Most notably, LOBPCG includes a blocking technique that is a common performance optimization for many applications. In particular, multiple memory-bound SpMV operations are blocked into a SpM-matrix product (SpMM), that achieves significantly higher performance than a sequence of SpMVs. We provide details about the GPU kernels we use for the SpMV, SpMM, and the LOBPCG implementation design, and study performance and energy consumption compared to CPU solutions. While a typical sparse computation like the SpMV reaches only a fraction of the peak of current GPUs, we show that the SpMM achieves up to a 6x performance improvement over the GPU's SpMV, and the GPU-accelerated LOBPCG based on this kernel is 3 to 5x faster than multicore CPUs with the same power draw, e.g., a K40 GPU vs. two Sandy Bridge CPUs (16 cores). In practice though, we show that currently available CPU implementations are much slower due to missed optimization opportunities. These performance results translate to similar improvements in energy consumption, and are indicative of today's frontiers in energy efficiency and performance for sparse computations on supercomputers.

References

[1]
J. Dongarra et al, "The international ExaScale software project roadmap," IJHPCA, vol. 25, no. 1, 2011.
[2]
The green 500 list, http://www.green500.org/.
[3]
The top 500 list, http://www.top.org/.
[4]
J. Dongarra and M. A. Heroux, "Toward a New Metric for Ranking High Performance Computing Systems," SANDIA REPORT SAND2013-4744, June 2013.
[5]
J. Aliaga, H. Anzt, M. Castillo, J. Fernández, G. Léon, J. Pérez, and E. Quintana-Ortí, "Unveiling the performance-energy tradeoff in iterative linear system solvers for multithreaded processors," Concurrency and Computation: Practice and Experience, 2014.
[6]
"Intel® Math Kernel Library. Sparse BLAS and Sparse Solver Performance Charts: DCSRGEMV and DCSRMM," October 2014. {Online}. Available: https://software.intel.com/en-us/intel-mkl
[7]
M. F. Hoemmen, "Communication-avoiding krylov subspace methods," Ph.D. dissertation, EECS Department, UC, Berkeley, Apr 2010.
[8]
I. Yamazaki, H. Anzt, S. Tomov, M. Hoemmen, and J. Dongarra, "Improving the performance of CA-GMRES on multicores with multiple GPUs," in IPDPS'14. Washington, DC, USA: 2014, pp. 382--391.
[9]
A. V. Knyazev, "Toward the optimal preconditioned eigensolver: Locally optimal block preconditioned conjugate gradient method," SIAM J. Sci. Comput, vol. 23, pp. 517--541, 2001.
[10]
S. Tomov, J. Langou, J. Dongarra, A. Canning, and L.-W. Wang, "Conjugate-gradient eigenvalue solvers in computing electronic properties of nanostructure architectures." IJCSE, vol. 2, no. 3/4, pp. 205--212, 2006.
[11]
C. Vömel, S. Tomov, O. A. Marques, A. Canning, L.-W. Wang, and J. J. Dongarra, "State-of-the-art eigensolvers for electronic structure calculations of large scale nano-systems." J. Comput. Physics, vol. 227, no. 15, pp. 7113--7124, 2008.
[12]
S. Yamada, T. Imamura, and M. Machida, "16.447 tflops and 159-billion-dimensional exact-diagonalization for trapped fermion-hubbard model on the earth simulator," in Proc. of SC'05, ser. SC '05. Washington, DC, USA: IEEE Computer Society, 2005, p. 44.
[13]
S. Yamada, T. Imamura, T. Kano, and M. Machida, "High-performance computing for exact numerical approaches to quantum many-body problems on the earth simulator," in Proc. of the ACM/IEEE SC'06. New York, NY, USA: ACM, 2006.
[14]
A. Knyazev. https://code.google.com/p/blopex/.
[15]
A. V. Knyazev, M. E. Argentati, I. Lashuk, and E. E. Ovtchinnikov, "Block locally optimal preconditioned eigenvalue xolvers (blopex) in hypre and petsc." SIAM J. Scientific Computing, vol. 29, no. 5, pp. 2224--2239, 2007.
[16]
I. C. Lab, "Software distribution of MAGMA version 1.5," http://icl.cs.utk.edu/magma/, 2014.
[17]
D. Donofrio, L. Oliker, J. Shalf, M. F. Wehner, C. Rowen, J. Krueger, S. Kamil, and M. Mohiyuddin, "Energy-efficient computing for extreme-scale science," Computer, vol. 42, no. 11, pp. 62--71, 2009.
[18]
V. Jiménez, R. Gioiosa, E. Kursun, F. Cazorla, C.-Y. Cher, A. Buyuktosunoglu, P. Bose, and M. Valero, "Trends and techniques for energy efficient architectures," in VLSI System on Chip Conference (VLSI-SoC), 2010 18th IEEE/IFIP, Sept 2010, pp. 276--279.
[19]
G. Kestor, R. Gioiosa, D. Kerbyson, and A. Hoisie, "Quantifying the energy cost of data movement in scientific applications," in Workload Characterization (IISWC), 2013 IEEE International Symposium on, Sept 2013, pp. 56--65.
[20]
J. Charles, W. Sawyer, M. F. Dolz, and S. Catalán, "Evaluating the performance and energy efficiency of the cosmo-art model system," Computer Science - Research and Development, pp. 1--10, 2014.
[21]
C. e. a. Knote, "Towards an online-coupled chemistry-climate model: evaluation of trace gases and aerosols in cosmo-art," Geoscientific Model Development, vol. 4, no. 4, pp. 1077--1102, 2011.
[22]
E. Padoin, L. Pilla, F. Boito, R. Kassick, P. Velho, and P. Navaux, "Evaluating application performance and energy consumption on hybrid CPU+GPU architecture," Cluster Computing, vol. 16, no. 3, pp. 511--525, 2013.
[23]
J. Krueger, D. Donofrio, J. Shalf, M. Mohiyuddin, S. Williams, L. Oliker, and F.-J. Pfreund, "Hardware/software co-design for energy-efficient seismic modeling," in Proc. of SC'11. New York, NY, USA: ACM, 2011, pp. 73:1--73:12.
[24]
M. Wittmann, G. Hager, T. Zeiser, and G. Wellein, "An analysis of energy-optimized lattice-boltzmann cfd simulations from the chip to the highly parallel level," CoRR, vol. abs/1304.7664, 2013.
[25]
NV, CUSPARSE LIBRARY, July 2013.
[26]
M. Naumov, "Preconditioned block-iterative methods on gpus," PAMM, vol. 12, no. 1, pp. 11--14, 2012.
[27]
G. H. Golub and C. F. Van Loan, Matrix Computations (3rd Ed.). Baltimore, MD, USA: Johns Hopkins University Press, 1996.
[28]
E. Anderson, Z. Bai, C. Bischof, L. S. Blackford, J. Demmel, J. J. Dongarra, J. Du Croz, S. Hammarling, A. Greenbaum, A. McKenney, and D. Sorensen, LAPACK Users' Guide (Third Ed.). Philadelphia, PA, USA: Society for Industrial and Applied Mathematics, 1999.
[29]
J. Demmel, L. Grigori, M. Hoemmen, and J. Langou, "Communication-avoiding parallel and sequential QR factorizations," CoRR, vol. abs/0806.2159, 2008.
[30]
M. Anderson, G. Ballard, J. Demmel, and K. Keutzer, "Communication-avoiding QR decomposition for GPUs," EECS Department, UC, Berkeley, Tech. Rep. UCB/EECS-2010-131, Oct 2010.
[31]
E. Agullo, C. Augonnet, J. Dongarra, M. Faverge, H. Ltaief, S. Thibault, and S. Tomov, "QR factorization on a multicore node enhanced with multiple gpu accelerators." in IPDPS. IEEE, 2011, pp. 932--943.
[32]
E. Jones, T. Oliphant, P. Peterson et al., "SciPy: Open source scientific tools for Python," 2001--. {Online: http://www.scipy.org/}.
[33]
A. Castro et al, "Octopus: a tool for the application of time-dependent density functional theory," phys. stat. sol. (b), vol. 243, no. 11, pp. 2465--2488, 2006.
[34]
C. G. Baker, U. L. Hetmaniuk, R. B. Lehoucq, and H. K. Thornquist, "Anasazi software for the numerical solution of large-scale eigenvalue problems," ACM TOMS, vol. 36, no. 3, pp. 13:1--13:23, Jul. 2009.
[35]
M. Heroux et al, "An Overview of Trilinos," Sandia National Laboratories, Tech. Rep. SAND2003--2927, 2003.
[36]
X. G. et al., "First-principles computation of material properties: the ABINIT software project," Computational Materials Science, vol. 25, no. 3, pp. 478--492, 2002.
[37]
NVIDIA CUDA Compute Unified Device Architecture Programming Guide, 6th ed., NVIDIA Corporation, April 2014.
[38]
A. V. Knyazev, "Preconditioned eigensolvers - an oxymoron?" ETNA, vol. 7, pp. 104--123, 1998.
[39]
U. Hetmaniuk and R. Lehoucq, "Basis selection in LOBPCG," Journal of Computational Physics, vol. 218, no. 1, pp. 324--332, 2006.
[40]
P. Arbenz and R. Geus, "Multilevel preconditioned iterative eigensolvers for Maxwell eigenvalue problems," Applied Numerical Mathematics, vol. 54, no. 2, pp. 107--121, 2005, 6th IMACS International Symposium on Iterative Methods in Scientific Computing.
[41]
P. Benner and T. Mach, "Locally optimal block preconditioned conjugate gradient method for hierarchical matrices," PAMM, vol. 11, no. 1, pp. 741--742, 2011.
[42]
T. V. Kolev and P. S. Vassilevski, "Parallel eigensolver for H(curl) problems using H1-auxiliary space AMG preconditioning," LLNL, Livermore, CA, Tech. Rep. UCRL-TR-226197, 2006.
[43]
A. Knyazev and K. Neymeyr, Efficient Solution of Symmetric Eigenvalue Problems Using Multigrid Preconditioners in the Locally Optimal Block Conjugate Gradient Method, ser. UCD/CCM report. University of Colorado at Denver, 2001.
[44]
H. Anzt, S. Tomov, and J. Dongarra, "Implementing a Sparse Matrix Vector Product for the SELL-C/SELL-C-σ formats on NVIDIA GPUs," University of Tennessee, Tech. Rep. ut-eecs-14-727, March 2014.
[45]
R. Barrett, M. Berry, T. F. Chan, J. Demmel, J. Donato, J. Dongarra, V. Eijkhout, R. Pozo, C. Romine, and H. V. der Vorst, Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, 2nd Edition. Philadelphia, PA: SIAM, 1994.
[46]
N. Bell and M. Garland, "Efficient sparse matrix-vector multiplication on CUDA," Dec. 2008.
[47]
A. Monakov, A. Lokhmotov, and A. Avetisyan, "Automatically tuning sparse matrix-vector multiplication for gpu architectures," in Proc. of HiPEAC'10. Berlin, Heidelberg: Springer-Verlag, 2010, pp. 111--125.
[48]
M. Kreutzer, G. Hager, G. Wellein, H. Fehske, and A. R. Bishop, "A unified sparse matrix data format for modern processors with wide simd units," CoRR, vol. abs/1307.6209, 2013.
[49]
N. Corp., NVIDIA CUDA TOOLKIT V6.0, July 2013.
[50]
"Intel® Math Kernel Library for Linux* OS," Document Number: 314774--005US, October 2007, Intel Corporation.
[51]
(2014) Piz Daint Computing Resources. Swiss National Computing Centre.
[52]
G. Fourestey, B. Cumming, L. Gilly, and T. C. Schulthess. (2014, August) First Experiences With Validating and Using the Cray Power Management Database Tool.
[53]
R. Nath, S. Tomov, and J. Dongarra, "An improved magma gemm for fermi graphics processing units," Int. J. High Perform. Comput. Appl., vol. 24, no. 4, pp. 511--515, Nov. 2010.
[54]
I. Yamazaki, S. Tomov, T. Dong, and J. Dongarra, "Mixed-precision orthogonalization scheme and adaptive step size for ca-gmres on gpus," VECPAR 2014 (Accepted), jan 2014.

Cited By

View all
  • (2021)ALBUS: A method for efficiently processing SpMV using SIMD and Load balancingFuture Generation Computer Systems10.1016/j.future.2020.10.036116(371-392)Online publication date: Mar-2021
  • (2019)Adaptive Sparse Matrix-Vector Multiplication on CPU-GPU Heterogeneous ArchitectureProceedings of the 2019 3rd High Performance Computing and Cluster Technologies Conference10.1145/3341069.3341072(6-10)Online publication date: 22-Jun-2019
  • (2018)BestSFACM Transactions on Architecture and Code Optimization10.1145/322622815:3(1-27)Online publication date: 4-Sep-2018
  • Show More Cited By

Index Terms

  1. Energy efficiency and performance frontiers for sparse computations on GPU supercomputers

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    PMAM '15: Proceedings of the Sixth International Workshop on Programming Models and Applications for Multicores and Manycores
    February 2015
    186 pages
    ISBN:9781450334044
    DOI:10.1145/2712386
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 07 February 2015

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. GPU supercomputer
    2. LOBPCG
    3. blocked sparse matrix vector product
    4. energy efficiency
    5. sparse eigensolver

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    PPoPP '15
    Sponsor:

    Acceptance Rates

    PMAM '15 Paper Acceptance Rate 19 of 34 submissions, 56%;
    Overall Acceptance Rate 53 of 97 submissions, 55%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)109
    • Downloads (Last 6 weeks)28
    Reflects downloads up to 18 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2021)ALBUS: A method for efficiently processing SpMV using SIMD and Load balancingFuture Generation Computer Systems10.1016/j.future.2020.10.036116(371-392)Online publication date: Mar-2021
    • (2019)Adaptive Sparse Matrix-Vector Multiplication on CPU-GPU Heterogeneous ArchitectureProceedings of the 2019 3rd High Performance Computing and Cluster Technologies Conference10.1145/3341069.3341072(6-10)Online publication date: 22-Jun-2019
    • (2018)BestSFACM Transactions on Architecture and Code Optimization10.1145/322622815:3(1-27)Online publication date: 4-Sep-2018
    • (2017)A High Performance Block Eigensolver for Nuclear Configuration Interaction CalculationsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2016.263069928:6(1550-1563)Online publication date: 1-Jun-2017
    • (2017)Accelerating the Conjugate Gradient Algorithm with GPUs in CFD SimulationsHigh Performance Computing for Computational Science – VECPAR 201610.1007/978-3-319-61982-8_5(35-43)Online publication date: 14-Jul-2017
    • (2016)On the performance and energy efficiency of sparse linear algebra on GPUsThe International Journal of High Performance Computing Applications10.1177/109434201667208131:5(375-390)Online publication date: 5-Oct-2016
    • (2016)Energy evaluation of Sparse Matrix-Vector Multiplication on GPU2016 Seventh International Green and Sustainable Computing Conference (IGSC)10.1109/IGCC.2016.7892595(1-6)Online publication date: 2016
    • (2016)Performance modeling of hyper-scale custom machine for the principal steps in block Wiedemann algorithmThe Journal of Supercomputing10.1007/s11227-016-1767-y72:11(4181-4203)Online publication date: 1-Nov-2016
    • (2015)Acceleration of GPU-based Krylov solvers via data transfer reductionThe International Journal of High Performance Computing Applications10.1177/109434201558013929:3(366-383)Online publication date: 8-Apr-2015

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media