research-article

Performance analysis of the high-performance conjugate gradient benchmark on GPUs

Authors:

Everett Phillips,

Massimiliano FaticaAuthors Info & Claims

International Journal of High Performance Computing Applications, Volume 30, Issue 1

Pages 28 - 38

https://doi.org/10.1177/1094342015599239

Published: 24 November 2019 Publication History

Abstract

Graphics processing unit accelerated supercomputers have proved to be very effective, especially with regard to power efficiency, for accelerating compute intensive applications like the high-performance Linpack used in the TOP500 list. This paper presents the details of a CUDA implementation of the high-performance conjugate gradient, a new proposed benchmark that better represents modern application workloads which rely more heavily on memory system and network performance than high-performance Linpack. The results obtained at full scale on the largest graphics processing unit supercomputers in the world, Titan, the Cray XK7 at ORNL and Piz-Daint, the Cray XC30 at CSCS, indicate that graphics processing unit accelerated supercomputers are also very effective for this type of workload. A comparison with other architectures is also presented, showing that graphics processing units, with their high memory bandwidth, are the highest performing devices for this new benchmark.

References

[1]

Barrett RF, Heroux MA, Lin PT . 2011 Poster: Mini-applications: Vehicles for co-design. In: Proceedings of the 2011 high-performance computing networking, storage and analysis companion SC '11 Companion, New York, USA, pp. pp.1-–2. New York: ACM Press.

Digital Library

Google Scholar

[2]

Briggs WL, Henson VE, McCormick SF 2000 A multigrid tutorial . Philadelphia, PA: SIAM.

Google Scholar

[3]

Cohen J, Castonguay P 2012 Efficient graph matching and coloring on the GPU. In: GPU Technology Conference, San Jose, USA, 14-17 May 2012, pp. pp.1-–10.

Google Scholar

[4]

Dongarra J, Heroux MA 2013 Toward a new metric for ranking high-performance computing systems. Sandia Report SAND2013-4744, USA.

Google Scholar

[5]

Dongarra J, Luszczek P 2005 Introduction to the HPC challenge benchmark suite. ICL Technical Report ICL-UT-05-01 also appears as CS Department Technical Report UT-CS-05-544.

Google Scholar

[6]

Golub GH, Van Loan CF 1996 Matrix Computations, 3rd Edition . Baltimore, MD: John Hopkins University Press.

Google Scholar

[7]

Heroux MA, Dongarra J, Luszczek P 2013 HPCG technical specification. Sandia Report SAND2013-8752.

Google Scholar

[8]

Jones MT, Plassmann PE 1992 A parallel graph coloring heuristic. SIAM Journal on Computing Volume 14 : pp.654-–669.

Digital Library

Google Scholar

[9]

Luby M 1986 A simple parallel algorithm for the maximal independent set problem. SIAM Journal on Computing Volume 15 Issue 4: pp.1036-–1053.

Google Scholar

[10]

McCalpin JD 1995 Memory bandwidth and machine balance in current high-performance computers. IEEE Computer Society Technical Committee on Computer Architecture TCCA Newsletter, 1995.

Google Scholar

[11]

Park J, Smelyanskiy M 2014 Optimizing Gauss-Seidel smoother in HPCG. In: ASCR HPCG workshop, Bethesda, MD, 25 March 2014.

Google Scholar

[12]

Phillips EH, Fatica M 2010 Implementing the Himeno benchmark with CUDA on GPU clusters. In: 2010 IEEE international symposium on parallel and distributed processing, pp. pp.1-–10. IEEE.

Google Scholar

Cited By

View all

Yang XLi SYuan FDong D(2024)DBSR: An Efficient Storage Format for Vectorizing Sparse Triangular Solvers on Structured GridsProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00065(1-14)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SC41406.2024.00065
Gómez CMantovani FFocht ECasas M(2023)HPCG on long-vector architecturesFuture Generation Computer Systems10.1016/j.future.2023.01.015143:C(152-162)Online publication date: 1-Jun-2023
https://dl.acm.org/doi/10.1016/j.future.2023.01.015

Performance analysis of the high-performance conjugate gradient benchmark on GPUs

Recommendations

Performance analysis and optimization strategies for a D3Q19 lattice Boltzmann kernel on nVIDIA GPUs using CUDA

This paper presents implementation strategies and optimization approaches for a D3Q19 lattice Boltzmann flow solver on nVIDIA graphics processing units (GPUs). Using the STREAM benchmarks we demonstrate the GPU parallelization approach and obtain an ...
Spiking neural P system simulations on a high performance GPU platform
ICA3PP'11: Proceedings of the 11th international conference on Algorithms and architectures for parallel processing - Volume Part II

In this paper we present our results in adapting a Spiking Neural P system (SNP system) simulator to a high performance graphics processing unit (GPU) platform. In particular, we extend our simulations to larger and more complex SNP systems using an ...
Performance analysis of accelerated image registration using GPGPU
GPGPU-2: Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units

This paper presents a performance analysis of an accelerated 2-D rigid image registration implementation that employs the Compute Unified Device Architecture (CUDA) programming environment to take advantage of the parallel processing capabilities of ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image International Journal of High Performance Computing Applications

International Journal of High Performance Computing Applications Volume 30, Issue 1

2 2016

131 pages

Issue’s Table of Contents

Publisher

Sage Publications, Inc.

United States

Publication History

Published: 24 November 2019

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 27 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Yang XLi SYuan FDong D(2024)DBSR: An Efficient Storage Format for Vectorizing Sparse Triangular Solvers on Structured GridsProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00065(1-14)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SC41406.2024.00065
Gómez CMantovani FFocht ECasas M(2023)HPCG on long-vector architecturesFuture Generation Computer Systems10.1016/j.future.2023.01.015143:C(152-162)Online publication date: 1-Jun-2023
https://dl.acm.org/doi/10.1016/j.future.2023.01.015

View Options

View options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Abstract

References

Cited By

Recommendations

Performance analysis and optimization strategies for a D3Q19 lattice Boltzmann kernel on nVIDIA GPUs using CUDA

Spiking neural P system simulations on a high performance GPU platform

Performance analysis of accelerated image registration using GPGPU

Comments

Information

Published In

Publisher

Publication History

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations