research-article

Efficient shared-memory implementation of high-performance conjugate gradient benchmark and its application to unstructured matrices

Authors:

Mikhail Smelyanskiy,

Karthikeyan Vaidyanathan,

Alexander Heinecke,

Dhiraj D. Kalamkar,

Md. Mosotofa Ali Patwary,

Pradeep DubeyAuthors Info & Claims

SC '14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Pages 945 - 955

https://doi.org/10.1109/SC.2014.82

Published: 16 November 2014 Publication History

Abstract

A new sparse high performance conjugate gradient benchmark (HPCG) has been recently released to address challenges in the design of sparse linear solvers for the next generation extreme-scale computing systems. Key computation, data access, and communication pattern in HPCG represent building blocks commonly found in today's HPC applications. While it is a well-known challenge to efficiently parallelize Gauss-Seidel smoother, the most time-consuming kernel in HPCG, our algorithmic and architecture-aware optimizations deliver 95% and 68% of the achievable bandwidth on Xeon and Xeon Phi, respectively. Based on available parallelism, our Xeon Phi shared-memory implementation of Gauss-Seidel smoother selectively applies block multi-color reordering. Combined with MPI parallelization, our implementation balances parallelism, data access locality, CG convergence rate, and communication overhead. Our implementation achieved 580 TFLOPS (82% parallelization efficiency) on Tianhe-2 system, ranking first on the most recent HPCG list in July 2014. In addition, we demonstrate that our optimizations not only benefit HPCG original dataset, which is based on structured 3D grid, but also a wide range of unstructured matrices.

References

[1]

A. Petitet, R. C. Whaley, J. Dongarra, and A. Cleary, "HPL - A Portable Implementation of the High-Performance Linpack Benchmark for Distributed-Memory Computers," http://www.netlib.org/benchmark/hpl/.

[2]

J. Dongarra and M. A. Heroux, "Toward a New Metric for Ranking High Performance Computing Systems," Sandia National Laboratories, Tech. Rep. 4744, 2013.

[3]

"Workshop on Extreme-Scale Solvers: Transition to Future Architectures," in American Geophysical Union, Washington, DC, 2012.

[4]

P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, K. Hill, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick, "ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems," 2008, www.cse.nd.edu/Reports/2008/TR-2008-13.pdf.

[5]

E. Rothberg and A. Gupta, "Parallel ICCG on a Hierarchical Memory Multiprocessor - Addressing the Triangular Solve Bottleneck," Parallel Computing, vol. 18, no. 7, 1992.

[6]

V. E. Henson and U. M. Yang, "BoomerAMG: a Parallel Algebraic Multigrid Solver and Preconditioner," Applied Numerical Mathematics, vol. 41, pp. 155--177, 2000.

Digital Library

[7]

J. Park, M. Smelyanskiy, N. Sundaram, and P. Dubey, "Sparsifying Synchronization for High-Performance Shared-Memory Sparse Triangular Solver," in International Supercomputing Conference (ISC), 2014.

[8]

T. Iwashita, H. Nakashima, and Y. Takahashi, "Algebraic Block Multi-Color Ordering Method for Parallel Multi-Threaded Sparse Triangular Solver in ICCG Method," in International Symposium on Parallel and Distributed Processing (IPDPS), 2012.

Digital Library

[9]

X. Liu, M. Smelyanskiy, E. Chow, and P. Dubey, "Efficient Sparse Matrix-Vector Multiplication on x86-Based Many-Core Processors," in International Conference on Supercomputing (ICS), 2013.

Digital Library

[10]

R. S. Daniel Molka, Daniel Hackenberg and M. S. Müller, "Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System," in International Conference on Parallel Architectures and Compilation Techniques (PACT), 2009.

Digital Library

[11]

E. Anderson and Y. Saad, "Solving Sparse Triangular Linear Systems on Parallel Computers," International Journal of High Speed Computing, vol. 1, no. 1, 1989.

Digital Library

[12]

J. H. Saltz, "Aggregation Methods for Solving Sparse Triangular Systems on Multiprocessors," SIAM Journal of Scientific and Statistical Computing, vol. 11, no. 1, 1990.

Digital Library

[13]

M. Naumov, "Parallel Solution of Sparse Triangular Linear Systems in the Preconditioned Iterative Methods on the GPU," NVIDIA Corporation, Tech. Rep. 001, 2011.

[14]

Y. Saad, Iterative Methods for Sparse Linear Systems. Society for Industrial and Applied Mathematics, 2003.

Digital Library

[15]

E. L. Poole and J. M. Ortega, "Multicolor ICCG Methods for Vector Computers," SIAM Journal on Numerical Analysis, vol. 24, no. 6, 1987.

Digital Library

[16]

H. C. Elman and E. Agrón, "Ordering techniques for the preconditioned conjugate gradient method on parallel computers," Computer Physics Communications, vol. 53, no. 1, 1989.

[17]

T. Iwashita and M. Shimasaki, "Block Red-Black Ordering: A New Ordering Strategy for Parallelization of ICCG Method," International Journal of Parallel Programming, vol. 31, no. 1, 2003.

Digital Library

[18]

A. Monakov, A. Lokhmotov, and A. Avetisyan, "Automatically Tuning Sparse Matrix-Vector Multiplication for GPU Architectures," in International Conference on High Performance Embedded Architectures and Compilers (HiPEAC), 2010.

Digital Library

[19]

D. LaSalle and G. Karypis, "Multi-Threaded Graph Partitioning," in International Symposium on Parallel and Distributed Processing (IPDPS), 2013.

Digital Library

[20]

A. H. Gebremedhin, F. Manne, and A. Pothen, "What Color Is Your Jacobian? Graph Coloring for Computing Derivatives," SIAM review, vol. 47, no. 4, 2005.

Digital Library

[21]

A. H. Gebremedhin, D. Nguyen, M. M. A. Patwary, and A. Pothen, "ColPack: Software for Graph Coloring and Related Problems in Scientific Computing," ACM Transactions on Mathematical Software (TOMS), vol. 40, no. 1, 2013.

Digital Library

[22]

M. M. A. Patwary, A. H. Gebremedhin, and A. Pothen, "New Multithreaded Ordering and Coloring Algorithms for Multicore Architectures," in Proceedings of 17th International European Conference on Parallel and Distributed Computing (Euro-Par 2011), 2011.

Digital Library

[23]

M. M. A. Patwary, A. H. Gebremedhin, F. Manne, and A. Pothen, "Paradigms for Effective Parallelization of Inherently Sequential Graph Algorithms on Multi-core Architectures," 2013, ACM Transactions on Parallel Computing, under review.

[24]

S. Lee and R. Eigenmann, "Adaptive Runtime Tuning of Parallel Sparse Matrix-Vector Multiplication on Distributed Memory Systems," in International Conference on Supercomputing (ICS), 2008.

Digital Library

[25]

A. Pinar and C. Aykanat, "Fast Optimal Load Balancing Algorithms for 1D Partitioning," Journal of Parallel and Distributed Computing, 2004.

Digital Library

[26]

J. D. McCalpin, "STREAM: Sustainable Memory Bandwidth in High Performance Computers," http://www.cs.virginia.edu/stream.

[27]

K. Raman, "Optimizing Memory Bandwidth on Stream Triad," https://software.intel.com/en-us/articles/optimizing-memory-bandwidth-on-stream-triad, 2013.

[28]

T. A. Davis and Y. Hu, "The University of Florida Sparse Matrix Collection," ACM Transactions on Mathematical Software, vol. 15, no. 1, 2011, http://www.cise.ufl.edu/research/sparse/matrices.

Digital Library

[29]

D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga, "The NAS Parallel Benchmarks," in International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 1991.

[30]

J. Dongarra, V. Eijkhout, and H. van der Vorst, "An iterative solver benchmark," Sci. Program., vol. 9, no. 4, Dec. 2001.

Digital Library

[31]

R. Li and Y. Saad, "GPU-Accelerated Preconditioned Iterative Linear Solvers," Technical report, University of Minnesota, Tech. Rep., 2010.

[32]

V. Heuveline, D. Lukarski, and J.-P. Weiss, "Enhanced Parallel ILU(p)-based Preconditioners for Multi-core CPUs and GPUs -- The Power(q)-pattern Method," Preprint Series of the Engineering Mathematics and Computing Lab, vol. 0, no. 08, 2011.

[33]

A. H. Baker, R. D. Falgout, T. Gamblin, T. V. Koleg, M. Schulz, and U. M. Yang, "Scaling Algebraic Multigrid Solvers: On the Road to Exascale," Competence in High Performance Computing 2010, vol. 31, 2012.

[34]

J. Bolz, I. Farmer, E. Grinspun, and P. Schröoder, "Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid," ACM Trans. Graph., vol. 22, no. 3, Jul. 2003.

Digital Library

[35]

N. Bell, S. Dalton, and L. Olson, "Exposing Fine-Grained Parallelism in Algebraic Multigrid Methods," NVIDIA Corporation, NVIDIA Technical Report NVR-2011-002, Jun. 2011.

[36]

S. Williams, D. D. Kalamkar, A. Singh, A. M. Deshpande, B. Van Straalen, M. Smelyanskiy, A. Almgren, P. Dubey, J. Shalf, and L. Oliker, "Optimization of Geometric Multigrid for Emerging Multi- and Manycore Processors," in International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2012.

Digital Library

[37]

E. Anderson and Y. Saad, "Solving Sparse Triangular Linear Systems on Parallel Computers," Int. J. High Speed Comput., vol. 1, no. 1, Apr. 1989.

Digital Library

[38]

D. Wallin, H. Löf, E. Hagersten, and S. Holmgren, "Multigrid and Gauss-Seidel Smoothers Revisited: Parallelization on Chip Multiprocessors," in International Conference on Supercomputing (ICS), 2006.

Digital Library

[39]

M. M. Wolf, M. A. Heroux, and E. G. Boman, "Factors Impacting Performance of Multithreaded Sparse Triangular Solve," in Proceedings of the 9th International Conference on High Performance Computing for Computational Science, 2011.

Digital Library

Cited By

Qiu HXu CFang JDeng LZhang JWang QDing YDai ZChe YChen SLiu JLee IChabbi MSteuwer M(2024)Towards Scalable Unstructured Mesh Computations on Shared Memory Many-CoresProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638473(109-119)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1145/3627535.3638473
Yang XLi SYuan FDong DHuang CWang ZGallivan KNikolopoulos DBeivide RGallopoulos E(2023)Optimizing Multi-grid Computation and Parallelization on Multi-coresProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593726(227-239)Online publication date: 21-Jun-2023
https://dl.acm.org/doi/10.1145/3577193.3593726
Yuan LCao HZhang YLi KLu PYue Yde Supinski BHall MGamblin T(2021)Temporal vectorization for stencilsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476149(1-13)Online publication date: 14-Nov-2021
https://dl.acm.org/doi/10.1145/3458817.3476149
Show More Cited By

Index Terms

Efficient shared-memory implementation of high-performance conjugate gradient benchmark and its application to unstructured matrices

Recommendations

A hierarchical grid algorithm for accelerating high-performance conjugate gradient benchmark on sunway many-core processor
ICCIP '17: Proceedings of the 3rd International Conference on Communication and Information Processing

This paper presents analysis and optimizations for High Performance Conjugate Gradient benchmark (HPCG) on the Sunway many-core processor. For modern multi-core and many-core processors, HPCG always presents a poor performance and under-utilizes ...
A CUDA Implementation of the High Performance Conjugate Gradient Benchmark
High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation
Abstract
The High Performance Conjugate Gradient (HPCG) benchmark has been recently proposed as a complement to the High Performance Linpack (HPL) benchmark currently used to rank supercomputers in the Top500 list. This new benchmark solves a large sparse ...
Performance analysis of the high-performance conjugate gradient benchmark on GPUs

Graphics processing unit accelerated supercomputers have proved to be very effective, especially with regard to power efficiency, for accelerating compute intensive applications like the high-performance Linpack used in the TOP500 list. This paper ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

November 2014

1054 pages

ISBN:9781479955008

General Chair:
Trish Damkroger
Lawrence Livermore National Laboratory, Livermore, California
,
Program Chair:
Jack Dongarra
University of Tennessee, Knoxville, Tennessee

Sponsors

Publisher

IEEE Press

Publication History

Published: 16 November 2014

Check for updates

Qualifiers

Research-article

Conference

SC '14

Sponsor:

SIGHPC
SIGARCH
IEEE-CS

SC '14: International Conference for High Performance Computing, Networking, Storage and Analysis

November 16 - 21, 2014

Louisana, New Orleans

Acceptance Rates

SC '14 Paper Acceptance Rate 83 of 394 submissions, 21%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

19
Total Citations
View Citations
315
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)0

Reflects downloads up to 24 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Qiu HXu CFang JDeng LZhang JWang QDing YDai ZChe YChen SLiu JLee IChabbi MSteuwer M(2024)Towards Scalable Unstructured Mesh Computations on Shared Memory Many-CoresProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638473(109-119)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1145/3627535.3638473
Yang XLi SYuan FDong DHuang CWang ZGallivan KNikolopoulos DBeivide RGallopoulos E(2023)Optimizing Multi-grid Computation and Parallelization on Multi-coresProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593726(227-239)Online publication date: 21-Jun-2023
https://dl.acm.org/doi/10.1145/3577193.3593726
Yuan LCao HZhang YLi KLu PYue Yde Supinski BHall MGamblin T(2021)Temporal vectorization for stencilsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476149(1-13)Online publication date: 14-Nov-2021
https://dl.acm.org/doi/10.1145/3458817.3476149
Alappat CBasermann ABishop AFehske HHager GSchenk OThies JWellein G(2020)A Recursive Algebraic Coloring Technique for Hardware-efficient Symmetric Sparse Matrix-vector MultiplicationACM Transactions on Parallel Computing10.1145/33997327:3(1-37)Online publication date: 29-Jun-2020
https://dl.acm.org/doi/10.1145/3399732
Mohammadi MYuki TCheshmi KDavis EHall MDehnavi MNandy POlschanowsky CVenkat AStrout MMcKinley KFisher K(2019)Sparse computation data dependence simplification for efficient compiler-generated inspectorsProceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/3314221.3314646(594-609)Online publication date: 8-Jun-2019
https://dl.acm.org/doi/10.1145/3314221.3314646
Akin BChou CPark JHughes CAgarwal RJacob B(2018)Dynamic fine-grained sparse memory accessesProceedings of the International Symposium on Memory Systems10.1145/3240302.3240416(85-97)Online publication date: 1-Oct-2018
https://dl.acm.org/doi/10.1145/3240302.3240416
Ao YYang CLiu FYin WJiang LSun Q(2018)Performance Optimization of the HPCG Benchmark on the Sunway TaihuLight SupercomputerACM Transactions on Architecture and Code Optimization10.1145/318217715:1(1-20)Online publication date: 22-Mar-2018
https://dl.acm.org/doi/10.1145/3182177
Liao CChen JHan WCao HSu ZYin WAn HBen-Othman JGang FLiu JArai M(2017)A hierarchical grid algorithm for accelerating high-performance conjugate gradient benchmark on sunway many-core processorProceedings of the 3rd International Conference on Communication and Information Processing10.1145/3162957.3163049(361-368)Online publication date: 24-Nov-2017
https://dl.acm.org/doi/10.1145/3162957.3163049
Malaya NMcDougall DMichoski CLee MSimmons CHart D(2017)Experiences Porting Scientific Applications to the Intel (KNL) Xeon Phi PlatformPractice and Experience in Advanced Research Computing 2017: Sustainability, Success and Impact10.1145/3093338.3093371(1-8)Online publication date: 9-Jul-2017
https://dl.acm.org/doi/10.1145/3093338.3093371
Zivanovic DPavlovic MRadulovic MShin HSon JMckee SCarpenter PRadojković PAyguadé E(2017)Main Memory in HPCACM Transactions on Architecture and Code Optimization10.1145/302336214:1(1-26)Online publication date: 6-Mar-2017
https://dl.acm.org/doi/10.1145/3023362
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents