Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1109/SC.2014.82acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Efficient shared-memory implementation of high-performance conjugate gradient benchmark and its application to unstructured matrices

Published: 16 November 2014 Publication History

Abstract

A new sparse high performance conjugate gradient benchmark (HPCG) has been recently released to address challenges in the design of sparse linear solvers for the next generation extreme-scale computing systems. Key computation, data access, and communication pattern in HPCG represent building blocks commonly found in today's HPC applications. While it is a well-known challenge to efficiently parallelize Gauss-Seidel smoother, the most time-consuming kernel in HPCG, our algorithmic and architecture-aware optimizations deliver 95% and 68% of the achievable bandwidth on Xeon and Xeon Phi, respectively. Based on available parallelism, our Xeon Phi shared-memory implementation of Gauss-Seidel smoother selectively applies block multi-color reordering. Combined with MPI parallelization, our implementation balances parallelism, data access locality, CG convergence rate, and communication overhead. Our implementation achieved 580 TFLOPS (82% parallelization efficiency) on Tianhe-2 system, ranking first on the most recent HPCG list in July 2014. In addition, we demonstrate that our optimizations not only benefit HPCG original dataset, which is based on structured 3D grid, but also a wide range of unstructured matrices.

References

[1]
A. Petitet, R. C. Whaley, J. Dongarra, and A. Cleary, "HPL - A Portable Implementation of the High-Performance Linpack Benchmark for Distributed-Memory Computers," http://www.netlib.org/benchmark/hpl/.
[2]
J. Dongarra and M. A. Heroux, "Toward a New Metric for Ranking High Performance Computing Systems," Sandia National Laboratories, Tech. Rep. 4744, 2013.
[3]
"Workshop on Extreme-Scale Solvers: Transition to Future Architectures," in American Geophysical Union, Washington, DC, 2012.
[4]
P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, K. Hill, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick, "ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems," 2008, www.cse.nd.edu/Reports/2008/TR-2008-13.pdf.
[5]
E. Rothberg and A. Gupta, "Parallel ICCG on a Hierarchical Memory Multiprocessor - Addressing the Triangular Solve Bottleneck," Parallel Computing, vol. 18, no. 7, 1992.
[6]
V. E. Henson and U. M. Yang, "BoomerAMG: a Parallel Algebraic Multigrid Solver and Preconditioner," Applied Numerical Mathematics, vol. 41, pp. 155--177, 2000.
[7]
J. Park, M. Smelyanskiy, N. Sundaram, and P. Dubey, "Sparsifying Synchronization for High-Performance Shared-Memory Sparse Triangular Solver," in International Supercomputing Conference (ISC), 2014.
[8]
T. Iwashita, H. Nakashima, and Y. Takahashi, "Algebraic Block Multi-Color Ordering Method for Parallel Multi-Threaded Sparse Triangular Solver in ICCG Method," in International Symposium on Parallel and Distributed Processing (IPDPS), 2012.
[9]
X. Liu, M. Smelyanskiy, E. Chow, and P. Dubey, "Efficient Sparse Matrix-Vector Multiplication on x86-Based Many-Core Processors," in International Conference on Supercomputing (ICS), 2013.
[10]
R. S. Daniel Molka, Daniel Hackenberg and M. S. Müller, "Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System," in International Conference on Parallel Architectures and Compilation Techniques (PACT), 2009.
[11]
E. Anderson and Y. Saad, "Solving Sparse Triangular Linear Systems on Parallel Computers," International Journal of High Speed Computing, vol. 1, no. 1, 1989.
[12]
J. H. Saltz, "Aggregation Methods for Solving Sparse Triangular Systems on Multiprocessors," SIAM Journal of Scientific and Statistical Computing, vol. 11, no. 1, 1990.
[13]
M. Naumov, "Parallel Solution of Sparse Triangular Linear Systems in the Preconditioned Iterative Methods on the GPU," NVIDIA Corporation, Tech. Rep. 001, 2011.
[14]
Y. Saad, Iterative Methods for Sparse Linear Systems. Society for Industrial and Applied Mathematics, 2003.
[15]
E. L. Poole and J. M. Ortega, "Multicolor ICCG Methods for Vector Computers," SIAM Journal on Numerical Analysis, vol. 24, no. 6, 1987.
[16]
H. C. Elman and E. Agrón, "Ordering techniques for the preconditioned conjugate gradient method on parallel computers," Computer Physics Communications, vol. 53, no. 1, 1989.
[17]
T. Iwashita and M. Shimasaki, "Block Red-Black Ordering: A New Ordering Strategy for Parallelization of ICCG Method," International Journal of Parallel Programming, vol. 31, no. 1, 2003.
[18]
A. Monakov, A. Lokhmotov, and A. Avetisyan, "Automatically Tuning Sparse Matrix-Vector Multiplication for GPU Architectures," in International Conference on High Performance Embedded Architectures and Compilers (HiPEAC), 2010.
[19]
D. LaSalle and G. Karypis, "Multi-Threaded Graph Partitioning," in International Symposium on Parallel and Distributed Processing (IPDPS), 2013.
[20]
A. H. Gebremedhin, F. Manne, and A. Pothen, "What Color Is Your Jacobian? Graph Coloring for Computing Derivatives," SIAM review, vol. 47, no. 4, 2005.
[21]
A. H. Gebremedhin, D. Nguyen, M. M. A. Patwary, and A. Pothen, "ColPack: Software for Graph Coloring and Related Problems in Scientific Computing," ACM Transactions on Mathematical Software (TOMS), vol. 40, no. 1, 2013.
[22]
M. M. A. Patwary, A. H. Gebremedhin, and A. Pothen, "New Multithreaded Ordering and Coloring Algorithms for Multicore Architectures," in Proceedings of 17th International European Conference on Parallel and Distributed Computing (Euro-Par 2011), 2011.
[23]
M. M. A. Patwary, A. H. Gebremedhin, F. Manne, and A. Pothen, "Paradigms for Effective Parallelization of Inherently Sequential Graph Algorithms on Multi-core Architectures," 2013, ACM Transactions on Parallel Computing, under review.
[24]
S. Lee and R. Eigenmann, "Adaptive Runtime Tuning of Parallel Sparse Matrix-Vector Multiplication on Distributed Memory Systems," in International Conference on Supercomputing (ICS), 2008.
[25]
A. Pinar and C. Aykanat, "Fast Optimal Load Balancing Algorithms for 1D Partitioning," Journal of Parallel and Distributed Computing, 2004.
[26]
J. D. McCalpin, "STREAM: Sustainable Memory Bandwidth in High Performance Computers," http://www.cs.virginia.edu/stream.
[27]
K. Raman, "Optimizing Memory Bandwidth on Stream Triad," https://software.intel.com/en-us/articles/optimizing-memory-bandwidth-on-stream-triad, 2013.
[28]
T. A. Davis and Y. Hu, "The University of Florida Sparse Matrix Collection," ACM Transactions on Mathematical Software, vol. 15, no. 1, 2011, http://www.cise.ufl.edu/research/sparse/matrices.
[29]
D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga, "The NAS Parallel Benchmarks," in International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 1991.
[30]
J. Dongarra, V. Eijkhout, and H. van der Vorst, "An iterative solver benchmark," Sci. Program., vol. 9, no. 4, Dec. 2001.
[31]
R. Li and Y. Saad, "GPU-Accelerated Preconditioned Iterative Linear Solvers," Technical report, University of Minnesota, Tech. Rep., 2010.
[32]
V. Heuveline, D. Lukarski, and J.-P. Weiss, "Enhanced Parallel ILU(p)-based Preconditioners for Multi-core CPUs and GPUs -- The Power(q)-pattern Method," Preprint Series of the Engineering Mathematics and Computing Lab, vol. 0, no. 08, 2011.
[33]
A. H. Baker, R. D. Falgout, T. Gamblin, T. V. Koleg, M. Schulz, and U. M. Yang, "Scaling Algebraic Multigrid Solvers: On the Road to Exascale," Competence in High Performance Computing 2010, vol. 31, 2012.
[34]
J. Bolz, I. Farmer, E. Grinspun, and P. Schröoder, "Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid," ACM Trans. Graph., vol. 22, no. 3, Jul. 2003.
[35]
N. Bell, S. Dalton, and L. Olson, "Exposing Fine-Grained Parallelism in Algebraic Multigrid Methods," NVIDIA Corporation, NVIDIA Technical Report NVR-2011-002, Jun. 2011.
[36]
S. Williams, D. D. Kalamkar, A. Singh, A. M. Deshpande, B. Van Straalen, M. Smelyanskiy, A. Almgren, P. Dubey, J. Shalf, and L. Oliker, "Optimization of Geometric Multigrid for Emerging Multi- and Manycore Processors," in International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2012.
[37]
E. Anderson and Y. Saad, "Solving Sparse Triangular Linear Systems on Parallel Computers," Int. J. High Speed Comput., vol. 1, no. 1, Apr. 1989.
[38]
D. Wallin, H. Löf, E. Hagersten, and S. Holmgren, "Multigrid and Gauss-Seidel Smoothers Revisited: Parallelization on Chip Multiprocessors," in International Conference on Supercomputing (ICS), 2006.
[39]
M. M. Wolf, M. A. Heroux, and E. G. Boman, "Factors Impacting Performance of Multithreaded Sparse Triangular Solve," in Proceedings of the 9th International Conference on High Performance Computing for Computational Science, 2011.

Cited By

View all
  • (2024)Towards Scalable Unstructured Mesh Computations on Shared Memory Many-CoresProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638473(109-119)Online publication date: 2-Mar-2024
  • (2023)Optimizing Multi-grid Computation and Parallelization on Multi-coresProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593726(227-239)Online publication date: 21-Jun-2023
  • (2021)Temporal vectorization for stencilsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476149(1-13)Online publication date: 14-Nov-2021
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
November 2014
1054 pages
ISBN:9781479955008
  • General Chair:
  • Trish Damkroger,
  • Program Chair:
  • Jack Dongarra

Sponsors

Publisher

IEEE Press

Publication History

Published: 16 November 2014

Check for updates

Qualifiers

  • Research-article

Conference

SC '14
Sponsor:

Acceptance Rates

SC '14 Paper Acceptance Rate 83 of 394 submissions, 21%;
Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)8
  • Downloads (Last 6 weeks)0
Reflects downloads up to 24 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Towards Scalable Unstructured Mesh Computations on Shared Memory Many-CoresProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638473(109-119)Online publication date: 2-Mar-2024
  • (2023)Optimizing Multi-grid Computation and Parallelization on Multi-coresProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593726(227-239)Online publication date: 21-Jun-2023
  • (2021)Temporal vectorization for stencilsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476149(1-13)Online publication date: 14-Nov-2021
  • (2020)A Recursive Algebraic Coloring Technique for Hardware-efficient Symmetric Sparse Matrix-vector MultiplicationACM Transactions on Parallel Computing10.1145/33997327:3(1-37)Online publication date: 29-Jun-2020
  • (2019)Sparse computation data dependence simplification for efficient compiler-generated inspectorsProceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/3314221.3314646(594-609)Online publication date: 8-Jun-2019
  • (2018)Dynamic fine-grained sparse memory accessesProceedings of the International Symposium on Memory Systems10.1145/3240302.3240416(85-97)Online publication date: 1-Oct-2018
  • (2018)Performance Optimization of the HPCG Benchmark on the Sunway TaihuLight SupercomputerACM Transactions on Architecture and Code Optimization10.1145/318217715:1(1-20)Online publication date: 22-Mar-2018
  • (2017)A hierarchical grid algorithm for accelerating high-performance conjugate gradient benchmark on sunway many-core processorProceedings of the 3rd International Conference on Communication and Information Processing10.1145/3162957.3163049(361-368)Online publication date: 24-Nov-2017
  • (2017)Experiences Porting Scientific Applications to the Intel (KNL) Xeon Phi PlatformPractice and Experience in Advanced Research Computing 2017: Sustainability, Success and Impact10.1145/3093338.3093371(1-8)Online publication date: 9-Jul-2017
  • (2017)Main Memory in HPCACM Transactions on Architecture and Code Optimization10.1145/302336214:1(1-26)Online publication date: 6-Mar-2017
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media