Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2532352.2532357acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Overcoming extreme-scale reproducibility challenges through a unified, targeted, and multilevel toolset

Published: 17 November 2013 Publication History

Abstract

Reproducibility, the ability to repeat program executions with the same numerical result or code behavior, is crucial for computational science and engineering applications. However, non-determinism in concurrency scheduling often hampers achieving this ability on high performance computing (HPC) systems. To aid in managing the adverse effects of non-determinism, prior work has provided techniques to achieve bit-precise reproducibility, but most of them focus only on small-scale parallelism. While scalable techniques recently emerged, they are disparate and target special purposes, e.g., single-schedule domains. On current systems with O(106) compute cores and future ones with O(109), any technique that does not embrace a unified, targeted, and multilevel approach will fall short of providing reproducibility. In this paper, we argue for a common toolset that embodies this approach, where programmers select and compose complementary tools and can effectively, yet scalably, analyze, control, and eliminate sources of non-determinism at scale. This allows users to gain reproducibility only to the levels demanded by specific code development needs. We present our research agenda and ongoing work toward this goal.

References

[1]
D. H. Ahn, B. R. de Supinski, I. Laguna, G. L. Lee, B. Liblit, B. P. Miller, and M. Schulz. Scalable temporal order analysis for large scale debugging. In ACM/IEEE Conference on High Performance Computing (SC), 2009.
[2]
D. C. Arnold, D. H. Ahn, B. R. de Supinski, G. L. Lee, B. P. Miller, and M. Schulz. Stack trace analysis for large scale debugging. In International Parallel and Distributed Processing Symposium (IPDPS), pages 1--10, 2007.
[3]
T. A. Brunner and P. S. Brantley. An efficient, robust, domain-decomposition algorithm for particle Monte Carlo. Journal of Computational Physics, 228(10): 3882--3890, 2009.
[4]
W.-F. Chiang, G. Gopalakrishnan, Z. Rakamarić, D. H. Ahn, and G. L. Lee. Determinism and reproducibility in large-scale HPC systems. In Workshop on Determinism and Correctness in Parallel Programming (WoDet), 2013.
[5]
W.-F. Chiang, G. Szubzda, G. Gopalakrishnan, and R. Thakur. Dynamic verification of hybrid programs. In European MPI Users' Group Meeting Conference on Recent Advances in the Message Passing Interface (EuroMPI), pages 298--301, 2010.
[6]
clang: a C language family frontend for LLVM. http://clang.llvm.org.
[7]
SciDAC Co-Design. http://science.energy.gov/ascr/research/scidac/co-design.
[8]
CORVETTE: Correctness verification and testing of parallel programs. http://crd.lbl.gov/groups-depts/ftg/projects/current-projects/corvette.
[9]
A. Eichenberger, J. Mellor-Crummey, M. Schulz, N. Copty, J. DelSignore, R. Dietrich, X. Liu, E. Loh, and D. Lorenz. OMPT and OMPD: OpenMP tools application programming interfaces for performance analysis and debugging. Technical report, 2013.
[10]
DOE Extreme-Scale Technology Acceleration FastForward. https://asc.llnl.gov/fastforward/rfp.
[11]
I. Laguna, D. H. Ahn, B. R. de Supinski, S. Bagchi, and T. Gamblin. Probabilistic diagnosis of performance faults in large-scale parallel applications. In International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 213--222, 2012.
[12]
G. L. Lee, D. H. Ahn, D. C. Arnold, B. R. de Supinski, M. Legendre, B. P. Miller, M. Schulz, and B. Liblit. Lessons learned at 208K: towards debugging millions of cores. In ACM/IEEE Conference on High Performance Computing (SC), pages 1--9, 2008.
[13]
E. Loh. The Ideal HPC Programming Language. Communications of the ACM, 53(7): 42--47, June 2010.
[14]
M. M. Marinak, G. D. Kerbel, N. A. Gentile, O. Jones, D. Munro, S. Pollaine, T. R. Dittrich, and S. W. Haan. Three-dimensional HYDRA simulations of National Ignition Facility targets. Physics of Plasmas, 8(5): 2275, 2001.
[15]
A. A. Mirin and P. H. Worley. Improving the performance scalability of the community atmosphere model. International Journal of High Performance Computing Applications (IJHPCA), 26(1): 17--30, 2012.
[16]
2013 Exascale Operating and Runtime Systems. http://science.doe.gov/grants/pdf/LAB_13-02.pdf.
[17]
ROSE compiler infrastructure. http://rosecompiler.org/.
[18]
M. Schulz, D. Kranzlmüller, and B. R. de Supinski. Exploring unexpected behavior in MPI. In International Conference on High Performance Computing and Communications (HPCC), pages 843--852, 2006.
[19]
O. Villa, D. Chavarría-mir, V. Gurumoorthi, A. Márquez, and S. Krishnamoorthy. Effects of floating-point non-associativity on numerical computations on massively multithreaded systems. In Proceedings of Cray User Group Meeting (CUG), 2009.
[20]
A. Vo, S. Aananthakrishnan, G. Gopalakrishnan, B. R. de Supinski, M. Schulz, and G. Bronevetsky. A scalable and distributed dynamic formal verifier for MPI programs. In ACM/IEEE Conference on High Performance Computing (SC), pages 1--10, 2010.
[21]
A. Vo, G. Gopalakrishnan, R. M. Kirby, B. R. de Supinski, M. Schulz, and G. Bronevetsky. Large scale verification of MPI programs using Lamport clocks with lazy update. In International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 330--339, 2011.
[22]
A. Vo, S. Vakkalanka, M. DeLisi, G. Gopalakrishnan, R. M. Kirby, and R. Thakur. Formal verification of practical MPI programs. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pages 261--270, 2009.
[23]
X-Stack Software Research. http://science.energy.gov/ascr/research/computer-science/xstack.

Cited By

View all
  • (2023)A Survey of Graph Comparison Methods with Applications to Nondeterminism in High-Performance ComputingThe International Journal of High Performance Computing Applications10.1177/1094342023116661037:3-4(306-327)Online publication date: 5-Apr-2023
  • (2019)A three-phase workflow for general and expressive representations of nondeterminism in HPC applicationsThe International Journal of High Performance Computing Applications10.1177/1094342019868826(109434201986882)Online publication date: 20-Aug-2019
  • (2018)Record-and-Replay Techniques for HPC Systems: A SurveySupercomputing Frontiers and Innovations: an International Journal10.14529/jsfi1801025:1(11-30)Online publication date: 15-Mar-2018
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SE-HPCCSE '13: Proceedings of the 1st International Workshop on Software Engineering for High Performance Computing in Computational Science and Engineering
November 2013
49 pages
ISBN:9781450324991
DOI:10.1145/2532352
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 November 2013

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Conference

SC13

Acceptance Rates

SE-HPCCSE '13 Paper Acceptance Rate 7 of 12 submissions, 58%;
Overall Acceptance Rate 7 of 12 submissions, 58%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)4
  • Downloads (Last 6 weeks)1
Reflects downloads up to 14 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2023)A Survey of Graph Comparison Methods with Applications to Nondeterminism in High-Performance ComputingThe International Journal of High Performance Computing Applications10.1177/1094342023116661037:3-4(306-327)Online publication date: 5-Apr-2023
  • (2019)A three-phase workflow for general and expressive representations of nondeterminism in HPC applicationsThe International Journal of High Performance Computing Applications10.1177/1094342019868826(109434201986882)Online publication date: 20-Aug-2019
  • (2018)Record-and-Replay Techniques for HPC Systems: A SurveySupercomputing Frontiers and Innovations: an International Journal10.14529/jsfi1801025:1(11-30)Online publication date: 15-Mar-2018
  • (2014)Getting ready for approximate computingProceedings of the 11th ACM Conference on Computing Frontiers10.1145/2597917.2597936(1-10)Online publication date: 20-May-2014
  • (2014)FluxProceedings of the 2014 43rd International Conference on Parallel Processing Workshops10.1109/ICPPW.2014.15(9-17)Online publication date: 9-Sep-2014

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media