Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/1345206.1345253acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
poster

Compiler-enhanced incremental checkpointing for OpenMP applications

Published: 20 February 2008 Publication History

Abstract

As modern supercomputing systems reach peta-flop performance they grow in both size and complexity, becoming increasingly vulnerable to failures. Checkpointing is a popular technique for tolerating such failures. Although a variety of automated system-level checkpointing solutions are currently available to HPC users, manual application-level checkpointing remains more popular due to its superior performance. This paper improves performance of automated checkpointing by presenting a compiler analysis for incremental checkpointing. This analysis, which works with both sequential and OpenMP applications, significantly reduces checkpoint sizes and enables asynchronous checkpointing.

References

[1]
http://phase.hpcc.jp/Omni/benchmarks/NPB.
[2]
Sarah E. Michalak, Kevin W. Harris, Nicolas W. Hengartner, Bruce E. Takala, and Stephen A. Wender. Predicting the number of fatal soft errors in los alamos national laboratorys asc q supercomputer. IEEE Transactions on Device and Materials Reliability, 5(3), 2005.
[3]
James S. Plank, Micah Beck, and Gerry Kingsley. Compiler-assisted memory exclusion for fast checkpointing. IEEE Technical Committee on Operating Systems and Application Environments, 7(4).
[4]
Jose Carlos Sancho, Fabrizio Petrini, Greg Johnson, Juan Fernandez, and Eitan Frachtenberg. On the feasibility of incremental checkpointing for scientific computing. In 18th International Parallel and Distributed Processing Symposium (IPDPS), 2004.
[5]
Bianca Schroeder and Garth A. Gibson. A large-scale study of failures in high-performance computing systems. In In Proceedings of the International Conference on Dependable Systems and Networks (DSN), 2006.
[6]
Kun Zhang and Santosh Pande. Efficient application migration under compiler guidance. In Poceedings of the Conference on Languages, Compilers, and Tools for Embedded Systems, 2005.

Cited By

View all
  • (2022)Software approaches for resilience of high performance computing systems: a surveyFrontiers of Computer Science10.1007/s11704-022-2096-317:4Online publication date: 12-Dec-2022
  • (2020)ACR: Amnesic Checkpointing and Recovery2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA47549.2020.00013(30-43)Online publication date: Feb-2020
  • (2019)Fault Tolerant High Performance Solver for Linear Equation Systems2019 38th Symposium on Reliable Distributed Systems (SRDS)10.1109/SRDS47363.2019.00022(113-11309)Online publication date: Oct-2019
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
PPoPP '08: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
February 2008
308 pages
ISBN:9781595937957
DOI:10.1145/1345206
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 February 2008

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. checkpointing
  2. fault tolerance
  3. performance optimization
  4. static analysis

Qualifiers

  • Poster

Conference

PPoPP08
Sponsor:

Acceptance Rates

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)8
  • Downloads (Last 6 weeks)1
Reflects downloads up to 20 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2022)Software approaches for resilience of high performance computing systems: a surveyFrontiers of Computer Science10.1007/s11704-022-2096-317:4Online publication date: 12-Dec-2022
  • (2020)ACR: Amnesic Checkpointing and Recovery2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA47549.2020.00013(30-43)Online publication date: Feb-2020
  • (2019)Fault Tolerant High Performance Solver for Linear Equation Systems2019 38th Symposium on Reliable Distributed Systems (SRDS)10.1109/SRDS47363.2019.00022(113-11309)Online publication date: Oct-2019
  • (2017)ITALCProceedings of the Fourth International Workshop on HPC User Support Tools10.1145/3152493.3152558(1-11)Online publication date: 12-Nov-2017
  • (2014)Accelerating incremental checkpointing for extreme-scale computingFuture Generation Computer Systems10.5555/2747903.274819930:C(66-77)Online publication date: 1-Jan-2014
  • (2014)Toward Exascale ResilienceSupercomputing Frontiers and Innovations: an International Journal10.14529/jsfi1401011:1(5-28)Online publication date: 6-Apr-2014
  • (2014)Fault tolerance for remote memory access programming modelsProceedings of the 23rd international symposium on High-performance parallel and distributed computing10.1145/2600212.2600224(37-48)Online publication date: 23-Jun-2014
  • (2014)Accelerating incremental checkpointing for extreme-scale computingFuture Generation Computer Systems10.1016/j.future.2013.04.01730(66-77)Online publication date: Jan-2014
  • (2013)Containment domainsScientific Programming10.1155/2013/47391521:3-4(197-212)Online publication date: 1-Jul-2013
  • (2012)Containment domainsProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.5555/2388996.2389075(1-11)Online publication date: 10-Nov-2012
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media