research-article

An Examination of the Impact of Failure Distribution on Coordinated Checkpoint/Restart

Authors:

Kurt B. FerreiraAuthors Info & Claims

FTXS '16: Proceedings of the ACM Workshop on Fault-Tolerance for HPC at Extreme Scale

Pages 35 - 42

https://doi.org/10.1145/2909428.2909430

Published: 31 May 2016 Publication History

Abstract

Fault tolerance is a key challenge to building the first exa\-scale system. To understand the potential impacts of failures on next-generation systems, significant effort has been devoted to collecting, characterizing and analyzing failures on current systems. These studies require large volumes of data and complex analysis. Because the occurrence of failures in large-scale systems is unpredictable, failures are commonly modeled as a stochastic process. Failure data from current systems is examined in an attempt to identify the underlying probability distribution and its statistical properties.

In this paper, we use modeling to examine the impact of failure distributions on the time-to-solution and the optimal checkpoint interval of applications that use coordinated checkpoint/restart. Using this approach, we show that as failures become more frequent, the failure distribution has a larger influence on application performance. We also show that as failure times are less tightly grouped (i.e., as the standard deviation increases) the underlying probability distribution has a greater impact on application performance. Finally, we show that computing the checkpoint interval based on the assumption that failures are exponentially distributed has a modest impact on application performance even when failures are drawn from a different distribution.

Our work provides critical analysis and guidance to the process of analyzing failure data in the context of coordinated checkpoint/restart. Specifically, the data presented in this paper helps to distinguish cases where the failure distribution has a strong influence on application performance from those cases when the failure distribution has relatively little impact.

References

[1]

SciPy.org. http://www.scipy.org/ (visited Feb. 2016).

[2]

G. Bosilca, A. Bouteiller, E. Brunet, F. Cappello, J. Dongarra, A. Guermouche, T. Herault, Y. Robert, F. Vivien, and D. Zaidouni. Unified model for assessing checkpointing protocols at extreme-scale. Concurrency and Computation: Practice and Experience, 26(17):2772--2791, 2014.

Digital Library

[3]

M. Bougeret, H. Casanova, M. Rabie, Y. Robert, and F. Vivien. Checkpointing strategies for parallel jobs. In High Performance Computing, Networking, Storage and Analysis (SC), 2011 International Conference for, pages 1--11. IEEE, 2011.

Digital Library

[4]

M.-S. Bouguerra, T. Gautier, D. Trystram, and J.-M. Vincent. A exible checkpoint/restart model in distributed systems. In Parallel Processing and Applied Mathematics, pages 206--215. Springer, 2009.

Digital Library

[5]

M. S. Bouguerra, D. Trystram, T. Gautier, and J.-M. Vincent. A new exible checkpoint/restart model. Research Report RR-6751, INRIA, 2008.

[6]

H. Casanova, Y. Robert, F. Vivien, and D. Zaidouni. On the impact of process replication on executions of large-scale parallel applications with coordinated checkpointing. Future Generation Computer Systems, 51:7--19, 2015.

Digital Library

[7]

J. T. Daly. A higher order estimate of the optimum checkpoint interval for restart dumps. Future Generation Computing Systems, 22(3):303--312, 2006.

Digital Library

[8]

E. N. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson. A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys, 34(3):375--408, 2002.

Digital Library

[9]

T. J. Hacker, F. Romero, and C. D. Carothers. An analysis of clustered failures on large supercomputing systems. Journal of Parallel and Distributed Computing, 69(7):652--665, 2009.

Digital Library

[10]

D. Kondo, B. Javadi, A. Iosup, and D. Epema. The failure trace archive: Enabling comparative analysis of failures in diverse distributed systems. In Cluster, Cloud and Grid Computing (CCGrid), 2010 10th IEEE/ACM International Conference on, pages 398--407. IEEE, 2010.

Digital Library

[11]

Y. Liu, R. Nassar, C. Leangsuksun, N. Naksinehaboon, M. Paun, and S. L. Scott. An optimal checkpoint/restart model for a large scale high performance computin system. In Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on, pages 1--9. IEEE, 2008.

[12]

B. Schroeder and G. A. Gibson. A large-scale study of failures in high-performance computing systems. In Proceedings of the International Conference on Dependable Systems and Networks, DSN '06, pages 249--258, Washington, DC, USA, 2006. IEEE Computer Society.

Digital Library

[13]

V. Sridharan, N. DeBardeleben, S. Blanchard, K. B. Ferreira, J. Stearley, J. Shalf, and S. Gurumurthi. Memory errors in modern systems: The good, the bad, and the ugly. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '15, pages 297--310, New York, NY, USA, 2015. ACM.

Digital Library

[14]

V. Sridharan and D. Liberty. A study of DRAM failures in the field. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC '12, pages 76:1--76:11, Los Alamitos, CA, USA, 2012. IEEE Computer Society Press.

Digital Library

[15]

V. Sridharan, J. Stearley, N. DeBardeleben, S. Blanchard, and S. Gurumurthi. Feng shui of supercomputer memory: Positional effects in DRAM and SRAM faults. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC '13, pages 22:1--22:11, New York, NY, USA, 2013. ACM.

Digital Library

[16]

J. W. Young. A first order approximation to the optimum checkpoint interval. Commun. ACM, 17(9):530--531, Sept. 1974.

Digital Library

[17]

Z. Zheng, L. Yu, and Z. Lan. Reliability-aware speedup models for parallel applications with coordinated checkpointing/restart. Computers, IEEE Transactions on, 64(5):1402--1415, 2015

Cited By

Benoit APerotin LRobert YVivien F(2024)Checkpointing Strategies to Tolerate Non-Memoryless Failures on HPC PlatformsACM Transactions on Parallel Computing10.1145/362456011:1(1-26)Online publication date: 11-Mar-2024
https://dl.acm.org/doi/10.1145/3624560
Kumar MEngelmann C(2022)RDPM: An Extensible Tool for Resilience Design Patterns ModellingEuro-Par 2021: Parallel Processing Workshops10.1007/978-3-031-06156-1_23(283-297)Online publication date: 9-Jun-2022
https://doi.org/10.1007/978-3-031-06156-1_23
Kumar MEngelmann C(2020)Models for Resilience Design Patterns2020 IEEE/ACM 10th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)10.1109/FTXS51974.2020.00008(21-30)Online publication date: Nov-2020
https://doi.org/10.1109/FTXS51974.2020.00008
Show More Cited By

Index Terms

An Examination of the Impact of Failure Distribution on Coordinated Checkpoint/Restart
1. Computer systems organization
  1. Dependable and fault-tolerant systems and networks
    1. Reliability
2. Software and its engineering
  1. Software organization and properties
    1. Extra-functional properties
      1. Software fault tolerance
        Checkpoint / restart

Recommendations

A tunable holistic resiliency approach for high-performance computing systems
PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming

In order to address anticipated high failure rates, resiliency characteristics have become an urgent priority for next-generation extreme-scale high-performance computing (HPC) systems. This poster describes our past and ongoing efforts in novel fault ...
A tunable holistic resiliency approach for high-performance computing systems
PPoPP '09

In order to address anticipated high failure rates, resiliency characteristics have become an urgent priority for next-generation extreme-scale high-performance computing (HPC) systems. This poster describes our past and ongoing efforts in novel fault ...
Multilevel Diskless Checkpointing

Extreme scale systems available before the end of this decade are expected to have 100 million to 1 billion CPU cores. The probability that a failure occurs during an application execution is expected to be much higher than today's systems. ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

FTXS '16: Proceedings of the ACM Workshop on Fault-Tolerance for HPC at Extreme Scale

May 2016

58 pages

ISBN:9781450343497

DOI:10.1145/2909428

Program Chair:
Nathan DeBardeleben
Los Alamos National Laboratory, USA

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

University of Arizona: University of Arizona
SIGARCH: ACM Special Interest Group on Computer Architecture

In-Cooperation

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 May 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

HPDC'16

Sponsor:

University of Arizona
SIGARCH

HPDC'16: The 25th International Symposium on High-Performance Parallel and Distributed Computing

May 31, 2016

Kyoto, Japan

Acceptance Rates

Overall Acceptance Rate 16 of 25 submissions, 64%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
97
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)0

Reflects downloads up to 13 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Benoit APerotin LRobert YVivien F(2024)Checkpointing Strategies to Tolerate Non-Memoryless Failures on HPC PlatformsACM Transactions on Parallel Computing10.1145/362456011:1(1-26)Online publication date: 11-Mar-2024
https://dl.acm.org/doi/10.1145/3624560
Kumar MEngelmann C(2022)RDPM: An Extensible Tool for Resilience Design Patterns ModellingEuro-Par 2021: Parallel Processing Workshops10.1007/978-3-031-06156-1_23(283-297)Online publication date: 9-Jun-2022
https://doi.org/10.1007/978-3-031-06156-1_23
Kumar MEngelmann C(2020)Models for Resilience Design Patterns2020 IEEE/ACM 10th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)10.1109/FTXS51974.2020.00008(21-30)Online publication date: Nov-2020
https://doi.org/10.1109/FTXS51974.2020.00008
Rojas EMeneses EJones TMaxwell D(2020)Towards a Model to Estimate the Reliability of Large-Scale Hybrid SupercomputersEuro-Par 2020: Parallel Processing10.1007/978-3-030-57675-2_3(37-51)Online publication date: 24-Aug-2020
https://dl.acm.org/doi/10.1007/978-3-030-57675-2_3
Levy SFerreira KDeBardeleben NSiddiqua TSridharan VBaseman E(2018)Lessons learned from memory errors observed over the lifetime of CieloProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.5555/3291656.3291714(1-12)Online publication date: 11-Nov-2018
https://dl.acm.org/doi/10.5555/3291656.3291714
Levy SFerreira KDeBardeleben NSiddiqua TSridharan VBaseman E(2018)Lessons learned from memory errors observed over the lifetime of CieloProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC.2018.00046(1-12)Online publication date: 11-Nov-2018
https://dl.acm.org/doi/10.1109/SC.2018.00046
Levy SFerreira KWidener P(2018)The unexpected virtue of almost: Exploiting MPI collective operations to approximately coordinate checkpointsConcurrency and Computation: Practice and Experience10.1002/cpe.489032:3Online publication date: 9-Sep-2018
https://doi.org/10.1002/cpe.4890
Liu JAgrawal G(2016)Soft Error Detection for Iterative Applications Using Offline Training2016 IEEE 23rd International Conference on High Performance Computing (HiPC)10.1109/HiPC.2016.011(2-11)Online publication date: Dec-2016
https://doi.org/10.1109/HiPC.2016.011

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten