Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2909428.2909430acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article

An Examination of the Impact of Failure Distribution on Coordinated Checkpoint/Restart

Published: 31 May 2016 Publication History

Abstract

Fault tolerance is a key challenge to building the first exa\-scale system. To understand the potential impacts of failures on next-generation systems, significant effort has been devoted to collecting, characterizing and analyzing failures on current systems. These studies require large volumes of data and complex analysis. Because the occurrence of failures in large-scale systems is unpredictable, failures are commonly modeled as a stochastic process. Failure data from current systems is examined in an attempt to identify the underlying probability distribution and its statistical properties.
In this paper, we use modeling to examine the impact of failure distributions on the time-to-solution and the optimal checkpoint interval of applications that use coordinated checkpoint/restart. Using this approach, we show that as failures become more frequent, the failure distribution has a larger influence on application performance. We also show that as failure times are less tightly grouped (i.e., as the standard deviation increases) the underlying probability distribution has a greater impact on application performance. Finally, we show that computing the checkpoint interval based on the assumption that failures are exponentially distributed has a modest impact on application performance even when failures are drawn from a different distribution.
Our work provides critical analysis and guidance to the process of analyzing failure data in the context of coordinated checkpoint/restart. Specifically, the data presented in this paper helps to distinguish cases where the failure distribution has a strong influence on application performance from those cases when the failure distribution has relatively little impact.

References

[1]
SciPy.org. http://www.scipy.org/ (visited Feb. 2016).
[2]
G. Bosilca, A. Bouteiller, E. Brunet, F. Cappello, J. Dongarra, A. Guermouche, T. Herault, Y. Robert, F. Vivien, and D. Zaidouni. Unified model for assessing checkpointing protocols at extreme-scale. Concurrency and Computation: Practice and Experience, 26(17):2772--2791, 2014.
[3]
M. Bougeret, H. Casanova, M. Rabie, Y. Robert, and F. Vivien. Checkpointing strategies for parallel jobs. In High Performance Computing, Networking, Storage and Analysis (SC), 2011 International Conference for, pages 1--11. IEEE, 2011.
[4]
M.-S. Bouguerra, T. Gautier, D. Trystram, and J.-M. Vincent. A exible checkpoint/restart model in distributed systems. In Parallel Processing and Applied Mathematics, pages 206--215. Springer, 2009.
[5]
M. S. Bouguerra, D. Trystram, T. Gautier, and J.-M. Vincent. A new exible checkpoint/restart model. Research Report RR-6751, INRIA, 2008.
[6]
H. Casanova, Y. Robert, F. Vivien, and D. Zaidouni. On the impact of process replication on executions of large-scale parallel applications with coordinated checkpointing. Future Generation Computer Systems, 51:7--19, 2015.
[7]
J. T. Daly. A higher order estimate of the optimum checkpoint interval for restart dumps. Future Generation Computing Systems, 22(3):303--312, 2006.
[8]
E. N. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson. A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys, 34(3):375--408, 2002.
[9]
T. J. Hacker, F. Romero, and C. D. Carothers. An analysis of clustered failures on large supercomputing systems. Journal of Parallel and Distributed Computing, 69(7):652--665, 2009.
[10]
D. Kondo, B. Javadi, A. Iosup, and D. Epema. The failure trace archive: Enabling comparative analysis of failures in diverse distributed systems. In Cluster, Cloud and Grid Computing (CCGrid), 2010 10th IEEE/ACM International Conference on, pages 398--407. IEEE, 2010.
[11]
Y. Liu, R. Nassar, C. Leangsuksun, N. Naksinehaboon, M. Paun, and S. L. Scott. An optimal checkpoint/restart model for a large scale high performance computin system. In Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on, pages 1--9. IEEE, 2008.
[12]
B. Schroeder and G. A. Gibson. A large-scale study of failures in high-performance computing systems. In Proceedings of the International Conference on Dependable Systems and Networks, DSN '06, pages 249--258, Washington, DC, USA, 2006. IEEE Computer Society.
[13]
V. Sridharan, N. DeBardeleben, S. Blanchard, K. B. Ferreira, J. Stearley, J. Shalf, and S. Gurumurthi. Memory errors in modern systems: The good, the bad, and the ugly. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '15, pages 297--310, New York, NY, USA, 2015. ACM.
[14]
V. Sridharan and D. Liberty. A study of DRAM failures in the field. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC '12, pages 76:1--76:11, Los Alamitos, CA, USA, 2012. IEEE Computer Society Press.
[15]
V. Sridharan, J. Stearley, N. DeBardeleben, S. Blanchard, and S. Gurumurthi. Feng shui of supercomputer memory: Positional effects in DRAM and SRAM faults. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC '13, pages 22:1--22:11, New York, NY, USA, 2013. ACM.
[16]
J. W. Young. A first order approximation to the optimum checkpoint interval. Commun. ACM, 17(9):530--531, Sept. 1974.
[17]
Z. Zheng, L. Yu, and Z. Lan. Reliability-aware speedup models for parallel applications with coordinated checkpointing/restart. Computers, IEEE Transactions on, 64(5):1402--1415, 2015

Cited By

View all
  • (2024)Checkpointing Strategies to Tolerate Non-Memoryless Failures on HPC PlatformsACM Transactions on Parallel Computing10.1145/362456011:1(1-26)Online publication date: 11-Mar-2024
  • (2022)RDPM: An Extensible Tool for Resilience Design Patterns ModellingEuro-Par 2021: Parallel Processing Workshops10.1007/978-3-031-06156-1_23(283-297)Online publication date: 9-Jun-2022
  • (2020)Models for Resilience Design Patterns2020 IEEE/ACM 10th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)10.1109/FTXS51974.2020.00008(21-30)Online publication date: Nov-2020
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
FTXS '16: Proceedings of the ACM Workshop on Fault-Tolerance for HPC at Extreme Scale
May 2016
58 pages
ISBN:9781450343497
DOI:10.1145/2909428
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 May 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. fault tolerance
  2. high-performance computing
  3. resilience

Qualifiers

  • Research-article

Conference

HPDC'16
Sponsor:

Acceptance Rates

Overall Acceptance Rate 16 of 25 submissions, 64%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)0
Reflects downloads up to 13 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Checkpointing Strategies to Tolerate Non-Memoryless Failures on HPC PlatformsACM Transactions on Parallel Computing10.1145/362456011:1(1-26)Online publication date: 11-Mar-2024
  • (2022)RDPM: An Extensible Tool for Resilience Design Patterns ModellingEuro-Par 2021: Parallel Processing Workshops10.1007/978-3-031-06156-1_23(283-297)Online publication date: 9-Jun-2022
  • (2020)Models for Resilience Design Patterns2020 IEEE/ACM 10th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)10.1109/FTXS51974.2020.00008(21-30)Online publication date: Nov-2020
  • (2020)Towards a Model to Estimate the Reliability of Large-Scale Hybrid SupercomputersEuro-Par 2020: Parallel Processing10.1007/978-3-030-57675-2_3(37-51)Online publication date: 24-Aug-2020
  • (2018)Lessons learned from memory errors observed over the lifetime of CieloProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.5555/3291656.3291714(1-12)Online publication date: 11-Nov-2018
  • (2018)Lessons learned from memory errors observed over the lifetime of CieloProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC.2018.00046(1-12)Online publication date: 11-Nov-2018
  • (2018)The unexpected virtue of almost: Exploiting MPI collective operations to approximately coordinate checkpointsConcurrency and Computation: Practice and Experience10.1002/cpe.489032:3Online publication date: 9-Sep-2018
  • (2016)Soft Error Detection for Iterative Applications Using Offline Training2016 IEEE 23rd International Conference on High Performance Computing (HiPC)10.1109/HiPC.2016.011(2-11)Online publication date: Dec-2016

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media