Using Replication for Resilience on Exascale Systems

Henri Casanova⁴,
Frédéric Vivien⁵ &
Dounia Zaidouni⁵

Part of the book series: Computer Communications and Networks ((CCN))

1368 Accesses
6 Citations

Abstract

High-performance computing applications must be resilient to faults. The traditional fault tolerance solution is checkpoint–recovery, by which application state is saved to and recovered from secondary storage throughout execution. It has been shown that, even when using an optimal checkpointing strategy, the checkpointing overhead precludes high parallel efficiency at large-scale. Additional fault tolerance mechanisms must thus be used. Such a mechanism is replication, which can be used in addition to checkpoint–recovery. Using replication, multiple processors perform the same computation so that a processor failure does not necessarily mean application failure. While at first glance replication may seem wasteful, it may be significantly more efficient than using solely checkpoint–recovery at large scale. In this work we investigate two approaches for replication. In the first approach, entire application instances are replicated. In the second approach, each process in a single application instance is (transparently) replicated. We provide a theoretical study of these two approaches, comparing them to the pure checkpoint–recovery approach in terms of expected application execution times.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Scheduling for Fault-Tolerance: An Introduction

Fault Detection of Process Replicas on Reliable Servers

FTRP: a new fault tolerance framework using process replication and prefetching for high-performance computing

Article 01 October 2018

References

Amdahl G (1967) The validity of the single processor approach to achieving large scale computing capabilities. In: AFIPS conference proceedings, vol 30. AFIPS Press, pp 483–485
Google Scholar
Blackford LS, Choi J, Cleary A, D’Azevedo E, Demmel J, Dhillon I, Dongarra J, Hammarling S, Henry G, Petitet A, Stanley K, Walker D, Whaley RC (1997) ScaLAPACK users’ guide. SIAM
Google Scholar
Bougeret M, Casanova H, Rabie M, Robert Y, Vivien F (2011) Checkpointing strategies for parallel jobs. In: Proceedings of 2011 international conference high performance computing, networking, storage and analysis SC’11. ACM Press
Google Scholar
Bouguerra M-S, Gautier T, Trystram D, Vincent J-M (2010) A flexible checkpoint/restart model in distributed systems. In: PPAM, vol 6067. LNCS, pp 206–215
Google Scholar
Castelli V, Harper RE, Heidelberger P, Hunter SW, Trivedi KS, Vaidyanathan K, Zeggert WP (2001) Proactive management of software aging. IBM J Res Dev 45(2):311–332
Article Google Scholar
Chung J, Lee I, Sullivan M, Ryoo JH, Kim DW, Yoon DH, Kaplan L, Erez M (2012) Containment domains: a scalable, efficient, and flexible resilience scheme for exascale systems. In: Proceedings of international conference on high performance computing, networking, storage and analysis SC’12. ACM Press
Google Scholar
Daly JT (2004) A higher order estimate of the optimum checkpoint interval for restart dumps. Future Gener Comput Syst 22(3):303–312
Article MATH Google Scholar
Dongarra J, Beckman P, Aerts P, Cappello F, Lippert T, Matsuoka S, Messina P, Moore T, Stevens R, Trefethen A, Valero M (2009) The international exascale software project: a call to cooperative action by the global high-performance community. Int J High Perform Comput Appl 23(4):309–322
Article MATH Google Scholar
Elliott J, Kharbas K, Fiala D, Mueller F, Ferreira K, Engelmann C (2012) Combining partial redundancy and checkpointing for HPC. In: ICDCS’12. IEEE
Google Scholar
Elnozahy E, Plank J (2004) Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery. IEEE Trans Dependable Secur Comput 1(2):97–108
Article Google Scholar
Engelmann C, Swen B (2011) Redundant execution of HPC applications with MR-MPI. In: PDCN. IASTED
Google Scholar
Engelmann C, Ong HH, Scorr SL (2009) The case for modular redundancy in large-scale high performance computing systems. In: Proceedings of the 8th IASTED international conference on parallel and distributed computing and networks (PDCN), pp 189–194
Google Scholar
Ferreira K, Stearley J, Laros JHI, Oldfield R, Pedretti K, Brightwell R, Riesen R, Bridges PG, Arnold D (2011) Evaluating the viability of process replication reliability for exascale systems. In: Proceedings of 2011 international conference on high performance computing, networking, storage and analysis SC’11. ACM Press
Google Scholar
Flajolet P, Grabner PJ, Kirschenhofer P, Prodinger H (1995) On Ramanujan’s Q-function. J Comput Appl Math 58:103–116
Article MathSciNet Google Scholar
George C, Vadhiyar SS (2012) AdFT: an adaptive framework for fault tolerance on large scale systems using application malleability. Procedia Comput Sci 9:166–175
Article Google Scholar
Gärtner F (1999) Fundamentals of fault-tolerant distributed computing in asynchronous environments. ACM comput Surv 31(1):1–26
Article Google Scholar
Heath T, Martin RP, Nguyen TD (2002) Improving cluster availability using workstation validation. SIGMETRICS Perf Eval Rev 30(1):217–227
Article Google Scholar
Heien R, Kondo D, Gainaru A, LaPine D, Kramer B, Cappello F (2011) Modeling and tolerating heterogeneous failures on large parallel system. In: Proceedings of the IEEE/ACM supercomputing conference (SC)
Google Scholar
Jones W, Daly J, DeBardeleben N (2010) Impact of sub-optimal checkpoint intervals on application efficiency in computational clusters. In: HPDC’10. ACM, pp 276–279
Google Scholar
Kolettis N, Fulton ND (1995) Software rejuvenation: analysis, module and applications. In: FTCS’95. IEEE CS, Washington, p 381
Google Scholar
Leblanc T, Anand R, Gabriel E, Subhlok J (2009) VolpexMPI: an MPI library for execution of parallel applications on volatile nodes. In: 16th European PVM/MPI users’ group meeting. Springer, pp 124–133
Google Scholar
Liu Y, Nassar R, Leangsuksun C, Naksinehaboon N, Paun M, Scott S (2008) An optimal checkpoint/restart model for a large scale high performance computing system. In: IPDPS 2008. IEEE, pp 1–9
Google Scholar
Oldfield RA, Arunagiri S, Teller PJ, Seelam S, Varela MR, Riesen R, Roth PC (2007) Modeling the impact of checkpoints on next-generation systems. In: Proceedings of the 24th IEEE conference on mass storage systems and technologies, pp 30–46
Google Scholar
Pinedo M (2008) Scheduling: theory, algorithms, and systems, 3rd edn. Springer, New York
Google Scholar
Riesen R, Ferreira K, Stearley J (2010) See applications run and throughput jump: the case for redundant computing in HPC. In: Proceedings of the dependable systems and networks workshops, pp 29–34
Google Scholar
Ross SM (2009) Introduction to probability models, 11th edn. Academic Press, New York
Google Scholar
Sarkar V, Harrod W, Snavely A (2009) Software challenges in extreme scale systems. J Phys Conf Ser 180(1):012045
Article Google Scholar
Schroeder B, Gibson G (2007) Understanding failures in petascale computers. J Phys Conf Ser 78(1):012022
Article Google Scholar
Schroeder B, Gibson GA (2006) A large-scale study of failures in high-performance computing systems. In: Proceedings of DSN, pp 249–258
Google Scholar
Schroeder B, Gibson GA (2007) Understanding failures in petascale computers. J Phys Conf Ser 78(1):188–198
Google Scholar
Stearley J, Ferreira KB, Robinson DJ, Laros J, Pedretti KT, Arnold D, Bridges PG, Riesen R (2012) Does partial replication pay off? In FTXS (a DSN workshop). IEEE
Google Scholar
Venkatesh K (2010) Analysis of dependencies of checkpoint cost and checkpoint interval of fault tolerant MPI applications. Analysis 2(08):2690–2697
Google Scholar
Wang L, Karthik P, Kalbarczyk Z, Iyer R, Votta L, Vick C, Wood A (2005) Modeling coordinated checkpointing for large-scale supercomputers. In: Proceedings of the international conference on dependable systems and networks, pp 812–821
Google Scholar
Yang X-J, Wang Z, Xue J, Zhou Y (2012) The reliability wall for exascale supercomputing. IEEE Trans Comput 61(6):767–779
Article MathSciNet Google Scholar
Yi S, Kondo D, Kim B, Park G, Cho Y (2010) Using replication and checkpointing for reliable task management in computational grids. In: Proceedings of the international conference on high performance computing and simulation
Google Scholar
Young JW (1974) A first order approximation to the optimum checkpoint interval. Commun ACM 17(9):530–531
Article Google Scholar
Zheng G, Ni X, Kale L (2012) A scalable double in-memory checkpoint and restart scheme towards exascale. In: Dependable systems and networks workshops (DSN-W)
Google Scholar
Zheng Z, Lan Z (2009) Reliability-aware scalability models for high performance computing. In: Proceedings of the IEEE conference on cluster computing
Google Scholar

Download references

Author information

Authors and Affiliations

University of Hawai‘i, Manoa, USA
Henri Casanova
INRIA & Ecole Normale Supérieure de Lyon, Lyon, France
Frédéric Vivien & Dounia Zaidouni

Authors

Henri Casanova
View author publications
Search author on:PubMed Google Scholar
Frédéric Vivien
View author publications
Search author on:PubMed Google Scholar
Dounia Zaidouni
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Frédéric Vivien .

Editor information

Editors and Affiliations

University of Tennessee, Knoxville, Tennessee, USA
Thomas Herault
Ecole Normale Supérieure de Lyon, Lyon, France
Yves Robert

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Casanova, H., Vivien, F., Zaidouni, D. (2015). Using Replication for Resilience on Exascale Systems. In: Herault, T., Robert, Y. (eds) Fault-Tolerance Techniques for High-Performance Computing. Computer Communications and Networks. Springer, Cham. https://doi.org/10.1007/978-3-319-20943-2_4

Download citation

DOI: https://doi.org/10.1007/978-3-319-20943-2_4
Published: 02 July 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-20942-5
Online ISBN: 978-3-319-20943-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics