Abstract
High-performance computing applications must be resilient to faults. The traditional fault tolerance solution is checkpoint–recovery, by which application state is saved to and recovered from secondary storage throughout execution. It has been shown that, even when using an optimal checkpointing strategy, the checkpointing overhead precludes high parallel efficiency at large-scale. Additional fault tolerance mechanisms must thus be used. Such a mechanism is replication, which can be used in addition to checkpoint–recovery. Using replication, multiple processors perform the same computation so that a processor failure does not necessarily mean application failure. While at first glance replication may seem wasteful, it may be significantly more efficient than using solely checkpoint–recovery at large scale. In this work we investigate two approaches for replication. In the first approach, entire application instances are replicated. In the second approach, each process in a single application instance is (transparently) replicated. We provide a theoretical study of these two approaches, comparing them to the pure checkpoint–recovery approach in terms of expected application execution times.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Amdahl G (1967) The validity of the single processor approach to achieving large scale computing capabilities. In: AFIPS conference proceedings, vol 30. AFIPS Press, pp 483–485
Blackford LS, Choi J, Cleary A, D’Azevedo E, Demmel J, Dhillon I, Dongarra J, Hammarling S, Henry G, Petitet A, Stanley K, Walker D, Whaley RC (1997) ScaLAPACK users’ guide. SIAM
Bougeret M, Casanova H, Rabie M, Robert Y, Vivien F (2011) Checkpointing strategies for parallel jobs. In: Proceedings of 2011 international conference high performance computing, networking, storage and analysis SC’11. ACM Press
Bouguerra M-S, Gautier T, Trystram D, Vincent J-M (2010) A flexible checkpoint/restart model in distributed systems. In: PPAM, vol 6067. LNCS, pp 206–215
Castelli V, Harper RE, Heidelberger P, Hunter SW, Trivedi KS, Vaidyanathan K, Zeggert WP (2001) Proactive management of software aging. IBM J Res Dev 45(2):311–332
Chung J, Lee I, Sullivan M, Ryoo JH, Kim DW, Yoon DH, Kaplan L, Erez M (2012) Containment domains: a scalable, efficient, and flexible resilience scheme for exascale systems. In: Proceedings of international conference on high performance computing, networking, storage and analysis SC’12. ACM Press
Daly JT (2004) A higher order estimate of the optimum checkpoint interval for restart dumps. Future Gener Comput Syst 22(3):303–312
Dongarra J, Beckman P, Aerts P, Cappello F, Lippert T, Matsuoka S, Messina P, Moore T, Stevens R, Trefethen A, Valero M (2009) The international exascale software project: a call to cooperative action by the global high-performance community. Int J High Perform Comput Appl 23(4):309–322
Elliott J, Kharbas K, Fiala D, Mueller F, Ferreira K, Engelmann C (2012) Combining partial redundancy and checkpointing for HPC. In: ICDCS’12. IEEE
Elnozahy E, Plank J (2004) Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery. IEEE Trans Dependable Secur Comput 1(2):97–108
Engelmann C, Swen B (2011) Redundant execution of HPC applications with MR-MPI. In: PDCN. IASTED
Engelmann C, Ong HH, Scorr SL (2009) The case for modular redundancy in large-scale high performance computing systems. In: Proceedings of the 8th IASTED international conference on parallel and distributed computing and networks (PDCN), pp 189–194
Ferreira K, Stearley J, Laros JHI, Oldfield R, Pedretti K, Brightwell R, Riesen R, Bridges PG, Arnold D (2011) Evaluating the viability of process replication reliability for exascale systems. In: Proceedings of 2011 international conference on high performance computing, networking, storage and analysis SC’11. ACM Press
Flajolet P, Grabner PJ, Kirschenhofer P, Prodinger H (1995) On Ramanujan’s Q-function. J Comput Appl Math 58:103–116
George C, Vadhiyar SS (2012) AdFT: an adaptive framework for fault tolerance on large scale systems using application malleability. Procedia Comput Sci 9:166–175
Gärtner F (1999) Fundamentals of fault-tolerant distributed computing in asynchronous environments. ACM comput Surv 31(1):1–26
Heath T, Martin RP, Nguyen TD (2002) Improving cluster availability using workstation validation. SIGMETRICS Perf Eval Rev 30(1):217–227
Heien R, Kondo D, Gainaru A, LaPine D, Kramer B, Cappello F (2011) Modeling and tolerating heterogeneous failures on large parallel system. In: Proceedings of the IEEE/ACM supercomputing conference (SC)
Jones W, Daly J, DeBardeleben N (2010) Impact of sub-optimal checkpoint intervals on application efficiency in computational clusters. In: HPDC’10. ACM, pp 276–279
Kolettis N, Fulton ND (1995) Software rejuvenation: analysis, module and applications. In: FTCS’95. IEEE CS, Washington, p 381
Leblanc T, Anand R, Gabriel E, Subhlok J (2009) VolpexMPI: an MPI library for execution of parallel applications on volatile nodes. In: 16th European PVM/MPI users’ group meeting. Springer, pp 124–133
Liu Y, Nassar R, Leangsuksun C, Naksinehaboon N, Paun M, Scott S (2008) An optimal checkpoint/restart model for a large scale high performance computing system. In: IPDPS 2008. IEEE, pp 1–9
Oldfield RA, Arunagiri S, Teller PJ, Seelam S, Varela MR, Riesen R, Roth PC (2007) Modeling the impact of checkpoints on next-generation systems. In: Proceedings of the 24th IEEE conference on mass storage systems and technologies, pp 30–46
Pinedo M (2008) Scheduling: theory, algorithms, and systems, 3rd edn. Springer, New York
Riesen R, Ferreira K, Stearley J (2010) See applications run and throughput jump: the case for redundant computing in HPC. In: Proceedings of the dependable systems and networks workshops, pp 29–34
Ross SM (2009) Introduction to probability models, 11th edn. Academic Press, New York
Sarkar V, Harrod W, Snavely A (2009) Software challenges in extreme scale systems. J Phys Conf Ser 180(1):012045
Schroeder B, Gibson G (2007) Understanding failures in petascale computers. J Phys Conf Ser 78(1):012022
Schroeder B, Gibson GA (2006) A large-scale study of failures in high-performance computing systems. In: Proceedings of DSN, pp 249–258
Schroeder B, Gibson GA (2007) Understanding failures in petascale computers. J Phys Conf Ser 78(1):188–198
Stearley J, Ferreira KB, Robinson DJ, Laros J, Pedretti KT, Arnold D, Bridges PG, Riesen R (2012) Does partial replication pay off? In FTXS (a DSN workshop). IEEE
Venkatesh K (2010) Analysis of dependencies of checkpoint cost and checkpoint interval of fault tolerant MPI applications. Analysis 2(08):2690–2697
Wang L, Karthik P, Kalbarczyk Z, Iyer R, Votta L, Vick C, Wood A (2005) Modeling coordinated checkpointing for large-scale supercomputers. In: Proceedings of the international conference on dependable systems and networks, pp 812–821
Yang X-J, Wang Z, Xue J, Zhou Y (2012) The reliability wall for exascale supercomputing. IEEE Trans Comput 61(6):767–779
Yi S, Kondo D, Kim B, Park G, Cho Y (2010) Using replication and checkpointing for reliable task management in computational grids. In: Proceedings of the international conference on high performance computing and simulation
Young JW (1974) A first order approximation to the optimum checkpoint interval. Commun ACM 17(9):530–531
Zheng G, Ni X, Kale L (2012) A scalable double in-memory checkpoint and restart scheme towards exascale. In: Dependable systems and networks workshops (DSN-W)
Zheng Z, Lan Z (2009) Reliability-aware scalability models for high performance computing. In: Proceedings of the IEEE conference on cluster computing
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Casanova, H., Vivien, F., Zaidouni, D. (2015). Using Replication for Resilience on Exascale Systems. In: Herault, T., Robert, Y. (eds) Fault-Tolerance Techniques for High-Performance Computing. Computer Communications and Networks. Springer, Cham. https://doi.org/10.1007/978-3-319-20943-2_4
Download citation
DOI: https://doi.org/10.1007/978-3-319-20943-2_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-20942-5
Online ISBN: 978-3-319-20943-2
eBook Packages: Computer ScienceComputer Science (R0)