Nothing Special   »   [go: up one dir, main page]

Skip to main content

Using Replication for Resilience on Exascale Systems

  • Chapter
  • First Online:
Fault-Tolerance Techniques for High-Performance Computing

Part of the book series: Computer Communications and Networks ((CCN))

  • 1368 Accesses

  • 6 Citations

Abstract

High-performance computing applications must be resilient to faults. The traditional fault tolerance solution is checkpoint–recovery, by which application state is saved to and recovered from secondary storage throughout execution. It has been shown that, even when using an optimal checkpointing strategy, the checkpointing overhead precludes high parallel efficiency at large-scale. Additional fault tolerance mechanisms must thus be used. Such a mechanism is replication, which can be used in addition to checkpoint–recovery. Using replication, multiple processors perform the same computation so that a processor failure does not necessarily mean application failure. While at first glance replication may seem wasteful, it may be significantly more efficient than using solely checkpoint–recovery at large scale. In this work we investigate two approaches for replication. In the first approach, entire application instances are replicated. In the second approach, each process in a single application instance is (transparently) replicated. We provide a theoretical study of these two approaches, comparing them to the pure checkpoint–recovery approach in terms of expected application execution times.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Amdahl G (1967) The validity of the single processor approach to achieving large scale computing capabilities. In: AFIPS conference proceedings, vol 30. AFIPS Press, pp 483–485

    Google Scholar 

  2. Blackford LS, Choi J, Cleary A, D’Azevedo E, Demmel J, Dhillon I, Dongarra J, Hammarling S, Henry G, Petitet A, Stanley K, Walker D, Whaley RC (1997) ScaLAPACK users’ guide. SIAM

    Google Scholar 

  3. Bougeret M, Casanova H, Rabie M, Robert Y, Vivien F (2011) Checkpointing strategies for parallel jobs. In: Proceedings of 2011 international conference high performance computing, networking, storage and analysis SC’11. ACM Press

    Google Scholar 

  4. Bouguerra M-S, Gautier T, Trystram D, Vincent J-M (2010) A flexible checkpoint/restart model in distributed systems. In: PPAM, vol 6067. LNCS, pp 206–215

    Google Scholar 

  5. Castelli V, Harper RE, Heidelberger P, Hunter SW, Trivedi KS, Vaidyanathan K, Zeggert WP (2001) Proactive management of software aging. IBM J Res Dev 45(2):311–332

    Article  Google Scholar 

  6. Chung J, Lee I, Sullivan M, Ryoo JH, Kim DW, Yoon DH, Kaplan L, Erez M (2012) Containment domains: a scalable, efficient, and flexible resilience scheme for exascale systems. In: Proceedings of international conference on high performance computing, networking, storage and analysis SC’12. ACM Press

    Google Scholar 

  7. Daly JT (2004) A higher order estimate of the optimum checkpoint interval for restart dumps. Future Gener Comput Syst 22(3):303–312

    Article  MATH  Google Scholar 

  8. Dongarra J, Beckman P, Aerts P, Cappello F, Lippert T, Matsuoka S, Messina P, Moore T, Stevens R, Trefethen A, Valero M (2009) The international exascale software project: a call to cooperative action by the global high-performance community. Int J High Perform Comput Appl 23(4):309–322

    Article  MATH  Google Scholar 

  9. Elliott J, Kharbas K, Fiala D, Mueller F, Ferreira K, Engelmann C (2012) Combining partial redundancy and checkpointing for HPC. In: ICDCS’12. IEEE

    Google Scholar 

  10. Elnozahy E, Plank J (2004) Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery. IEEE Trans Dependable Secur Comput 1(2):97–108

    Article  Google Scholar 

  11. Engelmann C, Swen B (2011) Redundant execution of HPC applications with MR-MPI. In: PDCN. IASTED

    Google Scholar 

  12. Engelmann C, Ong HH, Scorr SL (2009) The case for modular redundancy in large-scale high performance computing systems. In: Proceedings of the 8th IASTED international conference on parallel and distributed computing and networks (PDCN), pp 189–194

    Google Scholar 

  13. Ferreira K, Stearley J, Laros JHI, Oldfield R, Pedretti K, Brightwell R, Riesen R, Bridges PG, Arnold D (2011) Evaluating the viability of process replication reliability for exascale systems. In: Proceedings of 2011 international conference on high performance computing, networking, storage and analysis SC’11. ACM Press

    Google Scholar 

  14. Flajolet P, Grabner PJ, Kirschenhofer P, Prodinger H (1995) On Ramanujan’s Q-function. J Comput Appl Math 58:103–116

    Article  MathSciNet  Google Scholar 

  15. George C, Vadhiyar SS (2012) AdFT: an adaptive framework for fault tolerance on large scale systems using application malleability. Procedia Comput Sci 9:166–175

    Article  Google Scholar 

  16. Gärtner F (1999) Fundamentals of fault-tolerant distributed computing in asynchronous environments. ACM comput Surv 31(1):1–26

    Article  Google Scholar 

  17. Heath T, Martin RP, Nguyen TD (2002) Improving cluster availability using workstation validation. SIGMETRICS Perf Eval Rev 30(1):217–227

    Article  Google Scholar 

  18. Heien R, Kondo D, Gainaru A, LaPine D, Kramer B, Cappello F (2011) Modeling and tolerating heterogeneous failures on large parallel system. In: Proceedings of the IEEE/ACM supercomputing conference (SC)

    Google Scholar 

  19. Jones W, Daly J, DeBardeleben N (2010) Impact of sub-optimal checkpoint intervals on application efficiency in computational clusters. In: HPDC’10. ACM, pp 276–279

    Google Scholar 

  20. Kolettis N, Fulton ND (1995) Software rejuvenation: analysis, module and applications. In: FTCS’95. IEEE CS, Washington, p 381

    Google Scholar 

  21. Leblanc T, Anand R, Gabriel E, Subhlok J (2009) VolpexMPI: an MPI library for execution of parallel applications on volatile nodes. In: 16th European PVM/MPI users’ group meeting. Springer, pp 124–133

    Google Scholar 

  22. Liu Y, Nassar R, Leangsuksun C, Naksinehaboon N, Paun M, Scott S (2008) An optimal checkpoint/restart model for a large scale high performance computing system. In: IPDPS 2008. IEEE, pp 1–9

    Google Scholar 

  23. Oldfield RA, Arunagiri S, Teller PJ, Seelam S, Varela MR, Riesen R, Roth PC (2007) Modeling the impact of checkpoints on next-generation systems. In: Proceedings of the 24th IEEE conference on mass storage systems and technologies, pp 30–46

    Google Scholar 

  24. Pinedo M (2008) Scheduling: theory, algorithms, and systems, 3rd edn. Springer, New York

    Google Scholar 

  25. Riesen R, Ferreira K, Stearley J (2010) See applications run and throughput jump: the case for redundant computing in HPC. In: Proceedings of the dependable systems and networks workshops, pp 29–34

    Google Scholar 

  26. Ross SM (2009) Introduction to probability models, 11th edn. Academic Press, New York

    Google Scholar 

  27. Sarkar V, Harrod W, Snavely A (2009) Software challenges in extreme scale systems. J Phys Conf Ser 180(1):012045

    Article  Google Scholar 

  28. Schroeder B, Gibson G (2007) Understanding failures in petascale computers. J Phys Conf Ser 78(1):012022

    Article  Google Scholar 

  29. Schroeder B, Gibson GA (2006) A large-scale study of failures in high-performance computing systems. In: Proceedings of DSN, pp 249–258

    Google Scholar 

  30. Schroeder B, Gibson GA (2007) Understanding failures in petascale computers. J Phys Conf Ser 78(1):188–198

    Google Scholar 

  31. Stearley J, Ferreira KB, Robinson DJ, Laros J, Pedretti KT, Arnold D, Bridges PG, Riesen R (2012) Does partial replication pay off? In FTXS (a DSN workshop). IEEE

    Google Scholar 

  32. Venkatesh K (2010) Analysis of dependencies of checkpoint cost and checkpoint interval of fault tolerant MPI applications. Analysis 2(08):2690–2697

    Google Scholar 

  33. Wang L, Karthik P, Kalbarczyk Z, Iyer R, Votta L, Vick C, Wood A (2005) Modeling coordinated checkpointing for large-scale supercomputers. In: Proceedings of the international conference on dependable systems and networks, pp 812–821

    Google Scholar 

  34. Yang X-J, Wang Z, Xue J, Zhou Y (2012) The reliability wall for exascale supercomputing. IEEE Trans Comput 61(6):767–779

    Article  MathSciNet  Google Scholar 

  35. Yi S, Kondo D, Kim B, Park G, Cho Y (2010) Using replication and checkpointing for reliable task management in computational grids. In: Proceedings of the international conference on high performance computing and simulation

    Google Scholar 

  36. Young JW (1974) A first order approximation to the optimum checkpoint interval. Commun ACM 17(9):530–531

    Article  Google Scholar 

  37. Zheng G, Ni X, Kale L (2012) A scalable double in-memory checkpoint and restart scheme towards exascale. In: Dependable systems and networks workshops (DSN-W)

    Google Scholar 

  38. Zheng Z, Lan Z (2009) Reliability-aware scalability models for high performance computing. In: Proceedings of the IEEE conference on cluster computing

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Frédéric Vivien .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Casanova, H., Vivien, F., Zaidouni, D. (2015). Using Replication for Resilience on Exascale Systems. In: Herault, T., Robert, Y. (eds) Fault-Tolerance Techniques for High-Performance Computing. Computer Communications and Networks. Springer, Cham. https://doi.org/10.1007/978-3-319-20943-2_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-20943-2_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-20942-5

  • Online ISBN: 978-3-319-20943-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics