Simulating Application Resilience at Exascale

Rolf Riesen³⁰,
Kurt B. Ferreira³¹,
Maria Ruiz Varela³²,
Michela Taufer³² &
…
Arun Rodrigues³¹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7156))

Included in the following conference series:

European Conference on Parallel Processing

1246 Accesses
3 Citations

Abstract

The reliability mechanisms for future exascale systems will be a key aspect of their scalability and performance. With the expected jump in hardware component counts, faults will become increasingly common compared to today’s systems. Under these circumstances, the costs of current and emergent resilience methods need to be reevaluated. This includes the cost of recovery, which is often ignored in current work, and the impact of hardware features such as heterogeneous computing elements and non-volatile memory devices. We describe a simulation and modeling framework that enables the measurement of various resilience algorithms with varying application characteristics. For this framework we outline the simulator’s requirements, its application communication pattern generators, and a few of the key hardware component models.

Download to read the full chapter text

Chapter PDF

Using Simulation to Evaluate the Performance of Resilience Strategies at Scale

Pattern-Based Modeling of High-Performance Computing Resilience

Asking the Right Questions: Benchmarking Fault-Tolerant Extreme-Scale Systems

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Alvisi, L., Elnozahy, E.N., Rao, S., Husain, S.A., Mel, A.D.: An analysis of communication induced checkpointing. In: FTCS (1999)
Google Scholar
Bergman, K., et al.: Exascale computing study: Technology challenges in achieving exascale systems (2008)
Google Scholar
Bianchini, R., et al.: System resiliency at extreme scale (2009)
Google Scholar
Elnozahy, E., Plank, J.: Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery. IEEE Transactions on Dependable and Secure Computing 1(2) (2004)
Google Scholar
Hsieh, M., Thompson, K., Song, W., Rodrigues, A., Riesen, R.: A framework for architecture-level power, area and thermal simulation and its application to network-on-chip design exploration. In: 1st Intl. Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems, PMBS 2010 (November 2010)
Google Scholar
Maloney, A., Goscinski, A.: A survey and review of the current state of rollback-recovery for cluster systems. Concurrency and Computation: Practice and Experience (April 2009)
Google Scholar
Moody, A., Bronevetsky, G., Mohror, K., de Supinski, B.R.: Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: SC (2010)
Google Scholar
Oldfield, R.A., Arunagiri, S., Teller, P.J., Seelam, S., Varela, M.R., Riesen, R., Roth, P.C.: Modeling the impact of checkpoints on next-generation systems. In: 24th IEEE Conference on Mass Storage Systems and Technologies (September 2007)
Google Scholar
Ribeiro, P., Silva, F., Lopes, L.: Efficient parallel subgraph counting using g-tries. In: Cluster Computing (2010)
Google Scholar
Riesen, R.: Communication patterns. In: Workshop on Communication Architecture for Clusters CAC 2006 (April 2006)
Google Scholar
Riesen, R., Ferreira, K., Stearley, J.: See applications run and throughput jump: The case for redundant computing in HPC. In: 1st Intl. Workshop on Fault-Tolerance for HPC at Extreme Scale, FTXS 2010 (June 2010)
Google Scholar
Rodrigues, A., Cook, J., Cooper-Balis, E., Hemmert, K.S., Kersey, C., Riesen, R., Rosenfield, P., Oldfield, R., Weston, M., Barrett, B., Jacob, B.: The structural simulation toolkit. In: 1st Intl. Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems, PMBS 2010 (November 2010)
Google Scholar

Download references

Author information

Authors and Affiliations

IBM Research, Ireland
Rolf Riesen
Sandia National Laboratories, Albuquerque, NM, 87123, USA
Kurt B. Ferreira & Arun Rodrigues
University of Delaware, USA
Maria Ruiz Varela & Michela Taufer

Authors

Rolf Riesen
View author publications
You can also search for this author in PubMed Google Scholar
Kurt B. Ferreira
View author publications
You can also search for this author in PubMed Google Scholar
Maria Ruiz Varela
View author publications
You can also search for this author in PubMed Google Scholar
Michela Taufer
View author publications
You can also search for this author in PubMed Google Scholar
Arun Rodrigues
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Scilytics, Koellnerhofgasse 3/15A, 1010, Vienna, Austria
Michael Alexander
ICAR-CNR, Via P. Castellino, 111, 80131, Napoli, Italy
Pasqua D’Ambra
University of Amsterdam, 1090, Amsterdam, Netherlands
Adam Belloum
Innovative Computing Laboratory, The University of Tennessee, US
George Bosilca
Department of Experimental Medicine and Clinic, University Magna Græcia, 88100, Catanzaro, Italy
Mario Cannataro
Computer Science Department, University of Pisa, Italy
Marco Danelutto
Second University of Naples, Italy
Beniamino Di Martino
TUMünchen,, Boltzmannstr. 3, ,, 85748, Garching, Germany
Michael Gerndt
Equipe Runtime, INRIA Bordeaux Sud-Ouest, 33405, Talence Cedex, France
Emmanuel Jeannot & Raymond Namyst &
Equipe HIEPACS, INRIA Bordeaux Sud-Ouest, 33405, Talence Cedex, France
Jean Roman
Computer Science and Mathematics Division, Oak Ridge National Laboratory, 37831-6164, Oak Ridge, TN, USA
Stephen L. Scott
Department of Scientific Computing, University of Vienna, Nordbergstr. 15/3C, 1090, Vienna, Austria
Jesper Larsson Traff
Computer Science and Mathematics Division, Oak Ridge National Laboratory, 37831, Oak Ridge, TN, USA
Geoffroy Vallée
Technische Universität München, Germany
Josef Weidendorfer

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Riesen, R., Ferreira, K.B., Varela, M.R., Taufer, M., Rodrigues, A. (2012). Simulating Application Resilience at Exascale. In: Alexander, M., et al. Euro-Par 2011: Parallel Processing Workshops. Euro-Par 2011. Lecture Notes in Computer Science, vol 7156. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-29740-3_26

Download citation

DOI: https://doi.org/10.1007/978-3-642-29740-3_26
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-29739-7
Online ISBN: 978-3-642-29740-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Simulating Application Resilience at Exascale

Abstract

Chapter PDF

Similar content being viewed by others

Using Simulation to Evaluate the Performance of Resilience Strategies at Scale

Pattern-Based Modeling of High-Performance Computing Resilience

Asking the Right Questions: Benchmarking Fault-Tolerant Extreme-Scale Systems

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Simulating Application Resilience at Exascale

Abstract

Chapter PDF

Similar content being viewed by others

Using Simulation to Evaluate the Performance of Resilience Strategies at Scale

Pattern-Based Modeling of High-Performance Computing Resilience

Asking the Right Questions: Benchmarking Fault-Tolerant Extreme-Scale Systems

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation