Abstract
The reliability mechanisms for future exascale systems will be a key aspect of their scalability and performance. With the expected jump in hardware component counts, faults will become increasingly common compared to today’s systems. Under these circumstances, the costs of current and emergent resilience methods need to be reevaluated. This includes the cost of recovery, which is often ignored in current work, and the impact of hardware features such as heterogeneous computing elements and non-volatile memory devices. We describe a simulation and modeling framework that enables the measurement of various resilience algorithms with varying application characteristics. For this framework we outline the simulator’s requirements, its application communication pattern generators, and a few of the key hardware component models.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Alvisi, L., Elnozahy, E.N., Rao, S., Husain, S.A., Mel, A.D.: An analysis of communication induced checkpointing. In: FTCS (1999)
Bergman, K., et al.: Exascale computing study: Technology challenges in achieving exascale systems (2008)
Bianchini, R., et al.: System resiliency at extreme scale (2009)
Elnozahy, E., Plank, J.: Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery. IEEE Transactions on Dependable and Secure Computing 1(2) (2004)
Hsieh, M., Thompson, K., Song, W., Rodrigues, A., Riesen, R.: A framework for architecture-level power, area and thermal simulation and its application to network-on-chip design exploration. In: 1st Intl. Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems, PMBS 2010 (November 2010)
Maloney, A., Goscinski, A.: A survey and review of the current state of rollback-recovery for cluster systems. Concurrency and Computation: Practice and Experience (April 2009)
Moody, A., Bronevetsky, G., Mohror, K., de Supinski, B.R.: Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: SC (2010)
Oldfield, R.A., Arunagiri, S., Teller, P.J., Seelam, S., Varela, M.R., Riesen, R., Roth, P.C.: Modeling the impact of checkpoints on next-generation systems. In: 24th IEEE Conference on Mass Storage Systems and Technologies (September 2007)
Ribeiro, P., Silva, F., Lopes, L.: Efficient parallel subgraph counting using g-tries. In: Cluster Computing (2010)
Riesen, R.: Communication patterns. In: Workshop on Communication Architecture for Clusters CAC 2006 (April 2006)
Riesen, R., Ferreira, K., Stearley, J.: See applications run and throughput jump: The case for redundant computing in HPC. In: 1st Intl. Workshop on Fault-Tolerance for HPC at Extreme Scale, FTXS 2010 (June 2010)
Rodrigues, A., Cook, J., Cooper-Balis, E., Hemmert, K.S., Kersey, C., Riesen, R., Rosenfield, P., Oldfield, R., Weston, M., Barrett, B., Jacob, B.: The structural simulation toolkit. In: 1st Intl. Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems, PMBS 2010 (November 2010)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Riesen, R., Ferreira, K.B., Varela, M.R., Taufer, M., Rodrigues, A. (2012). Simulating Application Resilience at Exascale. In: Alexander, M., et al. Euro-Par 2011: Parallel Processing Workshops. Euro-Par 2011. Lecture Notes in Computer Science, vol 7156. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-29740-3_26
Download citation
DOI: https://doi.org/10.1007/978-3-642-29740-3_26
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-29739-7
Online ISBN: 978-3-642-29740-3
eBook Packages: Computer ScienceComputer Science (R0)