research-article

SPBC: leveraging the characteristics of MPI HPC applications for scalable checkpointing

Authors:

Tatiana V. Martsinkevich,

Amina Guermouche,

André Schiper,

Franck CappelloAuthors Info & Claims

SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Article No.: 8, Pages 1 - 12

https://doi.org/10.1145/2503210.2503271

Published: 17 November 2013 Publication History

Abstract

The high failure rate expected for future supercomputers requires the design of new fault tolerant solutions. Most checkpointing protocols are designed to work with any message-passing application but suffer from scalability issues at extreme scale. We take a different approach: We identify a property common to many HPC applications, namely channel-determinism, and introduce a new partial order relation, called always-happens-before relation, between events of such applications. Leveraging these two concepts, we design a protocol that combines an unprecedented set of features. Our protocol called SPBC combines in a hierarchical way coordinated checkpointing and message logging. It is the first protocol that provides failure containment without logging any information reliably apart from process checkpoints, and this, without penalizing recovery performance. Experiments run with a representative set of HPC workloads demonstrate a good performance of our protocol during both, failure-free execution and recovery.

References

[1]

Mpi3 hybrid working-group.

[2]

A. H. Baker, R. D. Falgout, and U. M. Yang. An assumed partition algorithm for determining processor inter-communication. Parallel Computing, 32(5--6):394--414, June 2006.

Digital Library

[3]

L. Bautista-Gomez, N. Maruyama, D. Komatitsch, S. Tsuboi, F. Cappello, S. Matsuoka, and T. Nakamura. FTI: high performance Fault Tolerance Interface for hybrid systems. In IEEE/ACM SuperComputing 2011, Seatle, USA, November 2011.

Digital Library

[4]

L. Bautista-Gomez, T. Ropars, N. Maruyama, F. Cappello, and S. Matsuoka. Hierarchical Clustering Strategies for Fault Tolerance in Large Scale HPC Systems. In IEEE Cluster 2012, 2012.

[5]

A. Bouteiller, G. Bosilca, and J. Dongarra. Redesigning the Message Logging Model for High Performance. Concurrency and Computation: Practice and Experience, 22:2196--2211, 2010.

Digital Library

[6]

A. Bouteiller, B. Collin, T. Herault, P. Lemarinier, and F. Cappello. Impact of Event Logger on Causal Message Logging Protocols for Fault Tolerant MPI. In Proceedings of the 19th IEEE InternationalParallel and Distributed Processing Symposium (IPDPS'05), volume 1, page 97, April 2005.

Digital Library

[7]

A. Bouteiller, T. Herault, G. Bosilca, and J. Dongarra. Correlated Set Coordination in Fault Tolerant Message Logging Protocols. In Proceedings of the 17th international conference on Parallel processing, Euro-Par'11, pages 51--64, 2011.

Digital Library

[8]

A. Bouteiller, T. Ropars, G. Bosilca, C. Morin, and J. Dongarra. Reasons for a Pessimistic or Optimistic Message Logging Protocol in MPI Uncoordinated Failure Recovery. In IEEE International Conference on Çluster Computing (Cluster 2009), New Orleans, USA, 2009.

[9]

F. Cappello, A. Guermouche, and M. Snir. On Communication Determinism in Parallel HPC Applications. In 19th International Conference on Computer Communications and Networks (ICCCN 2010), 2010.

[10]

J. Chung, I. Lee, M. Sullivan, J. H. Ryoo, D. W. Kim, D. H. Yoon, L. Kaplan, and M. Erez. Containment Domains: A Scalable, Efficient, and Flexible Resilience Scheme for Exascale Systems. In IEEE/ACM SuperComputing 2012, SC '12, pages 58:1--58:11, 2012.

Digital Library

[11]

J. Dongarra, P. Beckman, T. Moore, et al. The international exascale software project roadmap. International Journal of High Performance Computing Applications, 25:3--60, 2011.

Digital Library

[12]

G. Dózsa, S. Kumar, P. Balaji, D. Buntinas, D. Goodell, W. Gropp, J. Ratterman, and R. Thakur. Enabling concurrent multithreaded MPI communication on multicore petascale systems. In Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface, EuroMPI'10, pages 11--20, 2010.

Digital Library

[13]

P. Du, A. Bouteiller, G. Bosilca, T. Herault, and J. Dongarra. Algorithm-based fault tolerance for dense matrix factorizations. In Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, pages 225--234, 2012.

Digital Library

[14]

E. N. Elnozahy et al. System Resilience at Extreme Scale. Technical report, DARPA, 2008.

[15]

E. N. M. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson. A Survey of Rollback-Recovery Protocols in Message-Passing Systems. ACM Computing Surveys, 34(3):375--408, 2002.

Digital Library

[16]

K. Ferreira, J. Stearley, J. H. Laros, III, R. Oldfield, K. Pedretti, R. Brightwell, R. Riesen, P. G. Bridges, and D. Arnold. Evaluating the Viability of Process Replication Reliability for Exascale Systems. In IEEE/ACM SuperComputing 2011, pages 44:1--44:12, 2011.

Digital Library

[17]

D. Fiala, F. Mueller, C. Engelmann, R. Riesen, K. Ferreira, and R. Brightwell. Detection and correction of silent data corruption for large-scale high-performance computing. In IEEE/ACM SuperComputing 2012, pages 78:1--78:12, 2012.

Digital Library

[18]

A. Guermouche, T. Ropars, E. Brunet, M. Snir, and F. Cappello. Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic Message Passing Applications. In 25th IEEE International Parallel & Distributed Processing Symposium (IPDPS2011), Anchorage, USA, 2011.

Digital Library

[19]

A. Guermouche, T. Ropars, M. Snir, and F. Cappello. HydEE: Failure Containment without Event Logging for Large Scale Send-Deterministic MPI Applications. In 26th IEEE International Parallel & Distributed Processing Symposium (IPDPS2012), Shanghai, China, 2012.

Digital Library

[20]

V. E. Henson and U. M. Yang. BoomerAMG: a parallel algebraic multigrid solver and preconditioner. Applied Numerical Mathematics, 41(1):155--177, Apr. 2002.

Digital Library

[21]

D. B. Johnson and W. Zwaenepoel. Sender-Based Message Logging. In Digest of Papers: The 17th Annual International Symposium on Fault-Tolerant Computing, pages 14--19, 1987.

[22]

R. Koo and S. Toueg. Checkpointing and Rollback-Recovery for Distributed Systems. In Proceedings of 1986 ACM Fall joint computer conference, ACM '86, pages 1150--1158, 1986.

Digital Library

[23]

L. Lamport. Time, Clocks, and the Ordering of Events in a Distributed System. Communications of the ACM, 21(7):558--565, 1978.

Digital Library

[24]

T. Mattson, B. Sanders, and B. Massingill. Patterns for Parallel Programming. Addison-Wesley Professional, 2004.

Digital Library

[25]

E. Meneses, C. L. Mendes, and L. V. Kale. Team-based Message Logging: Preliminary Results. In 3rd Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids (CCGRID 2010), May 2010.

Digital Library

[26]

Message Passing Interface Forum. MPI: A Message-Passing Interface Standard. www.mpi-forum.org, 1995.

[27]

A. Moody, G. Bronevetsky, K. Mohror, and B. R. d. Supinski. Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1--11, 2010.

Digital Library

[28]

R. A. Oldfield, S. Arunagiri, P. J. Teller, S. Seelam, M. R. Varela, R. Riesen, and P. C. Roth. Modeling the Impact of Checkpoints on Next-Generation Systems. In MSST '07: Proceedings of the 24th IEEE Conference on Mass Storage Systems and Technologies, pages 30--46, 2007.

Digital Library

[29]

R. Riesen, K. Ferreira, D. Da Silva, P. Lemarinier, D. Arnold, and P. G. Bridges. Alleviating scalability issues of checkpointing protocols. In IEEE/ACM SuperComputing 2012, SC '12, pages 18:1--18:11, 2012.

Digital Library

[30]

T. Ropars, A. Guermouche, B. Uçar, E. Meneses, L. V. Kalé, and F. Cappello. On the Use of Cluster-Based Partial Message Logging to Improve Fault Tolerance for MPI HPC Applications. In Proceedings of the 17th international conference on Parallel processing, Euro-Par'11, pages 567--578, 2011.

Digital Library

[31]

T. Ropars and C. Morin. Active optimistic and distributed message logging for message-passing applications. Concurrency and Computation: Practice and Experience, 23(17):2167--2178, 2011.

Digital Library

Cited By

Agullo EAltenbernd MAnzt HBautista-Gomez LBenacchio TBonaventura LBungartz HChatterjee SCiorba FDeBardeleben NDrzisga DEibl SEngelmann CGansterer WGiraud LGöddeke DHeisig MJézéquel FKohl NLi XLion RMehl MMycek PObersteiner MQuintana-Ortí ERizzi FRüde USchulz MFung FSpeck RStals LTeranishi KThibault SThönnes DWagner AWohlmuth B(2022)Resiliency in numerical algorithm design for extreme scale simulationsInternational Journal of High Performance Computing Applications10.1177/1094342021105518836:2(251-285)Online publication date: 1-Mar-2022
https://dl.acm.org/doi/10.1177/10943420211055188
Anthony QDai D(2021)Evaluating Multi-Level Checkpointing for Distributed Deep Neural Network Training2021 SC Workshops Supplementary Proceedings (SCWS)10.1109/SCWS55283.2021.00018(60-67)Online publication date: Nov-2021
https://doi.org/10.1109/SCWS55283.2021.00018
Coti CMalony A(2021)DiPOSH: A portable OpenSHMEM implementation for short API‐to‐network pathConcurrency and Computation: Practice and Experience10.1002/cpe.617933:11Online publication date: 4-Feb-2021
https://doi.org/10.1002/cpe.6179
Show More Cited By

Index Terms

SPBC: leveraging the characteristics of MPI HPC applications for scalable checkpointing
1. Software and its engineering
  1. Software organization and properties
    1. Extra-functional properties
      1. Software fault tolerance
        Checkpoint / restart

Recommendations

McrEngine: A scalable checkpointing system using data-aware aggregation and compression
Selected Papers from Super Computing 2012

High performance computing HPC systems use checkpoint-restart to tolerate failures. Typically, applications store their states in checkpoints on a parallel file system PFS. As applications scale up, checkpoint-restart incurs high overheads due to ...
Almost-Optimally Fair Multiparty Coin-Tossing with Nearly Three-Quarters Malicious
Proceedings, Part I, of the 14th International Conference on Theory of Cryptography - Volume 9985

An $$\alpha $$α-fair coin-tossing protocol allows a set of mutually distrustful parties to generate a uniform bit, such that no efficient adversary can bias the output bit by more than $$\alpha $$α. Cleve [STOC 1986] has shown that if half of the ...
A fully informed model-based checkpointing protocol for preventing useless checkpoints

Checkpointing and rollback recovery are widely used techniques for handling failures in distributed systems. When processes involved in a distributed computation are allowed to take checkpoints independently without any coordination with each other, ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

November 2013

1123 pages

ISBN:9781450323789

DOI:10.1145/2503210

General Chair:
William Gropp
University of Illinois at Urbana-Champaign, Urbana, Illinois
,
Program Chair:
Satoshi Matsuoka
Tokyo Institute of Technology, Tokyo, Japan

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 November 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Conference

SC13

Sponsor:

SIGHPC
SIGARCH
IEEE-CS

SC13: International Conference for High Performance Computing, Networking, Storage and Analysis

November 17 - 21, 2013

Colorado, Denver

Acceptance Rates

SC '13 Paper Acceptance Rate 91 of 449 submissions, 20%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

24
Total Citations
View Citations
203
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 21 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Agullo EAltenbernd MAnzt HBautista-Gomez LBenacchio TBonaventura LBungartz HChatterjee SCiorba FDeBardeleben NDrzisga DEibl SEngelmann CGansterer WGiraud LGöddeke DHeisig MJézéquel FKohl NLi XLion RMehl MMycek PObersteiner MQuintana-Ortí ERizzi FRüde USchulz MFung FSpeck RStals LTeranishi KThibault SThönnes DWagner AWohlmuth B(2022)Resiliency in numerical algorithm design for extreme scale simulationsInternational Journal of High Performance Computing Applications10.1177/1094342021105518836:2(251-285)Online publication date: 1-Mar-2022
https://dl.acm.org/doi/10.1177/10943420211055188
Anthony QDai D(2021)Evaluating Multi-Level Checkpointing for Distributed Deep Neural Network Training2021 SC Workshops Supplementary Proceedings (SCWS)10.1109/SCWS55283.2021.00018(60-67)Online publication date: Nov-2021
https://doi.org/10.1109/SCWS55283.2021.00018
Coti CMalony A(2021)DiPOSH: A portable OpenSHMEM implementation for short API‐to‐network pathConcurrency and Computation: Practice and Experience10.1002/cpe.617933:11Online publication date: 4-Feb-2021
https://doi.org/10.1002/cpe.6179
Losada NBouteiller ABosilca G(2019)Asynchronous Receiver-Driven Replay for Local Rollback of MPI Applications2019 IEEE/ACM 9th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)10.1109/FTXS49593.2019.00006(1-10)Online publication date: Nov-2019
https://doi.org/10.1109/FTXS49593.2019.00006
Camargo EDuarte E(2018)Running resilient MPI applications on a Dynamic Group of Recommended ProcessesJournal of the Brazilian Computer Society10.1186/s13173-018-0069-z24:1Online publication date: 12-Mar-2018
https://doi.org/10.1186/s13173-018-0069-z
Huber MRüde UWohlmuth B(2018)Adaptive control in roll-forward recovery for extreme scale multigridThe International Journal of High Performance Computing Applications10.1177/1094342018817088(109434201881708)Online publication date: 25-Dec-2018
https://doi.org/10.1177/1094342018817088
Altenbernd MGöddeke D(2018)Soft fault detection and correction for multigridInternational Journal of High Performance Computing Applications10.1177/109434201668400632:6(897-912)Online publication date: 1-Nov-2018
https://dl.acm.org/doi/10.1177/1094342016684006
Subasi OMartsinkevich TZyulkyarov FUnsal OLabarta JCappello F(2018)Unified fault-tolerance framework for hybrid task-parallel message-passing applicationsInternational Journal of High Performance Computing Applications10.1177/109434201666941632:5(641-657)Online publication date: 1-Sep-2018
https://dl.acm.org/doi/10.1177/1094342016669416
Hussain ZCui XZnati TMelhem R(2018)CoLoR: Co-Located Rescuers for Fault Tolerance in HPC Systems2018 IEEE 24th International Conference on Parallel and Distributed Systems (ICPADS)10.1109/PADSW.2018.8644528(569-576)Online publication date: Dec-2018
https://doi.org/10.1109/PADSW.2018.8644528
Gamell MTeranishi KMayo JKolla HHeroux MChen JParashar M(2017)Modeling and Simulating Multiple Failure Masking Enabled by Local Recovery for Stencil-Based Applications at Extreme ScalesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2017.269653828:10(2881-2895)Online publication date: 1-Oct-2017
https://doi.org/10.1109/TPDS.2017.2696538
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents