Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2503210.2503271acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

SPBC: leveraging the characteristics of MPI HPC applications for scalable checkpointing

Published: 17 November 2013 Publication History

Abstract

The high failure rate expected for future supercomputers requires the design of new fault tolerant solutions. Most checkpointing protocols are designed to work with any message-passing application but suffer from scalability issues at extreme scale. We take a different approach: We identify a property common to many HPC applications, namely channel-determinism, and introduce a new partial order relation, called always-happens-before relation, between events of such applications. Leveraging these two concepts, we design a protocol that combines an unprecedented set of features. Our protocol called SPBC combines in a hierarchical way coordinated checkpointing and message logging. It is the first protocol that provides failure containment without logging any information reliably apart from process checkpoints, and this, without penalizing recovery performance. Experiments run with a representative set of HPC workloads demonstrate a good performance of our protocol during both, failure-free execution and recovery.

References

[1]
Mpi3 hybrid working-group.
[2]
A. H. Baker, R. D. Falgout, and U. M. Yang. An assumed partition algorithm for determining processor inter-communication. Parallel Computing, 32(5--6):394--414, June 2006.
[3]
L. Bautista-Gomez, N. Maruyama, D. Komatitsch, S. Tsuboi, F. Cappello, S. Matsuoka, and T. Nakamura. FTI: high performance Fault Tolerance Interface for hybrid systems. In IEEE/ACM SuperComputing 2011, Seatle, USA, November 2011.
[4]
L. Bautista-Gomez, T. Ropars, N. Maruyama, F. Cappello, and S. Matsuoka. Hierarchical Clustering Strategies for Fault Tolerance in Large Scale HPC Systems. In IEEE Cluster 2012, 2012.
[5]
A. Bouteiller, G. Bosilca, and J. Dongarra. Redesigning the Message Logging Model for High Performance. Concurrency and Computation: Practice and Experience, 22:2196--2211, 2010.
[6]
A. Bouteiller, B. Collin, T. Herault, P. Lemarinier, and F. Cappello. Impact of Event Logger on Causal Message Logging Protocols for Fault Tolerant MPI. In Proceedings of the 19th IEEE InternationalParallel and Distributed Processing Symposium (IPDPS'05), volume 1, page 97, April 2005.
[7]
A. Bouteiller, T. Herault, G. Bosilca, and J. Dongarra. Correlated Set Coordination in Fault Tolerant Message Logging Protocols. In Proceedings of the 17th international conference on Parallel processing, Euro-Par'11, pages 51--64, 2011.
[8]
A. Bouteiller, T. Ropars, G. Bosilca, C. Morin, and J. Dongarra. Reasons for a Pessimistic or Optimistic Message Logging Protocol in MPI Uncoordinated Failure Recovery. In IEEE International Conference on Çluster Computing (Cluster 2009), New Orleans, USA, 2009.
[9]
F. Cappello, A. Guermouche, and M. Snir. On Communication Determinism in Parallel HPC Applications. In 19th International Conference on Computer Communications and Networks (ICCCN 2010), 2010.
[10]
J. Chung, I. Lee, M. Sullivan, J. H. Ryoo, D. W. Kim, D. H. Yoon, L. Kaplan, and M. Erez. Containment Domains: A Scalable, Efficient, and Flexible Resilience Scheme for Exascale Systems. In IEEE/ACM SuperComputing 2012, SC '12, pages 58:1--58:11, 2012.
[11]
J. Dongarra, P. Beckman, T. Moore, et al. The international exascale software project roadmap. International Journal of High Performance Computing Applications, 25:3--60, 2011.
[12]
G. Dózsa, S. Kumar, P. Balaji, D. Buntinas, D. Goodell, W. Gropp, J. Ratterman, and R. Thakur. Enabling concurrent multithreaded MPI communication on multicore petascale systems. In Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface, EuroMPI'10, pages 11--20, 2010.
[13]
P. Du, A. Bouteiller, G. Bosilca, T. Herault, and J. Dongarra. Algorithm-based fault tolerance for dense matrix factorizations. In Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, pages 225--234, 2012.
[14]
E. N. Elnozahy et al. System Resilience at Extreme Scale. Technical report, DARPA, 2008.
[15]
E. N. M. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson. A Survey of Rollback-Recovery Protocols in Message-Passing Systems. ACM Computing Surveys, 34(3):375--408, 2002.
[16]
K. Ferreira, J. Stearley, J. H. Laros, III, R. Oldfield, K. Pedretti, R. Brightwell, R. Riesen, P. G. Bridges, and D. Arnold. Evaluating the Viability of Process Replication Reliability for Exascale Systems. In IEEE/ACM SuperComputing 2011, pages 44:1--44:12, 2011.
[17]
D. Fiala, F. Mueller, C. Engelmann, R. Riesen, K. Ferreira, and R. Brightwell. Detection and correction of silent data corruption for large-scale high-performance computing. In IEEE/ACM SuperComputing 2012, pages 78:1--78:12, 2012.
[18]
A. Guermouche, T. Ropars, E. Brunet, M. Snir, and F. Cappello. Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic Message Passing Applications. In 25th IEEE International Parallel & Distributed Processing Symposium (IPDPS2011), Anchorage, USA, 2011.
[19]
A. Guermouche, T. Ropars, M. Snir, and F. Cappello. HydEE: Failure Containment without Event Logging for Large Scale Send-Deterministic MPI Applications. In 26th IEEE International Parallel & Distributed Processing Symposium (IPDPS2012), Shanghai, China, 2012.
[20]
V. E. Henson and U. M. Yang. BoomerAMG: a parallel algebraic multigrid solver and preconditioner. Applied Numerical Mathematics, 41(1):155--177, Apr. 2002.
[21]
D. B. Johnson and W. Zwaenepoel. Sender-Based Message Logging. In Digest of Papers: The 17th Annual International Symposium on Fault-Tolerant Computing, pages 14--19, 1987.
[22]
R. Koo and S. Toueg. Checkpointing and Rollback-Recovery for Distributed Systems. In Proceedings of 1986 ACM Fall joint computer conference, ACM '86, pages 1150--1158, 1986.
[23]
L. Lamport. Time, Clocks, and the Ordering of Events in a Distributed System. Communications of the ACM, 21(7):558--565, 1978.
[24]
T. Mattson, B. Sanders, and B. Massingill. Patterns for Parallel Programming. Addison-Wesley Professional, 2004.
[25]
E. Meneses, C. L. Mendes, and L. V. Kale. Team-based Message Logging: Preliminary Results. In 3rd Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids (CCGRID 2010), May 2010.
[26]
Message Passing Interface Forum. MPI: A Message-Passing Interface Standard. www.mpi-forum.org, 1995.
[27]
A. Moody, G. Bronevetsky, K. Mohror, and B. R. d. Supinski. Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1--11, 2010.
[28]
R. A. Oldfield, S. Arunagiri, P. J. Teller, S. Seelam, M. R. Varela, R. Riesen, and P. C. Roth. Modeling the Impact of Checkpoints on Next-Generation Systems. In MSST '07: Proceedings of the 24th IEEE Conference on Mass Storage Systems and Technologies, pages 30--46, 2007.
[29]
R. Riesen, K. Ferreira, D. Da Silva, P. Lemarinier, D. Arnold, and P. G. Bridges. Alleviating scalability issues of checkpointing protocols. In IEEE/ACM SuperComputing 2012, SC '12, pages 18:1--18:11, 2012.
[30]
T. Ropars, A. Guermouche, B. Uçar, E. Meneses, L. V. Kalé, and F. Cappello. On the Use of Cluster-Based Partial Message Logging to Improve Fault Tolerance for MPI HPC Applications. In Proceedings of the 17th international conference on Parallel processing, Euro-Par'11, pages 567--578, 2011.
[31]
T. Ropars and C. Morin. Active optimistic and distributed message logging for message-passing applications. Concurrency and Computation: Practice and Experience, 23(17):2167--2178, 2011.

Cited By

View all
  • (2022)Resiliency in numerical algorithm design for extreme scale simulationsInternational Journal of High Performance Computing Applications10.1177/1094342021105518836:2(251-285)Online publication date: 1-Mar-2022
  • (2021)Evaluating Multi-Level Checkpointing for Distributed Deep Neural Network Training2021 SC Workshops Supplementary Proceedings (SCWS)10.1109/SCWS55283.2021.00018(60-67)Online publication date: Nov-2021
  • (2021)DiPOSH: A portable OpenSHMEM implementation for short API‐to‐network pathConcurrency and Computation: Practice and Experience10.1002/cpe.617933:11Online publication date: 4-Feb-2021
  • Show More Cited By

Index Terms

  1. SPBC: leveraging the characteristics of MPI HPC applications for scalable checkpointing

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
    November 2013
    1123 pages
    ISBN:9781450323789
    DOI:10.1145/2503210
    • General Chair:
    • William Gropp,
    • Program Chair:
    • Satoshi Matsuoka
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 November 2013

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Research-article

    Conference

    SC13
    Sponsor:

    Acceptance Rates

    SC '13 Paper Acceptance Rate 91 of 449 submissions, 20%;
    Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 21 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Resiliency in numerical algorithm design for extreme scale simulationsInternational Journal of High Performance Computing Applications10.1177/1094342021105518836:2(251-285)Online publication date: 1-Mar-2022
    • (2021)Evaluating Multi-Level Checkpointing for Distributed Deep Neural Network Training2021 SC Workshops Supplementary Proceedings (SCWS)10.1109/SCWS55283.2021.00018(60-67)Online publication date: Nov-2021
    • (2021)DiPOSH: A portable OpenSHMEM implementation for short API‐to‐network pathConcurrency and Computation: Practice and Experience10.1002/cpe.617933:11Online publication date: 4-Feb-2021
    • (2019)Asynchronous Receiver-Driven Replay for Local Rollback of MPI Applications2019 IEEE/ACM 9th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)10.1109/FTXS49593.2019.00006(1-10)Online publication date: Nov-2019
    • (2018)Running resilient MPI applications on a Dynamic Group of Recommended ProcessesJournal of the Brazilian Computer Society10.1186/s13173-018-0069-z24:1Online publication date: 12-Mar-2018
    • (2018)Adaptive control in roll-forward recovery for extreme scale multigridThe International Journal of High Performance Computing Applications10.1177/1094342018817088(109434201881708)Online publication date: 25-Dec-2018
    • (2018)Soft fault detection and correction for multigridInternational Journal of High Performance Computing Applications10.1177/109434201668400632:6(897-912)Online publication date: 1-Nov-2018
    • (2018)Unified fault-tolerance framework for hybrid task-parallel message-passing applicationsInternational Journal of High Performance Computing Applications10.1177/109434201666941632:5(641-657)Online publication date: 1-Sep-2018
    • (2018)CoLoR: Co-Located Rescuers for Fault Tolerance in HPC Systems2018 IEEE 24th International Conference on Parallel and Distributed Systems (ICPADS)10.1109/PADSW.2018.8644528(569-576)Online publication date: Dec-2018
    • (2017)Modeling and Simulating Multiple Failure Masking Enabled by Local Recovery for Stencil-Based Applications at Extreme ScalesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2017.269653828:10(2881-2895)Online publication date: 1-Oct-2017
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media