OCFTL: An MPI Implementation-Independent Fault Tolerance Library for Task-Based Applications

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1540))

Included in the following conference series:

Latin American High Performance Computing Conference

417 Accesses

Abstract

Fault tolerance (FT) is a common concern in HPC environments. One would expect that, when Message Passing Interface (MPI) is concerned (an HPC tool of paramount importance), FT would be a solved problem. It turns out that the scenario for FT and MPI is intricate. While FT is effectively a reality in these environments, it is usually done by hand. The few exceptions available tie MPI users to specific MPI implementations. This work proposes OCFTL, an implementation-independent FT library for MPI. OCFTL can detect and propagate failures; provides false-positive detection; exchanges a reduced number of messages during failure propagation; employs checkpointing to reduce the impact of failures; and has a reduced delay to detect sequential simultaneous failures in comparison to related works.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Assessing resilient versus stop-and-restart fault-tolerant solutions in MPI applications

Article 15 September 2016

A Scalable Runtime Fault Localization Framework for High-Performance Computing Systems

Article 30 September 2017

Resilient MPI applications using an application-level checkpointing framework and ULFM

Article 22 January 2016

Notes

1.
Coordinated checkpointing requires synchronization of checkpoint functions in every process, while uncoordinated checkpointing does not require synchronization.
2.
https://www.open-mpi.org/faq/?category=ft#ft-future.
3.
OCFTL’s wrappers source codes can be found in the library implementation, available at https://gitlab.com/phrosso/ftmpi-tests/-/tree/master/OCFTLBench/ftlib.
4.
The tests and the library are available at: https://gitlab.com/phrosso/ftmpi-tests.
5.
UCX is an optimized framework for communications between nodes in high-bandwidth and low-latency networks, which is commonly used with MPI.
6.
We consider an MPI program as stressed when it is overloaded by MPI message exchanges. Benchmarks are examples of stressed MPI programs, since they overload the MPI runtime with MPI operations to establish the limits of a MPI distribution.
7.
Available at https://github.com/intel/mpi-benchmarks.
8.
All test results are available in the aforementioned Git repository. Due to space limitations, here we show only results with the most significant overheads (i.e., overheads for the remaining configurations is lower than those presented here).

References

Aulwes, R., et al.: Architecture of LA-MPI, a network-fault-tolerant MPI. In: 18th International Parallel and Distributed Processing Symposium, Proceedings, Santa Fe, NM, USA, p. 15. IEEE (2004). https://doi.org/10.1109/IPDPS.2004.1302920
Bautista-Gomez, L., Tsuboi, S., Komatitsch, D., Cappello, F., Maruyama, N., Matsuoka, S.: FTI: high performance fault tolerance interface for hybrid systems. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2011. Association for Computing Machinery, New York (2011). https://doi.org/10.1145/2063384.2063427
Bland, W., Bouteiller, A., Herault, T., Bosilca, G., Dongarra, J.: Post-failure recovery of MPI communication capability: design and rationale. Intl. J. High Perform. Comput. Appl. 27(3), 244–254 (2013)
Article Google Scholar
Bosilca, G., et al.: MPICH-V: toward a scalable fault tolerant MPI for volatile nodes. In: SC 2002: Proceedings of the 2002 ACM/IEEE Conference on Supercomputing, p. 29. IEEE (2002)
Google Scholar
Bosilca, G., et al.: A failure detector for HPC platforms. Intl. J. High Perform. Comput. Appl. 32(1), 139–158 (2018)
Article Google Scholar
Efron, B., Hastie, T.: Computer Age Statistical Inference, vol. 5. Cambridge University Press, Cambridge (2016)
Book Google Scholar
El-Ansary, S., Alima, L.O., Brand, P., Haridi, S.: Efficient broadcast in structured P2P networks. In: Kaashoek, M.F., Stoica, I. (eds.) IPTPS 2003. LNCS, vol. 2735, pp. 304–314. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-540-45172-3_28
Chapter Google Scholar
Elliott, J., Kharbas, K., Fiala, D., Mueller, F., Ferreira, K., Engelmann, C.: Combining partial redundancy and checkpointing for HPC. In: 2012 IEEE 32nd International Conference on Distributed Computing Systems, pp. 615–626. IEEE (2012)
Google Scholar
Fagg, G.E., Dongarra, J.J.: FT-MPI: fault tolerant MPI, supporting dynamic applications in a dynamic world. In: Dongarra, J., Kacsuk, P., Podhorszki, N. (eds.) EuroPVM/MPI 2000. LNCS, vol. 1908, pp. 346–353. Springer, Heidelberg (2000). https://doi.org/10.1007/3-540-45255-9_47
Chapter Google Scholar
Gabriel, E., et al.: Open MPI: goals, concept, and design of a next generation MPI implementation. In: Kranzlmüller, D., Kacsuk, P., Dongarra, J. (eds.) EuroPVM/MPI 2004. LNCS, vol. 3241, pp. 97–104. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30218-6_19
Chapter Google Scholar
Gamell, M., Katz, D.S., Kolla, H., Chen, J., Klasky, S., Parashar, M.: Exploring automatic, online failure recovery for scientific applications at extreme scales. In: SC 2014: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 895–906. IEEE (2014)
Google Scholar
Georgakoudis, G., Guo, L., Laguna, I.: Evaluating the performance of global-restart recovery for MPI fault tolerance. Technical report, Lawrence Livermore National Lab. (LLNL), Livermore, CA, United States (2019)
Google Scholar
Koren, I., Krishna, C.: Fault-Tolerant Systems. Elsevier Science (2020). https://books.google.com.br/books?id=YrnjDwAAQBAJ
Louca, S., Neophytou, N., Lachanas, A., Evripidou, P.: MPI-FT: portable fault tolerance scheme for MPI. Parallel Process. Lett. 10(04), 371–382 (2000)
Article Google Scholar
Moody, A., Bronevetsky, G., Mohror, K., De Supinski, B.R.: Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: SC 2010: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, New Orleans, LA, USA, pp. 1–11. IEEE (2010)
Google Scholar
Nicolae, B., Moody, A., Gonsiorowski, E., Mohror, K., Cappello, F.: VeloC: towards high performance adaptive asynchronous checkpointing at large scale. In: 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Rio de Janeiro, Brazil, pp. 911–920. IEEE (2019)
Google Scholar
Rosso, P.H.D.F., Francesquini, E.: Improved failure detection and propagation mechanisms for MPI. In: Anais da XII Escola Regional de Alto Desempenho de São Paulo, pp. 45–48. SBC (2021)
Google Scholar
da Silva, J.A., Rebello, V.E.F.: A hybrid fault tolerance scheme for EasyGrid MPI applications. In: Proceedings of the 9th International Workshop on Middleware for Grids, Clouds and e-Science, MGC 2011. Association for Computing Machinery, New York (2011). https://doi.org/10.1145/2089002.2089006
Sultana, N., Skjellum, A., Laguna, I., Farmer, M.S., Mohror, K., Emani, M.: MPI stages: checkpointing MPI state for bulk synchronous applications. In: Proceedings of the 25th European MPI Users’ Group Meeting, pp. 1–11 (2018)
Google Scholar
Teranishi, K., Heroux, M.A.: Toward local failure local recovery resilience model using MPI-ULFM. In: Proceedings of the 21st European MPI Users’ Group Meeting, pp. 51–56 (2014)
Google Scholar
Young, J.W.: A first order approximation to the optimum checkpoint interval. Commun. ACM 17(9), 530–531 (1974)
Article Google Scholar
Zhong, D., Bouteiller, A., Luo, X., Bosilca, G.: Runtime level failure detection and propagation in HPC systems. In: Proceedings of the 26th European MPI Users’ Group Meeting, pp. 1–11 (2019)
Google Scholar

Download references

Acknowledgment

The authors are grateful to the Center of Petroleum Studies (CEPETRO-Unicamp/Brazil) and PETROBRAS S/A for the support to this work as part of BRCloud Project.

Author information

Authors and Affiliations

Universidade Estadual de Campinas (UNICAMP), Campinas, SP, Brazil
Pedro Henrique Di Francia Rosso
Universidade Federal do ABC (UFABC), Santo André, SP, Brazil
Pedro Henrique Di Francia Rosso & Emilio Francesquini

Authors

Pedro Henrique Di Francia Rosso
View author publications
You can also search for this author in PubMed Google Scholar
Emilio Francesquini
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pedro Henrique Di Francia Rosso .

Editor information

Editors and Affiliations

Centro de Investigación y de Estudios Avanzados, Mexico City, Mexico
Isidoro Gitler
Universidad Industrial de Santander, Bucaramanga, Colombia
Carlos Jaime Barrios Hernández
Centro Nacional de Alta Tecnología, San José, Costa Rica
Esteban Meneses

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Di Francia Rosso, P.H., Francesquini, E. (2022). OCFTL: An MPI Implementation-Independent Fault Tolerance Library for Task-Based Applications. In: Gitler, I., Barrios Hernández, C.J., Meneses, E. (eds) High Performance Computing. CARLA 2021. Communications in Computer and Information Science, vol 1540. Springer, Cham. https://doi.org/10.1007/978-3-031-04209-6_10

Download citation

DOI: https://doi.org/10.1007/978-3-031-04209-6_10
Published: 12 April 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-04208-9
Online ISBN: 978-3-031-04209-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

OCFTL: An MPI Implementation-Independent Fault Tolerance Library for Task-Based Applications

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Assessing resilient versus stop-and-restart fault-tolerant solutions in MPI applications

A Scalable Runtime Fault Localization Framework for High-Performance Computing Systems

Resilient MPI applications using an application-level checkpointing framework and ULFM

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

OCFTL: An MPI Implementation-Independent Fault Tolerance Library for Task-Based Applications

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Assessing resilient versus stop-and-restart fault-tolerant solutions in MPI applications

A Scalable Runtime Fault Localization Framework for High-Performance Computing Systems

Resilient MPI applications using an application-level checkpointing framework and ULFM

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation