Abstract
Fault tolerance (FT) is a common concern in HPC environments. One would expect that, when Message Passing Interface (MPI) is concerned (an HPC tool of paramount importance), FT would be a solved problem. It turns out that the scenario for FT and MPI is intricate. While FT is effectively a reality in these environments, it is usually done by hand. The few exceptions available tie MPI users to specific MPI implementations. This work proposes OCFTL, an implementation-independent FT library for MPI. OCFTL can detect and propagate failures; provides false-positive detection; exchanges a reduced number of messages during failure propagation; employs checkpointing to reduce the impact of failures; and has a reduced delay to detect sequential simultaneous failures in comparison to related works.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Coordinated checkpointing requires synchronization of checkpoint functions in every process, while uncoordinated checkpointing does not require synchronization.
- 2.
- 3.
OCFTL’s wrappers source codes can be found in the library implementation, available at https://gitlab.com/phrosso/ftmpi-tests/-/tree/master/OCFTLBench/ftlib.
- 4.
The tests and the library are available at: https://gitlab.com/phrosso/ftmpi-tests.
- 5.
UCX is an optimized framework for communications between nodes in high-bandwidth and low-latency networks, which is commonly used with MPI.
- 6.
We consider an MPI program as stressed when it is overloaded by MPI message exchanges. Benchmarks are examples of stressed MPI programs, since they overload the MPI runtime with MPI operations to establish the limits of a MPI distribution.
- 7.
Available at https://github.com/intel/mpi-benchmarks.
- 8.
All test results are available in the aforementioned Git repository. Due to space limitations, here we show only results with the most significant overheads (i.e., overheads for the remaining configurations is lower than those presented here).
References
Aulwes, R., et al.: Architecture of LA-MPI, a network-fault-tolerant MPI. In: 18th International Parallel and Distributed Processing Symposium, Proceedings, Santa Fe, NM, USA, p. 15. IEEE (2004). https://doi.org/10.1109/IPDPS.2004.1302920
Bautista-Gomez, L., Tsuboi, S., Komatitsch, D., Cappello, F., Maruyama, N., Matsuoka, S.: FTI: high performance fault tolerance interface for hybrid systems. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2011. Association for Computing Machinery, New York (2011). https://doi.org/10.1145/2063384.2063427
Bland, W., Bouteiller, A., Herault, T., Bosilca, G., Dongarra, J.: Post-failure recovery of MPI communication capability: design and rationale. Intl. J. High Perform. Comput. Appl. 27(3), 244–254 (2013)
Bosilca, G., et al.: MPICH-V: toward a scalable fault tolerant MPI for volatile nodes. In: SC 2002: Proceedings of the 2002 ACM/IEEE Conference on Supercomputing, p. 29. IEEE (2002)
Bosilca, G., et al.: A failure detector for HPC platforms. Intl. J. High Perform. Comput. Appl. 32(1), 139–158 (2018)
Efron, B., Hastie, T.: Computer Age Statistical Inference, vol. 5. Cambridge University Press, Cambridge (2016)
El-Ansary, S., Alima, L.O., Brand, P., Haridi, S.: Efficient broadcast in structured P2P networks. In: Kaashoek, M.F., Stoica, I. (eds.) IPTPS 2003. LNCS, vol. 2735, pp. 304–314. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-540-45172-3_28
Elliott, J., Kharbas, K., Fiala, D., Mueller, F., Ferreira, K., Engelmann, C.: Combining partial redundancy and checkpointing for HPC. In: 2012 IEEE 32nd International Conference on Distributed Computing Systems, pp. 615–626. IEEE (2012)
Fagg, G.E., Dongarra, J.J.: FT-MPI: fault tolerant MPI, supporting dynamic applications in a dynamic world. In: Dongarra, J., Kacsuk, P., Podhorszki, N. (eds.) EuroPVM/MPI 2000. LNCS, vol. 1908, pp. 346–353. Springer, Heidelberg (2000). https://doi.org/10.1007/3-540-45255-9_47
Gabriel, E., et al.: Open MPI: goals, concept, and design of a next generation MPI implementation. In: Kranzlmüller, D., Kacsuk, P., Dongarra, J. (eds.) EuroPVM/MPI 2004. LNCS, vol. 3241, pp. 97–104. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30218-6_19
Gamell, M., Katz, D.S., Kolla, H., Chen, J., Klasky, S., Parashar, M.: Exploring automatic, online failure recovery for scientific applications at extreme scales. In: SC 2014: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 895–906. IEEE (2014)
Georgakoudis, G., Guo, L., Laguna, I.: Evaluating the performance of global-restart recovery for MPI fault tolerance. Technical report, Lawrence Livermore National Lab. (LLNL), Livermore, CA, United States (2019)
Koren, I., Krishna, C.: Fault-Tolerant Systems. Elsevier Science (2020). https://books.google.com.br/books?id=YrnjDwAAQBAJ
Louca, S., Neophytou, N., Lachanas, A., Evripidou, P.: MPI-FT: portable fault tolerance scheme for MPI. Parallel Process. Lett. 10(04), 371–382 (2000)
Moody, A., Bronevetsky, G., Mohror, K., De Supinski, B.R.: Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: SC 2010: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, New Orleans, LA, USA, pp. 1–11. IEEE (2010)
Nicolae, B., Moody, A., Gonsiorowski, E., Mohror, K., Cappello, F.: VeloC: towards high performance adaptive asynchronous checkpointing at large scale. In: 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Rio de Janeiro, Brazil, pp. 911–920. IEEE (2019)
Rosso, P.H.D.F., Francesquini, E.: Improved failure detection and propagation mechanisms for MPI. In: Anais da XII Escola Regional de Alto Desempenho de São Paulo, pp. 45–48. SBC (2021)
da Silva, J.A., Rebello, V.E.F.: A hybrid fault tolerance scheme for EasyGrid MPI applications. In: Proceedings of the 9th International Workshop on Middleware for Grids, Clouds and e-Science, MGC 2011. Association for Computing Machinery, New York (2011). https://doi.org/10.1145/2089002.2089006
Sultana, N., Skjellum, A., Laguna, I., Farmer, M.S., Mohror, K., Emani, M.: MPI stages: checkpointing MPI state for bulk synchronous applications. In: Proceedings of the 25th European MPI Users’ Group Meeting, pp. 1–11 (2018)
Teranishi, K., Heroux, M.A.: Toward local failure local recovery resilience model using MPI-ULFM. In: Proceedings of the 21st European MPI Users’ Group Meeting, pp. 51–56 (2014)
Young, J.W.: A first order approximation to the optimum checkpoint interval. Commun. ACM 17(9), 530–531 (1974)
Zhong, D., Bouteiller, A., Luo, X., Bosilca, G.: Runtime level failure detection and propagation in HPC systems. In: Proceedings of the 26th European MPI Users’ Group Meeting, pp. 1–11 (2019)
Acknowledgment
The authors are grateful to the Center of Petroleum Studies (CEPETRO-Unicamp/Brazil) and PETROBRAS S/A for the support to this work as part of BRCloud Project.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Di Francia Rosso, P.H., Francesquini, E. (2022). OCFTL: An MPI Implementation-Independent Fault Tolerance Library for Task-Based Applications. In: Gitler, I., Barrios Hernández, C.J., Meneses, E. (eds) High Performance Computing. CARLA 2021. Communications in Computer and Information Science, vol 1540. Springer, Cham. https://doi.org/10.1007/978-3-031-04209-6_10
Download citation
DOI: https://doi.org/10.1007/978-3-031-04209-6_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-04208-9
Online ISBN: 978-3-031-04209-6
eBook Packages: Computer ScienceComputer Science (R0)