Nothing Special   »   [go: up one dir, main page]

Skip to main content

OCFTL: An MPI Implementation-Independent Fault Tolerance Library for Task-Based Applications

  • Conference paper
  • First Online:
High Performance Computing (CARLA 2021)

Abstract

Fault tolerance (FT) is a common concern in HPC environments. One would expect that, when Message Passing Interface (MPI) is concerned (an HPC tool of paramount importance), FT would be a solved problem. It turns out that the scenario for FT and MPI is intricate. While FT is effectively a reality in these environments, it is usually done by hand. The few exceptions available tie MPI users to specific MPI implementations. This work proposes OCFTL, an implementation-independent FT library for MPI. OCFTL can detect and propagate failures; provides false-positive detection; exchanges a reduced number of messages during failure propagation; employs checkpointing to reduce the impact of failures; and has a reduced delay to detect sequential simultaneous failures in comparison to related works.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 84.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Coordinated checkpointing requires synchronization of checkpoint functions in every process, while uncoordinated checkpointing does not require synchronization.

  2. 2.

    https://www.open-mpi.org/faq/?category=ft#ft-future.

  3. 3.

    OCFTL’s wrappers source codes can be found in the library implementation, available at https://gitlab.com/phrosso/ftmpi-tests/-/tree/master/OCFTLBench/ftlib.

  4. 4.

    The tests and the library are available at: https://gitlab.com/phrosso/ftmpi-tests.

  5. 5.

    UCX is an optimized framework for communications between nodes in high-bandwidth and low-latency networks, which is commonly used with MPI.

  6. 6.

    We consider an MPI program as stressed when it is overloaded by MPI message exchanges. Benchmarks are examples of stressed MPI programs, since they overload the MPI runtime with MPI operations to establish the limits of a MPI distribution.

  7. 7.

    Available at https://github.com/intel/mpi-benchmarks.

  8. 8.

    All test results are available in the aforementioned Git repository. Due to space limitations, here we show only results with the most significant overheads (i.e., overheads for the remaining configurations is lower than those presented here).

References

  1. Aulwes, R., et al.: Architecture of LA-MPI, a network-fault-tolerant MPI. In: 18th International Parallel and Distributed Processing Symposium, Proceedings, Santa Fe, NM, USA, p. 15. IEEE (2004). https://doi.org/10.1109/IPDPS.2004.1302920

  2. Bautista-Gomez, L., Tsuboi, S., Komatitsch, D., Cappello, F., Maruyama, N., Matsuoka, S.: FTI: high performance fault tolerance interface for hybrid systems. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2011. Association for Computing Machinery, New York (2011). https://doi.org/10.1145/2063384.2063427

  3. Bland, W., Bouteiller, A., Herault, T., Bosilca, G., Dongarra, J.: Post-failure recovery of MPI communication capability: design and rationale. Intl. J. High Perform. Comput. Appl. 27(3), 244–254 (2013)

    Article  Google Scholar 

  4. Bosilca, G., et al.: MPICH-V: toward a scalable fault tolerant MPI for volatile nodes. In: SC 2002: Proceedings of the 2002 ACM/IEEE Conference on Supercomputing, p. 29. IEEE (2002)

    Google Scholar 

  5. Bosilca, G., et al.: A failure detector for HPC platforms. Intl. J. High Perform. Comput. Appl. 32(1), 139–158 (2018)

    Article  Google Scholar 

  6. Efron, B., Hastie, T.: Computer Age Statistical Inference, vol. 5. Cambridge University Press, Cambridge (2016)

    Book  Google Scholar 

  7. El-Ansary, S., Alima, L.O., Brand, P., Haridi, S.: Efficient broadcast in structured P2P networks. In: Kaashoek, M.F., Stoica, I. (eds.) IPTPS 2003. LNCS, vol. 2735, pp. 304–314. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-540-45172-3_28

    Chapter  Google Scholar 

  8. Elliott, J., Kharbas, K., Fiala, D., Mueller, F., Ferreira, K., Engelmann, C.: Combining partial redundancy and checkpointing for HPC. In: 2012 IEEE 32nd International Conference on Distributed Computing Systems, pp. 615–626. IEEE (2012)

    Google Scholar 

  9. Fagg, G.E., Dongarra, J.J.: FT-MPI: fault tolerant MPI, supporting dynamic applications in a dynamic world. In: Dongarra, J., Kacsuk, P., Podhorszki, N. (eds.) EuroPVM/MPI 2000. LNCS, vol. 1908, pp. 346–353. Springer, Heidelberg (2000). https://doi.org/10.1007/3-540-45255-9_47

    Chapter  Google Scholar 

  10. Gabriel, E., et al.: Open MPI: goals, concept, and design of a next generation MPI implementation. In: Kranzlmüller, D., Kacsuk, P., Dongarra, J. (eds.) EuroPVM/MPI 2004. LNCS, vol. 3241, pp. 97–104. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30218-6_19

    Chapter  Google Scholar 

  11. Gamell, M., Katz, D.S., Kolla, H., Chen, J., Klasky, S., Parashar, M.: Exploring automatic, online failure recovery for scientific applications at extreme scales. In: SC 2014: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 895–906. IEEE (2014)

    Google Scholar 

  12. Georgakoudis, G., Guo, L., Laguna, I.: Evaluating the performance of global-restart recovery for MPI fault tolerance. Technical report, Lawrence Livermore National Lab. (LLNL), Livermore, CA, United States (2019)

    Google Scholar 

  13. Koren, I., Krishna, C.: Fault-Tolerant Systems. Elsevier Science (2020). https://books.google.com.br/books?id=YrnjDwAAQBAJ

  14. Louca, S., Neophytou, N., Lachanas, A., Evripidou, P.: MPI-FT: portable fault tolerance scheme for MPI. Parallel Process. Lett. 10(04), 371–382 (2000)

    Article  Google Scholar 

  15. Moody, A., Bronevetsky, G., Mohror, K., De Supinski, B.R.: Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: SC 2010: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, New Orleans, LA, USA, pp. 1–11. IEEE (2010)

    Google Scholar 

  16. Nicolae, B., Moody, A., Gonsiorowski, E., Mohror, K., Cappello, F.: VeloC: towards high performance adaptive asynchronous checkpointing at large scale. In: 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Rio de Janeiro, Brazil, pp. 911–920. IEEE (2019)

    Google Scholar 

  17. Rosso, P.H.D.F., Francesquini, E.: Improved failure detection and propagation mechanisms for MPI. In: Anais da XII Escola Regional de Alto Desempenho de São Paulo, pp. 45–48. SBC (2021)

    Google Scholar 

  18. da Silva, J.A., Rebello, V.E.F.: A hybrid fault tolerance scheme for EasyGrid MPI applications. In: Proceedings of the 9th International Workshop on Middleware for Grids, Clouds and e-Science, MGC 2011. Association for Computing Machinery, New York (2011). https://doi.org/10.1145/2089002.2089006

  19. Sultana, N., Skjellum, A., Laguna, I., Farmer, M.S., Mohror, K., Emani, M.: MPI stages: checkpointing MPI state for bulk synchronous applications. In: Proceedings of the 25th European MPI Users’ Group Meeting, pp. 1–11 (2018)

    Google Scholar 

  20. Teranishi, K., Heroux, M.A.: Toward local failure local recovery resilience model using MPI-ULFM. In: Proceedings of the 21st European MPI Users’ Group Meeting, pp. 51–56 (2014)

    Google Scholar 

  21. Young, J.W.: A first order approximation to the optimum checkpoint interval. Commun. ACM 17(9), 530–531 (1974)

    Article  Google Scholar 

  22. Zhong, D., Bouteiller, A., Luo, X., Bosilca, G.: Runtime level failure detection and propagation in HPC systems. In: Proceedings of the 26th European MPI Users’ Group Meeting, pp. 1–11 (2019)

    Google Scholar 

Download references

Acknowledgment

The authors are grateful to the Center of Petroleum Studies (CEPETRO-Unicamp/Brazil) and PETROBRAS S/A for the support to this work as part of BRCloud Project.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pedro Henrique Di Francia Rosso .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Di Francia Rosso, P.H., Francesquini, E. (2022). OCFTL: An MPI Implementation-Independent Fault Tolerance Library for Task-Based Applications. In: Gitler, I., Barrios Hernández, C.J., Meneses, E. (eds) High Performance Computing. CARLA 2021. Communications in Computer and Information Science, vol 1540. Springer, Cham. https://doi.org/10.1007/978-3-031-04209-6_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-04209-6_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-04208-9

  • Online ISBN: 978-3-031-04209-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics