Fault Tolerant MPI for the HARNESS Meta-computing System

Graham E. Fagg⁵,
Antonin Bukovsky⁵ &
Jack J. Dongarra⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2073))

Included in the following conference series:

International Conference on Computational Science

2393 Accesses
5 Citations

Abstract

Initial versions of MPI were designed to work efficiently on multiprocessors which had very little job control and thus static process models. Subsequently forcing them to support a dynamic process model suitable for use on clusters or distributed systems would have reduced their performance. As current HPC collaborative applications increase in size and distribution the potential levels of node and network failures increase the need arises for new fault tolerant systems to be developed. Here we present a new implementation of MPI called FT-MPI that allows the semantics and associated modes of failures to be explicitly controlled by an application via a modified MPI API. Given is an overview of the FT-MPI semantics, design, example applications and some performance issues such as efficient group communications and complex data handling.

Download to read the full chapter text

Chapter PDF

A Scalable Runtime Fault Localization Framework for High-Performance Computing Systems

Article 30 September 2017

Legio: fault resiliency for embarrassingly parallel MPI applications

Article Open access 25 June 2021

FFMK: A Fast and Fault-Tolerant Microkernel-Based System for Exascale Computing

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Beck, Dongarra, Fagg, Geist, Gray, Kohl, Migliardi, K. Moore, T. Moore, P. Papadopoulous, S. Scott, V. Sunderam, “HARNESS: a next generation distributed virtual machine”, Journal of Future Generation Computer Systems, (15), Elsevier Science B.V., 1999.
Google Scholar
G. Stellner, “CoCheck: Checkpointing and Process Migration for MPI”, In Proceedings of the International Parallel Processing Symposium, pp 526–531, Honolulu, April 1996.
Google Scholar
Adnan Agbaria and Roy Friedman, “Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations”, In the 8th IEEE International Symposium on High Performance Distributed Computing, 1999.
Google Scholar
Graham E. Fagg, Keith Moore, Jack J. Dongarra, “Scalable networked information processing environment (SNIPE)”, Journal of Future Generation Computer Systems, (15), pp. 571–582, Elsevier Science B.V., 1999.
Article Google Scholar
Mauro Migliardi and Vaidy Sunderam, “PVM Emulation in the HARNESS MetaComputing System: A Plug-in Based Approach”, Lecture Notes in Computer Science (1697), pp 117–124, September 1999.
Google Scholar
P. H. Worley, I. T. Foster, and B. Toonen, “Algorithm comparison and benchmarking using a parallel spectral transform shallow water model”, Proccedings of the Sixth Workshop on Parallel Processing in Meteorology, eds. G.-R. Hoffmann and N. Kreitz, World Scientific, Singapore, pp. 277–289, 1995.
Google Scholar
Thilo Kielmann, Henri E. Bal and Segei Gorlatch. Bandwidth-efficient Collective Communication for Clustered Wide Area Systems. IPDPS 2000, Cancun, Mexico. ( May 1-5, 2000)
Google Scholar
Graham E. Fagg, Sathish S. Vadhiyar, Jack J. Dongarra, “ACCT: Automatic Collective Communication Tuning”, Proc of the 7th European PVM/MPI Users’ Group Meeting, Lecture Notes in Computer Science, Vol. 1908, Springer Verlag, pp. 354–361, September 2000.
Google Scholar
David Culler, R. Karp, D. Patterson, A. Sahay, K. E. Schauser, E. Santos, R. Subramonian and T. von Eicken. LogP: Towards a Realistic Model of Parallel Computation. In Proc. Symposium on Principles and Practice of Parallel Programming (PpoPP), pages 1–12, San Diego, CA (May 1993).
Google Scholar
Sathish S. Vadhiyar, Graham E. Fagg and Jack J. Dongarra, “Automatically Tuned Collective Communications”, Proc. of SuperComputing 2000, Dallas, Texas, November 2000.
Google Scholar
Marc Snir, Steve Otto, Steven Huss-Lederman, David Walker and Jack Dongarra. MPIThe Complete Reference. Volume 1, The MPI Core, second edition (1998).
Google Scholar
M. Frigo. FFTW: An Adaptive Software Architecture for the FFT. Proceedings of the ICASSP Conference, page 1381, Vol. 3. (1998).
Google Scholar
R. Clint Whaley and Jack Dongarra. Automatically Tuned Linear Algebra Software. SC98: High Performance Networking and Computing. http://www.cs.utk.edu/~rwhaley/ATL/INDEX.HTM. (1998)
L. Prylli and B. Tourancheau. “BIP: a new protocol designed for high performance networking on myrinet” In the PC-NOW workshop, IPPS/SPDP 1998, Orlando, USA, 1998.
Google Scholar
Soulla Louca, Neophytos Neophytou, Adrianos Lachanas, Paraskevas Evripidou, “MPIFT: A portable fault tolerance scheme for MPI”, Proc. of PDPTA’ 98 International Conference, Las Vegas, Nevada 1998.
Google Scholar
Jesper Lasson Traff, Rolf Hempel, Hubert Ritzdort and Falk Zimmermann, “Flattening on the Fly: Efficient Handling of MPI Derived Datatypes”, Proc of the 6th European PVM/MPI Users’ Group Meeting, Lecture Notes in Computer Science, Vol. 1697, Springer Verlag, pp. 109–116, Bareclona, September 1999.
Google Scholar
W. D. Gropp, E. Lusk and D. Swider, “Improving the performance of MPI derived datatypes”, In Third MPI Developer’s and User’s Conf (MPIDC’99), pp. 25–30, 1999.
Google Scholar
Graham E Fagg, Kevin S. London and Jack J. Dongarra, “MPI-Connect, Managing Heterogeneous MPI Application Interoperation and Process Control”, EuroPVM-MPI 98, Lecture Notes in Computer Science, Vol. 1497, pp.93–96, Springer Verlag, 1998.
Google Scholar
Edgar Gabriel, Michael Resch, Thomas Beisel and Rainer Keller, “Distributed Computing in a Heterogeneous Computing Environment”, EuroPVM-MPI 98, Lecture Notes in Computer Science, Vol. 1497, pp.180–187, Springer Verlag, 1998.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Tennessee, Suite 203, 1122 Volunteer Blvd., Knoxville, TN-37996-3450, USA
Graham E. Fagg, Antonin Bukovsky & Jack J. Dongarra

Authors

Graham E. Fagg
View author publications
You can also search for this author in PubMed Google Scholar
Antonin Bukovsky
View author publications
You can also search for this author in PubMed Google Scholar
Jack J. Dongarra
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computer Science, Cybernetics and Electronic Engineering, University of Reading, Whiteknights, P.O. Box 225, Reading, RG6 6AY, UK
Vassil N. Alexandrov
Innovative Computing Lab, Computer Science Department, University of Tennessee, 1122 Volunteer Blvd, Knoxville, TN, 37996-3450, USA
Jack J. Dongarra
Computer Science Department, California State University, Chico, CA, 95929-0410, USA
Benjoe A. Juliano & René S. Renner &
School of Computer Science, The Queen’s University of Belfast, Belfast, BT7 1NN, Northern Ireland, UK
C. J. Kenneth Tan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fagg, G.E., Bukovsky, A., Dongarra, J.J. (2001). Fault Tolerant MPI for the HARNESS Meta-computing System. In: Alexandrov, V.N., Dongarra, J.J., Juliano, B.A., Renner, R.S., Tan, C.J.K. (eds) Computational Science — ICCS 2001. ICCS 2001. Lecture Notes in Computer Science, vol 2073. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45545-0_44

Download citation

DOI: https://doi.org/10.1007/3-540-45545-0_44
Published: 17 July 2001
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42232-7
Online ISBN: 978-3-540-45545-5
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics

Fault Tolerant MPI for the HARNESS Meta-computing System

Abstract

Chapter PDF

Similar content being viewed by others

A Scalable Runtime Fault Localization Framework for High-Performance Computing Systems

Legio: fault resiliency for embarrassingly parallel MPI applications

FFMK: A Fast and Fault-Tolerant Microkernel-Based System for Exascale Computing

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Fault Tolerant MPI for the HARNESS Meta-computing System

Abstract

Chapter PDF

Similar content being viewed by others

A Scalable Runtime Fault Localization Framework for High-Performance Computing Systems

Legio: fault resiliency for embarrassingly parallel MPI applications

FFMK: A Fast and Fault-Tolerant Microkernel-Based System for Exascale Computing

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation