Nothing Special   »   [go: up one dir, main page]

skip to main content
article

Algorithm-based fault tolerance applied to high performance computing

Published: 01 April 2009 Publication History

Abstract

We present a new approach to fault tolerance for High Performance Computing system. Our approach is based on a careful adaptation of the Algorithm-Based Fault Tolerance technique [K. Huang, J. Abraham, Algorithm-based fault tolerance for matrix operations, IEEE Transactions on Computers (Spec. Issue Reliable & Fault-Tolerant Comp.) 33 (1984) 518-528] to the need of parallel distributed computation. We obtain a strongly scalable mechanism for fault tolerance. We can also detect and correct errors (bit-flip) on the fly of a computation. To assess the viability of our approach, we have developed a fault-tolerant matrix-matrix multiplication subroutine and we propose some models to predict its running time. Our parallel fault-tolerant matrix-matrix multiplication scores 1.4 TFLOPS on 484 processors (cluster jacquard.nersc.gov) and returns a correct result while one process failure has happened. This represents 65% of the machine peak efficiency and less than 12% overhead with respect to the fastest failure-free implementation. We predict (and have observed) that, as we increase the processor count, the overhead of the fault tolerance drops significantly.

References

[1]
A. Agbaria, R. Friedman, Starfish: Fault-tolerant dynamic MPI programs on clusters of workstations, in: 8th IEEE International Symposium on High Performance Distributed Computing, 1999
[2]
Banerjee, P. and Abraham, J.A., Bounds on algorithm-based fault tolerance in multiple processor systems. IEEE Transactions on Computers. v35 i4. 296-306.
[3]
Banerjee, P., Rahmeh, J.T., Stunkel, C., Nair, V.S., Roy, K., Balasubramanian, V. and Abraham, J.A., Algorithm-based fault-tolerance on a hypercube multiprocessor. IEEE Transactions on Computers. v35 i9. 1132-1145.
[4]
A. Bouteiller, G. Bosilca, J. Dongarra, Redesigning the message logging model for high performance, in: ISC 2008, International Supercomputing Conference, Dresden, Germany, June 17-20, 2008
[5]
A. Bouteiller, P. Lemarinier, G. Krawezik, F. Cappello, Coordinated checkpoint versus message log for fault tolerant MPI, in: Proceedings of Cluster 2003, Hong Kong, December 2003
[6]
G. Burns, R. Daoud, J. Vaigl, LAM: An open cluster environment for MPI, in: Proceedings of Supercomputing Symposium, 1994, pp. 379-386
[7]
J. Casas, et al. MIST: PVM with transparent migration and checkpointing, 3rd Annual PVM Users' Group Meeting, Pittsburgh, PA, 1995
[8]
Charm++ web site. http://charm.cs.uiuc.edu/u/ft/
[9]
Chen, Z. and Dongarra, J., Condition numbers of Gaussian random matrices. SIAM Journal on Matrix Analysis and Applications. v27 i3. 603-620.
[10]
T. Chiueh, P. Deng, Efficient checkpoint mechanisms for massively parallel machines, in: 26th International Symposium on Fault-Tolerant Computing, Sendai, June 1996
[11]
A. Clematis, V. Gianuzzi, CPVM - Extending PVM for consistent checkpointing, in: 4th Euromicro Workshop on Parallel and Distributed Processing, Braga, January 1996
[12]
Elnozahy, E.N., Alvisi, L., Wang, Y.-M. and Johnson, D.B., A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys. v34 i3. 375-408.
[13]
E.N. Elnozahy, D.B. Johnson, W. Zwaenepoel, The performance of consistent checkpointing, in: 11th Symposium on Reliable Distributed Systems, October 1992
[14]
Elnozahy, E.N. and Plank, J.S., Checkpointing for peta-scale systems: A look into the future of practical rollback-recovery. IEEE Transactions on Dependable and Secure Computing. v1 i2. 97-108.
[15]
Engelmann, C. and Geist, G.A., A diskless checkpointing algorithm for super-scale architectures applied to the fast Fourier transform. In: Challenges of Large Applications in Distributed Environments,
[16]
Huang, K. and Abraham, J., Algorithm-based fault tolerance for matrix operations. IEEE Transactions on Computers (Spec. Issue Reliable & Fault-Tolerant Comp.). v33. 518-528.
[17]
Y. Huang, Y.-M. Wang, Why optimistic message logging has not been used in telecommunication systems, 1995
[18]
Y. Kim, J.S. Plank, J. Dongarra, Fault tolerant matrix operations for networks of workstations using multiple checkpointing, in: High Performance Computing on the Information Superhighway, HPC Asia'97, Seoul, Korea, 1997
[19]
J. Leon, A.L. Fisher, P. Steenkiste, Fail-safe PVM: A portable package for distributed programming with transparent recovery, Technical Report CMU-CS-93-124, Carnegie Mellon University, February 1993
[20]
W.-J. Li, J.-J. Tsay, Checkpointing message-passing interface (MPI) parallel programs, in: Pacific Rim International Symposium on Fault-Tolerant Systems, December 1997
[21]
C.D. Lu, Scalable diskless checkpointing for large parallel systems, Ph.D. Dissertation, University of Illinois at Urbana-Champaign, 2005
[22]
Luk, F.T. and Park, H., An analysis of algorithm-based fault tolerance techniques. Journal of Parallel and Distributed Computing. v5. 172-184.
[23]
V.K. Naik, S.P. Midkiff, J.E. Moreira, A checkpointing strategy for scalable recovery on distributed parallel systems, in: SC97: High Performance Networking and Computing, San Jose, CA, 1997
[24]
J.S. Plank, Y. Kim, J. Dongarra, Algorithm-based diskless checkpointing for fault tolerant matrix operations, in: 25th International Symposium on Fault-Tolerant Computing, Pasadena, CA, June 1995
[25]
J. Pruyne, M. Livny, Managing checkpoints for parallel programs, in: Workshop on Job Scheduling Strategies for Parallel Processing, IPPS'96, 1996
[26]
A. Roy-Chowdhury, P. Banerjee, Algorithm-based fault location and recovery for matrix computations, in: 24th International Symposium on Fault-Tolerant Computing, Austin, TX, 1994
[27]
L.M. Silva, J.G. Silva, Global checkpoints for distributed programs, in: 11th Symposium on Reliable Distributed Systems, Houston, TX, 1992
[28]
L.M. Silva, B. Veer, J.G. Silva, Checkpointing SPMD applications on transputer networks, in: Scalable High Performance Computing Conference, Knoxville, TN, May 1994
[29]
G. Stellner, Consistent checkpoints of PVM applications, in: First European PVM User Group Meeting, Rome, Italy, 1994
[30]
G. Stellner, CoCheck: Checkpointing and process migration for MPI, in: Proceedings of the 10th International Parallel Processing Symposium, IPPS'96, Honolulu, Hawaii, April 1996
[31]
N.H. Vaidya, A case for two-level distributed recovery schemes, in: ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, Ottawa, CA, May 1995
[32]
van de Geijn, R.A. and Watts, J., SUMMA: Scalable universal matrix multiplication algorithm. Concurrency: Practice and Experience. v9 i4. 255-274.

Cited By

View all
  • (2024)AutoCheck: Automatically Identifying Variables for Checkpointing by Data Dependency AnalysisProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00105(1-16)Online publication date: 17-Nov-2024
  • (2024)A survey on checkpointing strategiesFuture Generation Computer Systems10.1016/j.future.2024.07.022161:C(315-328)Online publication date: 1-Dec-2024
  • (2023)Highly Efficient Self-checking Matrix Multiplication on Tiled AMX AcceleratorsACM Transactions on Architecture and Code Optimization10.1145/363333221:2(1-22)Online publication date: 22-Nov-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Journal of Parallel and Distributed Computing
Journal of Parallel and Distributed Computing  Volume 69, Issue 4
April, 2009
93 pages

Publisher

Academic Press, Inc.

United States

Publication History

Published: 01 April 2009

Author Tags

  1. Fault tolerance
  2. High performance computing
  3. Linear algebra

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)AutoCheck: Automatically Identifying Variables for Checkpointing by Data Dependency AnalysisProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00105(1-16)Online publication date: 17-Nov-2024
  • (2024)A survey on checkpointing strategiesFuture Generation Computer Systems10.1016/j.future.2024.07.022161:C(315-328)Online publication date: 1-Dec-2024
  • (2023)Highly Efficient Self-checking Matrix Multiplication on Tiled AMX AcceleratorsACM Transactions on Architecture and Code Optimization10.1145/363333221:2(1-22)Online publication date: 22-Nov-2023
  • (2023)Automatic Algorithm-Based Fault Tolerance (AABFT) of Stencil ComputationsProceedings of the 32nd International Conference on Parallel Architectures and Compilation Techniques10.1109/PACT58117.2023.00024(187-198)Online publication date: 21-Oct-2023
  • (2023)Response of HPC hardware to neutron radiation at the dawn of exascaleThe Journal of Supercomputing10.1007/s11227-023-05199-y79:12(13817-13838)Online publication date: 30-Mar-2023
  • (2022)Resiliency in numerical algorithm design for extreme scale simulationsInternational Journal of High Performance Computing Applications10.1177/1094342021105518836:2(251-285)Online publication date: 1-Mar-2022
  • (2022)Checkpointing à la Young/Daly: An OverviewProceedings of the 2022 Fourteenth International Conference on Contemporary Computing10.1145/3549206.3549328(701-710)Online publication date: 4-Aug-2022
  • (2022)Root cause analysis of soft-error-induced failures from hardware and software perspectivesJournal of Systems Architecture: the EUROMICRO Journal10.1016/j.sysarc.2022.102652130:COnline publication date: 1-Sep-2022
  • (2021)Adaptive Erasure Coded Fault Tolerant Linear System SolverACM Transactions on Parallel Computing10.1145/34905578:4(1-19)Online publication date: 9-Dec-2021
  • (2021)Arithmetic-intensity-guided fault tolerance for neural network inference on GPUsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476184(1-15)Online publication date: 14-Nov-2021
  • Show More Cited By

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media