Article

Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources

Authors:

Zizhong Chen,

Jack DongarraAuthors Info & Claims

IPDPS'06: Proceedings of the 20th international conference on Parallel and distributed processing

Page 97

Published: 25 April 2006 Publication History

Get Access

Abstract

As the size of today's high performance computers increases from hundreds, to thousands, and even tens of thousands of processors, node failures in these computers are becoming frequent events. Although checkpoint/rollbaek-reovery is the typical technique to tolerate such failures, it often introduces a considerable overhead. Algorithm-based fault tolerance is a very cost-effective method to incorporate fault tolerance into matrix eomputations. However, previous algorithm-based fault tolerance methods for matrix computations are often derived using algorithms that are seldomly used in the practice of today's high performance matrix computations and have mostly focused on platforms where failed processors produce incorrect calculations.

To fill this gap, this paper extends the existing algorithm-based fault tolerance to the volatile computing platform where the failied processor stops working and applies it to scalable high performance matrix computations with two dimensional block cyclic data distribution. We show the practicality of this technique by applying it to the ScaLAPACK/PBLAS matrix-matrix multiplication kernel. Experimental results demonstrate that the proposed approach is able to survive process failures with a very low performance overhead.

References

[1]

J. S. Plank, K. Li, and M. A. Puening. Diskless checkpointing. IEEE Trans. Parallel Distrib. Syst., 9(10):972-986, 1998.

Digital Library

Google Scholar

[2]

J. S. Plank. A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems. Software -Practice & Experience, 27(9):995-1012, September 1997.

Digital Library

Google Scholar

[3]

Y. Kim. Fault Tolerant Matrix Operations for Parallel and Distributed Systems. Ph.D. dissertation, University of Tennessee, Knoxville, June 1996.

Digital Library

Google Scholar

[4]

Z. Chen, G. E. Fagg, E. Gabriel, J. Langou, T. Angskun, G. Bosilca, and J. Dongarra. Fault tolerant high performance computing by a coding approach. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 2005, June 14-17, 2005, Chicago, IL, USA. ACM, 2005.

Digital Library

Google Scholar

[5]

K.-H. Huang and J. A. Abraham. Algorithm-based fault tolerance for matrix operations,. IEEE Transactions on Computers, vol C-33:518-528, 1984.

Digital Library

Google Scholar

[6]

P. Banerjee, J. T. Rahmeh, C. B. Stunkel, V. S. S. Nair, K. Roy, V. Balasubramanian, and J. A. Abraham Algorithm-based fault tolerance on a hypercube muliprocessor. IEEE Transactions on Computers, vol C-39:1132-1145, 1990.

Digital Library

Google Scholar

[7]

V. Balasubramanian and P. Banerjee Compiler-Assisted Synthesis of Algorithm-Based Cheekingin Multiprocessors. IEEE Transactions on Gomputers , vol C-39:436-446, 1990.

Digital Library

Google Scholar

[8]

L. S. Blaekford, J. Choi, A. Cleary, A. Petitet, R. C. Whaley, J. Demmel, I. Dhillon, K. Stanley, J. Dongarra, S. Hammarling, G. Henry, and D. Walker. ScaLAPACK: a portable linear algebra library for distributed memory computers - design issues and performance. In Supercomputing '96: Proceedings of the 1996 ACM/IEEE conference on Supereomputing (CDROM), page 5, 1996.

Digital Library

Google Scholar

[9]

V. S. Sunderam. PVM: a framework for parallel distributed computing. Concurrency: Pract. Exper. , 2(4):315-339, 1990.

Digital Library

Google Scholar

[10]

I. Foster and C. Kesselman. The glohus toolkit. The grid: blueprint for a new computing infrastructure , pages 259-278, 1999.

Digital Library

Google Scholar

[11]

G. E. Fagg, E. Gabriel, G. Bosilca, T. Angskun, Z. Chen, J. Pjesivac-Grbovic, K. London, and J. J. Dongarra. Extending the MPI specification for process fault tolerance on high performance computing systems. In Proceedings of the International Supercomputer Conference, Heidelberg, Germany, 2004.

Google Scholar

[12]

G. E. Fagg, E. Gabriel, Z. Chen, T. Angskun, G. Bosilca, J. Pjesivac-Grbovic, and J. J. Dongarra. Process fault-tolerance: Semantics, design and applications for high performance computing. Submitted go International Journal of High Performance Computing Applications, 2004.

Google Scholar

[13]

E. Gabriel, G. E. Fagg, G. Bosilca, T. Angskun, J. Dongarra, J. M. Squyres, V. Sahay, P. Kambadur, B. Barrett, A. Lumsdaine, R. H. Castain, D. J. Daniel, R. L. Graham, and T. S. Woodall Open mpi: Goals, concept, and design of a next generation mpi implementation. In PVM/MPI, pages 97-104, 2004.

Google Scholar

[14]

P. Sanders and J. F. Sibeyn. A bandwidth latency tradeoff for broadcast and reduction. Inf. Process. Lett., 86(1):33-38, 2003.

Digital Library

Google Scholar

Cited By

View all

Alshboul MTuck JSolihin Y(2018)Lazy persistencyProceedings of the 45th Annual International Symposium on Computer Architecture10.1109/ISCA.2018.00044(439-451)Online publication date: 2-Jun-2018
https://dl.acm.org/doi/10.1109/ISCA.2018.00044
Levy SFerreira KBridges PWest J(2016)Improving application resilience to memory errors with lightweight compressionProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3014904.3014942(1-12)Online publication date: 13-Nov-2016
https://dl.acm.org/doi/10.5555/3014904.3014942
Bouteiller AHerault TBosilca GDu PDongarra J(2015)Algorithm-Based Fault Tolerance for Dense Matrix Factorizations, Multiple Failures and AccuracyACM Transactions on Parallel Computing10.1145/26868921:2(1-28)Online publication date: 18-Feb-2015
https://dl.acm.org/doi/10.1145/2686892
Show More Cited By

Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources

Recommendations

Algorithm-based fault tolerance for dense matrix factorizations
PPoPP '12: Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming

Dense matrix factorizations, such as LU, Cholesky and QR, are widely used for scientific applications that require solving systems of linear equations, eigenvalues and linear least squares problems. Such computations are normally carried out on ...
Algorithm-based fault tolerance for dense matrix factorizations
PPOPP '12

Dense matrix factorizations, such as LU, Cholesky and QR, are widely used for scientific applications that require solving systems of linear equations, eigenvalues and linear least squares problems. Such computations are normally carried out on ...
Algorithm-Based Fault Tolerance for FFT Networks

Algorithm-based fault tolerance (ABFT) is a low-overhead system-level fault tolerance technique. Many ABFT schemes have been proposed in the past for fast Fourier transform (FFT) networks. In this paper, a new ABFT scheme for FFT networks is proposed. ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

IPDPS'06: Proceedings of the 20th international conference on Parallel and distributed processing

April 2006

399 pages

ISBN:1424400546

In-Cooperation

IEEE TCCA: IEEE Computer Society Technical Committee on Computer Architecture
SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE Computer Society Technical Committee on Distributed Processing

Publisher

IEEE Computer Society

United States

Publication History

Published: 25 April 2006

Check for updates

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

18
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 14 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Alshboul MTuck JSolihin Y(2018)Lazy persistencyProceedings of the 45th Annual International Symposium on Computer Architecture10.1109/ISCA.2018.00044(439-451)Online publication date: 2-Jun-2018
https://dl.acm.org/doi/10.1109/ISCA.2018.00044
Levy SFerreira KBridges PWest J(2016)Improving application resilience to memory errors with lightweight compressionProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3014904.3014942(1-12)Online publication date: 13-Nov-2016
https://dl.acm.org/doi/10.5555/3014904.3014942
Bouteiller AHerault TBosilca GDu PDongarra J(2015)Algorithm-Based Fault Tolerance for Dense Matrix Factorizations, Multiple Failures and AccuracyACM Transactions on Parallel Computing10.1145/26868921:2(1-28)Online publication date: 18-Feb-2015
https://dl.acm.org/doi/10.1145/2686892
Engelmann C(2014)Scaling to a million cores and beyondFuture Generation Computer Systems10.5555/2747903.274819830:C(59-65)Online publication date: 1-Jan-2014
https://dl.acm.org/doi/10.5555/2747903.2748198
Cappello FAl GGropp WKale SKramer BSnir M(2014)Toward Exascale ResilienceSupercomputing Frontiers and Innovations: an International Journal10.14529/jsfi1401011:1(5-28)Online publication date: 6-Apr-2014
https://dl.acm.org/doi/10.14529/jsfi140101
Laguna ILeón ESchulz MStephenson M(2013)A study of application-level recovery methods for transient network faultsProceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems10.1145/2530268.2530271(1-8)Online publication date: 17-Nov-2013
https://dl.acm.org/doi/10.1145/2530268.2530271
Chen Z(2013)Online-ABFTACM SIGPLAN Notices10.1145/2517327.244253348:8(167-176)Online publication date: 23-Feb-2013
https://dl.acm.org/doi/10.1145/2517327.2442533
Levy SBridges PFerreira KThompson ATrott CHoefler TIskra K(2013)Evaluating the feasibility of using memory content similarity to improve system resilienceProceedings of the 3rd International Workshop on Runtime and Operating Systems for Supercomputers10.1145/2491661.2481432(1-8)Online publication date: 10-Jun-2013
https://dl.acm.org/doi/10.1145/2491661.2481432
Chen ZNicolau AShen XAmarasinghe SVuduc R(2013)Online-ABFTProceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming10.1145/2442516.2442533(167-176)Online publication date: 23-Feb-2013
https://dl.acm.org/doi/10.1145/2442516.2442533
Du PBouteiller ABosilca GHerault TDongarra J(2012)Algorithm-based fault tolerance for dense matrix factorizationsACM SIGPLAN Notices10.1145/2370036.214584547:8(225-234)Online publication date: 25-Feb-2012
https://dl.acm.org/doi/10.1145/2370036.2145845
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Recommendations

Algorithm-based fault tolerance for dense matrix factorizations

Algorithm-based fault tolerance for dense matrix factorizations

Algorithm-Based Fault Tolerance for FFT Networks