Nothing Special   »   [go: up one dir, main page]

skip to main content
10.5555/1898953.1899028acmotherconferencesArticle/Chapter ViewAbstractPublication PagesidpdsConference Proceedingsconference-collections
Article

Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources

Published: 25 April 2006 Publication History

Abstract

As the size of today's high performance computers increases from hundreds, to thousands, and even tens of thousands of processors, node failures in these computers are becoming frequent events. Although checkpoint/rollbaek-reovery is the typical technique to tolerate such failures, it often introduces a considerable overhead. Algorithm-based fault tolerance is a very cost-effective method to incorporate fault tolerance into matrix eomputations. However, previous algorithm-based fault tolerance methods for matrix computations are often derived using algorithms that are seldomly used in the practice of today's high performance matrix computations and have mostly focused on platforms where failed processors produce incorrect calculations.
To fill this gap, this paper extends the existing algorithm-based fault tolerance to the volatile computing platform where the failied processor stops working and applies it to scalable high performance matrix computations with two dimensional block cyclic data distribution. We show the practicality of this technique by applying it to the ScaLAPACK/PBLAS matrix-matrix multiplication kernel. Experimental results demonstrate that the proposed approach is able to survive process failures with a very low performance overhead.

References

[1]
J. S. Plank, K. Li, and M. A. Puening. Diskless checkpointing. IEEE Trans. Parallel Distrib. Syst., 9(10):972-986, 1998.
[2]
J. S. Plank. A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems. Software -Practice & Experience, 27(9):995-1012, September 1997.
[3]
Y. Kim. Fault Tolerant Matrix Operations for Parallel and Distributed Systems. Ph.D. dissertation, University of Tennessee, Knoxville, June 1996.
[4]
Z. Chen, G. E. Fagg, E. Gabriel, J. Langou, T. Angskun, G. Bosilca, and J. Dongarra. Fault tolerant high performance computing by a coding approach. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 2005, June 14-17, 2005, Chicago, IL, USA. ACM, 2005.
[5]
K.-H. Huang and J. A. Abraham. Algorithm-based fault tolerance for matrix operations,. IEEE Transactions on Computers, vol C-33:518-528, 1984.
[6]
P. Banerjee, J. T. Rahmeh, C. B. Stunkel, V. S. S. Nair, K. Roy, V. Balasubramanian, and J. A. Abraham Algorithm-based fault tolerance on a hypercube muliprocessor. IEEE Transactions on Computers, vol C-39:1132-1145, 1990.
[7]
V. Balasubramanian and P. Banerjee Compiler-Assisted Synthesis of Algorithm-Based Cheekingin Multiprocessors. IEEE Transactions on Gomputers , vol C-39:436-446, 1990.
[8]
L. S. Blaekford, J. Choi, A. Cleary, A. Petitet, R. C. Whaley, J. Demmel, I. Dhillon, K. Stanley, J. Dongarra, S. Hammarling, G. Henry, and D. Walker. ScaLAPACK: a portable linear algebra library for distributed memory computers - design issues and performance. In Supercomputing '96: Proceedings of the 1996 ACM/IEEE conference on Supereomputing (CDROM), page 5, 1996.
[9]
V. S. Sunderam. PVM: a framework for parallel distributed computing. Concurrency: Pract. Exper. , 2(4):315-339, 1990.
[10]
I. Foster and C. Kesselman. The glohus toolkit. The grid: blueprint for a new computing infrastructure , pages 259-278, 1999.
[11]
G. E. Fagg, E. Gabriel, G. Bosilca, T. Angskun, Z. Chen, J. Pjesivac-Grbovic, K. London, and J. J. Dongarra. Extending the MPI specification for process fault tolerance on high performance computing systems. In Proceedings of the International Supercomputer Conference, Heidelberg, Germany, 2004.
[12]
G. E. Fagg, E. Gabriel, Z. Chen, T. Angskun, G. Bosilca, J. Pjesivac-Grbovic, and J. J. Dongarra. Process fault-tolerance: Semantics, design and applications for high performance computing. Submitted go International Journal of High Performance Computing Applications, 2004.
[13]
E. Gabriel, G. E. Fagg, G. Bosilca, T. Angskun, J. Dongarra, J. M. Squyres, V. Sahay, P. Kambadur, B. Barrett, A. Lumsdaine, R. H. Castain, D. J. Daniel, R. L. Graham, and T. S. Woodall Open mpi: Goals, concept, and design of a next generation mpi implementation. In PVM/MPI, pages 97-104, 2004.
[14]
P. Sanders and J. F. Sibeyn. A bandwidth latency tradeoff for broadcast and reduction. Inf. Process. Lett., 86(1):33-38, 2003.

Cited By

View all
  • (2018)Lazy persistencyProceedings of the 45th Annual International Symposium on Computer Architecture10.1109/ISCA.2018.00044(439-451)Online publication date: 2-Jun-2018
  • (2016)Improving application resilience to memory errors with lightweight compressionProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3014904.3014942(1-12)Online publication date: 13-Nov-2016
  • (2015)Algorithm-Based Fault Tolerance for Dense Matrix Factorizations, Multiple Failures and AccuracyACM Transactions on Parallel Computing10.1145/26868921:2(1-28)Online publication date: 18-Feb-2015
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
IPDPS'06: Proceedings of the 20th international conference on Parallel and distributed processing
April 2006
399 pages
ISBN:1424400546

Sponsors

  • IEEE CS TCPP: IEEE Computer Society Technical Committee on Parallel Processing

In-Cooperation

Publisher

IEEE Computer Society

United States

Publication History

Published: 25 April 2006

Check for updates

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 14 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2018)Lazy persistencyProceedings of the 45th Annual International Symposium on Computer Architecture10.1109/ISCA.2018.00044(439-451)Online publication date: 2-Jun-2018
  • (2016)Improving application resilience to memory errors with lightweight compressionProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3014904.3014942(1-12)Online publication date: 13-Nov-2016
  • (2015)Algorithm-Based Fault Tolerance for Dense Matrix Factorizations, Multiple Failures and AccuracyACM Transactions on Parallel Computing10.1145/26868921:2(1-28)Online publication date: 18-Feb-2015
  • (2014)Scaling to a million cores and beyondFuture Generation Computer Systems10.5555/2747903.274819830:C(59-65)Online publication date: 1-Jan-2014
  • (2014)Toward Exascale ResilienceSupercomputing Frontiers and Innovations: an International Journal10.14529/jsfi1401011:1(5-28)Online publication date: 6-Apr-2014
  • (2013)A study of application-level recovery methods for transient network faultsProceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems10.1145/2530268.2530271(1-8)Online publication date: 17-Nov-2013
  • (2013)Online-ABFTACM SIGPLAN Notices10.1145/2517327.244253348:8(167-176)Online publication date: 23-Feb-2013
  • (2013)Evaluating the feasibility of using memory content similarity to improve system resilienceProceedings of the 3rd International Workshop on Runtime and Operating Systems for Supercomputers10.1145/2491661.2481432(1-8)Online publication date: 10-Jun-2013
  • (2013)Online-ABFTProceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming10.1145/2442516.2442533(167-176)Online publication date: 23-Feb-2013
  • (2012)Algorithm-based fault tolerance for dense matrix factorizationsACM SIGPLAN Notices10.1145/2370036.214584547:8(225-234)Online publication date: 25-Feb-2012
  • Show More Cited By

View Options

Get Access

Login options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media