Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/1995896.1995955acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
poster

SRC: soft error detection and recovery for high performance linpack

Published: 31 May 2011 Publication History

Abstract

In high-performance systems, the probability of failure is higher for larger systems. Errors in calculations may occur that cannot be detected by any other means. To address this problem, we create a checksum-based approach that detects and recovers from calculation errors. We apply this approach to the LU factorization algorithm used by High Performance Linpack. Our approach has low overhead. In contrast to existing approaches that require repeated calculation, it repeats only a fraction of the calculation during recovery. The frequency of checking can be adjusted for the error rate, resulting in a flexible method of fault tolerance.

References

[1]
K.-H. Huang and J. A. Abraham. Algorithm-based fault tolerance for matrix operations. IEEE Transactions on Computers, C-33:518--528, 1984.
[2]
F. T. Luk and H. Park. An analysis of algorithm-based fault tolerance techniques. Journal of Parallel and Distributed Computing, 5(2):172--184, 1988.

Cited By

View all
  • (2015)Bit Flipping Errors in High Performance Linpack at Exascale and BeyondProceedings of the 2015 44th International Conference on Parallel Processing (ICPP)10.1109/ICPP.2015.51(420-429)Online publication date: 1-Sep-2015

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ICS '11: Proceedings of the international conference on Supercomputing
May 2011
398 pages
ISBN:9781450301022
DOI:10.1145/1995896

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 May 2011

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. algorithm-based recovery
  2. fault tolerance
  3. high performance linpack benchmark
  4. lu factorization

Qualifiers

  • Poster

Conference

ICS '11
Sponsor:
ICS '11: International Conference on Supercomputing
May 31 - June 4, 2011
Arizona, Tucson, USA

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)1
Reflects downloads up to 03 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2015)Bit Flipping Errors in High Performance Linpack at Exascale and BeyondProceedings of the 2015 44th International Conference on Parallel Processing (ICPP)10.1109/ICPP.2015.51(420-429)Online publication date: 1-Sep-2015

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media