Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/1995896.1995955acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
poster

SRC: soft error detection and recovery for high performance linpack

Published: 31 May 2011 Publication History

Abstract

In high-performance systems, the probability of failure is higher for larger systems. Errors in calculations may occur that cannot be detected by any other means. To address this problem, we create a checksum-based approach that detects and recovers from calculation errors. We apply this approach to the LU factorization algorithm used by High Performance Linpack. Our approach has low overhead. In contrast to existing approaches that require repeated calculation, it repeats only a fraction of the calculation during recovery. The frequency of checking can be adjusted for the error rate, resulting in a flexible method of fault tolerance.

References

[1]
K.-H. Huang and J. A. Abraham. Algorithm-based fault tolerance for matrix operations. IEEE Transactions on Computers, C-33:518--528, 1984.
[2]
F. T. Luk and H. Park. An analysis of algorithm-based fault tolerance techniques. Journal of Parallel and Distributed Computing, 5(2):172--184, 1988.

Cited By

View all
  • (2015)Bit Flipping Errors in High Performance Linpack at Exascale and BeyondProceedings of the 2015 44th International Conference on Parallel Processing (ICPP)10.1109/ICPP.2015.51(420-429)Online publication date: 1-Sep-2015

Index Terms

  1. SRC: soft error detection and recovery for high performance linpack

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICS '11: Proceedings of the international conference on Supercomputing
    May 2011
    398 pages
    ISBN:9781450301022
    DOI:10.1145/1995896

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 31 May 2011

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. algorithm-based recovery
    2. fault tolerance
    3. high performance linpack benchmark
    4. lu factorization

    Qualifiers

    • Poster

    Conference

    ICS '11
    Sponsor:
    ICS '11: International Conference on Supercomputing
    May 31 - June 4, 2011
    Arizona, Tucson, USA

    Acceptance Rates

    Overall Acceptance Rate 629 of 2,180 submissions, 29%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)1
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 12 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2015)Bit Flipping Errors in High Performance Linpack at Exascale and BeyondProceedings of the 2015 44th International Conference on Parallel Processing (ICPP)10.1109/ICPP.2015.51(420-429)Online publication date: 1-Sep-2015

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media