Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2749246.2749253acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
short-paper
Public Access

Lightweight Silent Data Corruption Detection Based on Runtime Data Analysis for HPC Applications

Published: 15 June 2015 Publication History

Abstract

Next-generation supercomputers are expected to have more components and, at the same time, consume several times less energy per operation. Consequently, the number of soft errors is expected to increase dramatically in the coming years. In this respect, techniques that leverage certain properties of iterative HPC applications (such as the smoothness of the evolution of a particular dataset) can be used to detect silent errors at the application level. In this paper, we present a pointwise detection model with two phases: one involving the prediction of the next expected value in the time series for each data point, and another determining a range (i.e., normal value interval) surrounding the predicted next-step value. We show that dataset correlation can be used to detect corruptions indirectly and limit the size of the data set to monitor, taking advantage of the underlying physics of the simulation. Our results show that, using our techniques, we can detect a large number of corruptions (i.e., above 90% in some cases) with 84% memory overhead, and 13.75% extra computation time.

References

[1]
L. Bautista-Gomez, S. Tsuboi, D. Komatitsch, F. Cappello, N. Maruyama, and S. Matsuoka. Fti: High performance fault tolerance interface for hybrid systems. In SC'11, pages 32:1--32:32, 2011.
[2]
L. A. Bautista-Gomez and F. Cappello. Detecting silent data corruption through data dynamic monitoring for scientific applications. In PPoPP'14, pages 381--382, 2014.
[3]
A. R. Benson, S. Schmit, and R. Schreiber. Silent error detection in numerical time-stepping schemes. International Journal of High Performance Computing Applications, pages 1--20, 2014.
[4]
S. Borkar. Designing reliable systems from unreliable components: The challenges of transistor variability and degradation. IEEE Micro, 25:10--16, Nov. 2005.
[5]
S. Di, E. Berrocal, L. Bautista-Gomez, K. Heisey, R. Guptal, and F. Cappello. Toward effective detection of silent data corruptions for hpc applications. SC '14 - poster, 2014.
[6]
S. Di, E. Berrocal, and F. Cappello. An efficient silent data corruption detection method with error-feedback control and even sampling for hpc applications. CCGRID, 2015.
[7]
D. Fiala, F. Mueller, C. Engelmann, R. Riesen, K. Ferreira, and R. Brightwell. Detection and correction of silent data corruption for large-scale high-performance computing. In SC'12, pages 78:1--78:12, 2012.
[8]
S. Habib, V. A. Morozov, H. Finkel, A. Pope, K. Heitmann, K. Kumaran, T. Peterka, J. A. Insley, D. Daniel, P. K. Fasel, N. Frontiere, and Z. Lukic. The universe at extreme scale: Multi-petaflop sky simulation on the bg/q. In SC'12, pages 1--11, 2012.
[9]
K.-H. Huang and J. A. Abraham. Algorithm-based fault tolerance for matrix operations. IEEE Transactions on Computers, 100(6):518--528, 1984.
[10]
A. A. Hwang, I. A. Stefanovici, and B. Schroeder. Cosmic rays don't strike twice: Understanding the nature of dram errors and the implications for system design. In ASPLOS'XVII, pages 111--122, 2012.
[11]
S. S. Mukherjee, J. Emer, and S. K. Reinhardt. The soft error problem: An architectural perspective. In HPCA'05, pages 243--247. IEEE, 2005.
[12]
J. Shin, M. W. Hall, J. Chame, C. Chen, P. F. Fischer, and P. D. Hovland. Speeding up nek5000 with autotuning and specialization. In ICS'10, pages 253--262, 2010.

Cited By

View all
  • (2024)Understanding Silent Data Corruption in Processors for Mitigating its EffectsACM Transactions on Architecture and Code Optimization10.1145/369082521:4(1-27)Online publication date: 20-Nov-2024
  • (2024)A Visual Comparison of Silent Error PropagationIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2022.323063630:7(3268-3282)Online publication date: Jul-2024
  • (2024)Harpocrates: Breaking the Silence of CPU Faults through Hardware-in-the-Loop Program Generation2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00045(516-531)Online publication date: 29-Jun-2024
  • Show More Cited By

Index Terms

  1. Lightweight Silent Data Corruption Detection Based on Runtime Data Analysis for HPC Applications

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    HPDC '15: Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing
    June 2015
    296 pages
    ISBN:9781450335508
    DOI:10.1145/2749246
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 15 June 2015

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. fault tolerance
    2. high-performance computing
    3. resilience
    4. silent data corruption
    5. soft errors
    6. time series

    Qualifiers

    • Short-paper

    Funding Sources

    Conference

    HPDC'15
    Sponsor:

    Acceptance Rates

    HPDC '15 Paper Acceptance Rate 19 of 116 submissions, 16%;
    Overall Acceptance Rate 166 of 966 submissions, 17%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)128
    • Downloads (Last 6 weeks)21
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Understanding Silent Data Corruption in Processors for Mitigating its EffectsACM Transactions on Architecture and Code Optimization10.1145/369082521:4(1-27)Online publication date: 20-Nov-2024
    • (2024)A Visual Comparison of Silent Error PropagationIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2022.323063630:7(3268-3282)Online publication date: Jul-2024
    • (2024)Harpocrates: Breaking the Silence of CPU Faults through Hardware-in-the-Loop Program Generation2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00045(516-531)Online publication date: 29-Jun-2024
    • (2023)Understanding Integrity of Time Series IoT Datasets through Local Outlier Detection with Steep Peak and ValleyProceedings of the 2023 11th International Conference on Information Technology: IoT and Smart City10.1145/3638985.3639007(126-133)Online publication date: 14-Dec-2023
    • (2023)Anomaly Detection in Scientific Datasets using Sparse RepresentationProceedings of the First Workshop on AI for Systems10.1145/3588982.3603610(13-18)Online publication date: 10-Aug-2023
    • (2023)Outlier Elimination and Reliability Assessment for Peak and Declining Time Series Datasets2023 IEEE International Conference on Data Mining Workshops (ICDMW)10.1109/ICDMW60847.2023.00083(593-600)Online publication date: 4-Dec-2023
    • (2022)Software approaches for resilience of high performance computing systems: a surveyFrontiers of Computer Science10.1007/s11704-022-2096-317:4Online publication date: 12-Dec-2022
    • (2021)Resilience and fault tolerance in high-performance computing for numerical weather and climate predictionInternational Journal of High Performance Computing Applications10.1177/109434202199043335:4(285-311)Online publication date: 1-Jul-2021
    • (2021)User-level failure detection and auto-recovery of parallel programs in HPC systemsFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-020-0190-y15:6Online publication date: 1-Dec-2021
    • (2021)Efficient detection of silent data corruption in HPC applications with synchronization-free message verificationThe Journal of Supercomputing10.1007/s11227-021-03892-4Online publication date: 9-Jun-2021
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media