Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

A Large-Scale Study of Failures in High-Performance Computing Systems

Published: 01 October 2010 Publication History

Abstract

Designing highly dependable systems requires a good understanding of failure characteristics. Unfortunately, little raw data on failures in large IT installations are publicly available. This paper analyzes failure data collected at two large high-performance computing sites. The first data set has been collected over the past nine years at Los Alamos National Laboratory (LANL) and has recently been made publicly available. It covers 23,000 failures recorded on more than 20 different systems at LANL, mostly large clusters of SMP and NUMA nodes. The second data set has been collected over the period of one year on one large supercomputing system comprising 20 nodes and more than 10,000 processors. We study the statistics of the data, including the root cause of failures, the mean time between failures, and the mean time to repair. We find, for example, that average failure rates differ wildly across systems, ranging from 20-1000 failures per year, and that time between failures is modeled well by a Weibull distribution with decreasing hazard rate. From one system to another, mean repair time varies from less than an hour to more than a day, and repair times are well modeled by a lognormal distribution.

Cited By

View all
  • (2024)Removing obstacles before breaking through the memory wallProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692044(851-867)Online publication date: 10-Jul-2024
  • (2024)A Comprehensive Benchmarking Analysis of Fault Recovery in Stream Processing FrameworksProceedings of the 18th ACM International Conference on Distributed and Event-based Systems10.1145/3629104.3666040(171-182)Online publication date: 24-Jun-2024
  • (2024)Reinforcement Learning-based Adaptive Mitigation of Uncorrected DRAM Errors in the FieldProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658686(240-252)Online publication date: 3-Jun-2024
  • Show More Cited By
  1. A Large-Scale Study of Failures in High-Performance Computing Systems

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image IEEE Transactions on Dependable and Secure Computing
    IEEE Transactions on Dependable and Secure Computing  Volume 7, Issue 4
    October 2010
    110 pages

    Publisher

    IEEE Computer Society Press

    Washington, DC, United States

    Publication History

    Published: 01 October 2010

    Author Tags

    1. Large-scale systems
    2. Large-scale systems, high-performance computing, supercomputing, reliability, failures, node outages, field study, empirical study, repair time, time between failures, root cause.
    3. empirical study
    4. failures
    5. field study
    6. high-performance computing
    7. node outages
    8. reliability
    9. repair time
    10. root cause.
    11. supercomputing
    12. time between failures

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 13 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Removing obstacles before breaking through the memory wallProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692044(851-867)Online publication date: 10-Jul-2024
    • (2024)A Comprehensive Benchmarking Analysis of Fault Recovery in Stream Processing FrameworksProceedings of the 18th ACM International Conference on Distributed and Event-based Systems10.1145/3629104.3666040(171-182)Online publication date: 24-Jun-2024
    • (2024)Reinforcement Learning-based Adaptive Mitigation of Uncorrected DRAM Errors in the FieldProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658686(240-252)Online publication date: 3-Jun-2024
    • (2024)Rollback-Free Recovery for a High Performance Dense Linear Solver With Reduced Memory FootprintIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.340036535:7(1307-1319)Online publication date: 13-May-2024
    • (2024)Unavailability-Aware Backup Allocation Model Based on Two-Stage Shared Protection for MiddleboxesIEEE Transactions on Network and Service Management10.1109/TNSM.2023.328527821:1(70-87)Online publication date: 1-Feb-2024
    • (2024)Reliability-Aware Proactive Placement of Microservices-Based IoT Applications in Fog Computing EnvironmentsIEEE Transactions on Mobile Computing10.1109/TMC.2024.339448623:12(11326-11341)Online publication date: 1-Dec-2024
    • (2024)From Failure to Insight: Analyzing Disk Breakdowns in Large-Scale HPC EnvironmentsProceedings of the SC '24 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1109/SCW63240.2024.00070(484-495)Online publication date: 17-Nov-2024
    • (2024)Resilient VirtualizationComputer10.1109/MC.2023.330661757:2(70-78)Online publication date: 31-Jan-2024
    • (2024)Analyzing and predicting job failures from HPC system logThe Journal of Supercomputing10.1007/s11227-023-05482-y80:1(435-462)Online publication date: 1-Jan-2024
    • (2024)Adopting automated bug assignment in practice — a longitudinal case study at EricssonEmpirical Software Engineering10.1007/s10664-024-10507-y29:5Online publication date: 30-Jul-2024
    • Show More Cited By

    View Options

    View options

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media