Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1109/DSN.2014.62guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters

Published: 23 June 2014 Publication History

Abstract

This paper provides an analysis of failures and their impact for Blue Waters, the Cray hybrid (CPU/GPU) supercomputer at the University of Illinois at Urbana-Champaign. The analysis is based on both manual failure reports and automatically generated event logs collected over 261 days. Results include i) a characterization of the root causes of single-node failures, ii) a direct assessment of the effectiveness of system-level fail over as well as memory, processor, network, GPU accelerator, and file system error resiliency, and iii) an analysis of system-wide outages. The major findings of this study are as follows. Hardware is not the main cause of system downtime. This is notwithstanding the fact that hardware-related failures are 42% of all failures. Failures caused by hardware were responsible for only 23% of the total repair time. These results are partially due to the fact that processor and memory protection mechanisms (x8 and x4 Chip kill, ECC, and parity) are able to handle a sustained rate of errors as high as 250 errors/h while providing a coverage of 99.997% out of a set of more than 1.5 million of analyzed errors. Only 28 multiple-bit errors bypassed the employed protection mechanisms. Software, on the other hand, was the largest contributor to the node repair hours (53%), despite being the cause of only 20% of the total number of failures. A total of 29 out of 39 system-wide outages involved the Lustre file system with 42% of them caused by the inadequacy of the automated fail over procedures.

Cited By

View all
  • (2023)Using Benford's Law to Identify Unusual Failure RegionsProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624121(516-519)Online publication date: 12-Nov-2023
  • (2023)Structural Coding: A Low-Cost Scheme to Protect CNNs from Large-Granularity Memory FaultsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607084(1-17)Online publication date: 12-Nov-2023
  • (2023)Predicting GPU Failures With High Precision Under Deep Learning WorkloadsProceedings of the 16th ACM International Conference on Systems and Storage10.1145/3579370.3594777(124-135)Online publication date: 5-Jun-2023
  • Show More Cited By
  1. Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters

        Recommendations

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Information & Contributors

        Information

        Published In

        cover image Guide Proceedings
        DSN '14: Proceedings of the 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
        June 2014
        801 pages
        ISBN:9781479922338

        Publisher

        IEEE Computer Society

        United States

        Publication History

        Published: 23 June 2014

        Author Tag

        1. Failure Analysis, Failure Reports, Cray XE6, Cray XK7, Supercomputer, Machine Check, Nvidia GPU errors

        Qualifiers

        • Article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 14 Nov 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2023)Using Benford's Law to Identify Unusual Failure RegionsProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624121(516-519)Online publication date: 12-Nov-2023
        • (2023)Structural Coding: A Low-Cost Scheme to Protect CNNs from Large-Granularity Memory FaultsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607084(1-17)Online publication date: 12-Nov-2023
        • (2023)Predicting GPU Failures With High Precision Under Deep Learning WorkloadsProceedings of the 16th ACM International Conference on Systems and Storage10.1145/3579370.3594777(124-135)Online publication date: 5-Jun-2023
        • (2022)GPU Devices for Safety-Critical Systems: A SurveyACM Computing Surveys10.1145/354952655:7(1-37)Online publication date: 15-Dec-2022
        • (2021)Leveraging NVMe SSDs for Building a Fast, Cost-effective, LSM-tree-based KV StoreACM Transactions on Storage10.1145/348096317:4(1-29)Online publication date: 15-Oct-2021
        • (2021)Characterizing and Mitigating Soft Errors in GPU DRAMMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3466752.3480111(641-653)Online publication date: 18-Oct-2021
        • (2021)ARCProceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3431379.3460638(57-68)Online publication date: 21-Jun-2021
        • (2020)GPU lifetimes on titan supercomputerProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3433701.3433755(1-14)Online publication date: 9-Nov-2020
        • (2020)Intermittently failing tests in the embedded systems domainProceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3395363.3397359(337-348)Online publication date: 18-Jul-2020
        • (2020)A Methodology for Comparing the Reliability of GPU-Based and CPU-Based HPCsACM Computing Surveys10.1145/337279053:1(1-33)Online publication date: 6-Feb-2020
        • Show More Cited By

        View Options

        View options

        Get Access

        Login options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media