Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1109/DSN.2005.50guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Filtering Failure Logs for a BlueGene/L Prototype

Published: 28 June 2005 Publication History

Abstract

The growing computational and storage needs of several scientific applications mandate the deployment of extreme-scale parallel machines, such as IBMýs BlueGene/L which can accommodate as many as 128K processors. In this paper, we present our experiences in collecting and filtering error event logs from a 8192 processor BlueGene/L prototype at IBM Rochester, which is currently ranked #8 in the Top-500 list. We analyze the logs collected from this machine over a period of 84 days starting from August 26, 2004. We perform a three-step filtering algorithm on these logs: extracting and categorizing failure events; temporal filtering to remove duplicate reports from the same location; and finally coalescing failure reports of the same error across different locations. Using this approach, we can substantially compress these logs, removing over 99.96% of the 828,387 original entries, and more accurately portray the failure occurrences on this system.

Cited By

View all
  • (2022)Understanding Memory Failures on a Petascale Arm SystemProceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing10.1145/3502181.3531465(84-96)Online publication date: 27-Jun-2022
  • (2018)Lessons learned from memory errors observed over the lifetime of CieloProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.5555/3291656.3291714(1-12)Online publication date: 11-Nov-2018
  • (2018)Lessons learned from memory errors observed over the lifetime of CieloProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC.2018.00046(1-12)Online publication date: 11-Nov-2018
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings
DSN '05: Proceedings of the 2005 International Conference on Dependable Systems and Networks
June 2005
778 pages
ISBN:0769522823

Publisher

IEEE Computer Society

United States

Publication History

Published: 28 June 2005

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 14 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2022)Understanding Memory Failures on a Petascale Arm SystemProceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing10.1145/3502181.3531465(84-96)Online publication date: 27-Jun-2022
  • (2018)Lessons learned from memory errors observed over the lifetime of CieloProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.5555/3291656.3291714(1-12)Online publication date: 11-Nov-2018
  • (2018)Lessons learned from memory errors observed over the lifetime of CieloProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC.2018.00046(1-12)Online publication date: 11-Nov-2018
  • (2018)Fault site pruning for practical reliability analysis of GPGPU applicationsProceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2018.00066(749-761)Online publication date: 20-Oct-2018
  • (2017)Failures in large scale systemsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3126908.3126937(1-12)Online publication date: 12-Nov-2017
  • (2016)Dynamic prediction & estimation of intentional failures in HPCsProceedings of the 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining10.5555/3192424.3192653(1244-1250)Online publication date: 18-Aug-2016
  • (2015)Reliability lessons learned from GPU experience with the Titan supercomputer at Oak Ridge leadership computing facilityProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/2807591.2807666(1-12)Online publication date: 15-Nov-2015
  • (2015)LogDiverProceedings of the 5th Workshop on Fault Tolerance for HPC at eXtreme Scale10.1145/2751504.2751511(11-18)Online publication date: 15-Jun-2015
  • (2015)A Principled Approach to HPC Event MonitoringProceedings of the 5th Workshop on Fault Tolerance for HPC at eXtreme Scale10.1145/2751504.2751506(3-10)Online publication date: 15-Jun-2015
  • (2015)Analyzing and Predicting Failure in Hadoop Clusters Using Distributed Hidden Markov ModelRevised Selected Papers of the Second International Conference on Cloud Computing and Big Data - Volume 910610.1007/978-3-319-28430-9_18(232-246)Online publication date: 17-Jun-2015
  • Show More Cited By

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media