Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1109/DSN.2011.5958210guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Improving Log-based Field Failure Data Analysis of multi-node computing systems

Published: 27 June 2011 Publication History

Abstract

Log-based Field Failure Data Analysis (FFDA) is a widely-adopted methodology to assess dependability properties of an operational system. A key step in FFDA is filtering out entries that are not useful and redundant error entries from the log. The latter is challenging: a fault, once triggered, can generate multiple errors that propagate within the system. Grouping the error entries related to the same fault manifestation is crucial to obtain realistic measurements. This paper deals with the issues of the tuple heuristic, used to group the error entries in the log, in multi-node computing systems. We demonstrate that the tuple heuristic can group entries incorrectly; thus, an improved heuristic that adopts statistical indicators is proposed. We assess the impact of inaccurate grouping on dependability measurements by comparing the results obtained with both the heuristics. The analysis encompasses the log of the Mercury cluster at the National Center for Supercomputing Applications.

Cited By

View all
  • (2018)DeshProceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3208040.3208051(40-51)Online publication date: 11-Jun-2018
  • (2018)Fault site pruning for practical reliability analysis of GPGPU applicationsProceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2018.00066(749-761)Online publication date: 20-Oct-2018
  • (2017)Failures in large scale systemsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3126908.3126937(1-12)Online publication date: 12-Nov-2017
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings
DSN '11: Proceedings of the 2011 IEEE/IFIP 41st International Conference on Dependable Systems&Networks
June 2011
597 pages
ISBN:9781424492329

Publisher

IEEE Computer Society

United States

Publication History

Published: 27 June 2011

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 14 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2018)DeshProceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3208040.3208051(40-51)Online publication date: 11-Jun-2018
  • (2018)Fault site pruning for practical reliability analysis of GPGPU applicationsProceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2018.00066(749-761)Online publication date: 20-Oct-2018
  • (2017)Failures in large scale systemsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3126908.3126937(1-12)Online publication date: 12-Nov-2017
  • (2016)A web interface for XALT log data analysisProceedings of the XSEDE16 Conference on Diversity, Big Data, and Science at Scale10.1145/2949550.2949560(1-8)Online publication date: 17-Jul-2016
  • (2015)Reliability lessons learned from GPU experience with the Titan supercomputer at Oak Ridge leadership computing facilityProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/2807591.2807666(1-12)Online publication date: 15-Nov-2015
  • (2015)LogDiverProceedings of the 5th Workshop on Fault Tolerance for HPC at eXtreme Scale10.1145/2751504.2751511(11-18)Online publication date: 15-Jun-2015
  • (2015)Automating Crash Report Analysis Using 'Exception-based Patterns' & 'Reference Assembly mapping'Proceedings of the 8th India Software Engineering Conference10.1145/2723742.2723749(70-79)Online publication date: 18-Feb-2015
  • (2012)Fault prediction under the microscopeProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.5555/2388996.2389101(1-11)Online publication date: 10-Nov-2012
  • (2012)3-Dimensional root cause diagnosis via co-analysisProceedings of the 9th international conference on Autonomic computing10.1145/2371536.2371571(181-190)Online publication date: 18-Sep-2012

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media