research-article

3-Dimensional root cause diagnosis via co-analysis

Authors:

Terry JonesAuthors Info & Claims

ICAC '12: Proceedings of the 9th international conference on Autonomic computing

Pages 181 - 190

https://doi.org/10.1145/2371536.2371571

Published: 18 September 2012 Publication History

Abstract

With the growth of system size and complexity, reliability has become a major concern for large-scale systems. Upon the occurrence of failure, system administrators typically trace the events in Reliability, Availability, and Serviceability (RAS) logs for root cause diagnosis. However, RAS log only contains limited diagnosis information. Moreover, the manual processing is time-consuming, error-prone, and not scalable. To address the problem, in this paper we present an automated root cause diagnosis mechanism for large-scale HPC systems. Our mechanism examines multiple logs to provide a 3-D fine-grained root cause analysis. Here, 3-D means that our analysis will pinpoint the failure layer, the time, and the location of the event that causes the problem.

We evaluate our mechanism by means of real logs collected from a production IBM Blue Gene/P system at Oak Ridge National Laboratory. It successfully identifies failure layer information for the failures during 23-month period. Furthermore, it effectively identifies the triggering events with time and location information, even when the triggering events occur hundreds of hours before the resulting failures.

References

[1]

S. Alam, R. Barrett, M. Bast, M. Fahey, J. Kuehn, C. McCurdy, J. Rogers, P. Roth, R. Sankaran, J. Vetter, P. Worley, and W. Yu. Early evaluation of IBM BlueGene/P. Proc. of Supercomputing, 2008.

Digital Library

[2]

P. Barham, A. Donnelly, R. Isaacs, and R. Mortier. Using magpie for request extraction and workload modelling. In Proceedings of OSDI, 2004.

Digital Library

[3]

J. Brandt, A. Gentile, C. Houf, J. Mayo, P. Pebay, D. Roe, D. Thompson, and M. Wong. OVIS 3.2 user's guide. SAND 2010-7109, Sandia National Laboratories, October 2010.

[4]

G. Bronevetsky, I. Laguna, S. Bagchi, R. Bronis, D. Ahn, and M. Schulz. AutomaDeD: Automata-based debugging for dissimilar parallel tasks. In Proceedings of DSN, 2010.

[5]

M. Chen, E. Kiciman, E. Fratkin, A. Fox, and E. Brewer. Pinpoint: problem determination in large, dynamic Internet services. In Proceedings of DSN, 2002.

Digital Library

[6]

T. Chilimbi, B. Liblit, K. Mehra, A. Nori, and K. Vaswani. Holmes: Effective statistical debugging via efficient path profiling. In Proceedings of ICSE, 2009.

Digital Library

[7]

E. Chuah, S. Kuo, P. Hiew, W. Tjhi, G. Lee, J. Hammond, M. Michalewicz, T. Hung, and J. Browne. Diagnosing the root-causes of failures from cluster log files. In Proceedings of HiPC, 2010.

[8]

N. DeBardeleben, J. Laros, J. Daly, S. Scott, C. Engelmann, and B. Harrod. High-end computing resilience: Analysis of issues facing the HEC community and path-forward for research and development. White Paper, 2009.

[9]

N. Desai, R. Bradshaw, C. Lueninghoener, A. Cherry, S. Coghlan, and W. Scullin. Petascale system management experiences. In Proceedings of LISA, 2008.

Digital Library

[10]

A. Gainaru, F. Cappello, F. J., and S. Trausan. Adaptive event prediction strategy with dynamic time window for large-scale HPC systems. In Proceedings of SLAML, 2011.

Digital Library

[11]

Q. Gao, F. Qin, and D. Panda. DMTracker: Finding bugs in large-scale parallel programs by detecting anomaly in data movements. In Proceedings of Supercomputing, 2006.

Digital Library

[12]

J. Han and M. Kamber. Data Mining:Concepts and Techniques. Morgan Kaufmann, 2000.

Digital Library

[13]

M. Kasick, J. Tan, R. Gandhi, and P. Narasimhan. Black-box problem diagnosis in parallel file systems. In Proceedings of FAST, 2010.

Digital Library

[14]

M. Khan, H. Le, H. Ahmadi, T. Abdelzaher, and J. Han. Dustminer: troubleshooting interactive complexity bugs in sensor networks. In Proceedings of SenSys, 2008.

Digital Library

[15]

E. Kiciman and A. Fox. Detecting application-level failures in component-based internet services. IEEE Trans. Neural Networks, 16(5):1027--1041, 2005.

Digital Library

[16]

P. Kogge and et al. Exascale computing study: Technology challenges in achieving exascale systems. White Paper, 2008.

[17]

G. Lakner and G. Mullen-Schultz. IBM BlueGene solution: System administration. IBM Redbook, 2007.

[18]

Z. Lan, Z. Zheng, and Y. Li. Toward automated anomaly identification in large-scale systems. IEEE Trans. on Parallel and Distributed Systems, 21(2):174--187, 2010.

Digital Library

[19]

Y. Liang, Y. Zhang, A. Sivasubramanium, R. Sahoo, J. Moreia, and M. Gupta. Filtering failure logs for a BlueGene/L prototype. In Proceedings of DSN, 2005.

Digital Library

[20]

A. Mahimkar, Z. Ge, A. Shaikh, J. Wang, J. Yates, Y. Zhang, and Q. Zhao. Towards automated performance diagnosis in a large IPTV network. In Proceedings of SIGCOMM, 2009.

Digital Library

[21]

N. Maruyama and S. Matsuoka. Model-based fault localization: Finding behavioral outliers in large-scale computing systems. New Generation Comput, 28:237--255, 2010.

[22]

A. Mirgorodskiy, N. Maruyama, and B. Miller. Problem diagnosis in large-scale computing environments. In Proceedings of Supercomputing, 2006.

Digital Library

[23]

A. Oliner, A. Kulkarni, and A. Aiken. Using correlated surprise to infer shared influence. In Proceedings of DSN, 2010.

[24]

A. Oliner and J. Stearley. What supercomputers say: A study of five system logs. In Proceedings of DSN, 2007.

Digital Library

[25]

X. Pan, J. Tan, S. Kalvulya, R. Gandhi, and P. Narasimhan. Blind men and the elephant: Piecing together hadoop for diagnosis. In Proceedings of ISSRE, 2009.

[26]

J. Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, 2000.

Digital Library

[27]

A. Pecchia, D. Cotroneo, Z. Kalbarczyk, and R. Iyer. Improving log-based field failure data analysis of multi-node computing systems. In Proceedings of DSN, 2011.

Digital Library

[28]

X. Rao, H. Wang, D. Shi, Z. Chen, H. Cai, and Q. Zhou. Identifying faults in large-scale distributed systems by filtering noisy error logs. In Proceedings of DSNW, 2011.

Digital Library

[29]

S. Sabato, E. Yomtov, and A. Tsherniak. Analyzing system logs: A new view of what's important. In USENIX SysML workshop, 2007.

Digital Library

[30]

R. Sambasivan, A. Zheng, M. Rosa, E. Krevat, S. Whitman, M. Stroucken, W. Wang, L. Xu, and G. Ganger. Diagnosing performance changes by comparing request flows. In Proceedings of NSDI, 2011.

Digital Library

[31]

B. Schroeder and G. Gibson. A large-scale study of failures in high-performance computing systems. In Proceedings of DSN, 2006.

Digital Library

[32]

J. Stearley and A. Oliner. Bad words: Finding faults in spirit's syslogs. In Proceedings of the Workshop on Resiliency in High Performance Computing, 2008.

Digital Library

[33]

L. Yu, Z. Zheng, Z. Lan, T. Jones, J. Brandt, and A. Gentile. Filtering log data: Finding the needles in the haystack. In Proceedings of DSN, 2012.

Digital Library

[34]

Z. Zheng, Z. Lan, B. Park, and A. Geist. System log pre-processing to improve failure prediction. In Proceedings of DSN, 2009.

[35]

Z. Zheng, L. Yu, W. Tang, Z. Lan, R. Gupta, N. Desai, S. Coghlan, and D. Buettner. Co-analysis of RAS log and job log on Blue Gene/P. In Proceedings of IPDPS, 2011.

Digital Library

Cited By

Chuah EJhumka ANarasimhamurthy S(2023)An empirical study of major page faults for failure diagnosis in cluster systemsThe Journal of Supercomputing10.1007/s11227-023-05366-179:16(18445-18479)Online publication date: 15-May-2023
https://doi.org/10.1007/s11227-023-05366-1
Kobayashi SShima KCho KAkashi OFukuda K(2022)Comparative Causal Analysis of Network Log Data in Two Large ISPsNOMS 2022-2022 IEEE/IFIP Network Operations and Management Symposium10.1109/NOMS54207.2022.9789823(1-6)Online publication date: 25-Apr-2022
https://doi.org/10.1109/NOMS54207.2022.9789823
Chuah EJhumka AMalek MSuri N(2022)A Survey of Log-Correlation Tools for Failure Diagnosis and Prediction in Cluster SystemsIEEE Access10.1109/ACCESS.2022.323145410(133487-133503)Online publication date: 2022
https://doi.org/10.1109/ACCESS.2022.3231454
Show More Cited By

Index Terms

3-Dimensional root cause diagnosis via co-analysis
1. Hardware
  1. Hardware test
  2. Robustness
2. Software and its engineering
  1. Software creation and management
    1. Software verification and validation

Recommendations

Information-Theoretic Syndrome Evaluation, Statistical Root-Cause Analysis, and Correlation-Based Feature Selection for Guiding Board-Level Fault Diagnosis
Reasoning-based functional-fault diagnosis has recently been advocated to achieve high diagnosis accuracy, low defect escapes, and reducing manufacturing cost. However, such diagnosis method requires a rich set of test items (syndromes) and a sizable ...
Empirical study of root cause analysis of software failure

Root Cause Analysis (RCA) is the process of identifying project issues, correcting them and taking preventive actions to avoid occurrences of such issues in the future. Issues could be variance in schedule, effort, cost, productivity, expected results ...
An Experimental Platform for Root Cause Diagnosis Research
WCRE '07: Proceedings of the 14th Working Conference on Reverse Engineering

To obtain a healthy integrated production system that achieves defined quality goals in service oriented architecture (SOA), such as availability and performance, the timely detection and resolution of failures is needed. The goal of this thesis ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ICAC '12: Proceedings of the 9th international conference on Autonomic computing

September 2012

222 pages

ISBN:9781450315203

DOI:10.1145/2371536

General Chair:
Dejan Milojicic
HP Labs
,
Program Chairs:
Dongyan Xu
Purdue University
,
Vanish Talwar
HP Labs

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

University of Arizona: University of Arizona
SIGARCH: ACM Special Interest Group on Computer Architecture

In-Cooperation

IEEE

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 September 2012

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ICAC '12

Sponsor:

University of Arizona
SIGARCH

ICAC '12: 9th International Conference on Autonomic Computing

September 18 - 20, 2012

California, San Jose, USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

27
Total Citations
View Citations
313
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)1

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Chuah EJhumka ANarasimhamurthy S(2023)An empirical study of major page faults for failure diagnosis in cluster systemsThe Journal of Supercomputing10.1007/s11227-023-05366-179:16(18445-18479)Online publication date: 15-May-2023
https://doi.org/10.1007/s11227-023-05366-1
Kobayashi SShima KCho KAkashi OFukuda K(2022)Comparative Causal Analysis of Network Log Data in Two Large ISPsNOMS 2022-2022 IEEE/IFIP Network Operations and Management Symposium10.1109/NOMS54207.2022.9789823(1-6)Online publication date: 25-Apr-2022
https://doi.org/10.1109/NOMS54207.2022.9789823
Chuah EJhumka AMalek MSuri N(2022)A Survey of Log-Correlation Tools for Failure Diagnosis and Prediction in Cluster SystemsIEEE Access10.1109/ACCESS.2022.323145410(133487-133503)Online publication date: 2022
https://doi.org/10.1109/ACCESS.2022.3231454
Platini MRopars TPelletier BDe Palma N(2021)LogFlow: Simplified Log Analysis for Large Scale SystemsProceedings of the 22nd International Conference on Distributed Computing and Networking10.1145/3427796.3427808(116-125)Online publication date: 5-Jan-2021
https://dl.acm.org/doi/10.1145/3427796.3427808
ChuahM EJhumka AAlt SEvans RSuri N(2021)Failure Diagnosis for Cluster Systems using Partial Correlations2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom)10.1109/ISPA-BDCloud-SocialCom-SustainCom52081.2021.00151(1091-1101)Online publication date: Sep-2021
https://doi.org/10.1109/ISPA-BDCloud-SocialCom-SustainCom52081.2021.00151
Das AMueller FRountree B(2021)Systemic Assessment of Node Failures in HPC Production Platforms2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS49936.2021.00035(267-276)Online publication date: May-2021
https://doi.org/10.1109/IPDPS49936.2021.00035
Jarry RKobayashi SFukuda K(2021)A Quantitative Causal Analysis for Network Log Data2021 IEEE 45th Annual Computers, Software, and Applications Conference (COMPSAC)10.1109/COMPSAC51774.2021.00213(1437-1442)Online publication date: Jul-2021
https://doi.org/10.1109/COMPSAC51774.2021.00213
Kobayashi SYamashiro YOtomo KFukuda K(2021)amulog: A general log analysis framework for comparison and combination of diverse template generation methods*International Journal of Network Management10.1002/nem.219532:4Online publication date: 19-Dec-2021
https://doi.org/10.1002/nem.2195
Kobayashi SYamashiro YOtomo KFukuda K(2020)amulog: A General Log Analysis Framework for Diverse Template Generation Methods2020 16th International Conference on Network and Service Management (CNSM)10.23919/CNSM50824.2020.9269049(1-5)Online publication date: 2-Nov-2020
https://doi.org/10.23919/CNSM50824.2020.9269049
Mdini MSimon GBlanc ALecoeuvre J(2020)Introducing an Unsupervised Automated Solution for Root Cause Diagnosis in Mobile NetworksIEEE Transactions on Network and Service Management10.1109/TNSM.2019.295434017:1(547-561)Online publication date: Mar-2020
https://doi.org/10.1109/TNSM.2019.2954340
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten