Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2371536.2371571acmconferencesArticle/Chapter ViewAbstractPublication PagesicacConference Proceedingsconference-collections
research-article

3-Dimensional root cause diagnosis via co-analysis

Published: 18 September 2012 Publication History

Abstract

With the growth of system size and complexity, reliability has become a major concern for large-scale systems. Upon the occurrence of failure, system administrators typically trace the events in Reliability, Availability, and Serviceability (RAS) logs for root cause diagnosis. However, RAS log only contains limited diagnosis information. Moreover, the manual processing is time-consuming, error-prone, and not scalable. To address the problem, in this paper we present an automated root cause diagnosis mechanism for large-scale HPC systems. Our mechanism examines multiple logs to provide a 3-D fine-grained root cause analysis. Here, 3-D means that our analysis will pinpoint the failure layer, the time, and the location of the event that causes the problem.
We evaluate our mechanism by means of real logs collected from a production IBM Blue Gene/P system at Oak Ridge National Laboratory. It successfully identifies failure layer information for the failures during 23-month period. Furthermore, it effectively identifies the triggering events with time and location information, even when the triggering events occur hundreds of hours before the resulting failures.

References

[1]
S. Alam, R. Barrett, M. Bast, M. Fahey, J. Kuehn, C. McCurdy, J. Rogers, P. Roth, R. Sankaran, J. Vetter, P. Worley, and W. Yu. Early evaluation of IBM BlueGene/P. Proc. of Supercomputing, 2008.
[2]
P. Barham, A. Donnelly, R. Isaacs, and R. Mortier. Using magpie for request extraction and workload modelling. In Proceedings of OSDI, 2004.
[3]
J. Brandt, A. Gentile, C. Houf, J. Mayo, P. Pebay, D. Roe, D. Thompson, and M. Wong. OVIS 3.2 user's guide. SAND 2010-7109, Sandia National Laboratories, October 2010.
[4]
G. Bronevetsky, I. Laguna, S. Bagchi, R. Bronis, D. Ahn, and M. Schulz. AutomaDeD: Automata-based debugging for dissimilar parallel tasks. In Proceedings of DSN, 2010.
[5]
M. Chen, E. Kiciman, E. Fratkin, A. Fox, and E. Brewer. Pinpoint: problem determination in large, dynamic Internet services. In Proceedings of DSN, 2002.
[6]
T. Chilimbi, B. Liblit, K. Mehra, A. Nori, and K. Vaswani. Holmes: Effective statistical debugging via efficient path profiling. In Proceedings of ICSE, 2009.
[7]
E. Chuah, S. Kuo, P. Hiew, W. Tjhi, G. Lee, J. Hammond, M. Michalewicz, T. Hung, and J. Browne. Diagnosing the root-causes of failures from cluster log files. In Proceedings of HiPC, 2010.
[8]
N. DeBardeleben, J. Laros, J. Daly, S. Scott, C. Engelmann, and B. Harrod. High-end computing resilience: Analysis of issues facing the HEC community and path-forward for research and development. White Paper, 2009.
[9]
N. Desai, R. Bradshaw, C. Lueninghoener, A. Cherry, S. Coghlan, and W. Scullin. Petascale system management experiences. In Proceedings of LISA, 2008.
[10]
A. Gainaru, F. Cappello, F. J., and S. Trausan. Adaptive event prediction strategy with dynamic time window for large-scale HPC systems. In Proceedings of SLAML, 2011.
[11]
Q. Gao, F. Qin, and D. Panda. DMTracker: Finding bugs in large-scale parallel programs by detecting anomaly in data movements. In Proceedings of Supercomputing, 2006.
[12]
J. Han and M. Kamber. Data Mining:Concepts and Techniques. Morgan Kaufmann, 2000.
[13]
M. Kasick, J. Tan, R. Gandhi, and P. Narasimhan. Black-box problem diagnosis in parallel file systems. In Proceedings of FAST, 2010.
[14]
M. Khan, H. Le, H. Ahmadi, T. Abdelzaher, and J. Han. Dustminer: troubleshooting interactive complexity bugs in sensor networks. In Proceedings of SenSys, 2008.
[15]
E. Kiciman and A. Fox. Detecting application-level failures in component-based internet services. IEEE Trans. Neural Networks, 16(5):1027--1041, 2005.
[16]
P. Kogge and et al. Exascale computing study: Technology challenges in achieving exascale systems. White Paper, 2008.
[17]
G. Lakner and G. Mullen-Schultz. IBM BlueGene solution: System administration. IBM Redbook, 2007.
[18]
Z. Lan, Z. Zheng, and Y. Li. Toward automated anomaly identification in large-scale systems. IEEE Trans. on Parallel and Distributed Systems, 21(2):174--187, 2010.
[19]
Y. Liang, Y. Zhang, A. Sivasubramanium, R. Sahoo, J. Moreia, and M. Gupta. Filtering failure logs for a BlueGene/L prototype. In Proceedings of DSN, 2005.
[20]
A. Mahimkar, Z. Ge, A. Shaikh, J. Wang, J. Yates, Y. Zhang, and Q. Zhao. Towards automated performance diagnosis in a large IPTV network. In Proceedings of SIGCOMM, 2009.
[21]
N. Maruyama and S. Matsuoka. Model-based fault localization: Finding behavioral outliers in large-scale computing systems. New Generation Comput, 28:237--255, 2010.
[22]
A. Mirgorodskiy, N. Maruyama, and B. Miller. Problem diagnosis in large-scale computing environments. In Proceedings of Supercomputing, 2006.
[23]
A. Oliner, A. Kulkarni, and A. Aiken. Using correlated surprise to infer shared influence. In Proceedings of DSN, 2010.
[24]
A. Oliner and J. Stearley. What supercomputers say: A study of five system logs. In Proceedings of DSN, 2007.
[25]
X. Pan, J. Tan, S. Kalvulya, R. Gandhi, and P. Narasimhan. Blind men and the elephant: Piecing together hadoop for diagnosis. In Proceedings of ISSRE, 2009.
[26]
J. Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, 2000.
[27]
A. Pecchia, D. Cotroneo, Z. Kalbarczyk, and R. Iyer. Improving log-based field failure data analysis of multi-node computing systems. In Proceedings of DSN, 2011.
[28]
X. Rao, H. Wang, D. Shi, Z. Chen, H. Cai, and Q. Zhou. Identifying faults in large-scale distributed systems by filtering noisy error logs. In Proceedings of DSNW, 2011.
[29]
S. Sabato, E. Yomtov, and A. Tsherniak. Analyzing system logs: A new view of what's important. In USENIX SysML workshop, 2007.
[30]
R. Sambasivan, A. Zheng, M. Rosa, E. Krevat, S. Whitman, M. Stroucken, W. Wang, L. Xu, and G. Ganger. Diagnosing performance changes by comparing request flows. In Proceedings of NSDI, 2011.
[31]
B. Schroeder and G. Gibson. A large-scale study of failures in high-performance computing systems. In Proceedings of DSN, 2006.
[32]
J. Stearley and A. Oliner. Bad words: Finding faults in spirit's syslogs. In Proceedings of the Workshop on Resiliency in High Performance Computing, 2008.
[33]
L. Yu, Z. Zheng, Z. Lan, T. Jones, J. Brandt, and A. Gentile. Filtering log data: Finding the needles in the haystack. In Proceedings of DSN, 2012.
[34]
Z. Zheng, Z. Lan, B. Park, and A. Geist. System log pre-processing to improve failure prediction. In Proceedings of DSN, 2009.
[35]
Z. Zheng, L. Yu, W. Tang, Z. Lan, R. Gupta, N. Desai, S. Coghlan, and D. Buettner. Co-analysis of RAS log and job log on Blue Gene/P. In Proceedings of IPDPS, 2011.

Cited By

View all
  • (2023)An empirical study of major page faults for failure diagnosis in cluster systemsThe Journal of Supercomputing10.1007/s11227-023-05366-179:16(18445-18479)Online publication date: 15-May-2023
  • (2022)Comparative Causal Analysis of Network Log Data in Two Large ISPsNOMS 2022-2022 IEEE/IFIP Network Operations and Management Symposium10.1109/NOMS54207.2022.9789823(1-6)Online publication date: 25-Apr-2022
  • (2022)A Survey of Log-Correlation Tools for Failure Diagnosis and Prediction in Cluster SystemsIEEE Access10.1109/ACCESS.2022.323145410(133487-133503)Online publication date: 2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ICAC '12: Proceedings of the 9th international conference on Autonomic computing
September 2012
222 pages
ISBN:9781450315203
DOI:10.1145/2371536
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

  • IEEE

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 September 2012

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. co-analysis
  2. diagnosis
  3. large-scale system

Qualifiers

  • Research-article

Conference

ICAC '12
Sponsor:
ICAC '12: 9th International Conference on Autonomic Computing
September 18 - 20, 2012
California, San Jose, USA

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)7
  • Downloads (Last 6 weeks)0
Reflects downloads up to 14 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2023)An empirical study of major page faults for failure diagnosis in cluster systemsThe Journal of Supercomputing10.1007/s11227-023-05366-179:16(18445-18479)Online publication date: 15-May-2023
  • (2022)Comparative Causal Analysis of Network Log Data in Two Large ISPsNOMS 2022-2022 IEEE/IFIP Network Operations and Management Symposium10.1109/NOMS54207.2022.9789823(1-6)Online publication date: 25-Apr-2022
  • (2022)A Survey of Log-Correlation Tools for Failure Diagnosis and Prediction in Cluster SystemsIEEE Access10.1109/ACCESS.2022.323145410(133487-133503)Online publication date: 2022
  • (2021)LogFlow: Simplified Log Analysis for Large Scale SystemsProceedings of the 22nd International Conference on Distributed Computing and Networking10.1145/3427796.3427808(116-125)Online publication date: 5-Jan-2021
  • (2021)Failure Diagnosis for Cluster Systems using Partial Correlations2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom)10.1109/ISPA-BDCloud-SocialCom-SustainCom52081.2021.00151(1091-1101)Online publication date: Sep-2021
  • (2021)Systemic Assessment of Node Failures in HPC Production Platforms2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS49936.2021.00035(267-276)Online publication date: May-2021
  • (2021)A Quantitative Causal Analysis for Network Log Data2021 IEEE 45th Annual Computers, Software, and Applications Conference (COMPSAC)10.1109/COMPSAC51774.2021.00213(1437-1442)Online publication date: Jul-2021
  • (2021)amulog: A general log analysis framework for comparison and combination of diverse template generation methods*International Journal of Network Management10.1002/nem.219532:4Online publication date: 19-Dec-2021
  • (2020)amulog: A General Log Analysis Framework for Diverse Template Generation Methods2020 16th International Conference on Network and Service Management (CNSM)10.23919/CNSM50824.2020.9269049(1-5)Online publication date: 2-Nov-2020
  • (2020)Introducing an Unsupervised Automated Solution for Root Cause Diagnosis in Mobile NetworksIEEE Transactions on Network and Service Management10.1109/TNSM.2019.295434017:1(547-561)Online publication date: Mar-2020
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media