research-article

Open access

Actionable and interpretable fault localization for recurring failures in online service systems

Authors:

Li Cao,

Xu Du,

Dan PeiAuthors Info & Claims

ESEC/FSE 2022: Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering

Pages 996 - 1008

https://doi.org/10.1145/3540250.3549092

Published: 09 November 2022 Publication History

PDF eReader

Abstract

Fault localization is challenging in an online service system due to its monitoring data's large volume and variety and complex dependencies across/within its components (e.g., services or databases). Furthermore, engineers require fault localization solutions to be actionable and interpretable, which existing research approaches cannot satisfy. Therefore, the common industry practice is that, for a specific online service system, its experienced engineers focus on localization for recurring failures based on the knowledge accumulated about the system and historical failures. More specifically, 1) they can identify the underlying root causes and take mitigation actions when pinpointing a group of indicative metrics on the faulty component; 2) their diagnosis knowledge is roughly based on how one failure might affect the components in the whole system.

Although the above common practice is actionable and interpretable, it is largely manual, thus slow and sometimes inaccurate. In this paper, we aim to automate this practice through machine learning. That is, we propose an actionable and interpretable fault localization approach, DejaVu, for recurring failures in online service systems. For a specific online service system, DejaVu takes historical failures and dependencies in the system as input and trains a localization model offline; for an incoming failure, the trained model online recommends where the failure occurs (i.e., the faulty components) and which kind of failure occurs (i.e., the indicative group of metrics) (thus actionable), which are further interpreted both globally and locally (thus interpretable). Based on the evaluation on 601 failures from three production systems and one open-source benchmark, in less than one second, DejaVu can rank the ground truths at 1.66∼5.03-th among a long candidate list on average, outperforming baselines by 54.52%.

References

[1]

2021. PyTorch: Tensors and Dynamic Neural Networks in Python with Strong GPU Acceleration. https://github.com/pytorch/pytorch

Abstract

References

Cited By

Index Terms

Recommendations

Making Fault Localization in Online Service Systems More Actionable and Interpretable

Fault density, fault types, and spectra-based fault localization

Factorising the Multiple Fault Localization Problem: Adapting Single-Fault Localizer to Multi-fault Programs

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Badges

Author Tags

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations