Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3540250.3549092acmconferencesArticle/Chapter ViewAbstractPublication PagesfseConference Proceedingsconference-collections
research-article
Open access

Actionable and interpretable fault localization for recurring failures in online service systems

Published: 09 November 2022 Publication History

Abstract

Fault localization is challenging in an online service system due to its monitoring data's large volume and variety and complex dependencies across/within its components (e.g., services or databases). Furthermore, engineers require fault localization solutions to be actionable and interpretable, which existing research approaches cannot satisfy. Therefore, the common industry practice is that, for a specific online service system, its experienced engineers focus on localization for recurring failures based on the knowledge accumulated about the system and historical failures. More specifically, 1) they can identify the underlying root causes and take mitigation actions when pinpointing a group of indicative metrics on the faulty component; 2) their diagnosis knowledge is roughly based on how one failure might affect the components in the whole system.
Although the above common practice is actionable and interpretable, it is largely manual, thus slow and sometimes inaccurate. In this paper, we aim to automate this practice through machine learning. That is, we propose an actionable and interpretable fault localization approach, DejaVu, for recurring failures in online service systems. For a specific online service system, DejaVu takes historical failures and dependencies in the system as input and trains a localization model offline; for an incoming failure, the trained model online recommends where the failure occurs (i.e., the faulty components) and which kind of failure occurs (i.e., the indicative group of metrics) (thus actionable), which are further interpreted both globally and locally (thus interpretable). Based on the evaluation on 601 failures from three production systems and one open-source benchmark, in less than one second, DejaVu can rank the ground truths at 1.66∼5.03-th among a long candidate list on average, outperforming baselines by 54.52%.

References

[1]
2021. PyTorch: Tensors and Dynamic Neural Networks in Python with Strong GPU Acceleration. https://github.com/pytorch/pytorch
[2]
2021. Tsfresh. Blue Yonder GmbH. https://github.com/blue-yonder/tsfresh
[3]
2022. Chaos-Mesh/Chaos-Mesh. Chaos Mesh. https://github.com/chaos-mesh/chaos-mesh
[4]
2022. DejaVu. NetManAIOps. https://github.com/NetManAIOps/DejaVu
[5]
David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. 2017. Network Dissection: Quantifying Interpretability of Deep Visual Representations. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. https://doi.org/10.1109/CVPR.2017.354
[6]
Peter Bodik, Moises Goldszmidt, Armando Fox, Dawn B. Woodard, and Hans Andersen. 2010. Fingerprinting the Datacenter: Automated Classification of Performance Crises. In Proceedings of the 5th European Conference on Computer Systems. https://doi.org/10.1145/1755913.1755926
[7]
Álvaro Brandón, Marc Solé, Alberto Huélamo, David Solans, María S. Pérez, and Victor Muntés-Mulero. 2020. Graph-Based Root Cause Analysis for Service-Oriented and Microservice Architectures. Journal of Systems and Software, 159 (2020), Jan., https://doi.org/10.1016/j.jss.2019.110432
[8]
Leo Breiman. 2001. Random Forests. Machine Learning, 45, 1 (2001), https://doi.org/10.1023/A:1010933404324
[9]
Yang Cai, Biao Han, Jie Li, Na Zhao, and Jinshu Su. 2021. ModelCoder: A Fault Model Based Automatic Root Cause Localization Framework for Microservice Systems. In 2021 IEEE/ACM 29th International Symposium on Quality of Service (IWQOS). https://doi.org/10.1109/IWQOS52092.2021.9521318
[10]
Junjie Chen, Xiaoting He, Qingwei Lin, Yong Xu, Hongyu Zhang, Dan Hao, Feng Gao, Zhangwei Xu, Yingnong Dang, and Dongmei Zhang. 2019. An Empirical Investigation of Incident Triage for Online Service Systems. In Proceedings of the 41st International Conference on Software Engineering: Software Engineering in Practice. https://doi.org/10.1109/ICSE-SEIP.2019.00020
[11]
Junjie Chen, Xiaoting He, Qingwei Lin, Hongyu Zhang, Dan Hao, Feng Gao, Zhangwei Xu, Yingnong Dang, and Dongmei Zhang. 2019. Continuous Incident Triage for Large-Scale Online Service Systems. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). https://doi.org/10.1109/ASE.2019.00042
[12]
Zhuangbin Chen, Yu Kang, Liqun Li, Xu Zhang, Hongyu Zhang, Hui Xu, Yangfan Zhou, Li Yang, Jeffrey Sun, Zhangwei Xu, Yingnong Dang, Feng Gao, Pu Zhao, Bo Qiao, Qingwei Lin, Dongmei Zhang, and Michael R Lyu. 2020. Towards Intelligent Incident Management: Why We Need It and How We Make It. In Proceedings of the 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE ’20). https://doi.org/10.1145/3368089.3417055
[13]
Zhuangbin Chen, Jinyang Liu, Yuxin Su, Hongyu Zhang, Xiao Ling, Yongqiang Yang, and Michael R. Lyu. 2022. Adaptive Performance Anomaly Detection for Online Service Systems via Pattern Sketching. In The 44th International Conference on Software Engineering (ICSE 2022). https://doi.org/10.1145/3510003.3510085
[14]
Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations Using RNN Encoder– Decoder for Statistical Machine Translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). https://doi.org/10.3115/v1/D14-1179
[15]
J. Cohen. 1988. Statistical Power Analysis for the Behavioral Sciences. Lawrence Erlbaum Associates.
[16]
Yingnong Dang, Qingwei Lin, and Peng Huang. 2019. AIOps: Real-World Challenges and Research Innovations. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Companion Proceedings (ICSE-Companion). https://doi.org/10.1109/ICSE-Companion.2019.00023
[17]
Vincenzo Di Cicco, Donatella Firmani, Nick Koudas, Paolo Merialdo, and Divesh Srivastava. 2019. Interpreting Deep Learning Models for Entity Resolution: An Experience Report Using LIME. In Proceedings of the Second International Workshop on Exploiting Artificial Intelligence Techniques for Data Management. https://doi.org/10.1145/3329859.3329878
[18]
Pradeep Dogga, Karthik Narasimhan, Anirudh Sivaraman, Shiv Kumar Saini, George Varghese, and Ravi Netravali. 2022. Revelio: ML-Generated Debugging Queries for Finding Root Causes in Distributed Systems. In Proceedings of Machine Learning and Systems 2022, MLSys 2022, Santa Clara, CA, USA, August 29 - September 1, 2022, Diana Marculescu, Yuejie Chi, and Carole-Jean Wu (Eds.).
[19]
Vincent Dumoulin and Francesco Visin. 2016. A Guide to Convolution Arithmetic for Deep Learning. arXiv preprint arXiv:1603.07285.
[20]
Yu Gan, Mingyu Liang, Sundar Dev, David Lo, and Christina Delimitrou. 2021. Sage: Practical and Scalable ML-driven Performance Debugging in Microservices. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. https://doi.org/10.1145/3445814.3446700
[21]
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press. http://www.deeplearningbook.org/
[22]
Xiaofeng Guo, Xin Peng, Hanzhang Wang, Wanxue Li, Huai Jiang, Dan Ding, Tao Xie, and Liangfei Su. 2020. Graph-Based Trace Analysis for Microservice Architecture Understanding and Problem Diagnosis. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. https://doi.org/10.1145/3368089.3417066
[23]
William L Hamilton, Rex Ying, and Jure Leskovec. 2017. Inductive Representation Learning on Large Graphs. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017.
[24]
Shilin He, Qingwei Lin, Jian-Guang Lou, Hongyu Zhang, Michael R. Lyu, and Dongmei Zhang. 2018. Identifying Impactful Service System Problems via Log Analysis. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. https://doi.org/10.1145/3236024.3236083
[25]
Dan Hendrycks and Kevin Gimpel. 2020. Gaussian Error Linear Units (GELUs). arXiv:1606.08415 [cs], July.
[26]
Xuan Huo, Ferdian Thung, Ming Li, David Lo, and Shu-Ting Shi. 2021. Deep Transfer Bug Localization. IEEE Transactions on Software Engineering, 47, 7 (2021), July, https://doi.org/10.1109/TSE.2019.2920771
[27]
Myunghwan Kim, Roshan Sumbaly, and Sam Shah. 2013. Root Cause Detection in a Service-Oriented Architecture. In ACM SIGMETRICS / International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS ’13, Pittsburgh, PA, USA, June 17-21, 2013, Mor Harchol-Balter, John R. Douceur, and Jun Xu (Eds.). https://doi.org/10.1145/2465529.2465753
[28]
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.).
[29]
Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with Graph Convolutional Networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings.
[30]
An Ngoc Lam, Anh Tuan Nguyen, Hoan Anh Nguyen, and Tien N. Nguyen. 2017. Bug Localization with Combination of Deep Learning and Information Retrieval. In 2017 IEEE/ACM 25th International Conference on Program Comprehension (ICPC). https://doi.org/10.1109/ICPC.2017.24
[31]
I. Lee and R.K. Iyer. 2000. Diagnosing Rediscovered Software Problems Using Symptoms. IEEE Transactions on Software Engineering, 26, 2 (2000), https://doi.org/10.1109/32.841113
[32]
Guohao Li, Matthias Muller, Ali Thabet, and Bernard Ghanem. 2019. DeepGCNs: Can GCNs Go as Deep as CNNs? In 2019 IEEE/CVF International Conference on Computer Vision (ICCV). https://doi.org/10.1109/ICCV.2019.00936
[33]
Xia Li, Wei Li, Yuqun Zhang, and Lingming Zhang. 2019. DeepFL: Integrating Multiple Fault Diagnosis Dimensions for Deep Fault Localization. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis - ISSTA 2019. https://doi.org/10.1145/3293882.3330574
[34]
Zeyan Li, Junjie Chen, Rui Jiao, Nengwen Zhao, Zhijun Wang, Shuwei Zhang, Yanjun Wu, Long Jiang, Leiqin Yan, Zikai Wang, Zhekang Chen, Wenchi Zhang, Xiaohui Nie, Kaixin Sui, and Dan Pei. 2021. Practical Root Cause Localization for Microservice Systems via Trace Analysis. In 2021 IEEE/ACM 29th International Symposium on Quality of Service (IWQOS). https://doi.org/10.1109/IWQOS52092.2021.9521340
[35]
Q. Lin, H. Zhang, J. Lou, Y. Zhang, and X. Chen. 2016. Log Clustering Based Problem Identification for Online Service Systems. In 2016 IEEE/ACM 38th International Conference on Software Engineering Companion (ICSE-C). https://doi.org/10.1145/2889160.2889232
[36]
Dewei Liu. 2020. MicroHECL: High-Efficient Root Cause Localization in Large-Scale Microservice Systems. In ICSE 2021 Software Engineering in Practice. https://doi.org/10.1109/ICSE-SEIP52600.2021.00043
[37]
Ping Liu, Yu Chen, Xiaohui Nie, Jing Zhu, Shenglin Zhang, Kaixin Sui, Ming Zhang, and Dan Pei. 2019. FluxRank: A Widely-Deployable Framework to Automatically Localizing Root Cause Machines for Software Service Failure Mitigation. In 2019 IEEE 30th International Symposium on Software Reliability Engineering (ISSRE). https://doi.org/10.1109/ISSRE.2019.00014
[38]
P. Liu, H. Xu, Q. Ouyang, R. Jiao, Z. Chen, S. Zhang, J. Yang, L. Mo, J. Zeng, W. Xue, and D. Pei. 2020. Unsupervised Detection of Microservice Trace Anomalies through Service-Level Deep Bayesian Networks. In 2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE). https://doi.org/10.1109/ISSRE5003.2020.00014
[39]
Jian-Guang Lou, Qingwei Lin, Rui Ding, Qiang Fu, Dongmei Zhang, and Tao Xie. 2013. Software Analytics for Incident Management of Online Services: An Experience Report. In 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE). https://doi.org/10.1109/ASE.2013.6693105
[40]
Meng Ma, Jingmin Xu, Yuan Wang, Pengfei Chen, Zonghua Zhang, and Ping Wang. 2020. AutoMAP: Diagnose Your Microservice-based Web Applications Automatically. In Proceedings of The Web Conference 2020. https://doi.org/10.1145/3366423.3380111
[41]
Minghua Ma, Zheng Yin, Shenglin Zhang, Sheng Wang, Christopher Zheng, Xinhao Jiang, Hanwen Hu, Cheng Luo, Yilin Li, Nengjun Qiu, Feifei Li, Changcheng Chen, and Dan Pei. 2020. Diagnosing Root Causes of Intermittent Slow Queries in Cloud Databases. Proceedings of the VLDB Endowment, 13, 8 (2020), April, https://doi.org/10.14778/3389133.3389136
[42]
Leonardo Mariani, Mauro Pezzè, Oliviero Riganelli, and Rui Xin. 2020. Predicting Failures in Multi-Tier Distributed Systems. Journal of Systems and Software, 161 (2020), March, https://doi.org/10.1016/j.jss.2019.110464
[43]
Zili Meng, Minhu Wang, Jiasong Bai, Mingwei Xu, Hongzi Mao, and Hongxin Hu. 2020. Interpreting Deep Learning-Based Networking Systems. In Proceedings of the Annual Conference of the ACM Special Interest Group on Data Communication on the Applications, Technologies, Architectures, and Protocols for Computer Communication. https://doi.org/10.1145/3387514.3405859
[44]
Emanuel Metzenthin. 2022. LIME For Time. https://github.com/emanuel-metzenthin/Lime-For-Time
[45]
Christoph Molnar. 2020. Interpretable Machine Learning. Lulu.com.
[46]
Animesh Nandi, Atri Mandal, Shubham Atreja, Gargi B. Dasgupta, and Subhrajit Bhattacharya. 2016. Anomaly Detection Using Program Control Flow Graph Mining From Execution Logs. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. https://doi.org/10.1145/2939672.2939712
[47]
Cuong Pham, Long Wang, Byung Chul Tak, Salman Baset, Chunqiang Tang, Zbigniew Kalbarczyk, and Ravishankar K Iyer. 2017. Failure Diagnosis for Distributed Systems Using Targeted Fault Injection. IEEE Transactions on Parallel and Distributed Systems, 28, 2 (2017), https://doi.org/10.1109/TPDS.2016.2575829
[48]
Marco Túlio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why Should I Trust You?": Explaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016, Balaji Krishnapuram, Mohak Shah, Alexander J. Smola, Charu C. Aggarwal, Dou Shen, and Rajeev Rastogi (Eds.). https://doi.org/10.1145/2939672.2939778
[49]
Jörg Thalheim, Antonio Rodrigues, Istemi Ekin Akkus, Pramod Bhatotia, Ruichuan Chen, Bimal Viswanath, Lei Jiao, and Christof Fetzer. 2017. Sieve: Actionable Insights from Monitored Metrics in Distributed Systems. In Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference. https://doi.org/10.1145/3135974.3135977
[50]
Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph Attention Networks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings.
[51]
Minjie Wang, Da Zheng, Zihao Ye, Quan Gan, Mufei Li, Xiang Song, Jinjing Zhou, Chao Ma, Lingfan Yu, Yu Gai, Tianjun Xiao, Tong He, George Karypis, Jinyang Li, and Zheng Zhang. 2019. Deep Graph Library: A Graph-Centric, Highly-Performant Package for Graph Neural Networks. arXiv preprint arXiv:1909.01315.
[52]
Canhua Wu, Nengwen Zhao, Lixin Wang, Xiaoqin Yang, Shining Li, Ming Zhang, Xing Jin, Xidao Wen, Xiaohui Nie, Wenchi Zhang, Kaixin Sui, and Dan Pei. 2021. Identifying Root-Cause Metrics for Incident Diagnosis in Online Service Systems. In 2021 IEEE 32th International Symposium on Software Reliability Engineering (ISSRE). https://doi.org/10.1109/ISSRE52982.2021.00022
[53]
Li Wu, Johan Tordsson, Jasmin Bogatinovski, Erik Elmroth, and Odej Kao. 2021. MicroDiag: Fine-grained Performance Diagnosis for Microservice Systems. In ICSE21 Workshop on Cloud Intelligence.
[54]
Yanming Yang, Xin Xia, David Lo, and John Grundy. 2021. A Survey on Deep Learning for Software Engineering. Comput. Surveys, Dec., https://doi.org/10.1145/3505243
[55]
Zhitao Ying, Dylan Bourgeois, Jiaxuan You, Marinka Zitnik, and Jure Leskovec. 2019. Gnnexplainer: Generating Explanations for Graph Neural Networks. Advances in neural information processing systems, 32 (2019).
[56]
Guangba Yu, Pengfei Chen, Hongyang Chen, Zijie Guan, Zicheng Huang, Linxiao Jing, Tianjun Weng, Xinmeng Sun, and Xiaoyun Li. 2021. MicroRank: End-to-end Latency Issue Localization with Extended Spectrum Analysis in Microservice Environments. In Proceedings of the Web Conference 2021. https://doi.org/10.1145/3442381.3449905
[57]
Matthew D. Zeiler and Rob Fergus. 2014. Visualizing and Understanding Convolutional Networks. In Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I, David J. Fleet, Tomás Pajdla, Bernt Schiele, and Tinne Tuytelaars (Eds.). 8689, https://doi.org/10.1007/978-3-319-10590-1_53
[58]
Xu Zhang, Yong Xu, Qingwei Lin, Bo Qiao, Hongyu Zhang, Yingnong Dang, Chunyu Xie, Xinsheng Yang, Qian Cheng, Ze Li, Junjie Chen, Xiaoting He, Randolph Yao, Jian-Guang Lou, Murali Chintalapati, Furao Shen, and Dongmei Zhang. 2019. Robust Log-Based Anomaly Detection on Unstable Log Data. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. https://doi.org/10.1145/3338906.3338931
[59]
Xiang Zhou, Xin Peng, Tao Xie, Jun Sun, Chao Ji, Wenhai Li, and Dan Ding. 2018. Fault Analysis and Debugging of Microservice Systems: Industrial Survey, Benchmark System, and Empirical Study. IEEE Transactions on Software Engineering, https://doi.org/10.1109/TSE.2018.2887384
[60]
Xiang Zhou, Xin Peng, Tao Xie, Jun Sun, Chao Ji, Dewei Liu, Qilin Xiang, and Chuan He. 2019. Latent Error Prediction and Fault Localization for Microservice Applications by Learning from System Trace Logs. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. https://doi.org/10.1145/3338906.3338961
[61]
Zhi-Hua Zhou. 2021. Machine Learning. Springer Singapore. https://doi.org/10.1007/978-981-15-1967-3

Cited By

View all
  • (2024)D-Bot: Database Diagnosis System using Large Language ModelsProceedings of the VLDB Endowment10.14778/3675034.367504317:10(2514-2527)Online publication date: 1-Jun-2024
  • (2024)ART: A Unified Unsupervised Framework for Incident Management in Microservice SystemsProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695495(1183-1194)Online publication date: 27-Oct-2024
  • (2024)Giving Every Modality a Voice in Microservice Failure Diagnosis via Multimodal Adaptive OptimizationProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695489(1107-1119)Online publication date: 27-Oct-2024
  • Show More Cited By

Index Terms

  1. Actionable and interpretable fault localization for recurring failures in online service systems

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ESEC/FSE 2022: Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering
    November 2022
    1822 pages
    ISBN:9781450394130
    DOI:10.1145/3540250
    This work is licensed under a Creative Commons Attribution 4.0 International License.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 09 November 2022

    Permissions

    Request permissions for this article.

    Check for updates

    Badges

    Author Tags

    1. Fault Localization
    2. Online Service Systems
    3. Recurring Failures

    Qualifiers

    • Research-article

    Conference

    ESEC/FSE '22
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 112 of 543 submissions, 21%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)620
    • Downloads (Last 6 weeks)86
    Reflects downloads up to 22 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)D-Bot: Database Diagnosis System using Large Language ModelsProceedings of the VLDB Endowment10.14778/3675034.367504317:10(2514-2527)Online publication date: 1-Jun-2024
    • (2024)ART: A Unified Unsupervised Framework for Incident Management in Microservice SystemsProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695495(1183-1194)Online publication date: 27-Oct-2024
    • (2024)Giving Every Modality a Voice in Microservice Failure Diagnosis via Multimodal Adaptive OptimizationProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695489(1107-1119)Online publication date: 27-Oct-2024
    • (2024)The Potential of One-Shot Failure Root Cause Analysis: Collaboration of the Large Language Model and Small ClassifierProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695475(931-943)Online publication date: 27-Oct-2024
    • (2024)Root Cause Analysis for Microservice System based on Causal Inference: How Far Are We?Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695065(706-715)Online publication date: 27-Oct-2024
    • (2024)SLIM: a Scalable and Interpretable Light-weight Fault Localization Algorithm for Imbalanced Data in MicroserviceProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3694984(27-39)Online publication date: 27-Oct-2024
    • (2024)Illuminating the Gray Zone: Non-intrusive Gray Failure Localization in Server Operating SystemsCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663834(126-137)Online publication date: 10-Jul-2024
    • (2024)Chain-of-Event: Interpretable Root Cause Analysis for Microservices through Automatically Learning Weighted Event Causal GraphCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663827(50-61)Online publication date: 10-Jul-2024
    • (2024)ChangeRCA: Finding Root Causes from Software Changes in Large Online SystemsProceedings of the ACM on Software Engineering10.1145/36437281:FSE(24-46)Online publication date: 12-Jul-2024
    • (2024)FaultInsight: Interpreting Hyperscale Data Center Host FaultsProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3672051(141-152)Online publication date: 25-Aug-2024
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media