Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3580305.3599934acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article
Free access

Root Cause Analysis for Microservice Systems via Hierarchical Reinforcement Learning from Human Feedback

Published: 04 August 2023 Publication History

Abstract

In microservice systems, the identification of root causes of anomalies is imperative for service reliability and business impact. This process is typically divided into two phases: (i)constructing a service dependency graph that outlines the sequence and structure of system components that are invoked, and (ii) localizing the root cause components using the graph, traces, logs, and Key Performance Indicators (KPIs) such as latency. However, both phases are not straightforward due to the highly dynamic and complex nature of the system, particularly in large-scale commercial architectures like Microsoft Exchange.
In this paper, we propose a new framework that employs Hierarchical Reinforcement Learning from Human Feedback (HRLHF) to address these challenges. Our framework leverages the static topology of the microservice system and efficiently employs the feedback of engineers to reduce uncertainty in the discovery of the service dependency graph. The framework utilizes reinforcement learning to reduce the number of queries required from O(N2) to O(1), enabling the construction of the dependency graph with high accuracy and minimal human effort. Additionally, we extend the discovered dependency graphs to window causal graphs that capture the characteristics of time series over a specified time period, resulting in improved root cause analysis accuracy and robustness. Evaluations on both real datasets from Microsoft Exchange and synthetic datasets with injected anomalies demonstrate superior performance on various metrics compared to state-of-the-art methods. It is worth mentioning that, our framework has been integrated as a crucial component in Microsoft M365 Exchange service.

Supplementary Material

MP4 File (rca_agent_video.mp4)
Presentation video - short version of Root Cause Analysis for Microservice Systems via Hierarchical Reinforcement Learning from Human Feedback

References

[1]
Jacopo Soldani and Antonio Brogi. Anomaly detection and failure root cause analysis in (micro) service-based cloud applications: A survey. ACM Computing Surveys (CSUR), 55(3):1--39, 2022.
[2]
Haifeng Liu, Jinjun Zhang, Huasong Shan, Min Li, Yuan Chen, Xiaofeng He, and Xiaowei Li. Jcallgraph: tracing microservices in very large scale container cloud platforms. In Cloud Computing-CLOUD 2019: 12th International Conference, Held as Part of the Services Conference Federation, SCF 2019, San Diego, CA, USA, June 25-30, 2019, Proceedings 12, pages 287--302. Springer, 2019.
[3]
Xiang Zhou, Xin Peng, Tao Xie, Jun Sun, Chao Ji, Wenhai Li, and Dan Ding. Fault analysis and debugging of microservice systems: Industrial survey, benchmark system, and empirical study. IEEE Transactions on Software Engineering, 47(2):243--260, 2018.
[4]
Zhengran Zeng, Yuqun Zhang, Yong Xu, Minghua Ma, Bo Qiao, Wentao Zou, Qingjun Chen, Meng Zhang, Xu Zhang, Hongyu Zhang, Xuedong Gao, Hao Fan, Saravan Rajmohan, Qingwei Lin, and Dongmei Zhang. Traceark: Towards actionable performance anomaly alerting for online service systems. In To appear in Proc. of ICSE, 2023.
[5]
Xu Zhang, Chao Du, Yifan Li, Yong Xu, Hongyu Zhang, Si Qin, Ze Li, Qingwei Lin, Yingnong Dang, Andrew Zhou, et al. Halo: Hierarchy-aware fault localization for cloud systems. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 3948--3958, 2021.
[6]
Dewei Liu, Chuan He, Xin Peng, Fan Lin, Chenxi Zhang, Shengfang Gong, Ziang Li, Jiayu Ou, and Zheshun Wu. Microhecl: High-efficient root cause localization in large-scale microservice systems. In 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), pages 338--347. IEEE, 2021.
[7]
Ping Wang, Jingmin Xu, Meng Ma, Weilan Lin, Disheng Pan, Yuan Wang, and Pengfei Chen. Cloudranger: Root cause identification for cloud native systems. In 2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), pages 492--502. IEEE, 2018.
[8]
Kailash Budhathoki, Dominik Janzing, Patrick Bloebaum, and Hoiyi Ng. Why did the distribution change? In International Conference on Artificial Intelligence and Statistics, pages 1666--1674. PMLR, 2021.
[9]
Yuan Meng, Shenglin Zhang, Yongqian Sun, Ruru Zhang, Zhilong Hu, Yiyin Zhang, Chenyang Jia, Zhaogang Wang, and Dan Pei. Localizing failure root causes in a microservice through causality inference. In 2020 IEEE/ACM 28th International Symposium on Quality of Service (IWQoS), pages 1--10. IEEE, 2020.
[10]
JinJin Lin, Pengfei Chen, and Zibin Zheng. Microscope: Pinpoint performance issues with causal graphs in micro-service environments. In Service-Oriented Computing: 16th International Conference, ICSOC 2018, Hangzhou, China, November 12-15, 2018, Proceedings 16, pages 3--20. Springer, 2018.
[11]
Muhammad Azam Ikram, Sarthak Chakraborty, Subrata Mitra, Shiv Saini, Saurabh Bagchi, and Murat Kocaoglu. Root cause analysis of failures in microser-vices through causal discovery. In Advances in Neural Information Processing Systems.
[12]
Murat Kocaoglu, Amin Jaber, Karthikeyan Shanmugam, and Elias Bareinboim. Characterization and learning of causal graphs with latent variables from soft interventions. Advances in Neural Information Processing Systems, 32, 2019.
[13]
Judea Pearl. Causality. Cambridge university press, 2009.
[14]
Shiqi Hao, Yang Liu, Yu Wang, Yuan Wang, and Wenming Zhe. Three-stage root cause analysis for logistics time efficiency via explainable machine learning. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 2987--2996, 2022.
[15]
Mattia Carletti, Chiara Masiero, Alessandro Beghi, and Gian Antonio Susto. Explainable machine learning in industry 4.0: Evaluating feature importance in anomaly detection to enable root cause analysis. In 2019 IEEE international conference on systems, man and cybernetics (SMC), pages 21--26. IEEE, 2019.
[16]
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022.
[17]
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022.
[18]
Peter Spirtes and Clark Glymour. An algorithm for fast recovery of sparse causal graphs. Social science computer review, 9(1):62--72, 1991.
[19]
Peter Spirtes, Clark N Glymour, Richard Scheines, and David Heckerman. Causation, prediction, and search. MIT press, 2000.
[20]
Xun Zheng, Bryon Aragam, Pradeep K Ravikumar, and Eric P Xing. Dags with no tears: Continuous optimization for structure learning. Advances in neural information processing systems, 31, 2018.
[21]
Xun Zheng, Chen Dan, Bryon Aragam, Pradeep Ravikumar, and Eric Xing. Learning sparse nonparametric dags. In International Conference on Artificial Intelligence and Statistics, pages 3414--3425. PMLR, 2020.
[22]
Shengyu Zhu, Ignavier Ng, and Zhitang Chen. Causal discovery with reinforcement learning. In International Conference on Learning Representations.
[23]
Clive WJ Granger. Investigating causal relations by econometric models and cross-spectral methods. Econometrica: journal of the Econometric Society, pages 424--438, 1969.
[24]
Chaoyun Zhang, Marco Fiore, Cezary Ziemlicki, and Paul Patras. Microscope: mobile service traffic decomposition for network slicing as a service. In Proceedings of the 26th Annual International Conference on Mobile Computing and Networking, pages 1--14, 2020.
[25]
Chaoyun Zhang, Paul Patras, and Hamed Haddadi. Deep learning in mobile and wireless networking: A survey. IEEE Communications surveys & tutorials, 21(3):2224--2287, 2019.
[26]
Chaoyun Zhang, Marco Fiore, Iain Murray, and Paul Patras. Cloudlstm: A recurrent neural model for spatiotemporal point-cloud stream forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 10851--10858, 2021.
[27]
Chaoyun Zhang, Xi Ouyang, and Paul Patras. Zipnet-gan: Inferring fine-grained mobile traffic patterns via a generative adversarial neural network. In Proceedings of the 13th International Conference on emerging Networking EXperiments and Technologies, pages 363--375, 2017.
[28]
Daniele Marinazzo, Mario Pellicoro, and Sebastiano Stramaglia. Kernel-granger causality and the analysis of dynamical networks. Physical review E, 77(5):056215, 2008.
[29]
Anonymous. CUTS: Neural causal discovery from unstructured time-series data. In Submitted to The Eleventh International Conference on Learning Representations, 2023. under review.
[30]
Sindy Löwe, David Madras, Richard Zemel, and Max Welling. Amortized causal discovery: Learning to infer causal graphs from time-series data. In Conference on Causal Learning and Reasoning, pages 509--525. PMLR, 2022.
[31]
Pengfei Chen, Yong Qi, Pengfei Zheng, and Di Hou. Causeinfer: Automatic and distributed performance diagnosis with hierarchical causality graph in large distributed systems. In IEEE INFOCOM 2014-IEEE Conference on Computer Communications, pages 1887--1895. IEEE, 2014.
[32]
Myunghwan Kim, Roshan Sumbaly, and Sam Shah. Root cause detection in a service-oriented architecture. ACM SIGMETRICS Performance Evaluation Review, 41(1):93--104, 2013.
[33]
Yicheng Pan, Meng Ma, Xinrui Jiang, and Ping Wang. Faster, deeper, easier: crowdsourcing diagnosis of microservice kernel failure from user space. In Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 646--657, 2021.
[34]
Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
[35]
Judea Pearl et al. Models, reasoning and inference. Cambridge, UK: Cambridge UniversityPress, 19(2), 2000.
[36]
Lloyd S Shapley et al. A value for n-person games. 1953.
[37]
Charles K. Assaad, Emilie Devijver, and Eric Gaussier. Causal discovery of extended summary graphs in time series. In The 38th Conference on Uncertainty in Artificial Intelligence, 2022.
[38]
Microservice-based sample application written in Python. https://github.com/ rshriram/pymicro, 2023. [Online; accessed 27-Jan-2023].
[39]
Chan Li Long, Yash Guleria, and Sameer Alam. Air passenger forecasting using neural granger causal google trend queries. Journal of Air Transport Management, 95:102083, 2021.
[40]
Zonghan Wu, Shirui Pan, Guodong Long, Jing Jiang, Xiaojun Chang, and Chengqi Zhang. Connecting the dots: Multivariate time series forecasting with graph neural networks. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pages 753--763, 2020.
[41]
Allan H Murphy. The finley affair: A signal event in the history of forecast verification. Weather and forecasting, 11(1):3--20, 1996.
[42]
James P Bagrow and Erik M Bollt. An information-theoretic, all-scales approach to comparing networks. Applied Network Science, 4(1):1--15, 2019.
[43]
Zhihan Li, Youjian Zhao, Jiaqi Han, Ya Su, Rui Jiao, Xidao Wen, and Dan Pei. Multivariate time series anomaly detection and interpretation using hierarchical intermetric and temporal embedding. In Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining, pages 3220--3230, 2021.
[44]
John Paparrizos, Yuhao Kang, Paul Boniol, Ruey S Tsay, Themis Palpanas, and Michael J Franklin. TSB-UAD: an end-to-end benchmark suite for univariate time-series anomaly detection. Proceedings of the VLDB Endowment, 15(8):1697--1711, 2022.
[45]
Minghua Ma and Shenglin Zhang. Jump-starting multivariate time series anomaly detection for online service systems. In Proceedings of the 2021 USENIX Annual Technical Conference, 2021.

Cited By

View all
  • (2024)Deoxys: A Causal Inference Engine for Unhealthy Node Mitigation in Large-scale Cloud InfrastructureProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698534(361-379)Online publication date: 20-Nov-2024
  • (2024)Root Cause Analysis for Microservice System based on Causal Inference: How Far Are We?Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695065(706-715)Online publication date: 27-Oct-2024
  • (2024)Illuminating the Gray Zone: Non-intrusive Gray Failure Localization in Server Operating SystemsCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663834(126-137)Online publication date: 10-Jul-2024
  • Show More Cited By

Index Terms

  1. Root Cause Analysis for Microservice Systems via Hierarchical Reinforcement Learning from Human Feedback

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    KDD '23: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
    August 2023
    5996 pages
    ISBN:9798400701030
    DOI:10.1145/3580305
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 04 August 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. causal discovery
    2. reinforcement learning from human feedback
    3. root cause analysis

    Qualifiers

    • Research-article

    Conference

    KDD '23
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)727
    • Downloads (Last 6 weeks)82
    Reflects downloads up to 25 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Deoxys: A Causal Inference Engine for Unhealthy Node Mitigation in Large-scale Cloud InfrastructureProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698534(361-379)Online publication date: 20-Nov-2024
    • (2024)Root Cause Analysis for Microservice System based on Causal Inference: How Far Are We?Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695065(706-715)Online publication date: 27-Oct-2024
    • (2024)Illuminating the Gray Zone: Non-intrusive Gray Failure Localization in Server Operating SystemsCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663834(126-137)Online publication date: 10-Jul-2024
    • (2024)On the Fly Detection of Root Causes from Observed Data with Application to IT SystemsProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3680010(5062-5069)Online publication date: 21-Oct-2024
    • (2024)MULAN: Multi-modal Causal Structure Learning and Root Cause Analysis for Microservice SystemsProceedings of the ACM Web Conference 202410.1145/3589334.3645442(4107-4116)Online publication date: 13-May-2024
    • (2024)Elevating Strategy with Cloud-Based Reinforcement Learning for Dynamic Decision-Making in Modern Enterprises2024 3rd International Conference for Innovation in Technology (INOCON)10.1109/INOCON60754.2024.10511401(1-6)Online publication date: 1-Mar-2024
    • (2023)ImDiffusion: Imputed Diffusion Models for Multivariate Time Series Anomaly DetectionProceedings of the VLDB Endowment10.14778/3632093.363210117:3(359-372)Online publication date: 1-Nov-2023
    • (2023)TraceDiag: Adaptive, Interpretable, and Efficient Root Cause Analysis on Large-Scale Microservice SystemsProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3613864(1762-1773)Online publication date: 30-Nov-2023

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media

    Access Granted

    The conference sponsors are committed to making content openly accessible in a timely manner.
    This article is provided by ACM and the conference, through the ACM OpenTOC service.