research-article

Free access

Root Cause Analysis for Microservice Systems via Hierarchical Reinforcement Learning from Human Feedback

Authors:

Saravan Rajmohan,

Dongmei ZhangAuthors Info & Claims

KDD '23: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Pages 5116 - 5125

https://doi.org/10.1145/3580305.3599934

Published: 04 August 2023 Publication History

Abstract

In microservice systems, the identification of root causes of anomalies is imperative for service reliability and business impact. This process is typically divided into two phases: (i)constructing a service dependency graph that outlines the sequence and structure of system components that are invoked, and (ii) localizing the root cause components using the graph, traces, logs, and Key Performance Indicators (KPIs) such as latency. However, both phases are not straightforward due to the highly dynamic and complex nature of the system, particularly in large-scale commercial architectures like Microsoft Exchange.

In this paper, we propose a new framework that employs Hierarchical Reinforcement Learning from Human Feedback (HRLHF) to address these challenges. Our framework leverages the static topology of the microservice system and efficiently employs the feedback of engineers to reduce uncertainty in the discovery of the service dependency graph. The framework utilizes reinforcement learning to reduce the number of queries required from O(N²) to O(1), enabling the construction of the dependency graph with high accuracy and minimal human effort. Additionally, we extend the discovered dependency graphs to window causal graphs that capture the characteristics of time series over a specified time period, resulting in improved root cause analysis accuracy and robustness. Evaluations on both real datasets from Microsoft Exchange and synthetic datasets with injected anomalies demonstrate superior performance on various metrics compared to state-of-the-art methods. It is worth mentioning that, our framework has been integrated as a crucial component in Microsoft M365 Exchange service.

Supplementary Material

MP4 File (rca_agent_video.mp4)

Presentation video - short version of Root Cause Analysis for Microservice Systems via Hierarchical Reinforcement Learning from Human Feedback

Download
18.59 MB

References

[1]

Jacopo Soldani and Antonio Brogi. Anomaly detection and failure root cause analysis in (micro) service-based cloud applications: A survey. ACM Computing Surveys (CSUR), 55(3):1--39, 2022.

[2]

Haifeng Liu, Jinjun Zhang, Huasong Shan, Min Li, Yuan Chen, Xiaofeng He, and Xiaowei Li. Jcallgraph: tracing microservices in very large scale container cloud platforms. In Cloud Computing-CLOUD 2019: 12th International Conference, Held as Part of the Services Conference Federation, SCF 2019, San Diego, CA, USA, June 25-30, 2019, Proceedings 12, pages 287--302. Springer, 2019.

[3]

Xiang Zhou, Xin Peng, Tao Xie, Jun Sun, Chao Ji, Wenhai Li, and Dan Ding. Fault analysis and debugging of microservice systems: Industrial survey, benchmark system, and empirical study. IEEE Transactions on Software Engineering, 47(2):243--260, 2018.

Digital Library

[4]

Zhengran Zeng, Yuqun Zhang, Yong Xu, Minghua Ma, Bo Qiao, Wentao Zou, Qingjun Chen, Meng Zhang, Xu Zhang, Hongyu Zhang, Xuedong Gao, Hao Fan, Saravan Rajmohan, Qingwei Lin, and Dongmei Zhang. Traceark: Towards actionable performance anomaly alerting for online service systems. In To appear in Proc. of ICSE, 2023.

[5]

Xu Zhang, Chao Du, Yifan Li, Yong Xu, Hongyu Zhang, Si Qin, Ze Li, Qingwei Lin, Yingnong Dang, Andrew Zhou, et al. Halo: Hierarchy-aware fault localization for cloud systems. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 3948--3958, 2021.

Digital Library

[6]

Dewei Liu, Chuan He, Xin Peng, Fan Lin, Chenxi Zhang, Shengfang Gong, Ziang Li, Jiayu Ou, and Zheshun Wu. Microhecl: High-efficient root cause localization in large-scale microservice systems. In 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), pages 338--347. IEEE, 2021.

Digital Library

[7]

Ping Wang, Jingmin Xu, Meng Ma, Weilan Lin, Disheng Pan, Yuan Wang, and Pengfei Chen. Cloudranger: Root cause identification for cloud native systems. In 2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), pages 492--502. IEEE, 2018.

Digital Library

[8]

Kailash Budhathoki, Dominik Janzing, Patrick Bloebaum, and Hoiyi Ng. Why did the distribution change? In International Conference on Artificial Intelligence and Statistics, pages 1666--1674. PMLR, 2021.

[9]

Yuan Meng, Shenglin Zhang, Yongqian Sun, Ruru Zhang, Zhilong Hu, Yiyin Zhang, Chenyang Jia, Zhaogang Wang, and Dan Pei. Localizing failure root causes in a microservice through causality inference. In 2020 IEEE/ACM 28th International Symposium on Quality of Service (IWQoS), pages 1--10. IEEE, 2020.

[10]

JinJin Lin, Pengfei Chen, and Zibin Zheng. Microscope: Pinpoint performance issues with causal graphs in micro-service environments. In Service-Oriented Computing: 16th International Conference, ICSOC 2018, Hangzhou, China, November 12-15, 2018, Proceedings 16, pages 3--20. Springer, 2018.

[11]

Muhammad Azam Ikram, Sarthak Chakraborty, Subrata Mitra, Shiv Saini, Saurabh Bagchi, and Murat Kocaoglu. Root cause analysis of failures in microser-vices through causal discovery. In Advances in Neural Information Processing Systems.

[12]

Murat Kocaoglu, Amin Jaber, Karthikeyan Shanmugam, and Elias Bareinboim. Characterization and learning of causal graphs with latent variables from soft interventions. Advances in Neural Information Processing Systems, 32, 2019.

[13]

Judea Pearl. Causality. Cambridge university press, 2009.

[14]

Shiqi Hao, Yang Liu, Yu Wang, Yuan Wang, and Wenming Zhe. Three-stage root cause analysis for logistics time efficiency via explainable machine learning. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 2987--2996, 2022.

Digital Library

[15]

Mattia Carletti, Chiara Masiero, Alessandro Beghi, and Gian Antonio Susto. Explainable machine learning in industry 4.0: Evaluating feature importance in anomaly detection to enable root cause analysis. In 2019 IEEE international conference on systems, man and cybernetics (SMC), pages 21--26. IEEE, 2019.

Digital Library

[16]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022.

[17]

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022.

[18]

Peter Spirtes and Clark Glymour. An algorithm for fast recovery of sparse causal graphs. Social science computer review, 9(1):62--72, 1991.

[19]

Peter Spirtes, Clark N Glymour, Richard Scheines, and David Heckerman. Causation, prediction, and search. MIT press, 2000.

[20]

Xun Zheng, Bryon Aragam, Pradeep K Ravikumar, and Eric P Xing. Dags with no tears: Continuous optimization for structure learning. Advances in neural information processing systems, 31, 2018.

[21]

Xun Zheng, Chen Dan, Bryon Aragam, Pradeep Ravikumar, and Eric Xing. Learning sparse nonparametric dags. In International Conference on Artificial Intelligence and Statistics, pages 3414--3425. PMLR, 2020.

[22]

Shengyu Zhu, Ignavier Ng, and Zhitang Chen. Causal discovery with reinforcement learning. In International Conference on Learning Representations.

[23]

Clive WJ Granger. Investigating causal relations by econometric models and cross-spectral methods. Econometrica: journal of the Econometric Society, pages 424--438, 1969.

[24]

Chaoyun Zhang, Marco Fiore, Cezary Ziemlicki, and Paul Patras. Microscope: mobile service traffic decomposition for network slicing as a service. In Proceedings of the 26th Annual International Conference on Mobile Computing and Networking, pages 1--14, 2020.

[25]

Chaoyun Zhang, Paul Patras, and Hamed Haddadi. Deep learning in mobile and wireless networking: A survey. IEEE Communications surveys & tutorials, 21(3):2224--2287, 2019.

[26]

Chaoyun Zhang, Marco Fiore, Iain Murray, and Paul Patras. Cloudlstm: A recurrent neural model for spatiotemporal point-cloud stream forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 10851--10858, 2021.

[27]

Chaoyun Zhang, Xi Ouyang, and Paul Patras. Zipnet-gan: Inferring fine-grained mobile traffic patterns via a generative adversarial neural network. In Proceedings of the 13th International Conference on emerging Networking EXperiments and Technologies, pages 363--375, 2017.

Digital Library

[28]

Daniele Marinazzo, Mario Pellicoro, and Sebastiano Stramaglia. Kernel-granger causality and the analysis of dynamical networks. Physical review E, 77(5):056215, 2008.

[29]

Anonymous. CUTS: Neural causal discovery from unstructured time-series data. In Submitted to The Eleventh International Conference on Learning Representations, 2023. under review.

[30]

Sindy Löwe, David Madras, Richard Zemel, and Max Welling. Amortized causal discovery: Learning to infer causal graphs from time-series data. In Conference on Causal Learning and Reasoning, pages 509--525. PMLR, 2022.

[31]

Pengfei Chen, Yong Qi, Pengfei Zheng, and Di Hou. Causeinfer: Automatic and distributed performance diagnosis with hierarchical causality graph in large distributed systems. In IEEE INFOCOM 2014-IEEE Conference on Computer Communications, pages 1887--1895. IEEE, 2014.

[32]

Myunghwan Kim, Roshan Sumbaly, and Sam Shah. Root cause detection in a service-oriented architecture. ACM SIGMETRICS Performance Evaluation Review, 41(1):93--104, 2013.

Digital Library

[33]

Yicheng Pan, Meng Ma, Xinrui Jiang, and Ping Wang. Faster, deeper, easier: crowdsourcing diagnosis of microservice kernel failure from user space. In Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 646--657, 2021.

Digital Library

[34]

Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.

[35]

Judea Pearl et al. Models, reasoning and inference. Cambridge, UK: Cambridge UniversityPress, 19(2), 2000.

[36]

Lloyd S Shapley et al. A value for n-person games. 1953.

[37]

Charles K. Assaad, Emilie Devijver, and Eric Gaussier. Causal discovery of extended summary graphs in time series. In The 38th Conference on Uncertainty in Artificial Intelligence, 2022.

[38]

Microservice-based sample application written in Python. https://github.com/ rshriram/pymicro, 2023. [Online; accessed 27-Jan-2023].

[39]

Chan Li Long, Yash Guleria, and Sameer Alam. Air passenger forecasting using neural granger causal google trend queries. Journal of Air Transport Management, 95:102083, 2021.

[40]

Zonghan Wu, Shirui Pan, Guodong Long, Jing Jiang, Xiaojun Chang, and Chengqi Zhang. Connecting the dots: Multivariate time series forecasting with graph neural networks. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pages 753--763, 2020.

Digital Library

[41]

Allan H Murphy. The finley affair: A signal event in the history of forecast verification. Weather and forecasting, 11(1):3--20, 1996.

[42]

James P Bagrow and Erik M Bollt. An information-theoretic, all-scales approach to comparing networks. Applied Network Science, 4(1):1--15, 2019.

[43]

Zhihan Li, Youjian Zhao, Jiaqi Han, Ya Su, Rui Jiao, Xidao Wen, and Dan Pei. Multivariate time series anomaly detection and interpretation using hierarchical intermetric and temporal embedding. In Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining, pages 3220--3230, 2021.

Digital Library

[44]

John Paparrizos, Yuhao Kang, Paul Boniol, Ruey S Tsay, Themis Palpanas, and Michael J Franklin. TSB-UAD: an end-to-end benchmark suite for univariate time-series anomaly detection. Proceedings of the VLDB Endowment, 15(8):1697--1711, 2022.

Digital Library

[45]

Minghua Ma and Shenglin Zhang. Jump-starting multivariate time series anomaly detection for online service systems. In Proceedings of the 2021 USENIX Annual Technical Conference, 2021.

Cited By

Zhang CYao RQin SLi ZAgrawal SMishra BTran TMa MLin QChintalapati MZhang D(2024)Deoxys: A Causal Inference Engine for Unhealthy Node Mitigation in Large-scale Cloud InfrastructureProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698534(361-379)Online publication date: 20-Nov-2024
https://dl.acm.org/doi/10.1145/3698038.3698534
Pham LHa HZhang HFilkov VRay BZhou M(2024)Root Cause Analysis for Microservice System based on Causal Inference: How Far Are We?Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695065(706-715)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3691620.3695065
Zhang SZhao YXiong XSun YNie XZhang JWang FZheng XZhang YPei Dd'Amorim M(2024)Illuminating the Gray Zone: Non-intrusive Gray Failure Localization in Server Operating SystemsCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663834(126-137)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3663529.3663834
Show More Cited By

Index Terms

Root Cause Analysis for Microservice Systems via Hierarchical Reinforcement Learning from Human Feedback
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Reinforcement learning
        Sequential decision making

Recommendations

Interdependent Causal Networks for Root Cause Localization
KDD '23: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

The goal of root cause analysis is to identify the underlying causes of system problems by discovering and analyzing the causal structure from system monitoring data. It is indispensable for maintaining the stability and robustness of large-scale complex ...
CORE: Towards Scalable and Efficient Causal Discovery with Reinforcement Learning
AAMAS '24: Proceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems

Causal discovery is the challenging task of inferring causal structure from data. Motivated by Pearl's Causal Hierarchy (PCH), which tells us that passive observations alone are not enough to distinguish correlation from causation, there has been a ...
Latent Causal Dynamics Model for Model-Based Reinforcement Learning
Neural Information Processing
Abstract
Learning an accurate dynamics model is the key task for model-based reinforcement learning (MBRL). Most existing MBRL methods learn the dynamics model over states. But in most cases, the relationships among states are complex because the states ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '23: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 2023

5996 pages

ISBN:9798400701030

DOI:10.1145/3580305

General Chairs:
Ambuj Singh
UC Santa Barbara, USA
,
Yizhou Sun
UC Los Angeles, USA
,
Program Chairs:
Leman Akoglu
Carnegie Mellon University, USA
,
Dimitrios Gunopulos
University of Athens, Greece
,
Xifeng Yan
UC Santa Barbara, USA
,
Ravi Kumar
Google, USA
,
Fatma Ozcan
Google, USA
,
Jieping Ye
Alibaba DAMO Academy

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 August 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

KDD '23

Sponsor:

KDD '23: The 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 6 - 10, 2023

CA, Long Beach, USA

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
1,002
Total Downloads

Downloads (Last 12 months)727
Downloads (Last 6 weeks)82

Reflects downloads up to 25 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhang CYao RQin SLi ZAgrawal SMishra BTran TMa MLin QChintalapati MZhang D(2024)Deoxys: A Causal Inference Engine for Unhealthy Node Mitigation in Large-scale Cloud InfrastructureProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698534(361-379)Online publication date: 20-Nov-2024
https://dl.acm.org/doi/10.1145/3698038.3698534
Pham LHa HZhang HFilkov VRay BZhou M(2024)Root Cause Analysis for Microservice System based on Causal Inference: How Far Are We?Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695065(706-715)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3691620.3695065
Zhang SZhao YXiong XSun YNie XZhang JWang FZheng XZhang YPei Dd'Amorim M(2024)Illuminating the Gray Zone: Non-intrusive Gray Failure Localization in Server Operating SystemsCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663834(126-137)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3663529.3663834
Zan LAssaad CDevijver EGaussier EAït-Bachir ASerra ESpezzano F(2024)On the Fly Detection of Root Causes from Observed Data with Application to IT SystemsProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3680010(5062-5069)Online publication date: 21-Oct-2024
https://dl.acm.org/doi/10.1145/3627673.3680010
Zheng LChen ZHe JChen HChua TNgo CKa-Wei Lee RKumar RLauw H(2024)MULAN: Multi-modal Causal Structure Learning and Root Cause Analysis for Microservice SystemsProceedings of the ACM Web Conference 202410.1145/3589334.3645442(4107-4116)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589334.3645442
Shanmugapriya MVenkatramaraju D(2024)Elevating Strategy with Cloud-Based Reinforcement Learning for Dynamic Decision-Making in Modern Enterprises2024 3rd International Conference for Innovation in Technology (INOCON)10.1109/INOCON60754.2024.10511401(1-6)Online publication date: 1-Mar-2024
https://doi.org/10.1109/INOCON60754.2024.10511401
Chen YZhang CMa MLiu YDing RLi BHe SRajmohan SLin QZhang D(2023)ImDiffusion: Imputed Diffusion Models for Multivariate Time Series Anomaly DetectionProceedings of the VLDB Endowment10.14778/3632093.363210117:3(359-372)Online publication date: 1-Nov-2023
https://dl.acm.org/doi/10.14778/3632093.3632101
Ding RZhang CWang LXu YMa MWu XZhang MChen QGao XGao XFan HRajmohan SLin QZhang DChandra SBlincoe KTonella P(2023)TraceDiag: Adaptive, Interpretable, and Efficient Root Cause Analysis on Large-Scale Microservice SystemsProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3613864(1762-1773)Online publication date: 30-Nov-2023
https://dl.acm.org/doi/10.1145/3611643.3613864

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents