Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

HEAL: Performance Troubleshooting Deep inside Data Center Hosts

Published: 12 December 2023 Publication History

Abstract

This study demonstrates the salient facts and challenges of host failure operations in hyperscale data centers. A host incident can involve hundreds of distinct host-level metrics, covering broad aspects. The faulting mechanism inside the host connects these heterogeneous metrics through direct and indirect correlation, making it extremely difficult to sort out the propagation procedures and the root cause from these intertwined indicators. To deeply understand the failure mechanism inside the host, we develop HEAL -- a novel host metrics analysis toolkit. HEAL synergistically discovers dynamic causality in sparse heterogeneous host metrics by combining the strengths of both time series and random variable analysis. It can also proactively extract causal directional hints from causality's asymmetry and historical knowledge. Together, these breakthroughs help HEAL produce accurate results given undesirable inputs. Extensive experiments in our production environment verify that HEAL provides significantly better result accuracy and full-process interpretability than the SOTA baselines. With these advantages, HEAL successfully serves our data center and worldwide product operations and impressively contributes to many other workflows.

References

[1]
Nuha Alshuqayran, Nour Ali, and Roger Evans. 2016. A Systematic Mapping Study in Microservice Architecture. In SOCA. 44--51.
[2]
Dan Ardelean, Amer Diwan, and Chandra Erdman. 2018. Performance Analysis of Cloud Applications. In NSDI. 405--417.
[3]
atop. 2023. atop. https://github.com/Atoptool/atop (Accessed: July 18, 2023).
[4]
Armin Balalaie, Abbas Heydarnoori, and Pooyan Jamshidi. 2016. Microservices architecture enables devops: Migration to a cloud-native architecture. IEEE Softw., Vol. 33, 3 (2016), 42--52. https://doi.org/10.1109/MS.2016.64
[5]
Ali Basiri, Lorin Hochstein, Nora Jones, and Haley Tucker. 2019. Automating chaos experiments in production. In ICSE (SEIP). 31--40.
[6]
Pengfei Chen, Yong Qi, Pengfei Zheng, and Di Hou. 2014. CauseInfer: Automatic and distributed performance diagnosis with hierarchical causality graph in large distributed systems. In INFOCOM. 1887--1895.
[7]
Ira Cohen, Jeffrey S. Chase, Moisé s Goldszmidt, Terence Kelly, and Julie Symons. 2004. Correlating Instrumentation Data to System States: A Building Block for Automated Diagnosis and Control. In OSDI. 231--244.
[8]
Thomas M. J. Fruchterman and Edward M. Reingold. 1991. Graph Drawing by Force-directed Placement. Softw. Pract. Exp., Vol. 21, 11 (1991), 1129--1164.
[9]
Yu Gan, Mingyu Liang, Sundar Dev, David Lo, and Christina Delimitrou. 2021. Sage: practical and scalable ML-driven performance debugging in microservices. In ASPLOS. 135--151.
[10]
Yu Gan, Yanqi Zhang, Kelvin Hu, Dailun Cheng, Yuan He, Meghna Pancholi, and Christina Delimitrou. 2019. Seer: Leveraging Big Data to Navigate the Complexity of Performance Debugging in Cloud Microservices. In ASPLOS. 19--33.
[11]
Clive WJ Granger. 1969. Investigating causal relations by econometric models and cross-spectral methods. Econometrica: Journal of the Econometric Society (1969), 424--438.
[12]
Clive W. J. Granger and Ramu Ramanathan. 1984. Improved methods of combining forecasts. Journal of Forecasting, Vol. 3, 2 (1984), 197--204. https://doi.org/10.1002/for.3980030207
[13]
Mark Grechanik, Chen Fu, and Qing Xie. 2012. Automatically finding performance problems with feedback-directed learning software testing. In ICSE. 156--166.
[14]
Shenghui Gu, Guoping Rong, Tian Ren, He Zhang, Haifeng Shen, Yongda Yu, Xian Li, Jian Ouyang, and Chunan Chen. 2023. TrinityRCL: Multi-Granular and Code-Level Root Cause Localization Using Multiple Types of Telemetry Data in Microservice Systems. IEEE Trans. Software Eng., Vol. 49, 5 (2023), 3071--3088.
[15]
Haryadi S. Gunawi, Mingzhe Hao, Riza O. Suminto, Agung Laksono, Anang D. Satria, Jeffry Adityatama, and Kurnia J. Eliazar. 2016. Why Does the Cloud Stop Computing? Lessons from Hundreds of Service Outages. In SoCC. 1--16.
[16]
hwmon. 2023. hwmon. https://docs.kernel.org/hwmon (Accessed: July 18, 2023).
[17]
Antti Hyttinen, Sergey M. Plis, Matti J"a rvisalo, Frederick Eberhardt, and David Danks. 2016. Causal Discovery from Subsampled Time Series Data by Constraint Optimization. In JMLR Workshop Conf Proc, Vol. 52. 216--227.
[18]
Vimalkumar Jeyakumar, Omid Madani, Ali Parandeh, Ashutosh Kulshreshtha, Weifei Zeng, and Navindra Yadav. 2019. ExplainIt! - A Declarative Root-cause Analysis Engine for Time Series Data. In SIGMOD Conference. 333--348.
[19]
Srikanth Kandula, Ratul Mahajan, Patrick Verkaik, Sharad Agarwal, Jitendra Padhye, and Paramvir Bahl. 2009. Detailed diagnosis in enterprise networks. In Proceedings of the ACM SIGCOMM 2009 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, Barcelona, Spain, August 16--21, 2009. 243--254. https://doi.org/10.1145/1592568.1592597
[20]
Myunghwan Kim, Roshan Sumbaly, and Sam Shah. 2013. Root cause detection in a service-oriented architecture. In SIGMETRICS. 93--104.
[21]
Alexander Kraskov, Harald Stögbauer, and Peter Grassberger. 2004. Estimating mutual information. Physical review E, Vol. 69, 6 (2004), 066138.
[22]
Jaewon Lee, Changkyu Kim, Kun Lin, Liqun Cheng, Rama Govindaraju, and Jangwoo Kim. 2018. WSMeter: A Performance Evaluation Methodology for Google's Production Warehouse-Scale Computers. In ASPLOS. 549--563.
[23]
Mingjie Li, Zeyan Li, Kanglin Yin, Xiaohui Nie, Wenchi Zhang, Kaixin Sui, and Dan Pei. 2022. Causal Inference-Based Root Cause Analysis for Online Service Systems with Intervention Recognition. In KDD. 3230--3240.
[24]
JinJin Lin, Pengfei Chen, and Zibin Zheng. 2018. Microscope: Pinpoint Performance Issues with Causal Graphs in Micro-service Environments. In ICSOC (Lecture Notes in Computer Science, Vol. 11236). 3--20.
[25]
Meng Ma, Weilan Lin, Disheng Pan, and Ping Wang. 2022a. Self-Adaptive Root Cause Diagnosis for Large-Scale Microservice Architecture. IEEE Trans. Serv. Comput., Vol. 15, 3 (2022), 1399--1410.
[26]
Meng Ma, Weilan Lin, Disheng Pan, and Ping Wang. 2022b. ServiceRank: Root Cause Identification of Anomaly in Large-Scale Microservice Architectures. IEEE Trans. Dependable Secur. Comput., Vol. 19, 5 (2022), 3087--3100.
[27]
Meng Ma, Jingmin Xu, Yuan Wang, Pengfei Chen, Zonghua Zhang, and Ping Wang. 2020. AutoMAP: Diagnose Your Microservice-based Web Applications Automatically. In WWW. 246--258.
[28]
Weibin Meng, Ying Liu, Yichen Zhu, Shenglin Zhang, Dan Pei, Yuqing Liu, Yihao Chen, Ruizhi Zhang, Shimin Tao, Pei Sun, and Rong Zhou. 2019. LogAnomaly: Unsupervised Detection of Sequential and Quantitative Anomalies in Unstructured Logs. In IJCAI. 4739--4745.
[29]
nux. 2023. nux. https://github.com/toolkits/nux (Accessed: July 18, 2023).
[30]
Open-Falcon. 2023. Open-Falcon. http://open-falcon.org (Accessed: July 18, 2023).
[31]
Xinghao Pan, Jiaqi Tan, Soila Kavulya, Rajeev Gandhi, and Priya Narasimhan. 2009. Ganesha: blackBox diagnosis of MapReduce systems. SIGMETRICS Perform. Evaluation Rev., Vol. 37, 3 (2009), 8--13.
[32]
Yicheng Pan, Meng Ma, Xinrui Jiang, and Ping Wang. 2021. Faster, deeper, easier: crowdsourcing diagnosis of microservice kernel failure from user space. In ISSTA. 646--657.
[33]
Jamie Pool, Ebrahim Beyrami, Vishak Gopal, Ashkan Aazami, Jayant Gupchup, Jeff Rowland, Binlong Li, Pritesh Kanani, Ross Cutler, and Johannes Gehrke. 2020. Lumos: A Library for Diagnosing Metric Regressions in Web-Scale Applications. In KDD. 2562--2570.
[34]
Prometheus. 2023. Prometheus. https://prometheus.io (Accessed: July 18, 2023).
[35]
Jakob Runge. 2018a. Causal network reconstruction from time series: From theoretical assumptions to practical estimation. Chaos, Vol. 28, 7 (2018), 075310.
[36]
Jakob Runge. 2018b. Conditional independence testing based on a nearest-neighbor estimator of conditional mutual information. In AISTATS (Proceedings of Machine Learning Research, Vol. 84). 938--947.
[37]
Skipper Seabold and Josef Perktold. 2010. statsmodels: Econometric and statistical modeling with python. In 9th Python in Science Conference.
[38]
Marc Solé, Victor Muntés-Mulero, Annie Ibrahim Rana, and Giovani Estrada. 2017. Survey on models and techniques for root-cause analysis. arXiv preprint arXiv:1701.08546 (2017).
[39]
Peter Spirtes and Clark Glymour. 1991. An Algorithm for Fast Recovery of Sparse Causal Graphs. Social Science Computer Review, Vol. 9, 1 (1991), 62--72. https://doi.org/10.1177/089443939100900106
[40]
Jiaqi Tan, Xinghao Pan, Eugene Marinelli, Soila Kavulya, Rajeev Gandhi, and Priya Narasimhan. 2010. Kahuna: Problem diagnosis for Mapreduce-based cloud computing environments. In NOMS. 112--119.
[41]
Yongmin Tan, Hiep Nguyen, Zhiming Shen, Xiaohui Gu, Chitra Venkatramani, and Deepak Rajan. 2012. PREPARE: Predictive Performance Anomaly Prevention for Virtualized Cloud Systems. In ICDCS. 285--294.
[42]
Jö rg Thalheim, Antonio Rodrigues, Istemi Ekin Akkus, Pramod Bhatotia, Ruichuan Chen, Bimal Viswanath, Lei Jiao, and Christof Fetzer. 2017. Sieve: actionable insights from monitored metrics in distributed systems. In Middleware. 14--27.
[43]
Chengwei Wang, Soila Kavulya, Jiaqi Tan, Liting Hu, Mahendra Kutare, Michael P. Kasick, Karsten Schwan, Priya Narasimhan, and Rajeev Gandhi. 2013. Performance troubleshooting in data centers: an annotated bibliography? ACM SIGOPS Oper. Syst. Rev., Vol. 47, 3 (2013), 50--62. https://doi.org/10.1145/2553070.2553079
[44]
Chengwei Wang, Krishnamurthy Viswanathan, Choudur Lakshminarayan, Vanish Talwar, Wade Satterfield, and Karsten Schwan. 2011. Statistical techniques for online anomaly detection in data centers. In Integrated Network Management. 385--392.
[45]
Ping Wang, Jingmin Xu, Meng Ma, Weilan Lin, Disheng Pan, Yuan Wang, and Pengfei Chen. 2018. CloudRanger: Root Cause Identification for Cloud Native Systems. In CCGrid. 492--502.
[46]
Kejiang Ye. 2017. Anomaly Detection in Clouds: Challenges and Practice. In ETCD@ASPLOS. 6:1--6:2.
[47]
Xu Zhang, Chao Du, Yifan Li, Yong Xu, Hongyu Zhang, Si Qin, Ze Li, Qingwei Lin, Yingnong Dang, Andrew Zhou, Saravanakumar Rajmohan, and Dongmei Zhang. 2021. HALO: Hierarchy-aware Fault Localization for Cloud Systems. In KDD. 3948--3958.
[48]
Nengwen Zhao, Honglin Wang, Zeyan Li, Xiao Peng, Gang Wang, Zhu Pan, Yong Wu, Zhen Feng, Xidao Wen, Wenchi Zhang, Kaixin Sui, and Dan Pei. 2021. An empirical investigation of practical log anomaly detection for online service systems. In ESEC/SIGSOFT FSE. 1404--1415.
[49]
Xiang Zhou, Xin Peng, Tao Xie, Jun Sun, Chao Ji, Wenhai Li, and Dan Ding. 2021. Fault Analysis and Debugging of Microservice Systems: Industrial Survey, Benchmark System, and Empirical Study. IEEE Trans. Software Eng., Vol. 47, 2 (2021), 243--260. https://doi.org/10.1109/TSE.2018.2887384
[50]
Xiang Zhou, Xin Peng, Tao Xie, Jun Sun, Chao Ji, Dewei Liu, Qilin Xiang, and Chuan He. 2019. Latent error prediction and fault localization for microservice applications by learning from system trace logs. In ESEC/SIGSOFT FSE. 683--694.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Measurement and Analysis of Computing Systems
Proceedings of the ACM on Measurement and Analysis of Computing Systems  Volume 7, Issue 3
POMACS
December 2023
599 pages
EISSN:2476-1249
DOI:10.1145/3637453
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 December 2023
Published in POMACS Volume 7, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Granger causality analysis
  2. dynamic causality
  3. host machines
  4. monitoring metrics
  5. root cause analysis

Qualifiers

  • Research-article

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 202
    Total Downloads
  • Downloads (Last 12 months)202
  • Downloads (Last 6 weeks)18
Reflects downloads up to 12 Nov 2024

Other Metrics

Citations

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media