research-article

HEAL: Performance Troubleshooting Deep inside Data Center Hosts

Authors:

Xiangzhuang Shen,

Ping WangAuthors Info & Claims

Proceedings of the ACM on Measurement and Analysis of Computing Systems, Volume 7, Issue 3

Article No.: 54, Pages 1 - 24

https://doi.org/10.1145/3626785

Published: 12 December 2023 Publication History

Abstract

This study demonstrates the salient facts and challenges of host failure operations in hyperscale data centers. A host incident can involve hundreds of distinct host-level metrics, covering broad aspects. The faulting mechanism inside the host connects these heterogeneous metrics through direct and indirect correlation, making it extremely difficult to sort out the propagation procedures and the root cause from these intertwined indicators. To deeply understand the failure mechanism inside the host, we develop HEAL -- a novel host metrics analysis toolkit. HEAL synergistically discovers dynamic causality in sparse heterogeneous host metrics by combining the strengths of both time series and random variable analysis. It can also proactively extract causal directional hints from causality's asymmetry and historical knowledge. Together, these breakthroughs help HEAL produce accurate results given undesirable inputs. Extensive experiments in our production environment verify that HEAL provides significantly better result accuracy and full-process interpretability than the SOTA baselines. With these advantages, HEAL successfully serves our data center and worldwide product operations and impressively contributes to many other workflows.

References

[1]

Nuha Alshuqayran, Nour Ali, and Roger Evans. 2016. A Systematic Mapping Study in Microservice Architecture. In SOCA. 44--51.

[2]

Dan Ardelean, Amer Diwan, and Chandra Erdman. 2018. Performance Analysis of Cloud Applications. In NSDI. 405--417.

[3]

atop. 2023. atop. https://github.com/Atoptool/atop (Accessed: July 18, 2023).

[4]

Armin Balalaie, Abbas Heydarnoori, and Pooyan Jamshidi. 2016. Microservices architecture enables devops: Migration to a cloud-native architecture. IEEE Softw., Vol. 33, 3 (2016), 42--52. https://doi.org/10.1109/MS.2016.64

Digital Library

[5]

Ali Basiri, Lorin Hochstein, Nora Jones, and Haley Tucker. 2019. Automating chaos experiments in production. In ICSE (SEIP). 31--40.

[6]

Pengfei Chen, Yong Qi, Pengfei Zheng, and Di Hou. 2014. CauseInfer: Automatic and distributed performance diagnosis with hierarchical causality graph in large distributed systems. In INFOCOM. 1887--1895.

[7]

Ira Cohen, Jeffrey S. Chase, Moisé s Goldszmidt, Terence Kelly, and Julie Symons. 2004. Correlating Instrumentation Data to System States: A Building Block for Automated Diagnosis and Control. In OSDI. 231--244.

[8]

Thomas M. J. Fruchterman and Edward M. Reingold. 1991. Graph Drawing by Force-directed Placement. Softw. Pract. Exp., Vol. 21, 11 (1991), 1129--1164.

Digital Library

[9]

Yu Gan, Mingyu Liang, Sundar Dev, David Lo, and Christina Delimitrou. 2021. Sage: practical and scalable ML-driven performance debugging in microservices. In ASPLOS. 135--151.

[10]

Yu Gan, Yanqi Zhang, Kelvin Hu, Dailun Cheng, Yuan He, Meghna Pancholi, and Christina Delimitrou. 2019. Seer: Leveraging Big Data to Navigate the Complexity of Performance Debugging in Cloud Microservices. In ASPLOS. 19--33.

Digital Library

[11]

Clive WJ Granger. 1969. Investigating causal relations by econometric models and cross-spectral methods. Econometrica: Journal of the Econometric Society (1969), 424--438.

Digital Library

[12]

Clive W. J. Granger and Ramu Ramanathan. 1984. Improved methods of combining forecasts. Journal of Forecasting, Vol. 3, 2 (1984), 197--204. https://doi.org/10.1002/for.3980030207

[13]

Mark Grechanik, Chen Fu, and Qing Xie. 2012. Automatically finding performance problems with feedback-directed learning software testing. In ICSE. 156--166.

[14]

Shenghui Gu, Guoping Rong, Tian Ren, He Zhang, Haifeng Shen, Yongda Yu, Xian Li, Jian Ouyang, and Chunan Chen. 2023. TrinityRCL: Multi-Granular and Code-Level Root Cause Localization Using Multiple Types of Telemetry Data in Microservice Systems. IEEE Trans. Software Eng., Vol. 49, 5 (2023), 3071--3088.

Digital Library

[15]

Haryadi S. Gunawi, Mingzhe Hao, Riza O. Suminto, Agung Laksono, Anang D. Satria, Jeffry Adityatama, and Kurnia J. Eliazar. 2016. Why Does the Cloud Stop Computing? Lessons from Hundreds of Service Outages. In SoCC. 1--16.

[16]

hwmon. 2023. hwmon. https://docs.kernel.org/hwmon (Accessed: July 18, 2023).

[17]

Antti Hyttinen, Sergey M. Plis, Matti J"a rvisalo, Frederick Eberhardt, and David Danks. 2016. Causal Discovery from Subsampled Time Series Data by Constraint Optimization. In JMLR Workshop Conf Proc, Vol. 52. 216--227.

[18]

Vimalkumar Jeyakumar, Omid Madani, Ali Parandeh, Ashutosh Kulshreshtha, Weifei Zeng, and Navindra Yadav. 2019. ExplainIt! - A Declarative Root-cause Analysis Engine for Time Series Data. In SIGMOD Conference. 333--348.

Digital Library

[19]

Srikanth Kandula, Ratul Mahajan, Patrick Verkaik, Sharad Agarwal, Jitendra Padhye, and Paramvir Bahl. 2009. Detailed diagnosis in enterprise networks. In Proceedings of the ACM SIGCOMM 2009 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, Barcelona, Spain, August 16--21, 2009. 243--254. https://doi.org/10.1145/1592568.1592597

Digital Library

[20]

Myunghwan Kim, Roshan Sumbaly, and Sam Shah. 2013. Root cause detection in a service-oriented architecture. In SIGMETRICS. 93--104.

[21]

Alexander Kraskov, Harald Stögbauer, and Peter Grassberger. 2004. Estimating mutual information. Physical review E, Vol. 69, 6 (2004), 066138.

[22]

Jaewon Lee, Changkyu Kim, Kun Lin, Liqun Cheng, Rama Govindaraju, and Jangwoo Kim. 2018. WSMeter: A Performance Evaluation Methodology for Google's Production Warehouse-Scale Computers. In ASPLOS. 549--563.

[23]

Mingjie Li, Zeyan Li, Kanglin Yin, Xiaohui Nie, Wenchi Zhang, Kaixin Sui, and Dan Pei. 2022. Causal Inference-Based Root Cause Analysis for Online Service Systems with Intervention Recognition. In KDD. 3230--3240.

[24]

JinJin Lin, Pengfei Chen, and Zibin Zheng. 2018. Microscope: Pinpoint Performance Issues with Causal Graphs in Micro-service Environments. In ICSOC (Lecture Notes in Computer Science, Vol. 11236). 3--20.

Digital Library

[25]

Meng Ma, Weilan Lin, Disheng Pan, and Ping Wang. 2022a. Self-Adaptive Root Cause Diagnosis for Large-Scale Microservice Architecture. IEEE Trans. Serv. Comput., Vol. 15, 3 (2022), 1399--1410.

[26]

Meng Ma, Weilan Lin, Disheng Pan, and Ping Wang. 2022b. ServiceRank: Root Cause Identification of Anomaly in Large-Scale Microservice Architectures. IEEE Trans. Dependable Secur. Comput., Vol. 19, 5 (2022), 3087--3100.

[27]

Meng Ma, Jingmin Xu, Yuan Wang, Pengfei Chen, Zonghua Zhang, and Ping Wang. 2020. AutoMAP: Diagnose Your Microservice-based Web Applications Automatically. In WWW. 246--258.

[28]

Weibin Meng, Ying Liu, Yichen Zhu, Shenglin Zhang, Dan Pei, Yuqing Liu, Yihao Chen, Ruizhi Zhang, Shimin Tao, Pei Sun, and Rong Zhou. 2019. LogAnomaly: Unsupervised Detection of Sequential and Quantitative Anomalies in Unstructured Logs. In IJCAI. 4739--4745.

[29]

nux. 2023. nux. https://github.com/toolkits/nux (Accessed: July 18, 2023).

[30]

Open-Falcon. 2023. Open-Falcon. http://open-falcon.org (Accessed: July 18, 2023).

[31]

Xinghao Pan, Jiaqi Tan, Soila Kavulya, Rajeev Gandhi, and Priya Narasimhan. 2009. Ganesha: blackBox diagnosis of MapReduce systems. SIGMETRICS Perform. Evaluation Rev., Vol. 37, 3 (2009), 8--13.

Digital Library

[32]

Yicheng Pan, Meng Ma, Xinrui Jiang, and Ping Wang. 2021. Faster, deeper, easier: crowdsourcing diagnosis of microservice kernel failure from user space. In ISSTA. 646--657.

[33]

Jamie Pool, Ebrahim Beyrami, Vishak Gopal, Ashkan Aazami, Jayant Gupchup, Jeff Rowland, Binlong Li, Pritesh Kanani, Ross Cutler, and Johannes Gehrke. 2020. Lumos: A Library for Diagnosing Metric Regressions in Web-Scale Applications. In KDD. 2562--2570.

Digital Library

[34]

Prometheus. 2023. Prometheus. https://prometheus.io (Accessed: July 18, 2023).

[35]

Jakob Runge. 2018a. Causal network reconstruction from time series: From theoretical assumptions to practical estimation. Chaos, Vol. 28, 7 (2018), 075310.

[36]

Jakob Runge. 2018b. Conditional independence testing based on a nearest-neighbor estimator of conditional mutual information. In AISTATS (Proceedings of Machine Learning Research, Vol. 84). 938--947.

[37]

Skipper Seabold and Josef Perktold. 2010. statsmodels: Econometric and statistical modeling with python. In 9th Python in Science Conference.

[38]

Marc Solé, Victor Muntés-Mulero, Annie Ibrahim Rana, and Giovani Estrada. 2017. Survey on models and techniques for root-cause analysis. arXiv preprint arXiv:1701.08546 (2017).

[39]

Peter Spirtes and Clark Glymour. 1991. An Algorithm for Fast Recovery of Sparse Causal Graphs. Social Science Computer Review, Vol. 9, 1 (1991), 62--72. https://doi.org/10.1177/089443939100900106

[40]

Jiaqi Tan, Xinghao Pan, Eugene Marinelli, Soila Kavulya, Rajeev Gandhi, and Priya Narasimhan. 2010. Kahuna: Problem diagnosis for Mapreduce-based cloud computing environments. In NOMS. 112--119.

[41]

Yongmin Tan, Hiep Nguyen, Zhiming Shen, Xiaohui Gu, Chitra Venkatramani, and Deepak Rajan. 2012. PREPARE: Predictive Performance Anomaly Prevention for Virtualized Cloud Systems. In ICDCS. 285--294.

[42]

Jö rg Thalheim, Antonio Rodrigues, Istemi Ekin Akkus, Pramod Bhatotia, Ruichuan Chen, Bimal Viswanath, Lei Jiao, and Christof Fetzer. 2017. Sieve: actionable insights from monitored metrics in distributed systems. In Middleware. 14--27.

[43]

Chengwei Wang, Soila Kavulya, Jiaqi Tan, Liting Hu, Mahendra Kutare, Michael P. Kasick, Karsten Schwan, Priya Narasimhan, and Rajeev Gandhi. 2013. Performance troubleshooting in data centers: an annotated bibliography? ACM SIGOPS Oper. Syst. Rev., Vol. 47, 3 (2013), 50--62. https://doi.org/10.1145/2553070.2553079

Digital Library

[44]

Chengwei Wang, Krishnamurthy Viswanathan, Choudur Lakshminarayan, Vanish Talwar, Wade Satterfield, and Karsten Schwan. 2011. Statistical techniques for online anomaly detection in data centers. In Integrated Network Management. 385--392.

[45]

Ping Wang, Jingmin Xu, Meng Ma, Weilan Lin, Disheng Pan, Yuan Wang, and Pengfei Chen. 2018. CloudRanger: Root Cause Identification for Cloud Native Systems. In CCGrid. 492--502.

[46]

Kejiang Ye. 2017. Anomaly Detection in Clouds: Challenges and Practice. In ETCD@ASPLOS. 6:1--6:2.

[47]

Xu Zhang, Chao Du, Yifan Li, Yong Xu, Hongyu Zhang, Si Qin, Ze Li, Qingwei Lin, Yingnong Dang, Andrew Zhou, Saravanakumar Rajmohan, and Dongmei Zhang. 2021. HALO: Hierarchy-aware Fault Localization for Cloud Systems. In KDD. 3948--3958.

[48]

Nengwen Zhao, Honglin Wang, Zeyan Li, Xiao Peng, Gang Wang, Zhu Pan, Yong Wu, Zhen Feng, Xidao Wen, Wenchi Zhang, Kaixin Sui, and Dan Pei. 2021. An empirical investigation of practical log anomaly detection for online service systems. In ESEC/SIGSOFT FSE. 1404--1415.

[49]

Xiang Zhou, Xin Peng, Tao Xie, Jun Sun, Chao Ji, Wenhai Li, and Dan Ding. 2021. Fault Analysis and Debugging of Microservice Systems: Industrial Survey, Benchmark System, and Empirical Study. IEEE Trans. Software Eng., Vol. 47, 2 (2021), 243--260. https://doi.org/10.1109/TSE.2018.2887384

Digital Library

[50]

Xiang Zhou, Xin Peng, Tao Xie, Jun Sun, Chao Ji, Dewei Liu, Qilin Xiang, and Chuan He. 2019. Latent error prediction and fault localization for microservice applications by learning from system trace logs. In ESEC/SIGSOFT FSE. 683--694.

Index Terms

HEAL: Performance Troubleshooting Deep inside Data Center Hosts

Recommendations

HEAL: Performance Troubleshooting Deep inside Data Center Hosts
SIGMETRICS '24

This study demonstrates the salient facts and challenges of host failure operations in hyperscale data centers. A host incident can involve hundreds of distinct host-level metrics. The faulting mechanism inside the host connects these heterogeneous ...
HEAL: Performance Troubleshooting Deep inside Data Center Hosts
SIGMETRICS/PERFORMANCE '24: Abstracts of the 2024 ACM SIGMETRICS/IFIP PERFORMANCE Joint International Conference on Measurement and Modeling of Computer Systems

This study demonstrates the salient facts and challenges of host failure operations in hyperscale data centers. A host incident can involve hundreds of distinct host-level metrics. The faulting mechanism inside the host connects these heterogeneous ...
PoiRoot: investigating the root cause of interdomain path changes

Interdomain path changes occur frequently. Because routing protocols expose insufficient information to reason about all changes, the general problem of identifying the root cause remains unsolved. In this work, we design and evaluate PoiRoot, a real-...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Measurement and Analysis of Computing Systems

Proceedings of the ACM on Measurement and Analysis of Computing Systems Volume 7, Issue 3

POMACS

December 2023

599 pages

EISSN:2476-1249

DOI:10.1145/3637453

Editors:
Augustin Chaintreau
Columbia University
,
Leana Golubchik
University of Southern California, United States
,
Zhi-Li Zhang
University of Minnesota, United States

Issue’s Table of Contents

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 December 2023

Published in POMACS Volume 7, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Qiyuan Lab Innovation Fund
ByteDance University Research Project
National Natural Science Foundation of China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
202
Total Downloads

Downloads (Last 12 months)202
Downloads (Last 6 weeks)18

Reflects downloads up to 12 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents