Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Fast Dimensional Analysis for Root Cause Investigation in a Large-Scale Service Environment

Published: 12 June 2020 Publication History

Abstract

Root cause analysis in a large-scale production environment is challenging due to the complexity of the services running across global data centers. Due to the distributed nature of a large-scale system, the various hardware, software, and tooling logs are often maintained separately, making it difficult to review the logs jointly for understanding production issues. Another challenge in reviewing the logs for identifying issues is the scale - there could easily be millions of entities, each described by hundreds of features. In this paper we present a fast dimensional analysis framework that automates the root cause analysis on structured logs with improved scalability. We first explore item-sets, i.e. combinations of feature values, that could identify groups of samples with sufficient support for the target failures using the Apriori algorithm and a subsequent improvement, FP-Growth. These algorithms were designed for frequent item-set mining and association rule learning over transactional databases. After applying them on structured logs, we select the item-sets that are most unique to the target failures based on lift. We propose pre-processing steps with the use of a large-scale real-time database and post-processing techniques and parallelism to further speed up the analysis and improve interpretability, and demonstrate that such optimization is necessary for handling large-scale production datasets. We have successfully rolled out this approach for root cause investigation purposes within Facebook's infrastructure. We also present the setup and results from multiple production use cases in this paper.

References

[1]
Lior Abraham, John Allen, Oleksandr Barykin, Vinayak Borkar, Bhuwan Chopra, Ciprian Gerea, Dan Merl, Josh Metzler, David Reiss, Subbu Subramanian, Janet Wiener, and Okay Zed. 2013. Scuba: Diving into Data at Facebook. In International Conference on Very Large Data Bases (VLDB) .
[2]
Rakesh Agrawal, Tomasz Imielinski, and Arun Swami. 1993. Mining Association Rules between Sets of Items in Large Databases. In ACM SIGMOD International Conference on Management of Data .
[3]
Dea Delvia Arifin, Shaufiah, and Moch. Arif Bijaksana. 2016. Enhancing Spam Detection on Mobile Phone Short Message Service (SMS) Performance Using FP-Growth and Naive Bayes Classifier. In IEEE Asia Pacific Conference on Wireless and Mobile (APWiMob) .
[4]
Stephen D. Bay and Michael J. Pazzani. 1999. Detecting Change in Categorical Data: Mining Contrast Sets. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining .
[5]
Stephen D. Bay and Michael J. Pazzani. 2001. Detecting Group Differences: Mining Contrast Sets. Data Mining and Knowledge Discovery, Vol. 5, 3 (2001).
[6]
Ran M. Bittmann, Philippe Nemery, Xingtian Shi, Michael Kemelmakher, and Mengjiao Wang. 2018. Frequent Item-set Mining without Ubiquitous Items. In arXiv:1803.11105 [cs.DS] .
[7]
David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent Dirichlet Allocation. Journal of machine Learning research, Vol. 3 (Jan 2003), 993--1022.
[8]
Dhruba Borthakur. 2019. HDFS Architecture Guide. https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
[9]
Eric Boutin, Jaliya Ekanayake, Wei Lin, Bing Shi, Jingren Zhou, Zhengping Qian, Ming Wu, and Lidong Zhou. 2014. Apollo: Scalable and Coordinated Scheduling for Cloud-Scale Computing. In USENIX Symposium on Operating Systems Design and Implementation .
[10]
Sergey Brin, Rajeev Motwani, Jeffrey D. Ullman, and Shalom Tsur. 1997. Dynamic Itemset Counting and Implication Rules for Market Basket Data. In ACM SIGMOD International Conference on Management of Data .
[11]
Marco Castelluccio, Carlo Sansone, Luisa Verdoliva, and Giovanni Poggi. 2017. Automatically Analyzing Groups of Crashes for Finding Correlations. In ESEC/FSE Joint Meeting on Foundations of Software Engineering .
[12]
Albert Greenberg, James Hamilton, David A. Maltz, and Parveen Patel. 2009. The Cost of a Cloud: Research Problems in Data Center Networks. In ACM SIGCOMM Computer Communication Review .
[13]
Jiawei Han, Jian Pei, and Yiwen Yin. 2000. Mining Frequent Patterns Without Candidate Generation. In ACM SIGMOD International Conference on Management of Data .
[14]
David Harris and Sarah Harris. 2012. Digital Design and Computer Architecture second ed.). Morgan Kaufmann.
[15]
Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, and Ion Stoica. 2011. Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center. In USENIX conference on Networked systems design and implementation .
[16]
Michael Isard. 2007. Autopilot: Automatic Data Center Management. In ACM SIGOPS Operating System Review .
[17]
Walter A. Kosters, Wim Pijls, and Viara Popova. 2003. Complexity Analysis of Depth First and FP-growth Implementations of APRIORI. In International Conference on Machine Learning and Data Mining in Pattern Recognition .
[18]
Fan (Fred) Lin, Matt Beadon, Harish Dattatraya Dixit, Gautham Vunnam, Amol Desai, and Sriram Sankar. 2018. Hardware Remediation At Scale. In IEEE/IFIP International Conference on Dependable Systems and Networks Workshops .
[19]
Ruilin Liu, Kai Yang, Yanjia Sun, Tao Quan, and Jin Yang. 2016. Spark-Based Rare Association Rule Mining for Big Datasets. In IEEE International Conference on Big Data (Big Data) .
[20]
MySQL. 2019. MySQL Customer: Facebook. https://www.mysql.com/customers/view/?id=757
[21]
Suriadi Suriadi, Chun Ouyang, Wil M. P. van der Aalst, and Arthur H. M. ter Hofstede. 2012. Root Cause Analysis with Enriched Process Logs. In International Conference on Business Process Management, Vol. 132. Springer.
[22]
Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Ning Zhang, Suresh Antony, and Hao Liuand Raghotham Murthy. 2010. Hive - A Petabyte Scale Data Warehouse Using Hadoop. In IEEE International Conference on Data Engineering (ICDE) .
[23]
Martin Traverso. 2013. Presto: Interacting with Petabytes of Data at Facebook. https://www.facebook.com/notes/facebook-engineering/presto-interacting-with-petabytes-of-data-at-facebook/10151786197628920/
[24]
Vinod Kumar Vavilapalli, Arun C Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, Bikas Saha, Carlo Curino, Owen O'Malley, Sanjay Radia, Benjamin Reed, and Eric Baldeschwieler. 2013. Apache Hadoop YARN: Yet Another Resource Negotiator. In ACM Symposium on Cloud Computing .
[25]
A. Verma, L. Pedrosa, M. Korupolu, D. Oppenheimer, E. Tune, and J. Wilkes. 2015. Large-scale Cluster Management At Google with Borg. In European Conference on Computer Systems (EuroSys) .
[26]
Bowei Wang, Dan Chen, Benyun Shi, Jindong Zhang, Yifu Duan, Jingying Chen, and Ruimin Hu. 2017. Comprehensive Association Rules Mining of Health Examination Data with an Extended FP-Growth Method. In Mobile Networks and Applications .
[27]
Xuerui Wang, Andrew McCallum, and Xing Wei. 2007. Topical n-grams: Phrase and Topic Discovery, with an Application to Information Retrieval. In IEEE International Conference on Data Mining (ICDM 2007) . 697--702.
[28]
Ian H. Witten, Eibe Frank, Mark A. Hall, and Christopher J. Pal. 2017. Data Mining: Practical Machine Learning Tools and Techniques fourth ed.). Morgan Kaufmann.
[29]
Tzu-Tsung Wong and Kuo-Lung Tseng. 2005. Mining Negative Contrast Sets from Data with Discrete Attributes. In Expert Systems with Applications .
[30]
Kenny Yu and Chunqiang (CQ) Tang. 2019. Efficient, Reliable Cluster Management at Scale with Tupperware. https://engineering.fb.com/data-center-engineering/tupperware/
[31]
Xudong Zhang, Yuebin Bai, Peng Feng, Weitao Wang, Shuai Liu, Wenhao Jiang, Junfang Zeng, and Rui Wang. 2018. Network Alarm Flood Pattern Mining Algorithm Based on Multi-dimensional Association. In ACM International Conference on Modeling, Analysis and Simulation of Wireless and Mobile Systems (MSWIM) .
[32]
Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-Level Convolutional Networks for Text Classification. In Advances in neural information processing systems. 649--657.
[33]
Zhuo Zhang, Chao Li, Yangyu Tao, Renyu Yang, Hong Tang, and Jie Xu. 2014. Fuxi: a Fault-Tolerant Resource Management and Job Scheduling System at Internet Scale. In International Conference on Very Large Data Bases (VLDB) .
[34]
Liang Zheng, Carlee Joe-Wong, Chee Wei Tan, Mung Chiang, and Xinyu Wang. 2015. How to Bid the Cloud. In ACM Conference on Special Interest Group on Data Communication (SIGCOMM) .

Cited By

View all
  • (2025)SinkFlow: Fast and traceable root-cause localization for multidimensional anomaly eventsEngineering Applications of Artificial Intelligence10.1016/j.engappai.2024.109582139(109582)Online publication date: Jan-2025
  • (2024)The Diagnosis-Effective Sampling of Application TracesApplied Sciences10.3390/app1413577914:13(5779)Online publication date: 2-Jul-2024
  • (2024)Trace-based Multi-Dimensional Root Cause Localization of Performance Issues in Microservice SystemsProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639088(1-12)Online publication date: 20-May-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Measurement and Analysis of Computing Systems
Proceedings of the ACM on Measurement and Analysis of Computing Systems  Volume 4, Issue 2
SIGMETRICS
June 2020
623 pages
EISSN:2476-1249
DOI:10.1145/3405833
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 June 2020
Online AM: 07 May 2020
Published in POMACS Volume 4, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. anomaly detection
  2. dimension correlation analysis
  3. investigation analysis
  4. large-scale service environment
  5. root cause analysis

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)65
  • Downloads (Last 6 weeks)7
Reflects downloads up to 23 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2025)SinkFlow: Fast and traceable root-cause localization for multidimensional anomaly eventsEngineering Applications of Artificial Intelligence10.1016/j.engappai.2024.109582139(109582)Online publication date: Jan-2025
  • (2024)The Diagnosis-Effective Sampling of Application TracesApplied Sciences10.3390/app1413577914:13(5779)Online publication date: 2-Jul-2024
  • (2024)Trace-based Multi-Dimensional Root Cause Localization of Performance Issues in Microservice SystemsProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639088(1-12)Online publication date: 20-May-2024
  • (2024)Data-driven root cause analysis via causal discovery using time-to-event dataComputers & Industrial Engineering10.1016/j.cie.2024.109974190(109974)Online publication date: Apr-2024
  • (2023)Nezha: Interpretable Fine-Grained Root Causes Analysis for Microservices on Multi-modal Observability DataProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3616249(553-565)Online publication date: 30-Nov-2023
  • (2023)LogRule: Efficient Structured Log Mining for Root Cause AnalysisIEEE Transactions on Network and Service Management10.1109/TNSM.2023.328227020:4(4231-4243)Online publication date: Dec-2023
  • (2023)MicroKGCL: A Knowledge Graph for Root Cause Localization of Feedback Issues in Microservices2023 IEEE 23rd International Conference on Software Quality, Reliability, and Security (QRS)10.1109/QRS60937.2023.00067(628-637)Online publication date: 22-Oct-2023
  • (2023)FTM-RCA: A Fast Two-Stage Multi-dimensional Root-Cause Analysis of Network Anomalies2023 IEEE/ACM 31st International Symposium on Quality of Service (IWQoS)10.1109/IWQoS57198.2023.10188732(01-10)Online publication date: 19-Jun-2023
  • (2023)Measurement of Student Physical Fitness Based on Association Rule Algorithm2023 International Conference on Applied Intelligence and Sustainable Computing (ICAISC)10.1109/ICAISC58445.2023.10199789(1-8)Online publication date: 16-Jun-2023
  • (2023)Generic and robust root cause localization for multi-dimensional data in online service systemsJournal of Systems and Software10.1016/j.jss.2023.111748203:COnline publication date: 13-Jul-2023
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media