Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Ganesha: blackBox diagnosis of MapReduce systems

Published: 21 January 2010 Publication History

Abstract

Ganesha aims to diagnose faults transparently (in a black-box manner) in MapReduce systems, by analyzing OS-level metrics. Ganesha's approach is based on peer-symmetry under fault-free conditions, and can diagnose faults that manifest asymmetrically at nodes within a MapReduce system. We evaluate Ganesha by diagnosing Hadoop problems for the Gridmix Hadoop benchmark on 10-node and 50-node MapReduce clusters on Amazon's EC2. We also candidly highlight faults that escape Ganesha's diagnosis.

References

[1]
J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, pp 137--150, San Francisco, CA, Dec 2004.
[2]
S. Ghemawat, H. Gobioff, and S. Leung. The Google file system. In SOSP, pp 29--43, Lake George, NY, Oct 2003.
[3]
Hadoop. http://hadoop.apache.org/core.
[4]
Apache's JIRA issue tracker, 2006. https://issues.apache.org/jira.
[5]
S. Godard. SYSSTAT, 2008. http://pagesperso-orange.fr/sebastien.godard.
[6]
D. R. A. Dempster, N. Laird. Maximum likelihood from incomplete data via the em algorithm. J. of the Royal Statistical Society, 39:1,38, 1977.
[7]
R. Fonseca, G. Porter, R. Katz, S. Shenker, and I. Stoica. X-Trace: A pervasive network tracing framework. In NSDI, Cambridge, MA, Apr 2007.
[8]
G. F. Cretu-Ciocarlie, M. Budiu, M. Goldszmidt. Hunting for Problems with Artemis. In USENIX Workshop on Analysis of System Logs, San Diego, CA, Dec 2008.
[9]
J. Tan, X. Pan, S. Kavulya, R. Gandhi, and P. Narasimhan. SALSA: Analyzing logs as state machines. In USENIX Workshop on Analysis of System Logs, San Diego, CA, Dec 2008.
[10]
J. Tan, X. Pan, S. Kavulya, R. Gandhi, P. Narasimhan. Mochi: Visual Log-Analysis Based Tools for Debugging Hadoop. HotCloud, San Diego, CA, Jun 2009.
[11]
X. Pan, Blind Men and the Elephant: Piecing Together Hadoop for Diagnosis. Masters Thesis, Carnegie Mellon University, 2009. Technical Report: CMU-CS-09-135, Carnegie Mellon University, May 2009.
[12]
X. Pan, J. Tan, S. Kalvulya, R. Gandhi, and P. Narasimhan. Blind Men and the Elephant: Piecing Together Hadoop for Diagnosis. In 20th International Symposium on Software Reliability Engineering, Mysuru, India, Nov 2009.
[13]
P. Barham, A. Donnelly, R. Isaacs, and R. Mortier. Using Magpie for request extraction and workload modelling. In OSDI, San Francisco, CA, Dec 2004.
[14]
I. Cohen, S. Zhang, M. Goldszmidt, J. Symons, T. Kelly, and A. Fox. Capturing, indexing, clustering, and retrieving system history. In SOSP, pp 105--118, Brighton, U.K., Oct 2005.
[15]
M. K. Aguilera, J. C. Mogul, J. L. Wiener, P. Reynolds, and A. Muthitacharoen. Performance debugging for distributed system of black boxes. In SOSP, pp 74--89, Bolton Landing, NY, Oct 2003.
[16]
E. Kiciman and A. Fox. Detecting application-level failures in component-based internet services. IEEE Trans. on Neural Networks, 16(5):1027--1041, Sep 2005.
[17]
K. Bare, M. Kasick, S. Kavulya, E. Marinelli, X. Pan, J. Tan, R. Gandhi, and P. Narasimhan. ASDF: Automated online fingerpointing for Hadoop. Technical Report CMU-PDL-08-104, Carnegie Mellon University, May 2008.

Cited By

View all
  • (2024)KBJNet: Kinematic Bi-Joint Temporal Convolutional Network Attention for Anomaly Detection in Multivariate Time Series DataData Science Journal10.5334/dsj-2024-01023Online publication date: 4-Mar-2024
  • (2024)Automatic Debugging of Design Faults in MapReduce ApplicationsIEEE Transactions on Software Engineering10.1109/TSE.2024.336976650:4(956-978)Online publication date: 26-Feb-2024
  • (2024)DTAAD: Dual Tcn-attention networks for anomaly detection in multivariate time series dataKnowledge-Based Systems10.1016/j.knosys.2024.111849295(111849)Online publication date: Jul-2024
  • Show More Cited By

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM SIGMETRICS Performance Evaluation Review
ACM SIGMETRICS Performance Evaluation Review  Volume 37, Issue 3
December 2009
70 pages
ISSN:0163-5999
DOI:10.1145/1710115
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 January 2010
Published in SIGMETRICS Volume 37, Issue 3

Check for updates

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)4
  • Downloads (Last 6 weeks)1
Reflects downloads up to 12 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)KBJNet: Kinematic Bi-Joint Temporal Convolutional Network Attention for Anomaly Detection in Multivariate Time Series DataData Science Journal10.5334/dsj-2024-01023Online publication date: 4-Mar-2024
  • (2024)Automatic Debugging of Design Faults in MapReduce ApplicationsIEEE Transactions on Software Engineering10.1109/TSE.2024.336976650:4(956-978)Online publication date: 26-Feb-2024
  • (2024)DTAAD: Dual Tcn-attention networks for anomaly detection in multivariate time series dataKnowledge-Based Systems10.1016/j.knosys.2024.111849295(111849)Online publication date: Jul-2024
  • (2023)HEAL: Performance Troubleshooting Deep inside Data Center HostsProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/36267857:3(1-24)Online publication date: 7-Dec-2023
  • (2023)Locating Anomaly Clues for Atypical Anomalous Services: An Industrial ExplorationIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2022.318114320:4(2746-2761)Online publication date: 1-Jul-2023
  • (2023)Context-aware Outlier Detection for Sensor Data Stream Processing2023 IEEE International Conference on Pervasive Computing and Communications Workshops and other Affiliated Events (PerCom Workshops)10.1109/PerComWorkshops56833.2023.10150217(540-545)Online publication date: 13-Mar-2023
  • (2023)Anonymous Decentralized Chat Engine Over Overlay Networks2023 4th International Conference for Emerging Technology (INCET)10.1109/INCET57972.2023.10170194(1-6)Online publication date: 26-May-2023
  • (2023)Autonomous anomaly detection on traffic flow time series with reinforcement learningTransportation Research Part C: Emerging Technologies10.1016/j.trc.2023.104089150(104089)Online publication date: May-2023
  • (2023)Machine learning job failure analysis and prediction model for the cloud environmentHigh-Confidence Computing10.1016/j.hcc.2023.1001653:4(100165)Online publication date: Dec-2023
  • (2022)Analysis of Job Failure and Prediction Model for Cloud Computing Using Machine LearningSensors10.3390/s2205203522:5(2035)Online publication date: 5-Mar-2022
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media