Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2371536.2371572acmconferencesArticle/Chapter ViewAbstractPublication PagesicacConference Proceedingsconference-collections
research-article

UBL: unsupervised behavior learning for predicting performance anomalies in virtualized cloud systems

Published: 18 September 2012 Publication History

Abstract

Infrastructure-as-a-Service (IaaS) clouds are prone to performance anomalies due to their complex nature. Although previous work has shown the effectiveness of using statistical learning to detect performance anomalies, existing schemes often assume labelled training data, which requires significant human effort and can only handle previously known anomalies. We present an Unsupervised Behavior Learning (UBL) system for IaaS cloud computing infrastructures. UBL leverages Self-Organizing Maps to capture emergent system behaviors and predict unknown anomalies. For scalability, UBL uses residual resources in the cloud infrastructure for behavior learning and anomaly prediction with little add-on cost. We have implemented a prototype of the UBL system on top of the Xen platform and conducted extensive experiments using a range of distributed systems. Our results show that UBL can predict performance anomalies with high accuracy and achieve sufficient lead time for automatic anomaly prevention. UBL supports large-scale infrastructure-wide behavior learning with negligible overhead.

References

[1]
Amazon elastic compute cloud. http://aws.amazon.com/ec2/.
[2]
Apache Hadoop System. http://hadoop.apache.org/core/.
[3]
Httperf. http://code.google.com/p/httperf/.
[4]
RUBiS: Rice University Bidding System. http://rubis.ow2.org.
[5]
The IRCache Project. http://www.ircache.net/.
[6]
Virtual computing lab. http://vcl.ncsu.edu/.
[7]
P. Barham, A. Donnelly, R. Isaacs, and R. Mortier. Using magpie for request extraction and workload modelling. In Proc. of OSDI, 2004.
[8]
P. Barham and et al. Xen and the Art of Virtualization. In Proc. of SOSP, 2003.
[9]
S. Bhatia, A. Kumar, M. E. Fiuczynski, and L. Peterson. Lightweight, high-resolution monitoring for troubleshooting production systems. In Proc. of OSDI, 2008.
[10]
P. Bodik, M. Goldszmidt, and A. Fox. Hilighter: Automatically building robust signatures of performance behavior for small- and large-scale systems. In Proc. of SysML, 2008.
[11]
D. Breitgand, M. B.-Yehuda, M. Factor, H. Kolodner, V. Kravtsov, and D. Pelleg. NAP: a building block for remediating performance bottlenecks via black box network analysis. In Proc. ICAC, 2009.
[12]
S. K. Cha, I. Moraru, J. Jang, J. Truelove, D. Brumley, and D. G. Andersen. Splitscreen: enabling efficient, distributed malware detection. In Proc. of NSDI, 2010.
[13]
L. Cherkasova, K. Ozonat, N. Mi, J. Symons, and E. Smirni. Anomaly? application change? or workload change? In Proc. of DSN, 2008.
[14]
C. Clark, K. Fraser, S. Hand, J. G. Hansen, E. Jul, C. Limpach, I. Pratt, and A. Warfield. Live migration of virtual machines. In Proc. of NSDI, 2005.
[15]
I. Cohen, M. Goldszmidt, T. Kelly, J. Symons, and J. S. Chase. Correlating Instrumentation Data to System States: A Building Block for Automated Diagnosis and Control. In Proc. of OSDI, 2004.
[16]
I. Cohen, S. Zhang, M. Goldszmidt, J. Symons, T. Kelly, and A. Fox. Capturing, indexing, clustering, and retrieving system history. In Proc. of SOSP, 2005.
[17]
S. Duan, S. Babu, and K. Munagala. Fa: A system for automating failure diagnosis. In Proc. of ICDE, 2009.
[18]
B. Gedik, H. Andrade, K.-L. Wu, P. S. Yu, and M. Doo. SPADE: the system s declarative stream processing engine. In Proc. of SIGMOD, 2008.
[19]
Z. Gong and X. Gu. PAC: Pattern-driven Application Consolidation for Efficient Cloud Computing. In Proc. of MASCOTS, 2010.
[20]
X. Gu and H. Wang. Online anomaly prediction for robust cluster systems. In Proc. of ICDE, 2009.
[21]
G. Jiang, H. Chen, and K. Yoshihira. Discovering likely invariants of distributed transaction systems for autonomic system management. In Proc. of ICAC, 2006.
[22]
M. Jiang, M. Munawar, T. Reidemeister, and P. A. S. Ward. Automatic fault detection and diagnosis in complex software systems by information-theoretic monitoring. In Proc. of DSN, 2009.
[23]
M. P. Kasick, J. Tan, R. Gandhi, and P. Narasimhan. Black-box problem diagnosis in parallel file systems. In Proc. of FAST, 2010.
[24]
T. Kohonen, M. R. Schroeder, and T. S. Huang, editors. Self-Organizing Maps. Springer, 3rd edition, 2001.
[25]
A. Makanju, A. N. Zincir-Heywood, and E. E. Milios. Fast entropy based alert detection in super computer. In Proc. of DSN, 2010.
[26]
I. T. Olliffe. Principal Component Analysis. Springer-Verlag, 2002.
[27]
R. Powers, M. Goldszmidt, and I. Cohen. Short term performance forecasting in enterprise systems. In Proc. of KDD, 2005.
[28]
P. Reynolds, J. Wiener, J. Mogul, M. Aguilera, and A. Vahdat. Wap5: black-box performance debugging for wide-area systems. In Proc. of WWW, 2006.
[29]
K. Shen, C. Stewart, C. Li, and X. Li. Reference-driven performance anomaly identification. In Proc. of SIGMETRICS, 2009.
[30]
C. Stewart, T. Kelly, and A. Zhang. Exploiting nonstationarity for performance prediction. In Proc. of Eurosys, 2007.
[31]
C. Stewart and K. Shen. Performance modeling and system management for multi-component online service. In Proc. of NSDI, 2005.
[32]
P.-N. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Addison Wesley, 2005.
[33]
Y. Tan, X. Gu, and H. Wang. Adaptive system anomaly prediction for large-scale hosting infrastructures. In Proc. of PODC, 2010.
[34]
Y. Tan, H. Nguyen, Z. Shen, X. Gu, C. Venkatramani, and D. Rajan. PREPARE: Predictive Performance Anomaly Prevention for Virtualized Cloud Systems. In Proc. of ICDCS, 2012.
[35]
C. Wang, V. Talwar, K. Schwan, and P. Ranganathan. Online detection of utility cloud anomalies using metric distributions. In Proc. of NOMS, 2010.
[36]
A. W. Williams, S. M. Pertet, and P. Narasimah. Tiresias: Black-box failure prediction in distributed systems. In Proc. of IPDPS, 2007.

Cited By

View all
  • (2024)Demystifying the Fight Against Complexity: A Comprehensive Study of Live Debugging Activities in Production Cloud SystemsProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698568(341-360)Online publication date: 20-Nov-2024
  • (2024)Probabilistic Temporal Fusion Transformers for Large-Scale KPI Anomaly DetectionIEEE Access10.1109/ACCESS.2024.335320112(9123-9137)Online publication date: 2024
  • (2023)Hi-RCA: A Hierarchy Anomaly Diagnosis Framework Based on Causality and Correlation AnalysisApplied Sciences10.3390/app13221212613:22(12126)Online publication date: 8-Nov-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ICAC '12: Proceedings of the 9th international conference on Autonomic computing
September 2012
222 pages
ISBN:9781450315203
DOI:10.1145/2371536
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

  • IEEE

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 September 2012

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. anomaly prediction
  2. cloud computing
  3. unsupervised system behavior learning

Qualifiers

  • Research-article

Conference

ICAC '12
Sponsor:
ICAC '12: 9th International Conference on Autonomic Computing
September 18 - 20, 2012
California, San Jose, USA

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)28
  • Downloads (Last 6 weeks)3
Reflects downloads up to 12 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Demystifying the Fight Against Complexity: A Comprehensive Study of Live Debugging Activities in Production Cloud SystemsProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698568(341-360)Online publication date: 20-Nov-2024
  • (2024)Probabilistic Temporal Fusion Transformers for Large-Scale KPI Anomaly DetectionIEEE Access10.1109/ACCESS.2024.335320112(9123-9137)Online publication date: 2024
  • (2023)Hi-RCA: A Hierarchy Anomaly Diagnosis Framework Based on Causality and Correlation AnalysisApplied Sciences10.3390/app13221212613:22(12126)Online publication date: 8-Nov-2023
  • (2023)FSFP: A Fine-Grained Online Service System Performance Fault Prediction Method Based on Cross-attention2023 30th Asia-Pacific Software Engineering Conference (APSEC)10.1109/APSEC60848.2023.00018(81-90)Online publication date: 4-Dec-2023
  • (2023)Production-Run Noise DetectionPerformance Analysis of Parallel Applications for HPC10.1007/978-981-99-4366-1_8(199-224)Online publication date: 19-Jun-2023
  • (2023)Anomaly Detection Using Machine Learning Techniques: A Systematic ReviewAdvances in Data-Driven Computing and Intelligent Systems10.1007/978-981-99-3250-4_42(553-572)Online publication date: 4-Aug-2023
  • (2022)The Vision of Self-Management in Cognitive Organic Power Distribution SystemsEnergies10.3390/en1503088115:3(881)Online publication date: 26-Jan-2022
  • (2022)BatchLens: A Visualization Approach for Analyzing Batch Jobs in Cloud Systems2022 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE54114.2022.9774668(108-111)Online publication date: 14-Mar-2022
  • (2022)VaproProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3503221.3508411(150-162)Online publication date: 2-Apr-2022
  • (2022)Detecting Performance Variance for Parallel Applications Without Source CodeIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.318179933:12(4239-4255)Online publication date: 1-Dec-2022
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media