research-article

Open access

Intelligent Monitoring Framework for Cloud Services: A Data-Driven Approach

Authors:

Pooja Srinivas,

Anjaly Parayil,

Saravan RajmohanAuthors Info & Claims

ICSE-SEIP '24: Proceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice

Pages 381 - 391

https://doi.org/10.1145/3639477.3639753

Published: 31 May 2024 Publication History

Abstract

Cloud service owners need to continuously monitor their services to ensure high availability and reliability. Gaps in monitoring can lead to delay in incident detection and significant negative customer impact. Current process of monitor creation is ad-hoc and reactive in nature. Developers create monitors using their tribal knowledge and, primarily, a trial and error based process. As a result, monitors often have incomplete coverage which leads to production issues, or, redundancy which results in noise and wasted effort.

In this work, we address this issue by proposing an intelligent monitoring framework that recommends monitors for cloud services based on their service properties. We start by mining the attributes of 30,000+ monitors from 791 production services at Microsoft and derive a structured ontology for monitors. We focus on two crucial dimensions: what to monitor (resources) and which metrics to monitor. We conduct an extensive empirical study and derive key insights on the major classes of monitors employed by cloud services at Microsoft, their associated dimensions, and the interrelationship between service properties and this ontology. Using these insights, we propose a deep learning based framework that recommends monitors based on the service properties. Finally, we conduct a user study with engineers from Microsoft which demonstrates the usefulness of the proposed framework. The proposed framework along with the ontology driven projections, succeeded in creating production quality recommendations for majority of resource classes. This was also validated by the users from the study who rated the framework's usefulness as 4.27 out of 5.

References

[1]

Giuseppe Aceto, Alessio Botta, Walter De Donato, and Antonio Pescapè. 2013. Cloud monitoring: A survey. Computer Networks 57, 9 (2013), 2093--2115.

Digital Library

[2]

Betsy Beyer, Niall Richard Murphy, David K Rensin, Kent Kawahara, and Stephen Thorne. 2018. The site reliability workbook: practical ways to implement SRE. " O'Reilly Media, Inc.".

[3]

Ayush Bhardwaj, Zhenyu Zhou, and Theophilus A Benson. 2021. A Comprehensive Study of Bugs in Software Defined Networks. In 2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). IEEE, 101--115.

[4]

Arnamoy Bhattacharyya, Seyed Ali Jokar Jandaghi, Stelios Sotiriadis, and Cristiana Amza. 2016. Semantic aware online detection of resource anomalies on the cloud. In 2016 IEEE international conference on cloud computing technology and science (CloudCom). IEEE, 134--143.

[5]

Zhuangbin Chen, Yu Kang, Liqun Li, Xu Zhang, Hongyu Zhang, Hui Xu, Yangfan Zhou, Li Yang, Jeffrey Sun, Zhangwei Xu, et al. 2020. Towards intelligent incident management: why we need it and how we make it. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1487--1497.

Digital Library

[6]

Jianru Ding. 2020. Characterizing Service Level Objectives for Cloud Services: Motivation of Short-Term Cache Allocation Performance Modeling. Ph. D. Dissertation. The Ohio State University.

[7]

Jianru Ding, Ruiqi Cao, Indrajeet Saravanan, Nathaniel Morris, and Christopher Stewart. 2019. Characterizing service level objectives for cloud services: Realities and myths. In 2019 IEEE International Conference on Autonomic Computing (ICAC). IEEE, 200--206.

[8]

Qiang Fu, Jian-Guang Lou, Qing-Wei Lin, Rui Ding, Dongmei Zhang, Zihao Ye, and Tao Xie. 2012. Performance issue diagnosis for online service systems. In 2012 IEEE 31st Symposium on Reliable Distributed Systems. IEEE, 273--278.

Digital Library

[9]

Yu Gao, Wensheng Dou, Feng Qin, Chushu Gao, Dong Wang, Jun Wei, Ruirui Huang, Li Zhou, and Yongming Wu. 2018. An empirical study on crash recovery bugs in large-scale distributed systems. In Proceedings of the 2018 26th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering. 539--550.

Digital Library

[10]

Supriyo Ghosh, Manish Shetty, Chetan Bansal, and Suman Nath. 2022. How to fight production incidents? an empirical study on a large-scale cloud service. In Proceedings of the 13th Symposium on Cloud Computing. 126--141.

Digital Library

[11]

Google Cloud. 2020. Adopt SLOs. https://cloud.google.com/architecture/framework/reliability/adopting-slos/.

[12]

Haryadi S Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, Thanh Do, Jeffry Adityatama, Kurnia J Eliazar, Agung Laksono, Jeffrey F Lukman, Vincentius Martin, et al. 2014. What bugs live in the cloud? a study of 3000+ issues in cloud systems. In Proceedings of the ACM symposium on cloud computing. 1--14.

Digital Library

[13]

Haryadi S Gunawi, Mingzhe Hao, Riza O Suminto, Agung Laksono, Anang D Satria, Jeffry Adityatama, and Kurnia J Eliazar. 2016. Why does the cloud stop computing? lessons from hundreds of service outages. In Proceedings of the Seventh ACM Symposium on Cloud Computing. 1--16.

Digital Library

[14]

Hosein Hasanbeig, Hiteshi Sharma, Leo Betthauser, Felipe Vieira Frujeri, and Ida Momennejad. 2023. ALLURE: Auditing and Improving LLM-based Evaluation of Text using Iterative In-Context-Learning. arXiv:2309.13701 [cs.CL]

[15]

Jørgen Hilden and Paul Glasziou. 1996. Regret graphs, diagnostic uncertainty and Youden's Index. Statistics in medicine 15, 10 (1996), 969--986.

[16]

Hermann O Hirschfeld. 1935. A connection between correlation and contingency. In Mathematical Proceedings of the Cambridge Philosophical Society, Vol. 31. Cambridge University Press, 520--524.

[17]

Hiranya Jayathilaka, Chandra Krintz, and Rich Wolski. 2015. Response time service level agreements for cloud-hosted web applications. In Proceedings of the Sixth ACM Symposium on Cloud Computing. 315--328.

Digital Library

[18]

Luuk Klaver, Thijs van der Knaap, Johan van der Geest, Edwin Harmsma, Bram van der Waaij, and Paolo Pileggi. 2021. Towards independent run-time cloud monitoring. In Companion of the ACM/SPEC International Conference on Performance Engineering. 21--26.

[19]

Liqun Li, Xu Zhang, Xin Zhao, Hongyu Zhang, Yu Kang, Pu Zhao, Bo Qiao, Shilin He, Pochian Lee, Jeffrey Sun, et al. 2021. Fighting the fog of war: Automated incident detection for cloud systems. In 2021 USENIX Annual Technical Conference (USENIX ATC 21). 131--146.

[20]

Oscar Li, Hao Liu, Chaofan Chen, and Cynthia Rudin. 2018. Deep learning for case-based reasoning through prototypes: A neural network that explains its predictions. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.

[21]

Yichen Li, Xu Zhang, Shilin He, Zhuangbin Chen, Yu Kang, Jinyang Liu, Liqun Li, Yingnong Dang, Feng Gao, Zhangwei Xu, et al. 2022. An Intelligent Framework for Timely, Accurate, and Comprehensive Cloud Incident Detection. ACM SIGOPS Operating Systems Review 56, 1 (2022), 1--7.

Digital Library

[22]

Derek Lin, Rashmi Raghu, Vivek Ramamurthy, Jin Yu, Regunathan Radhakrishnan, and Joseph Fernandez. 2014. Unveiling clusters of events for alert and incident management in large-scale enterprise it. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. 1630--1639.

Digital Library

[23]

Haopeng Liu, Shan Lu, Madan Musuvathi, and Suman Nath. 2019. What bugs cause production cloud incidents?. In Proceedings of the Workshop on Hot Topics in Operating Systems. 155--162.

Digital Library

[24]

Jeffrey C Mogul, Rebecca Isaacs, and Brent Welch. 2017. Thinking about availability in large service infrastructures. In Proceedings of the 16th Workshop on Hot Topics in Operating Systems. 12--17.

Digital Library

[25]

Jeffrey C Mogul and John Wilkes. 2019. Nines are not enough: Meaningful metrics for clouds. In Proceedings of the Workshop on Hot Topics in Operating Systems. 136--141.

Digital Library

[26]

Jesús Montes, Alberto Sánchez, Bunjamin Memishi, María S Pérez, and Gabriel Antoniu. 2013. GMonE: A complete approach to cloud monitoring. Future Generation Computer Systems 29, 8 (2013), 2026--2040.

Digital Library

[27]

Vinod Nair, Ameya Raul, Shwetabh Khanduja, Vikas Bahirwani, Qihong Shao, Sundararajan Sellamanickam, Sathiya Keerthi, Steve Herbert, and Sudheer Dhulipalla. 2015. Learning a hierarchical monitoring system for detecting and diagnosing service issues. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. 2029--2038.

Digital Library

[28]

Stefan Nastic, Andrea Morichetta, Thomas Pusztai, Schahram Dustdar, Xiaoning Ding, Deepak Vij, and Ying Xiong. 2020. Sloc: Service level objectives for next generation cloud computing. IEEE Internet Computing 24, 3 (2020), 39--50.

[29]

Panagiotis Patros, Kenneth B Kent, and Michael Dawson. 2017. SLO request modeling, reordering and scaling. In Proceedings of the 27th annual international conference on computer science and software engineering. 180--191.

Digital Library

[30]

Karl Pearson. 1900. X. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 50, 302 (1900), 157--175.

[31]

Haoran Qiu, Subho S Banerjee, Saurabh Jha, Zbigniew T Kalbarczyk, and Ravishankar K Iyer. 2020. {FIRM}: An intelligent fine-grained resource management framework for {SLO-Oriented} microservices. In 14th USENIX symposium on operating systems design and implementation (OSDI 20). 805--825.

[32]

Chellammal Surianarayanan, Pethuru Raj Chelliah, Chellammal Surianarayanan, and Pethuru Raj Chelliah. 2019. Cloud Monitoring. Essentials of Cloud Computing: A Holistic Perspective (2019), 241--254.

[33]

Zibo Wang, Pinghe Li, Chieh-Jan Mike Liang, Feng Wu, and Francis Y Yan. 2022. Autothrottle: A Practical Framework for Harvesting CPUs from SLO-Targeted Microservices. arXiv preprint arXiv:2212.12180 (2022).

[34]

Nengwen Zhao, Junjie Chen, Xiao Peng, Honglin Wang, Xinya Wu, Yuanzong Zhang, Zikai Chen, Xiangzhong Zheng, Xiaohui Nie, Gang Wang, et al. 2020. Understanding and handling alert storm for online service systems. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering: Software Engineering in Practice. 162--171.

Digital Library

[35]

Nengwen Zhao, Panshi Jin, Lixin Wang, Xiaoqin Yang, Rong Liu, Wenchi Zhang, Kaixin Sui, and Dan Pei. 2020. Automatically and adaptively identifying severe alerts for online service systems. In IEEE INFOCOM 2020-IEEE Conference on Computer Communications. IEEE, 2420--2429.

Digital Library

Index Terms

Intelligent Monitoring Framework for Cloud Services: A Data-Driven Approach
1. Information systems
  1. Information systems applications
    1. Decision support systems
      1. Expert systems

Recommendations

Detection Is Better Than Cure: A Cloud Incidents Perspective
ESEC/FSE 2023: Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering

Cloud providers use automated watchdogs or monitors to continuously observe service availability and to proactively report incidents when system performance degrades. Improper monitoring can lead to delays in the detection and mitigation of production ...
A Framework for User Feedback Based Cloud Service Monitoring
CISIS '12: Proceedings of the 2012 Sixth International Conference on Complex, Intelligent, and Software Intensive Systems (CISIS)

The increasing popularity of the cloud computing paradigm and the emerging concept of federated cloud computing have motivated research efforts towards intelligent cloud service selection aimed at developing techniques for enabling the cloud users to ...
Efficient Algorithm for Identification and Cache Based Discovery of Cloud Services

Efficient resource identification and discovery is the primary requirements for cloud computing services, as it assists in scheduling and managing of cloud applications. Cloud computing is a revolution of the economic model rather than technological. It ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ICSE-SEIP '24: Proceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice

April 2024

480 pages

ISBN:9798400705014

DOI:10.1145/3639477

Co-chairs:
Ana Paiva,
Rui Abreu,
Maurício Aniche
Delft University of Technology, Netherlands
,
Nachiappan Nagappan
Meta, USA
,
Program Co-chairs:
Abhik Roychoudhury,
Margaret Storey

Copyright © 2024 Copyright is held by the owner/author(s). Publication rights licensed to ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

In-Cooperation

Faculty of Engineering of University of Porto

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 May 2024

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ICSE-SEIP '24

Sponsor:

SIGSOFT

ICSE-SEIP '24: 46th International Conference on Software Engineering: Software Engineering in Practice

April 14 - 20, 2024

Lisbon, Portugal

Upcoming Conference

ICSE 2025

2025 IEEE/ACM 46th International Conference on Software Engineering

April 26 - May 3, 2025

Ottawa , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
190
Total Downloads

Downloads (Last 12 months)190
Downloads (Last 6 weeks)31

Reflects downloads up to 21 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents