Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3639477.3639753acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article
Open access

Intelligent Monitoring Framework for Cloud Services: A Data-Driven Approach

Published: 31 May 2024 Publication History

Abstract

Cloud service owners need to continuously monitor their services to ensure high availability and reliability. Gaps in monitoring can lead to delay in incident detection and significant negative customer impact. Current process of monitor creation is ad-hoc and reactive in nature. Developers create monitors using their tribal knowledge and, primarily, a trial and error based process. As a result, monitors often have incomplete coverage which leads to production issues, or, redundancy which results in noise and wasted effort.
In this work, we address this issue by proposing an intelligent monitoring framework that recommends monitors for cloud services based on their service properties. We start by mining the attributes of 30,000+ monitors from 791 production services at Microsoft and derive a structured ontology for monitors. We focus on two crucial dimensions: what to monitor (resources) and which metrics to monitor. We conduct an extensive empirical study and derive key insights on the major classes of monitors employed by cloud services at Microsoft, their associated dimensions, and the interrelationship between service properties and this ontology. Using these insights, we propose a deep learning based framework that recommends monitors based on the service properties. Finally, we conduct a user study with engineers from Microsoft which demonstrates the usefulness of the proposed framework. The proposed framework along with the ontology driven projections, succeeded in creating production quality recommendations for majority of resource classes. This was also validated by the users from the study who rated the framework's usefulness as 4.27 out of 5.

References

[1]
Giuseppe Aceto, Alessio Botta, Walter De Donato, and Antonio Pescapè. 2013. Cloud monitoring: A survey. Computer Networks 57, 9 (2013), 2093--2115.
[2]
Betsy Beyer, Niall Richard Murphy, David K Rensin, Kent Kawahara, and Stephen Thorne. 2018. The site reliability workbook: practical ways to implement SRE. " O'Reilly Media, Inc.".
[3]
Ayush Bhardwaj, Zhenyu Zhou, and Theophilus A Benson. 2021. A Comprehensive Study of Bugs in Software Defined Networks. In 2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). IEEE, 101--115.
[4]
Arnamoy Bhattacharyya, Seyed Ali Jokar Jandaghi, Stelios Sotiriadis, and Cristiana Amza. 2016. Semantic aware online detection of resource anomalies on the cloud. In 2016 IEEE international conference on cloud computing technology and science (CloudCom). IEEE, 134--143.
[5]
Zhuangbin Chen, Yu Kang, Liqun Li, Xu Zhang, Hongyu Zhang, Hui Xu, Yangfan Zhou, Li Yang, Jeffrey Sun, Zhangwei Xu, et al. 2020. Towards intelligent incident management: why we need it and how we make it. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1487--1497.
[6]
Jianru Ding. 2020. Characterizing Service Level Objectives for Cloud Services: Motivation of Short-Term Cache Allocation Performance Modeling. Ph. D. Dissertation. The Ohio State University.
[7]
Jianru Ding, Ruiqi Cao, Indrajeet Saravanan, Nathaniel Morris, and Christopher Stewart. 2019. Characterizing service level objectives for cloud services: Realities and myths. In 2019 IEEE International Conference on Autonomic Computing (ICAC). IEEE, 200--206.
[8]
Qiang Fu, Jian-Guang Lou, Qing-Wei Lin, Rui Ding, Dongmei Zhang, Zihao Ye, and Tao Xie. 2012. Performance issue diagnosis for online service systems. In 2012 IEEE 31st Symposium on Reliable Distributed Systems. IEEE, 273--278.
[9]
Yu Gao, Wensheng Dou, Feng Qin, Chushu Gao, Dong Wang, Jun Wei, Ruirui Huang, Li Zhou, and Yongming Wu. 2018. An empirical study on crash recovery bugs in large-scale distributed systems. In Proceedings of the 2018 26th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering. 539--550.
[10]
Supriyo Ghosh, Manish Shetty, Chetan Bansal, and Suman Nath. 2022. How to fight production incidents? an empirical study on a large-scale cloud service. In Proceedings of the 13th Symposium on Cloud Computing. 126--141.
[11]
Google Cloud. 2020. Adopt SLOs. https://cloud.google.com/architecture/framework/reliability/adopting-slos/.
[12]
Haryadi S Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, Thanh Do, Jeffry Adityatama, Kurnia J Eliazar, Agung Laksono, Jeffrey F Lukman, Vincentius Martin, et al. 2014. What bugs live in the cloud? a study of 3000+ issues in cloud systems. In Proceedings of the ACM symposium on cloud computing. 1--14.
[13]
Haryadi S Gunawi, Mingzhe Hao, Riza O Suminto, Agung Laksono, Anang D Satria, Jeffry Adityatama, and Kurnia J Eliazar. 2016. Why does the cloud stop computing? lessons from hundreds of service outages. In Proceedings of the Seventh ACM Symposium on Cloud Computing. 1--16.
[14]
Hosein Hasanbeig, Hiteshi Sharma, Leo Betthauser, Felipe Vieira Frujeri, and Ida Momennejad. 2023. ALLURE: Auditing and Improving LLM-based Evaluation of Text using Iterative In-Context-Learning. arXiv:2309.13701 [cs.CL]
[15]
Jørgen Hilden and Paul Glasziou. 1996. Regret graphs, diagnostic uncertainty and Youden's Index. Statistics in medicine 15, 10 (1996), 969--986.
[16]
Hermann O Hirschfeld. 1935. A connection between correlation and contingency. In Mathematical Proceedings of the Cambridge Philosophical Society, Vol. 31. Cambridge University Press, 520--524.
[17]
Hiranya Jayathilaka, Chandra Krintz, and Rich Wolski. 2015. Response time service level agreements for cloud-hosted web applications. In Proceedings of the Sixth ACM Symposium on Cloud Computing. 315--328.
[18]
Luuk Klaver, Thijs van der Knaap, Johan van der Geest, Edwin Harmsma, Bram van der Waaij, and Paolo Pileggi. 2021. Towards independent run-time cloud monitoring. In Companion of the ACM/SPEC International Conference on Performance Engineering. 21--26.
[19]
Liqun Li, Xu Zhang, Xin Zhao, Hongyu Zhang, Yu Kang, Pu Zhao, Bo Qiao, Shilin He, Pochian Lee, Jeffrey Sun, et al. 2021. Fighting the fog of war: Automated incident detection for cloud systems. In 2021 USENIX Annual Technical Conference (USENIX ATC 21). 131--146.
[20]
Oscar Li, Hao Liu, Chaofan Chen, and Cynthia Rudin. 2018. Deep learning for case-based reasoning through prototypes: A neural network that explains its predictions. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.
[21]
Yichen Li, Xu Zhang, Shilin He, Zhuangbin Chen, Yu Kang, Jinyang Liu, Liqun Li, Yingnong Dang, Feng Gao, Zhangwei Xu, et al. 2022. An Intelligent Framework for Timely, Accurate, and Comprehensive Cloud Incident Detection. ACM SIGOPS Operating Systems Review 56, 1 (2022), 1--7.
[22]
Derek Lin, Rashmi Raghu, Vivek Ramamurthy, Jin Yu, Regunathan Radhakrishnan, and Joseph Fernandez. 2014. Unveiling clusters of events for alert and incident management in large-scale enterprise it. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. 1630--1639.
[23]
Haopeng Liu, Shan Lu, Madan Musuvathi, and Suman Nath. 2019. What bugs cause production cloud incidents?. In Proceedings of the Workshop on Hot Topics in Operating Systems. 155--162.
[24]
Jeffrey C Mogul, Rebecca Isaacs, and Brent Welch. 2017. Thinking about availability in large service infrastructures. In Proceedings of the 16th Workshop on Hot Topics in Operating Systems. 12--17.
[25]
Jeffrey C Mogul and John Wilkes. 2019. Nines are not enough: Meaningful metrics for clouds. In Proceedings of the Workshop on Hot Topics in Operating Systems. 136--141.
[26]
Jesús Montes, Alberto Sánchez, Bunjamin Memishi, María S Pérez, and Gabriel Antoniu. 2013. GMonE: A complete approach to cloud monitoring. Future Generation Computer Systems 29, 8 (2013), 2026--2040.
[27]
Vinod Nair, Ameya Raul, Shwetabh Khanduja, Vikas Bahirwani, Qihong Shao, Sundararajan Sellamanickam, Sathiya Keerthi, Steve Herbert, and Sudheer Dhulipalla. 2015. Learning a hierarchical monitoring system for detecting and diagnosing service issues. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. 2029--2038.
[28]
Stefan Nastic, Andrea Morichetta, Thomas Pusztai, Schahram Dustdar, Xiaoning Ding, Deepak Vij, and Ying Xiong. 2020. Sloc: Service level objectives for next generation cloud computing. IEEE Internet Computing 24, 3 (2020), 39--50.
[29]
Panagiotis Patros, Kenneth B Kent, and Michael Dawson. 2017. SLO request modeling, reordering and scaling. In Proceedings of the 27th annual international conference on computer science and software engineering. 180--191.
[30]
Karl Pearson. 1900. X. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 50, 302 (1900), 157--175.
[31]
Haoran Qiu, Subho S Banerjee, Saurabh Jha, Zbigniew T Kalbarczyk, and Ravishankar K Iyer. 2020. {FIRM}: An intelligent fine-grained resource management framework for {SLO-Oriented} microservices. In 14th USENIX symposium on operating systems design and implementation (OSDI 20). 805--825.
[32]
Chellammal Surianarayanan, Pethuru Raj Chelliah, Chellammal Surianarayanan, and Pethuru Raj Chelliah. 2019. Cloud Monitoring. Essentials of Cloud Computing: A Holistic Perspective (2019), 241--254.
[33]
Zibo Wang, Pinghe Li, Chieh-Jan Mike Liang, Feng Wu, and Francis Y Yan. 2022. Autothrottle: A Practical Framework for Harvesting CPUs from SLO-Targeted Microservices. arXiv preprint arXiv:2212.12180 (2022).
[34]
Nengwen Zhao, Junjie Chen, Xiao Peng, Honglin Wang, Xinya Wu, Yuanzong Zhang, Zikai Chen, Xiangzhong Zheng, Xiaohui Nie, Gang Wang, et al. 2020. Understanding and handling alert storm for online service systems. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering: Software Engineering in Practice. 162--171.
[35]
Nengwen Zhao, Panshi Jin, Lixin Wang, Xiaoqin Yang, Rong Liu, Wenchi Zhang, Kaixin Sui, and Dan Pei. 2020. Automatically and adaptively identifying severe alerts for online service systems. In IEEE INFOCOM 2020-IEEE Conference on Computer Communications. IEEE, 2420--2429.

Index Terms

  1. Intelligent Monitoring Framework for Cloud Services: A Data-Driven Approach

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICSE-SEIP '24: Proceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice
    April 2024
    480 pages
    ISBN:9798400705014
    DOI:10.1145/3639477
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    In-Cooperation

    • Faculty of Engineering of University of Porto

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 31 May 2024

    Check for updates

    Author Tags

    1. cloud services
    2. reliability
    3. intelligent monitoring

    Qualifiers

    • Research-article

    Conference

    ICSE-SEIP '24
    Sponsor:

    Upcoming Conference

    ICSE 2025

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 190
      Total Downloads
    • Downloads (Last 12 months)190
    • Downloads (Last 6 weeks)31
    Reflects downloads up to 21 Nov 2024

    Other Metrics

    Citations

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media