research-article

Run-time failure detection via non-intrusive event analysis in a large-scale cloud computing platform

Authors:

Domenico Cotroneo,

Luigi De Simone,

Pietro Liguori,

Roberto NatellaAuthors Info & Claims

Volume 198, Issue C

https://doi.org/10.1016/j.jss.2023.111611

Published: 01 April 2023 Publication History

Abstract

Cloud computing systems fail in complex and unforeseen ways due to unexpected combinations of events and interactions among hardware and software components. These failures are especially problematic when they are silent, i.e., not accompanied by any explicit failure notification, hindering the timely detection and recovery.

In this work, we propose an approach to run-time failure detection tailored for monitoring multi-tenant and concurrent cloud computing systems. The approach uses a non-intrusive form of event tracing, without manual changes to the system’s internals to propagate session identifiers (IDs), and builds a set of lightweight monitoring rules from fault-free executions. We evaluated the effectiveness of the approach in detecting failures in the context of the OpenStack cloud computing platform, a complex and “off-the-shelf” distributed system, by executing a campaign of fault injection experiments in a multi-tenant scenario. Our experiments show that the approach detects the failure with an F 1 score (0.85) and accuracy (0.77) higher than the ones provided by the OpenStack failure logging mechanisms (0.53 and 0.50) and two non-session-aware run-time verification approaches (both lower than 0.15). Moreover, the approach significantly decreases the average time to detect failures at run-time (∼ 114 seconds) compared to the OpenStack logging mechanisms.

Highlights

•

The approach performs run-time verification without using session IDs.

•

The approach improves the failure detection of the OpenStack cloud computing system.

•

The approach can be used in combination with the system failure logging mechanisms.

•

The approach decreases the average time to detect failures at run-time.

References

[1]

Aguilera M.K., Mogul J.C., Wiener J.L., Reynolds P., Muthitacharoen A., Performance debugging for distributed systems of black boxes, Oper. Syst. Rev. 37 (5) (2003) 74–89.

[2]

Alpernas, K., Panda, A., Ryzhyk, L., Sagiv, M., 2021. Cloud-Scale Runtime Verification of Serverless Applications. In: Proceedings of the ACM Symposium on Cloud Computing. pp. 92–107.

[3]

Ammar H.H., Cukic B., Mili A., Fuhrman C., A comparative analysis of hardware and software fault tolerance: Impact on software reliability engineering, Ann. Softw. Eng. 10 (1) (2000) 103–150.

[4]

An N., Duff A., Naik G., Faloutsos M., Weber S., Mancoridis S., Behavioral anomaly detection of malware on home routers, in: 2017 12th International Conference on Malicious and Unwanted Software (MALWARE), IEEE, 2017, pp. 47–54.

[5]

Ariff N.A.M., Mas’ ud M.Z., Bahaman N., Hamid E., Anuar N.A., Ensemble method for mobile malware detection using N-gram sequences of system calls, Int. J. Commun. Netw. Inf. Secur. 13 (2) (2021) 236–241.

[6]

Arlat J., Fabre J.-C., Rodríguez M., Dependability of COTS microkernel-based systems, IEEE Trans. Comput. 51 (2) (2002) 138–163.

[7]

Atlidakis V., Godefroid P., Polishchuk M., Checking security properties of cloud service REST APIs, in: 2020 IEEE 13th International Conference on Software Testing, Validation and Verification (ICST), IEEE, 2020, pp. 387–397.

[8]

Avižienis A., Laprie J.-C., Randell B., Dependability and its threats: a taxonomy, in: Building the Information Society, Springer, 2004, pp. 91–120.

[9]

Avizienis A., Laprie J.-C., Randell B., Landwehr C., Basic concepts and taxonomy of dependable and secure computing, IEEE Trans. Dependable Secure Comput. 1 (1) (2004) 11–33.

Digital Library

[10]

Bahl, P., Han, R.Y., Li, L.E., Satyanarayanan, M., 2012. Advancing the state of mobile cloud computing. In: Proceedings of the Third ACM Workshop on Mobile Cloud Computing and Services. pp. 21–28.

[11]

Barham P., Isaacs R., Mortier R., Narayanan D., Magpie: Online modelling and performance-aware systems, in: HotOS, 2003, pp. 85–90.

[12]

Bartocci E., Falcone Y., Francalanza A., Reger G., Introduction to runtime verification, in: Lectures on Runtime Verification, Springer, 2018, pp. 1–33.

[13]

Bartolo Burlò C., Francalanza A., Scalas A., Trubiani C., Tuosto E., Towards probabilistic session-type monitoring, in: International Conference on Coordination Languages and Models, Springer, 2021, pp. 106–120.

[14]

Begleiter R., El-Yaniv R., Yona G., On prediction using variable order Markov models, J. Artificial Intelligence Res. 22 (2004) 385–421.

[15]

Beschastnikh I., Wang P., Brun Y., Ernst M.D., Debugging distributed systems, Commun. ACM 59 (8) (2016) 32–37.

[16]

Brown P., Brown A., Gupta M., Abdelsalam M., Online malware classification with system-wide system calls in cloud IaaS, 2022, arXiv preprint arXiv:2208.04891.

[17]

Cailliau A., Lamsweerde A.V., Runtime monitoring and resolution of probabilistic obstacles to system goals, ACM Trans. Auton. Adapt. Syst. (TAAS) 14 (1) (2019) 1–40.

[18]

Carreon N.A., Lu S., Lysecky R., Probabilistic estimation of threat intrusion in embedded systems for runtime detection, ACM Trans. Embedd. Comput. Syst. (TECS) 20 (2) (2021) 1–27.

[19]

Chen Y.-Y.M., Path-Based Failure and Evolution Management, University of California, Berkeley, 2004.

[20]

Chen M.Y., Kiciman E., Fratkin E., Fox A., Brewer E., Pinpoint: Problem determination in large, dynamic internet services, in: Proceedings International Conference on Dependable Systems and Networks, IEEE, 2002, pp. 595–604.

[21]

Chen, F., Roşu, G., 2007. Mop: an efficient and generic runtime verification framework. In: Proceedings of the 22nd Annual ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages and Applications. pp. 569–588.

[22]

Chow, M., Meisner, D., Flinn, J., Peek, D., Wenisch, T.F., 2014. The mystery machine: End-to-end performance analysis of large-scale internet services. In: 11th { U S E N I X } Symposium on Operating Systems Design and Implementation ({ O S D I } 14). pp. 217–231.

[23]

Christmansson J., Chillarege R., Generation of an error set that emulates software faults based on field data, in: Proceedings of Annual Symposium on Fault Tolerant Computing, IEEE, 1996, pp. 304–313.

[24]

Cotroneo D., De Simone L., Liguori P., Natella R., Fault injection analytics: A novel approach to discover failure modes in cloud-computing systems, IEEE Trans. Dependable Secure Comput. (2020).

[25]

Cotroneo D., De Simone L., Liguori P., Natella R., ProFIPy: Programmable software fault injection as-a-service, in: 2020 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), IEEE, 2020, pp. 364–372.

[26]

Cotroneo D., De Simone L., Liguori P., Natella R., Enhancing the analysis of software failures in cloud computing systems with deep learning, J. Syst. Softw. 181 (2021).

Digital Library

[27]

Cotroneo D., De Simone L., Liguori P., Natella R., Bidokhti N., Enhancing failure propagation analysis in cloud computing systems, in: 2019 IEEE 30th International Symposium on Software Reliability Engineering (ISSRE), IEEE, 2019, pp. 139–150.

[28]

Cotroneo D., De Simone L., Liguori P., Natella R., Bidokhti N., Failviz: A tool for visualizing fault injection experiments in distributed systems, in: 2019 15th European Dependable Computing Conference (EDCC), IEEE, 2019, pp. 145–148.

[29]

Cotroneo, D., De Simone, L., Liguori, P., Natella, R., Bidokhti, N., 2019c. How bad can a bug get? an empirical analysis of software failures in the openstack cloud computing platform. In: Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. pp. 200–211.

[30]

Cotroneo D., De Simone L., Natella R., Run-time detection of protocol bugs in storage i/o device drivers, IEEE Trans. Reliab. 67 (3) (2018) 847–869.

[31]

d’Angelo B., Sankaranarayanan S., Sánchez C., Robinson W., Finkbeiner B., Sipma H.B., Mehrotra S., Manna Z., LOLA: runtime monitoring of synchronous systems, in: 12th International Symposium on Temporal Representation and Reasoning (TIME’05), IEEE, 2005, pp. 166–174.

[32]

Delgado N., Gates A.Q., Roach S., A taxonomy and catalog of runtime software-fault monitoring tools, IEEE Trans. Softw. Eng. 30 (12) (2004) 859–872.

[33]

Deligiannis, P., McCutchen, M., Thomson, P., Chen, S., Donaldson, A.F., Erickson, J., Huang, C., Lal, A., Mudduluru, R., Qadeer, S., et al., 2016. Uncovering bugs in distributed storage systems during testing (not in production!). In: 14th { U S E N I X } Conference on File and Storage Technologies ({ F A S T } 16). pp. 249–262.

[34]

Denton J., Learning OpenStack Networking (Neutron), Packt Publishing Ltd, 2015.

[35]

Ernst M.D., Perkins J.H., Guo P.J., McCamant S., Pacheco C., Tschantz M.S., Xiao C., The daikon system for dynamic detection of likely invariants, Sci. Comput. Program. 69 (1–3) (2007) 35–45.

[36]

EsperTech, ., Home page of Esper, http://www.espertech.com/esper.

[37]

Espertech, ., Esper reference, http://esper.espertech.com/release-8.7.0/reference-esper/html_single/index.html.

[38]

Farshchi M., Schneider J.-G., Weber I., Grundy J., Experience report: Anomaly detection of cloud application operations using log and cloud metric correlation analysis, in: 2015 IEEE 26th International Symposium on Software Reliability Engineering (ISSRE), IEEE, 2015, pp. 24–34.

[39]

Garraghan P., Yang R., Wen Z., Romanovsky A., Xu J., Buyya R., Ranjan R., Emergent failures: Rethinking cloud reliability at scale, IEEE Cloud Comput. 5 (5) (2018) 12–21.

[40]

Geels D., Altekar G., Maniatis P., Roscoe T., Stoica I., Friday: Global comprehension for distributed replay, in: NSDI, Vol. 7, 2007, pp. 285–298.

[41]

Girish L., Rao S.K., Anomaly detection in cloud environment using artificial intelligence techniques, Computing (2021) 1–14.

[42]

Grant, S., Cech, H., Beschastnikh, I., 2018. Inferring and asserting distributed system invariants. In: Proceedings of the 40th International Conference on Software Engineering. pp. 1149–1159.

[43]

Gu J., Wang L., Yang Y., Li Y., Kerep: Experience in extracting knowledge on distributed system behavior through request execution path, in: 2018 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW), IEEE, 2018, pp. 30–35.

[44]

Gunawi, H.S., Do, T., Joshi, P., Alvaro, P., Hellerstein, J.M., Arpaci-Dusseau, A.C., Arpaci-Dusseau, R.H., Sen, K., Borthakur, D., 2011. FATE and DESTINI: A framework for cloud recovery testing. In: Proceedings of NSDI’11: 8th USENIX Symposium on Networked Systems Design and Implementation. p. 239.

[45]

Gunawi, H.S., Hao, M., Leesatapornwongsa, T., Patana-anake, T., Do, T., Adityatama, J., Eliazar, K.J., Laksono, A., Lukman, J.F., Martin, V., et al., 2014. What bugs live in the cloud? a study of 3000+ issues in cloud systems. In: Proceedings of the ACM Symposium on Cloud Computing. pp. 1–14.

[46]

Gunawi, H.S., Hao, M., Suminto, R.O., Laksono, A., Satria, A.D., Adityatama, J., Eliazar, K.J., 2016. Why does the cloud stop computing? lessons from hundreds of service outages. In: Proceedings of the Seventh ACM Symposium on Cloud Computing. pp. 1–16.

[47]

Hole K.J., Otterstad C., Software systems with antifragility to downtime, Computer 52 (2) (2019) 23–31.

[48]

Horovits D., Introduction to instrumentation with OpenTracing and Jaeger, 2022, https://logz.io/learn/opentracing-jaeger-guide-to-instrumentation/.

[49]

Horovitz S., Arian Y., Vaisbrot M., Peretz N., Non-intrusive cloud application transaction pattern discovery, in: 2019 IEEE 12th International Conference on Cloud Computing (CLOUD), IEEE, 2019, pp. 311–320.

[50]

Huch F., Golagha M., Petrovska A., Krauss A., Machine learning-based run-time anomaly detection in software systems: An industrial evaluation, in: 2018 IEEE Workshop on Machine Learning Techniques for Software Quality Evaluation (MaLTeSQuE), IEEE, 2018, pp. 13–18.

[51]

Islam M.S., Pourmajidi W., Zhang L., Steinbacher J., Erwin T., Miranskyy A., Anomaly detection in a large-scale cloud platform, in: 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), IEEE, 2021, pp. 150–159.

[52]

Khreich W., Khosravifar B., Hamou-Lhadj A., Talhi C., An anomaly detection system based on variable N-gram features and one-class SVM, Inf. Softw. Technol. 91 (2017) 186–197.

[53]

Koskinen E., Jannotti J., Borderpatrol: isolating events for black-box tracing, Oper. Syst. Rev. 42 (4) (2008) 191–203.

[54]

Krause B., Design and implementation of a non-intrusive distributed tracing system for wireless embedded networks, 2021.

[55]

Lanzaro, A., Natella, R., Winter, S., Cotroneo, D., Suri, N., 2014. An empirical study of injected versus actual interface errors. In: Proceedings of the 2014 International Symposium on Software Testing and Analysis. pp. 397–408.

[56]

Las-Casas, P., Papakerashvili, G., Anand, V., Mace, J., 2019. Sifter: Scalable sampling for distributed traces, without feature engineering. In: Proceedings of the ACM Symposium on Cloud Computing. pp. 312–324.

[57]

Li Z., Lu Q., Zhu L., Xu X., Liu Y., Zhang W., An empirical study of cloud API issues, IEEE Cloud Comput. 5 (2) (2018) 58–72.

[58]

Li B., Peng X., Xiang Q., Wang H., Xie T., Sun J., Liu X., Enjoy your observability: an industrial survey of microservice tracing and analysis, Empir. Softw. Eng. 27 (1) (2022) 1–28.

[59]

Mariani L., Monni C., Pezzé M., Riganelli O., Xin R., Localizing faults in cloud systems, in: 2018 IEEE 11th International Conference on Software Testing, Verification and Validation (ICST), IEEE, 2018, pp. 262–273.

[60]

Mariani L., Pezzè M., Riganelli O., Xin R., Predicting failures in multi-tier distributed systems, J. Syst. Softw. 161 (2020).

[61]

Marques H., Laranjeiro N., Bernardino J., Injecting software faults in Python applications, Empir. Softw. Eng. 27 (1) (2022) 1–33.

[62]

Musavi P., Adams B., Khomh F., Experience report: An empirical study of API failures in OpenStack cloud environments, in: 2016 IEEE 27th International Symposium on Software Reliability Engineering (ISSRE), IEEE, 2016, pp. 424–434.

[63]

Oliner A., Ganapathi A., Xu W., Advances and challenges in log analysis, Commun. ACM 55 (2) (2012) 55–61.

[64]

OpenStack, ., Compute API, https://docs.openstack.org/api-ref/compute/?expanded=create-server-detail.

[65]

OpenStack, ., Launch an instance from an image, https://docs.openstack.org/ocata/user-guide/cli-nova-launch-instance-from-image.html.

[66]

OpenStack, ., Manage projects, users, and roles, https://docs.openstack.org/keystone/pike/admin/cli-manage-projects-users-and-roles.html.

[67]

OpenStack, ., Oslo context library, https://docs.openstack.org/oslo.context/latest/index.html.

[68]

OpenStack, ., Compute API, https://developer.openstack.org/api-ref/compute.

[69]

OpenStack, ., Networking API v2.0, https://developer.openstack.org/api-ref/network/v2.

[70]

OpenStack, ., Block storage API, https://developer.openstack.org/api-ref/block-storage.

[71]

Ou Z., Song M., Hwang Z.-H., Ylä-Jääski A., Wang R., Cui Y., Hui P., Is cloud storage ready? Performance comparison of representative IP-based storage systems, J. Syst. Softw. 138 (2018) 206–221.

[72]

Parker A., Spoonhower D., Mace J., Sigelman B., Isaacs R., Distributed Tracing in Practice: Instrumenting, Analyzing, and Debugging Microservices, O’Reilly Media, 2020.

[73]

Perrochon L., Real time event based analysis of complex systems, 1998, Perrochon.Com.

[74]

Petrillo F., Merle P., Moha N., Guéhéneuc Y.-G., Are REST APIs for cloud computing well-designed? An exploratory study, in: International Conference on Service-Oriented Computing, Springer, 2016, pp. 157–170.

[75]

Power, A., Kotonya, G., 2019. Providing Fault Tolerance via Complex Event Processing and Machine Learning for IoT Systems. In: Proc. IoT. pp. 1–7.

[76]

Rabiser R., Guinea S., Vierhauser M., Baresi L., Grünbacher P., A comparison framework for runtime monitoring approaches, J. Syst. Softw. 125 (2017) 309–321.

Digital Library

[77]

Reynolds P., Killian C.E., Wiener J.L., Mogul J.C., Shah M.A., Vahdat A., Pip: Detecting the unexpected in distributed systems, in: NSDI, Vol. 6, 2006, p. 9.

[78]

Salesforce Engineering, ., Anomaly detection in zipkin trace data, https://engineering.salesforce.com/anomaly-detection-in-zipkin-trace-data-87c8a2ded8a1.

[79]

Satyanarayanan, M., Steere, D.C., Kudo, M., Mashburn, H., 1992. Transparent logging as a technique for debugging complex distributed systems. In: Proceedings of the 5th Workshop on ACM SIGOPS European Workshop: Models and Paradigms for Distributed Systems Structuring. pp. 1–3.

[80]

Sharma, D., Poddar, R., Mahajan, K., Dhawan, M., Mann, V., 2015. Hansel: Diagnosing faults in openStack. In: Proceedings of the 11th ACM Conference on Emerging Networking Experiments and Technologies. pp. 1–13.

[81]

Shin Y., Kim K., Comparison of anomaly detection accuracy of host-based intrusion detection systems based on different machine learning algorithms, Int. J. Adv. Comput. Sci. Appl. 11 (2020) 252–259.

[82]

Solberg M., Silverman B., OpenStack for Architects, Packt Publishing Ltd, 2017.

[83]

Stackalytics, ., OpenStack pike lines of code, https://www.stackalytics.com/?release=pike&metric=loc.

[84]

Stackalytics, ., OpenStack pike commits, https://www.stackalytics.com/?release=pike&metric=commits.

[85]

Twitter Engineering, ., Distributed systems tracing with Zipkin, https://blog.twitter.com/engineering/en_us/a/2012/distributed-systems-tracing-with-zipkin.

[86]

van Hoorn A., Hasselbring W., Waller J., Ehlers J., Frey S., Kieselhorst D., Continuous monitoring of software services: Design and application of the kieker framework, 2009.

[87]

Whalen, S., Boggs, N., Stolfo, S.J., 2014. Model aggregation for distributed content anomaly detection. In: Proceedings of the 2014 Workshop on Artificial Intelligent and Security Workshop. pp. 61–71.

[88]

Wu, E., Diao, Y., Rizvi, S., 2006. High-performance complex event processing over streams. In: Proc. SIGMOD/PODS. pp. 407–418.

[89]

Wu X., Zheng W., Chen X., Wang F., Mu D., CVE-assisted large-scale security bug report dataset construction method, J. Syst. Softw. 160 (2020).

[90]

Yabandeh M., Anand A., Canini M., Kostic D., Finding almost-invariants in distributed systems, in: 2011 IEEE 30th International Symposium on Reliable Distributed Systems, IEEE, 2011, pp. 177–182.

[91]

Zhang, Y., Makarov, S., Ren, X., Lion, D., Yuan, D., 2017. Pensieve: Non-intrusive failure reproduction for distributed systems using the event chaining approach. In: Proceedings of the 26th Symposium on Operating Systems Principles. pp. 19–33.

[92]

Zheng W., Feng C., Yu T., Yang X., Wu X., Towards understanding bugs in an open source cloud management stack: An empirical study of OpenStack software bugs, J. Syst. Softw. 151 (2019) 210–223.

[93]

Zhou J., Chen Z., Wang J., Zheng Z., Dong W., A runtime verification based trace-oriented monitoring framework for cloud systems, in: Software Reliability Engineering Workshops (ISSREW), 2014 IEEE International Symposium on, IEEE, 2014, pp. 152–155.

[94]

Zipkin, ., Home page of Zipkin, https://zipkin.io.

Index Terms

Run-time failure detection via non-intrusive event analysis in a large-scale cloud computing platform

Index terms have been assigned to the content through auto-classification.

Recommendations

Towards Runtime Verification via Event Stream Processing in Cloud Computing Infrastructures
Service-Oriented Computing – ICSOC 2020 Workshops
Abstract
Software bugs in cloud management systems often cause erratic behavior, hindering detection, and recovery of failures. As a consequence, the failures are not timely detected and notified, and can silently propagate through the system. To face ...
A Conceptual Platform of SLA in Cloud Computing
DASC '11: Proceedings of the 2011 IEEE Ninth International Conference on Dependable, Autonomic and Secure Computing

Cloud computing is a promising technology, where the infrastructure, developing platform, software and storage are delivered as a service. With the development of cloud computing, more and more cloud service providers emerge. However, there are no ...
An inter-cloud bridge system for heterogeneous cloud platforms

Over the years, more cloud computing systems have been developed providing flexible interfaces for inter-cloud interaction. This work approaches the concept of inter-cloud by utilizing APIs, open source specifications and exposed interfaces from cloud ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Journal of Systems and Software

Journal of Systems and Software Volume 198, Issue C

Apr 2023

492 pages

ISSN:0164-1212

Issue’s Table of Contents

The Authors.

Publisher

Elsevier Science Inc.

United States

Publication History

Published: 01 April 2023

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 26 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents