research-article

Enhancing the analysis of software failures in cloud computing systems with deep learning

Authors:

Domenico Cotroneo,

Luigi De Simone,

Pietro Liguori,

Roberto NatellaAuthors Info & Claims

Volume 181, Issue C

https://doi.org/10.1016/j.jss.2021.111043

Published: 01 November 2021 Publication History

Abstract

Identifying the failure modes of cloud computing systems is a difficult and time-consuming task, due to the growing complexity of such systems, and the large volume and noisiness of failure data. This paper presents a novel approach for analyzing failure data from cloud systems, in order to relieve human analysts from manually fine-tuning the data for feature engineering. The approach leverages Deep Embedded Clustering (DEC), a family of unsupervised clustering algorithms based on deep learning, which uses an autoencoder to optimize data dimensionality and inter-cluster variance. We applied the approach in the context of the OpenStack cloud computing platform, both on the raw failure data and in combination with an anomaly detection pre-processing algorithm. The results show that the performance of the proposed approach, in terms of purity of clusters, is comparable to, or in some cases even better than manually fine-tuned clustering, thus avoiding the need for deep domain knowledge and reducing the effort to perform the analysis. In all cases, the proposed approach provides better performance than unsupervised clustering when no feature engineering is applied to the data. Moreover, the distribution of failure modes from the proposed approach is closer to the actual frequency of the failure modes.

Highlights

•

The approach aids to develop the fault-tolerance mechanisms in the cloud systems.

•

The approach can cluster failure modes without manual-fine tuning of features.

•

The approach achieves a higher cluster purity compared to traditional clustering.

•

The approach can be used in combination with anomaly detection with high accuracy.

References

[1]

Aharon M., Barash G., Cohen I., Mordechai E., One graph is worth a thousand logs: Uncovering hidden structures in massive system event logs, in: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, 2009, pp. 227–243.

[2]

Arlat J., Aguera M., Amat L., Crouzet Y., Fabre J.-C., Laprie J.-C., Martins E., Powell D., Fault injection for dependability validation: A methodology and some applications, IEEE Trans. Softw. Eng. 16 (2) (1990) 166–182.

[3]

Arlat J., Costes A., Crouzet Y., Laprie J.-C., Powell D., Fault injection and dependability evaluation of fault-tolerant systems, IEEE Trans. Comput. 42 (8) (1993) 913–923.

[4]

Arlat J., Moraes R., Collecting, analyzing and archiving results from fault injection experiments, in: 2011 5th Latin-American Symposium on Dependable Computing, IEEE, 2011, pp. 100–105.

[5]

Arora P., Varshney S., et al., Analysis of k-means and k-medoids algorithm for big data, Procedia Comput. Sci. 78 (2016) 507–512.

Digital Library

[6]

Arunajadai S.G., Uder S.J., Stone R.B., Tumer I.Y., Failure mode identification through clustering analysis, Qual. Reliab. Eng. Int. 20 (5) (2004) 511–526.

[7]

Bergroth L., Hakonen H., Raita T., A survey of longest common subsequence algorithms, in: Proc. SPIRE, IEEE, 2000, pp. 39–48.

[8]

Bondavalli A., Ceccarelli A., Falai L., Vadursi M., Foundations of measurement theory applied to the evaluation of dependability attributes, in: 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’07), IEEE, 2007, pp. 522–533.

[9]

Bondavalli A., Ceccarelli A., Falai L., Vadursi M., A new approach and a related tool for dependability measurements on distributed systems, IEEE Trans. Instrum. Meas. 59 (4) (2010) 820–831.

[10]

Chandra R., Lefever R.M., Joshi K.R., Cukier M., Sanders W.H., A global-state-triggered fault injector for distributed system evaluation, IEEE Trans. Parallel Distributed Syst. 15 (7) (2004) 593–605,.

Digital Library

[11]

Chang W.L., Tay K.M., Lim C.P., Clustering and visualization of failure modes using an evolving tree, Expert Syst. Appl. 42 (20) (2015) 7235–7244.

[12]

Christmansson J., Chillarege R., Generation of an error set that emulates software faults based on field data, in: Fault Tolerant Computing, 1996., Proceedings of Annual Symposium on, IEEE, 1996, pp. 304–313.

[13]

Cotroneo D., De Simone L., Liguori P., Natella R., Fault injection analytics: A novel approach to discover failure modes in cloud-computing systems, IEEE Trans. Dependable Secure Comput. (2020).

[14]

Cotroneo D., De Simone L., Liguori P., Natella R., Profipy: Programmable software fault injection as-a-service, in: 2020 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2020, pp. 364–372.

[15]

Cotroneo D., De Simone L., Liguori P., Natella R., Bidokhti N., Enhancing failure propagation analysis in cloud computing systems, in: 2019 IEEE 30th International Symposium on Software Reliability Engineering (ISSRE), IEEE, 2019, pp. 139–150.

[16]

Cotroneo D., De Simone L., Liguori P., Natella R., Bidokhti N., Failviz: A tool for visualizing fault injection experiments in distributed systems, in: 2019 15th European Dependable Computing Conference (EDCC), IEEE, 2019, pp. 145–148.

[17]

Cotroneo D., De Simone L., Liguori P., Natella R., Bidokhti N., How bad can a bug get? An empirical analysis of software failures in the openstack cloud computing platform, in: Proc. ESEC/FSE, ACM, 2019, pp. 200–211.

[18]

Denton J., Learning OpenStack Networking, Packt Publishing Ltd, 2015.

[19]

Duan C.-Y., Chen X.-Q., Shi H., Liu H.-C., A new model for failure mode and effects analysis based on k-means clustering within hesitant linguistic environment, IEEE Trans. Eng. Manage. (2019).

[20]

Fu Q., Lou J.-G., Wang Y., Li J., Execution anomaly detection in distributed systems through unstructured log analysis, in: 2009 Ninth IEEE International Conference on Data Mining, IEEE, 2009, pp. 149–158.

[21]

Garraghan P., Townend P., Xu J., An empirical failure-analysis of a large-scale cloud computing environment, in: 2014 IEEE 15th International Symposium on High-Assurance Systems Engineering, 2014, pp. 113–120,.

Digital Library

[22]

Garraghan P., Yang R., Wen Z., Romanovsky A., Xu J., Buyya R., Ranjan R., Emergent failures: Rethinking cloud reliability at scale, IEEE Cloud Comput. 5 (5) (2018) 12–21.

[23]

Ghasedi Dizaji, K., Herandi, A., Deng, C., Cai, W., Huang, H., 2017. Deep clustering via joint convolutional autoencoder embedding and relative entropy minimization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5736–5745.

[24]

Gulenko A., Schmidt F., Acker A., Wallschläger M., Kao O., Liu F., Detecting anomalous behavior of black-box services modeled with distance-based online clustering, in: 2018 IEEE 11th International Conference on Cloud Computing (CLOUD), IEEE, 2018, pp. 912–915.

[25]

Guo X., Gao L., Liu X., Yin J., Improved deep embedded clustering with local structure preservation., in: IJCAI, 2017, pp. 1753–1759.

[26]

Guo X., Zhu E., Liu X., Yin J., Deep embedded clustering with data augmentation, in: Asian Conference on Machine Learning, PMLR, 2018, pp. 550–565.

[27]

Hole K.J., Otterstad C., Software systems with antifragility to downtime, Computer 52 (2) (2019) 23–31.

[28]

Hsueh M.-C., Tsai T.K., Iyer R.K., Fault injection techniques and tools, Computer 30 (4) (1997) 75–82.

Digital Library

[29]

Huang J., You J.-X., Liu H.-C., Song M.-S., Failure mode and effect analysis improvement: A systematic literature review and future research agenda, Reliab. Eng. Syst. Saf. 199 (2020).

[30]

Jabi M., Pedersoli M., Mitiche A., Ayed I.B., Deep clustering: On the link between discriminative models and k-means, IEEE Trans. Pattern Anal. Mach. Intell. (2019).

[31]

Jain A.K., Murty M.N., Flynn P.J., Data clustering: a review, ACM Comput. Surv. 31 (3) (1999) 264–323.

Digital Library

[32]

Koohzadi M., Charkari N.M., Ghaderi F., Unsupervised representation learning based on the deep multi-view ensemble learning, Appl. Intell. 50 (2) (2020) 562–581.

[33]

Lanzaro A., Natella R., Winter S., Cotroneo D., Suri N., An empirical study of injected versus actual interface errors, in: Proceedings of the 2014 International Symposium on Software Testing and Analysis, ACM, 2014, pp. 397–408.

[34]

Leesatapornwongsa T., Lukman J.F., Lu S., Gunawi H.S., Taxdc: A taxonomy of non-deterministic concurrency bugs in datacenter distributed systems, ACM SIGPLAN Not. 51 (4) (2016) 517–530.

[35]

Li H., Groep D., Wolters L., Templon J., Job failure analysis and its implications in a large-scale production grid, in: 2006 Second IEEE International Conference on E-Science and Grid Computing (E-Science’06), IEEE, 2006, p. 27.

[36]

Li F., Qiao H., Zhang B., Discriminatively boosted image clustering with fully convolutional auto-encoders, Pattern Recognit. 83 (2018) 161–173.

Digital Library

[37]

Lim C., Singh N., Yajnik S., A log mining approach to failure analysis of enterprise telephony systems, in: 2008 IEEE International Conference on Dependable Systems and Networks with FTCS and DCC (DSN), IEEE, 2008, pp. 398–403.

[38]

Liu H.-C., Chen X.-Q., You J.-X., Li Z., A new integrated approach for risk evaluation and classification with dynamic expert weights, IEEE Trans. Reliab. (2020).

[39]

Liu H.-C., Hu Y.-P., Wang J.-J., Sun M., Failure mode and effects analysis using two-dimensional uncertain linguistic variables and alternative queuing method, IEEE Trans. Reliab. 68 (2) (2018) 554–565.

[40]

Lu X., Tsao Y., Matsuda S., Hori C., Speech enhancement based on deep denoising autoencoder., in: Interspeech, 2013, 2013, pp. 436–440.

[41]

Makanju A., Zincir-Heywood A.N., Milios E.E., System state discovery via information content clustering of system logs, in: 2011 Sixth International Conference on Availability, Reliability and Security, IEEE, 2011, pp. 301–306.

[42]

Mendoza H., Klein A., Feurer M., Springenberg J.T., Urban M., Burkart M., Dippel M., Lindauer M., Hutter F., Towards automatically-tuned deep neural networks, in: Automated Machine Learning, Springer, Cham, 2019, pp. 135–149.

[43]

Modha D.S., Spangler W.S., Feature weighting in k-means clustering, Mach. Learn. 52 (3) (2003) 217–237.

Digital Library

[44]

Mousavi S.M., Zhu W., Ellsworth W., Beroza G., Unsupervised clustering of seismic signals using deep convolutional autoencoders, IEEE Geosci. Remote Sens. Lett. 16 (11) (2019) 1693–1697.

[45]

Nedelkoski, S., Cardoso, J.S., Kao, O., 2019. Anomaly Detection and Classification using Distributed Tracing and Deep Learning. In: Proc. CCGRID, pp. 41–250.

[46]

OpenStack, Openstack, 2018, URL http://www.openstack.org/.

[47]

OpenStack, Tempest testing project, 2018, URL https://docs.openstack.org/tempest.

[48]

OpenStack project, The openstack marketplace, 2018, URL https://www.openstack.org/marketplace/.

[49]

OpenStack project, Stackalytics, 2018, URL https://www.stackalytics.com.

[50]

OpenStack project, User stories showing how the world #runsonopenstack, 2018, URL https://www.openstack.org/user-stories/.

[51]

Palazzi L., Li G., Fang B., Pattabiraman K., A tale of two injectors: End-to-end comparison of IR-level and assembly-level fault injection, in: 2019 IEEE 30th International Symposium on Software Reliability Engineering (ISSRE), IEEE, 2019, pp. 151–162.

[52]

Peng X., Zhu H., Feng J., Shen C., Zhang H., Zhou J.T., Deep clustering with sample-assignment invariance prior, IEEE Trans. Neural Netw. Learn. Syst. 31 (11) (2019) 4857–4868.

[53]

Qian N., On the momentum term in gradient descent learning algorithms, Neural Netw. 12 (1) (1999) 145–151.

Digital Library

[54]

Rahimi A., Azimi G., Asgari H., Jin X., Clustering approach toward large truck crash analysis, Transp. Res. Rec. 2673 (8) (2019) 73–85.

[55]

Sigelman B.H., Barroso L.A., Burrows M., Stephenson P., Plakal M., Beaver D., Jaspan S., Shanbhag C., Dapper, a Large-Scale Distributed Systems Tracing Infrastructure, Google, Inc., 2010, URL https://research.google.com/archive/papers/dapper-2010-1.pdf.

[56]

Skarin D., Barbosa R., Karlsson J., GOOFI-2: A tool for experimental dependability assessment, in: 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN), IEEE, 2010, pp. 557–562.

[57]

Solberg M., OpenStack for Architects, Packt Publishing, 2017.

[58]

Vaarandi R., A breadth-first algorithm for mining frequent patterns from event logs, in: International Conference on Intelligence in Communication Systems, Springer, 2004, pp. 293–308.

[59]

Velmurugan T., Santhanam T., Computational complexity between K-means and K-medoids clustering algorithms for normal and uniform distributions of data points, J. Comput. Sci. 6 (3) (2010) 363.

[60]

Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.-A., 2008. Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th International Conference on Machine Learning, pp. 1096–1103.

[61]

Vishwanath, K.V., Nagappan, N., 2010. Characterizing cloud computing hardware reliability. In: Proceedings of the 1st ACM Symposium on Cloud Computing, pp. 193–204.

[62]

Voas J., Ghosh A., Charron F., Kassab L., Reducing uncertainty about common-mode failures, in: Proceedings the Eighth International Symposium on Software Reliability Engineering, IEEE, 1997, pp. 308–319.

[63]

Wolter K., Avritzer A., Vieira M., Van Moorsel A., Resilience assessment and evaluation of computing systems, Springer, 2012.

[64]

Wu L., Bogatinovski J., Nedelkoski S., Tordsson J., Kao O., Performance diagnosis in cloud microservices using deep learning, in: AIOPS 2020-International Workshop on Artificial Intelligence for IT Operations, 2020.

[65]

Xie J., Girshick R., Farhadi A., Unsupervised deep embedding for clustering analysis, in: International Conference on Machine Learning, 2016, pp. 478–487.

[66]

Xiong H., Wu J., Chen J., K-means clustering versus validation measures: a data-distribution perspective, IEEE Trans. Syst. Man Cybern. B 39 (2) (2009) 318–331.

[67]

Xu Z., Dang Y., Munro P., Wang Y., A data-driven approach for constructing the component-failure mode matrix for FMEA, J. Intell. Manuf. 31 (1) (2020) 249–265.

[68]

Xu R., Wunsch D., Survey of clustering algorithms, IEEE Trans. Neural Netw. 16 (3) (2005) 645–678.

Digital Library

[69]

Xu Z., Zhang T., Keung J.W., Yan M., Luo X., Zhang X., Xu L., Tang Y., Feature selection and embedding based cross project framework for identifying crashing fault residence, Inf. Softw. Technol. 131 (2021).

[70]

Yang B., Fu X., Sidiropoulos N.D., Hong M., Towards k-means-friendly spaces: Simultaneous deep learning and clustering, in: International Conference on Machine Learning, PMLR, 2017, pp. 3861–3870.

[71]

Zhang W., Dong X., Li H., Xu J., Wang D., Unsupervised detection of abnormal electricity consumption behavior based on feature engineering, IEEE Access 8 (2020) 55483–55500.

[72]

Zhao W., Melliar-Smith P., Moser L.E., Fault tolerance middleware for cloud computing, in: 2010 IEEE 3rd International Conference on Cloud Computing, IEEE, 2010, pp. 67–74.

Cited By

Hrusto ARuneson POhlsson MRoychoudhury APaiva AAbreu RStorey MAniche MNagappan N(2024)Autonomous Monitors for Detecting Failures Early and Reporting Interpretable Alerts in Cloud OperationsProceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice10.1145/3639477.3639712(47-57)Online publication date: 14-Apr-2024
https://dl.acm.org/doi/10.1145/3639477.3639712
Putri RBhawiyuga AAkbar SShaffan NAmron KBasuki A(2023)Implementation of Fault-Tolerance Mechanism in Quorum-Based Blockchain Provisioning in Cloud Infrastructure Using Replication and Monitoring ProtocolsProceedings of the 8th International Conference on Sustainable Information Engineering and Technology10.1145/3626641.3626673(311-322)Online publication date: 24-Oct-2023
https://dl.acm.org/doi/10.1145/3626641.3626673
Arcelli Fontana FCamilli MRendina DTaraboi ATrubiani C(2023)Impact of Architectural Smells on Software Performance: an Exploratory StudyProceedings of the 27th International Conference on Evaluation and Assessment in Software Engineering10.1145/3593434.3593442(22-31)Online publication date: 14-Jun-2023
https://dl.acm.org/doi/10.1145/3593434.3593442
Show More Cited By

Index Terms

Enhancing the analysis of software failures in cloud computing systems with deep learning

Index terms have been assigned to the content through auto-classification.

Recommendations

Run-time failure detection via non-intrusive event analysis in a large-scale cloud computing platform
Abstract
Cloud computing systems fail in complex and unforeseen ways due to unexpected combinations of events and interactions among hardware and software components. These failures are especially problematic when they are silent, i.e., not accompanied by ...
Highlights
- The approach performs run-time verification without using session IDs.
- The approach improves the failure detection of the OpenStack cloud computing system.
- The approach can be used in combination with the system failure logging ...
An inter-cloud bridge system for heterogeneous cloud platforms

Over the years, more cloud computing systems have been developed providing flexible interfaces for inter-cloud interaction. This work approaches the concept of inter-cloud by utilizing APIs, open source specifications and exposed interfaces from cloud ...
Analysis and Research of Cloud Computing System Instance
ICFN '10: Proceedings of the 2010 Second International Conference on Future Networks

As a kind of emerging business computational model, Cloud Computing distribute computation task on the resource pool which consists of massive computers, accordingly ,the application systems can gain the computation strength, the storage space and ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Journal of Systems and Software

Journal of Systems and Software Volume 181, Issue C

Nov 2021

331 pages

ISSN:0164-1212

Issue’s Table of Contents

Elsevier Inc.

Publisher

Elsevier Science Inc.

United States

Publication History

Published: 01 November 2021

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 17 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Hrusto ARuneson POhlsson MRoychoudhury APaiva AAbreu RStorey MAniche MNagappan N(2024)Autonomous Monitors for Detecting Failures Early and Reporting Interpretable Alerts in Cloud OperationsProceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice10.1145/3639477.3639712(47-57)Online publication date: 14-Apr-2024
https://dl.acm.org/doi/10.1145/3639477.3639712
Putri RBhawiyuga AAkbar SShaffan NAmron KBasuki A(2023)Implementation of Fault-Tolerance Mechanism in Quorum-Based Blockchain Provisioning in Cloud Infrastructure Using Replication and Monitoring ProtocolsProceedings of the 8th International Conference on Sustainable Information Engineering and Technology10.1145/3626641.3626673(311-322)Online publication date: 24-Oct-2023
https://dl.acm.org/doi/10.1145/3626641.3626673
Arcelli Fontana FCamilli MRendina DTaraboi ATrubiani C(2023)Impact of Architectural Smells on Software Performance: an Exploratory StudyProceedings of the 27th International Conference on Evaluation and Assessment in Software Engineering10.1145/3593434.3593442(22-31)Online publication date: 14-Jun-2023
https://dl.acm.org/doi/10.1145/3593434.3593442
Cotroneo DDe Simone LLiguori PNatella R(2023)Run-time failure detection via non-intrusive event analysis in a large-scale cloud computing platformJournal of Systems and Software10.1016/j.jss.2023.111611198:COnline publication date: 1-Apr-2023
https://dl.acm.org/doi/10.1016/j.jss.2023.111611
Li PPei YLi J(2023)A comprehensive survey on design and application of autoencoder in deep learningApplied Soft Computing10.1016/j.asoc.2023.110176138:COnline publication date: 1-May-2023
https://dl.acm.org/doi/10.1016/j.asoc.2023.110176

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents