Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3338906.3338916acmconferencesArticle/Chapter ViewAbstractPublication PagesfseConference Proceedingsconference-collections
research-article

How bad can a bug get? an empirical analysis of software failures in the OpenStack cloud computing platform

Published: 12 August 2019 Publication History

Abstract

Cloud management systems provide abstractions and APIs for programmatically configuring cloud infrastructures. Unfortunately, residual software bugs in these systems can potentially lead to high-severity failures, such as prolonged outages and data losses. In this paper, we investigate the impact of failures in the context widespread OpenStack cloud management system, by performing fault injection and by analyzing the impact of the resulting failures in terms of fail-stop behavior, failure detection through logging, and failure propagation across components. The analysis points out that most of the failures are not timely detected and notified; moreover, many of these failures can silently propagate over time and through components of the cloud management system, which call for more thorough run-time checks and fault containment.

References

[1]
J.H. Andrews, L.C. Briand, and Y. Labiche. 2005. Is mutation an appropriate tool for testing experiments?. In Proc. Intl. Conf. on Software Engineering. 402–411.
[2]
Jean Arlat, J-C Fabre, and Manuel Rodríguez. 2002. Dependability of COTS microkernel-based systems. IEEE Transactions on computers 51, 2 (2002), 138–163.
[3]
Eric Bauer and Randee Adams. 2012. Reliability and Availability of Cloud Computing (1st ed.). Wiley-IEEE Press.
[4]
Black Duck Software, Inc. 2018. The OpenStack Open Source Project on Open Hub. https://www.openhub.net/p/openstack
[5]
George Candea and Armando Fox. 2003. Crash-Only Software. In Workshop on Hot Topics in Operating Systems (HotOS), Vol. 3. 67–72.
[6]
Gabriella Carrozza, Domenico Cotroneo, Roberto Natella, Roberto Pietrantuono, and Stefano Russo. 2013. Analysis and prediction of mandelbugs in an industrial software system. In 2013 IEEE Sixth International Conference on Software Testing, Verification and Validation. IEEE, 262–271.
[7]
Frederico Cerveira, Raul Barbosa, Henrique Madeira, and Filipe Araujo. 2015. Recovery for Virtualized Environments. In Proc. EDCC. 25–36.
[8]
Feng Chen and Grigore Roşu. 2007. MOP: An efficient and generic runtime verification framework. In Acm Sigplan Notices, Vol. 42. ACM, 569–588.
[9]
J. Christmansson and R. Chillarege. 1996. Generation of an Error Set that Emulates Software Faults based on Field Data. In Digest of Papers, Intl. Symp. on Fault-Tolerant Computing. 304–313.
[10]
Jörgen Christmansson and Ram Chillarege. 1996. Generation of an error set that emulates software faults based on field data. In Fault Tolerant Computing, 1996., Proceedings of Annual Symposium on. IEEE, 304–313.
[11]
Domenico Cotroneo, Roberto Pietrantuono, and Stefano Russo. 2013. Combining operational and debug testing for improving reliability. IEEE Transactions on Reliability 62, 2 (2013), 408–423.
[12]
M. Daran and P. Thévenod-Fosse. 1996. Software Error Analysis: A Real Case Study Involving Real Faults and Mutations. ACM Soft. Eng. Notes 21, 3 (1996), 158–171.
[13]
Nelly Delgado, Ann Q Gates, and Steve Roach. 2004. A taxonomy and catalog of runtime software-fault monitoring tools. IEEE Transactions on software Engineering 30, 12 (2004), 859–872.
[14]
James Denton. 2015. Learning OpenStack Networking. Packt Publishing Ltd.
[15]
Brendan Dolan-Gavitt, Patrick Hulin, Engin Kirda, Tim Leek, Andrea Mambretti, Wil Robertson, Frederick Ulrich, and Ryan Whelan. 2016. Lava: Large-scale automated vulnerability addition. In 2016 IEEE Symposium on Security and Privacy (SP). IEEE, 110–121.
[16]
Joao A Duraes and Henrique S Madeira. 2006. Emulation of Software Faults: A Field Data Study and a Practical Approach. IEEE Transactions on Software Engineering 32, 11 (2006), 849.
[17]
Mostafa Farshchi, Jean-Guy Schneider, Ingo Weber, and John Grundy. 2018. Metric selection and anomaly detection for cloud operations using log and metric correlation analysis. Journal of Systems and Software 137 (2018), 531–549.
[18]
Vincenzo De Florio and Chris Blondia. 2008. A survey of linguistic structures for application-level fault tolerance. ACM Computing Surveys (CSUR) 40, 2 (2008), 6.
[19]
Min Fu, Liming Zhu, Ingo Weber, Len Bass, Anna Liu, and Xiwei Xu. 2016. Process-Oriented Non-intrusive Recovery for Sporadic Operations on Cloud. In Dependable Systems and Networks (DSN), 2016 46th Annual IEEE/IFIP International Conference on. IEEE, 85–96.
[20]
Cristiano Giuffrida, Anton Kuijsten, and Andrew S Tanenbaum. 2013. EDFI: A dependable fault injection tool for dependability benchmarking experiments. In 2013 IEEE 19th Pacific Rim International Symposium on Dependable Computing (PRDC). IEEE, 31–40.
[21]
Jim Gray. 1986. Why do computers stop and what can be done about it?. In Symposium on Reliability in Distributed Software and Database Systems. 3–12.
[22]
Michael Grottke and Kishor S Trivedi. 2007. Fighting bugs: Remove, retry, replicate, and rejuvenate. IEEE Computer 40, 2 (2007).
[23]
Haryadi S Gunawi, Thanh Do, Pallavi Joshi, Peter Alvaro, Joseph M Hellerstein, Andrea C Arpaci-Dusseau, Remzi H Arpaci-Dusseau, Koushik Sen, and Dhruba Borthakur. 2011. FATE and DESTINI: A Framework for Cloud Recovery Testing. In Proc. NSDI.
[24]
Haryadi S Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patanaanake,ThanhDo,JeffryAdityatama,KurniaJEliazar,AgungLaksono,JeffreyFLukman, Vincentius Martin, et al. 2014. What bugs live in the cloud? A study of 3000+ issues in cloud systems. In Proceedings of the ACM Symposium on Cloud Computing.
[25]
Haryadi S Gunawi, Agung Laksono, Riza O Suminto, Mingzhe Hao, et al. 2016. Why Does the Cloud Stop Computing? Lessons from Hundreds of Service Outages. In Proc. SoCC.
[26]
Jorrit N Herder, Herbert Bos, Ben Gras, Philip Homburg, and Andrew S Tanenbaum. 2009. Fault isolation for device drivers. In IEEE/IFIP International Conference on Dependable Systems & Networks (DSN). IEEE, 33–42.
[27]
Yue Jia and Mark Harman. 2010. An analysis and survey of the development of mutation testing. IEEE transactions on software engineering 37, 5 (2010), 649–678.
[28]
Yue Jia and Mark Harman. 2011. An Analysis and Survey of the Development of Mutation Testing. 37, 5 (2011), 649–678.
[29]
Pallavi Joshi, Haryadi S. Gunawi, and Koushik Sen. 2011. PREFAIL: A Programmable Tool for Multiple-failure Injection. In Proc. OOPSLA.
[30]
Xiaoen Ju, Livio Soares, Kang G Shin, Kyung Dong Ryu, and Dilma Da Silva. 2013. On fault resilience of OpenStack. In Proceedings of the 4th Annual Symposium on Cloud Computing (SoCC). ACM, 2.
[31]
Xiaoen Ju, Livio Soares, Kang G. Shin, Kyung Dong Ryu, and Dilma Da Silva. 2013. On fault resilience of OpenStack. In Proc. SoCC.
[32]
René Just, Darioush Jalali, Laura Inozemtseva, Michael D Ernst, Reid Holmes, and Gordon Fraser. 2014. Are mutants a valid substitute for real faults in software testing?. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, 654–665.
[33]
Anna Lanzaro, Roberto Natella, Stefan Winter, Domenico Cotroneo, and Neeraj Suri. 2014. An empirical study of injected versus actual interface errors. In Proceedings of the 2014 International Symposium on Software Testing and Analysis. ACM, 397–408.
[34]
Inhwan Lee and Ravishankar K Iyer. 1993. Faults, symptoms, and software fault tolerance in the tandem guardian90 operating system. In Fault-Tolerant Computing, 1993. FTCS-23. Digest of Papers., The Twenty-Third International Symposium on. IEEE, 20–29.
[35]
Heng Li, Weiyi Shang, and Ahmed E Hassan. 2017. Which log level should developers choose for a new logging statement? Empirical Software Engineering 22, 4 (2017), 1684–1716.
[36]
Zhongwei Li, Qinghua Lu, Liming Zhu, Xiwei Xu, Yue Liu, and Weishan Zhang. 2018. An Empirical Study of Cloud API Issues. IEEE Cloud Computing 5, 2 (2018), 58–72.
[37]
libvirt. 2018. libvirt Home Page. https://www.libvirt.org/
[38]
Michael R Lyu. 2007. Software reliability engineering: A roadmap. In Future of Software Engineering, 2007. FOSE’07. IEEE, 153–170.
[39]
Andrey Markelov. 2016. How to Build Your Own Virtual Test Environment. Apress.
[40]
Matias Martinez, Laurence Duchien, and Martin Monperrus. 2013. Automatically extracting instances of code change patterns with ast analysis. In 2013 IEEE international conference on software maintenance. IEEE, 388–391.
[41]
Steve McConnell. 2004. Code complete. Pearson Education.
[42]
Microsoft Corp. 2017. Bulkhead pattern. https://docs.microsoft.com/enus/azure/architecture/patterns/bulkhead
[43]
Microsoft Corp. 2017. Circuit Breaker pattern. https://docs.microsoft.com/enus/azure/architecture/patterns/circuit-breaker
[44]
Pooya Musavi, Bram Adams, and Foutse Khomh. 2016. Experience Report: An Empirical Study of API Failures in OpenStack Cloud Environments. In Software Reliability Engineering (ISSRE), 2016 IEEE 27th International Symposium on. IEEE, 424–434.
[45]
Roberto Natella, Domenico Cotroneo, and Henrique S Madeira. 2016. Assessing dependability with software fault injection: A survey. ACM Computing Surveys (CSUR) 48, 3 (2016), 44.
[46]
Netflix. 2017. The Chaos Monkey. https://github.com/Netflix/SimianArmy/wiki/ Chaos-Monkey
[47]
Netflix Inc. 2017. Hystrix Wiki - How It Works. https://github.com/Netflix/ Hystrix/wiki/How-it-Works
[48]
W.T. Ng and P.M. Chen. 2001. The Design and Verification of the Rio File Cache. IEEE Trans. on Computers 50, 4 (2001), 322–337.
[49]
OpenStack. 2018. OpenStack. http://www.openstack.org/
[50]
OpenStack. 2018. OpenStack issue tracker. https://bugs.launchpad.net/openstack
[51]
OpenStack. 2018. Tempest Testing Project. https://docs.openstack.org/tempest
[52]
OpenStack. 2018. Virtual Machine States and Transitions. https: //docs.openstack.org/nova/latest/reference/vm-states.html
[53]
OpenStack project. 2018. OpenStack in Launchpad. https://launchpad.net/ openstack
[54]
OpenStack project. 2018. The OpenStack Marketplace. https: //www.openstack.org/marketplace/
[55]
OpenStack project. 2018. Stackalytics. https://www.stackalytics.com
[56]
OpenStack project. 2018. User Stories Showing How The World #RunsOnOpen-Stack. https://www.openstack.org/user-stories/
[57]
David Oppenheimer, Archana Ganapathi, and David A Patterson. 2003. Why do Internet services fail, and what can be done about it?. In USENIX symposium on internet technologies and systems, Vol. 67. Seattle, WA.
[58]
Matteo Orrú, Ewan D Tempero, Michele Marchesi, Roberto Tonelli, and Giuseppe Destefanis. 2015. A Curated Benchmark Collection of Python Systems for Empirical Studies on Software Engineering. In PROMISE. 2–1.
[59]
Kai Pan, Sunghun Kim, and E James Whitehead. 2009. Toward an understanding of bug fix patterns. Empirical Software Engineering 14, 3 (2009), 286–315.
[60]
Mike Papadakis, Marinos Kintis, Jie Zhang, Yue Jia, Yves Le Traon, and Mark Harman. 2019. Mutation testing advances: An analysis and survey. In Advances in Computers. Vol. 112. Elsevier, 275–378.
[61]
Mike Papadakis, Donghwan Shin, Shin Yoo, and Doo-Hwan Bae. 2018. Are mutation scores correlated with real fault detection? A large scale empirical study on the relationship between mutants and real faults. In 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE). IEEE, 537–548.
[62]
Cuong Pham, Daniel Chen, Zbigniew Kalbarczyk, and Ravishankar K. Iyer. 2011. CloudVal: A framework for validation of virtualization environment in cloud infrastructure. In Proc. DSN.
[63]
Cuong Pham, Long Wang, Byung-Chul Tak, Salman Baset, Chunqiang Tang, Zbigniew T Kalbarczyk, and Ravishankar K Iyer. 2017. Failure Diagnosis for Distributed Systems Using Targeted Fault Injection. IEEE Trans. Parallel Distrib. Syst. 28, 2 (2017), 503–516.
[64]
Rick Rabiser, Sam Guinea, Michael Vierhauser, Luciano Baresi, and Paul Grünbacher. 2017. A comparison framework for runtime monitoring approaches. Journal of Systems and Software 125 (2017), 309–321. ESEC/FSE ’19, August 26–30, 2019, Tallinn, Estonia Domenico Cotroneo, Luigi De Simone, Pietro Liguori, Roberto Natella, and Nematollah Bidokhti
[65]
RDO. 2018. Packstack. https://www.rdoproject.org/install/packstack/
[66]
Red Hat, Inc. 2018. Evaluating OpenStack: Single-Node Deployment. https://access.redhat.com/articles/1127153
[67]
Gema Rodríguez-Pérez, Andy Zaidman, Alexander Serebrenik, Gregorio Robles, and Jesús M González-Barahona. 2018. What if a bug has a different origin?: making sense of bugs without an explicit bug introducing change. In Proceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement. ACM, 52.
[68]
Subhajit Roy, Awanish Pandey, Brendan Dolan-Gavitt, and Yu Hu. 2018. Bug synthesis: challenging bug-finding tools with deep faults. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM, 224–234.
[69]
Suhrid Satyal, Ingo Weber, Len Bass, and Min Fu. 2017. Rollback Mechanisms for Cloud Management APIs using AI planning. IEEE Transactions on Dependable and Secure Computing (2017).
[70]
M. Solberg. 2017. OpenStack for Architects. Packt Publishing.
[71]
Michael M Swift, Muthukaruppan Annamalai, Brian N Bershad, and Henry M Levy. 2006. Recovering device drivers. ACM Transactions on Computer Systems (TOCS) 24, 4 (2006), 333–360.
[72]
Andrew S Tanenbaum, Jorrit N Herder, and Herbert Bos. 2006. Can we make operating systems reliable and secure? Computer 39, 5 (2006), 44–51.
[73]
Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, and Denys Poshyvanyk. 2018. Learning How to Mutate Source Code from Bug-Fixes. arXiv preprint arXiv:1812.10772 (2018).
[74]
J.M. Voas, F. Charron, G. McGraw, K. Miller, and M. Friedman. 1997. Predicting How Badly "Good" Software Can Behave. IEEE Software 14, 4 (1997), 73–83.
[75]
Ingo Weber, Hiroshi Wada, Alan Fekete, Anna Liu, and Len Bass. 2012. Automatic Undo for Cloud Management via AI Planning. In HotDep.
[76]
Ding Yuan, Soyeon Park, Peng Huang, Yang Liu, Michael Mihn-Jong Lee, Xiaoming Tang, Yuanyuan Zhou, and Stefan Savage. 2012. Be Conservative: Enhancing Failure Diagnosis with Proactive Logging. In OSDI, Vol. 12. 293–306.
[77]
Ding Yuan, Jing Zheng, Soyeon Park, Yuanyuan Zhou, and Stefan Savage. 2012. Improving software diagnosability via log enhancement. ACM Transactions on Computer Systems (TOCS) 30, 1 (2012), 4.
[78]
Hao Zhong and Na Meng. 2018. Towards reusing hints from past fixes. Empirical Software Engineering 23, 5 (2018), 2521–2549.
[79]
Jingwen Zhou, Zhenbang Chen, Ji Wang, Zibin Zheng, and Wei Dong. 2014. A runtime verification based trace-oriented monitoring framework for cloud systems. In Software Reliability Engineering Workshops (ISSREW), 2014 IEEE International Symposium on. IEEE, 152–155.

Cited By

View all
  • (2024)EEG as a potential ground truth for the assessment of cognitive state in software development activities: A multimodal imaging studyPLOS ONE10.1371/journal.pone.029910819:3(e0299108)Online publication date: 7-Mar-2024
  • (2024)State Reconciliation Defects in Infrastructure as CodeProceedings of the ACM on Software Engineering10.1145/36607901:FSE(1865-1888)Online publication date: 12-Jul-2024
  • (2024)ChangeRCA: Finding Root Causes from Software Changes in Large Online SystemsProceedings of the ACM on Software Engineering10.1145/36437281:FSE(24-46)Online publication date: 12-Jul-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ESEC/FSE 2019: Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering
August 2019
1264 pages
ISBN:9781450355728
DOI:10.1145/3338906
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 August 2019

Permissions

Request permissions for this article.

Check for updates

Badges

Author Tags

  1. Bug analysis
  2. Fault injection
  3. OpenStack

Qualifiers

  • Research-article

Conference

ESEC/FSE '19
Sponsor:

Acceptance Rates

Overall Acceptance Rate 112 of 543 submissions, 21%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)147
  • Downloads (Last 6 weeks)14
Reflects downloads up to 23 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)EEG as a potential ground truth for the assessment of cognitive state in software development activities: A multimodal imaging studyPLOS ONE10.1371/journal.pone.029910819:3(e0299108)Online publication date: 7-Mar-2024
  • (2024)State Reconciliation Defects in Infrastructure as CodeProceedings of the ACM on Software Engineering10.1145/36607901:FSE(1865-1888)Online publication date: 12-Jul-2024
  • (2024)ChangeRCA: Finding Root Causes from Software Changes in Large Online SystemsProceedings of the ACM on Software Engineering10.1145/36437281:FSE(24-46)Online publication date: 12-Jul-2024
  • (2024)MicroFI: Non-Intrusive and Prioritized Request-Level Fault Injection for Microservice ApplicationsIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2024.3363902(1-18)Online publication date: 2024
  • (2024)Resilient VirtualizationComputer10.1109/MC.2023.330661757:2(70-78)Online publication date: 31-Jan-2024
  • (2024)Mutiny! How Does Kubernetes Fail, and What Can We Do About It?2024 54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)10.1109/DSN58291.2024.00016(1-14)Online publication date: 24-Jun-2024
  • (2024)Neural Fault Injection: Generating Software Faults from Natural Language2024 54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks - Supplemental Volume (DSN-S)10.1109/DSN-S60304.2024.00016(23-27)Online publication date: 24-Jun-2024
  • (2024)PMTT: Parallel multi-scale temporal convolution network and transformer for predicting the time to aging failure of software systemsJournal of Systems and Software10.1016/j.jss.2024.112167217(112167)Online publication date: Nov-2024
  • (2024)TTAFPred: Prediction of time to aging failure for software systems based on a two-stream multi-scale features fusion networkSoftware Quality Journal10.1007/s11219-024-09692-232:4(1481-1513)Online publication date: 22-Jul-2024
  • (2024)Systematic Evaluation of Deep Learning Models for Log-based Failure PredictionEmpirical Software Engineering10.1007/s10664-024-10501-429:5Online publication date: 20-Jun-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media