Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3472883.3487005acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Open access

Service-Level Fault Injection Testing

Published: 01 November 2021 Publication History

Abstract

Companies today increasingly rely on microservice architectures to deliver service for their large-scale mobile or web applications. However, not all developers working on these applications are distributed systems engineers and therefore do not anticipate partial failure: where one or more of the dependencies of their service might be unavailable once deployed into production. Therefore, it is paramount that these issues be raised early and often, ideally in a testing environment or before the code ships to production.
In this paper, we present an approach called service-level fault injection testing and a prototype implementation called Filibuster, that can be used to systematically identify resilience issues early in the development of microservice applications. Filibuster combines static analysis and concolicstyle execution with a novel dynamic reduction algorithm to extend existing functional test suites to cover failure scenarios with minimal developer effort. To demonstrate the applicability of our tool, we present a corpus of 4 real-world industrial microservice applications containing bugs. These applications and bugs are taken from publicly available information of chaos engineering experiments run by large companies in production. We then demonstrate how all of these chaos experiments could have been run during development instead, and the bugs they discovered detected long before they ended up in production.

Supplementary Material

MP4 File (Day2_8-3.mp4)
Presentation video

References

[1]
2016. Building Microservices in Python and Flask. https://codeahoy. com/2016/07/10/writing-microservices-in-python-using-flask. Accessed: 2021-05-21.
[2]
2018. LinkedOut: A Request-Level Failure Injection Framework. https://engineering.linkedin.com/blog/2018/05/linkedout--a-request-level-failure-injection-framework. Accessed: 2021-05-21.
[3]
2020. Automating Chaos Attacks at Expedia - Daniel and Nikos. https://www.youtube.com/watch?v=xrtbiyfRvb4. Accessed: 2021-05-21.
[4]
2020. Introducing Domain-Oriented Microservice Architecture. https://eng.uber.com/microservice-architecture/. Accessed: 2021-05-21.
[5]
2020. Rethinking How the Industry Approaches Chaos Engineering. https://www.infoq.com/presentations/rethinking-chaos-engineering. Accessed: 2021-05-21.
[6]
2021. Amazon EKS | Managed Kubernetes Service. https://aws.amazon.com/eks/. Accessed: 2021-05-21.
[7]
2021. Audible. https://www.audible.com. Accessed: 2021-05-21.
[8]
2021. docker. https://www.docker.com/. Accessed: 2021-05-21.
[9]
2021. Expedia. https://www.expedia.com. Accessed: 2021-05-21.
[10]
2021. Filibuster. http://filibuster.cloud. Accessed: 2021-09-07.
[11]
2021. Flask web framework. https://flask.palletsprojects.com/en/2.0.x/. Accessed: 2021-05-21.
[12]
2021. Gremlin. http://www.gremlin.com. Accessed: 2021-05-21.
[13]
2021. Mailchimp. https://www.mailchimp.com. Accessed: 2021-05-21.
[14]
2021. minikube. https://minikube.sigs.k8s.io/docs/. Accessed: 2021-05-21.
[15]
2021. Netflix. https://www.netflix.com. Accessed: 2021-05-21.
[16]
2021. Online Boutique. https://github.com/GoogleCloudPlatform/microservices-demo. Accessed: 2021-05-21.
[17]
2021. Sock Shop: A Microservices Demo Application. https://microservices-demo.github.io. Accessed: 2021-05-21.
[18]
Peter Alvaro, Kolton Andrus, Chris Sanden, Casey Rosenthal, Ali Basiri, and Lorin Hochstein. 2016. Automating Failure Testing Research at Internet Scale. In Proceedings of the Seventh ACM Symposium on Cloud Computing (Santa Clara, CA, USA) (SoCC '16). Association for Computing Machinery, New York, NY, USA, 17--28. https://doi.org/10.1145/2987550.2987555
[19]
Peter Alvaro, Joshua Rosen, and Joseph M. Hellerstein. 2015. Lineage-Driven Fault Injection. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (Melbourne, Victoria, Australia) (SIGMOD '15). Association for Computing Machinery, New York, NY, USA, 331--346. https://doi.org/10.1145/2723372.2723711
[20]
Cyrille Artho, Armin Biere, and Shinichi Honiden. 2006. Exhaustive Testing of Exception Handlers with Enforcer. In Proceedings of the 5th International Conference on Formal Methods for Components and Objects (Amsterdam, The Netherlands) (FMCO'06). Springer-Verlag, Berlin, Heidelberg, 26--46.
[21]
Radu Banabic and George Candea. 2012. Fast Black-Box Testing of System Recovery Code. In Proceedings of the 7th ACM European Conference on Computer Systems (Bern, Switzerland) (EuroSys '12). Association for Computing Machinery, New York, NY, USA, 281--294. https://doi.org/10.1145/2168836.2168865
[22]
Phiradet Bangcharoensap, Akinori Ihara, Yasutaka Kamei, and Kenichi Matsumoto. 2012. Locating Source Code to Be Fixed Based on Initial Bug Reports - A Case Study on the Eclipse Project. In 2012 Fourth International Workshop on Empirical Software Engineering in Practice. 10--15. https://doi.org/10.1109/IWESEP.2012.14
[23]
Ali Basiri, Lorin Hochstein, Nora Jones, and Haley Tucker. 2019. Automating Chaos Experiments in Production. In Proceedings of the 41st International Conference on Software Engineering: Software Engineering in Practice (Montreal, Quebec, Canada) (ICSE-SEIP '19). IEEE Press, 31--40. https://doi.org/10.1109/ICSE-SEIP.2019.00012
[24]
Pete Broadwell, Naveen Sastry, and Jonathan Traupman. 2002. FIG: A prototype tool for online verification of recovery mechanisms. In Workshop on Self-Healing, Adaptive and Self-Managed Systems. Citeseer.
[25]
Valentin Dallmeier and Thomas Zimmermann. 2007. Extraction of Bug Localization Benchmarks from History. In Proceedings of the Twenty-Second IEEE/ACM International Conference on Automated Software Engineering (Atlanta, Georgia, USA) (ASE '07). Association for Computing Machinery, New York, NY, USA, 433--436. https://doi.org/10.1145/1321631.1321702
[26]
Marco D'Ambros, Michele Lanza, and Romain Robbes. 2010. An extensive comparison of bug prediction approaches. In 2010 7th IEEE Working Conference on Mining Software Repositories (MSR 2010). 31--41. https://doi.org/10.1109/MSR.2010.5463279
[27]
C. Fidge. 1991. Logical time in distributed computing systems. Computer 24, 8 (1991), 28--33. https://doi.org/10.1109/2.84874
[28]
Patrice Godefroid, Nils Klarlund, and Koushik Sen. 2005. DART: Directed Automated Random Testing. In Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation (Chicago, IL, USA) (PLDI '05). Association for Computing Machinery, New York, NY, USA, 213--223. https://doi.org/10.1145/1065010.1065036
[29]
Patrice Godefroid, Nils Klarlund, and Koushik Sen. 2005. DART: Directed Automated Random Testing. SIGPLAN Not. 40, 6 (June 2005), 213--223. https://doi.org/10.1145/1064978.1065036
[30]
Haryadi S. Gunawi, Thanh Do, Pallavi Joshi, Peter Alvaro, Joseph M. Hellerstein, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Koushik Sen, and Dhruba Borthakur. 2011. FATE and DESTINI: A Framework for Cloud Recovery Testing. In Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation (Boston, MA) (NSDI'11). USENIX Association, USA, 238--252.
[31]
T. Gyimothy, R. Ferenc, and I. Siket. 2005. Empirical validation of object-oriented metrics on open source software for fault prediction. IEEE Transactions on Software Engineering 31, 10 (2005), 897--910. https://doi.org/10.1109/TSE.2005.112
[32]
Tracy Hall, Min Zhang, David Bowes, and Yi Sun. 2014. Some Code Smells Have a Significant but Small Effect on Faults. ACM Trans. Softw. Eng. Methodol. 23, 4, Article 33 (Sept. 2014), 39 pages. https://doi.org/10.1145/2629648
[33]
Victor Heorhiadi, Shriram Rajagopalan, Hani Jamjoom, Michael K. Reiter, and Vyas Sekar. 2016. Gremlin: Systematic Resilience Testing of Microservices. In 2016 IEEE 36th International Conference on Distributed Computing Systems (ICDCS). 57--66. https://doi.org/10.1109/ICDCS.2016.11
[34]
Pallavi Joshi, Malay Ganai, Gogul Balakrishnan, Aarti Gupta, and Nadia Papakonstantinou. 2013. SETSUDundefined: Perturbation-Based Testing Framework for Scalable Distributed Systems. In Proceedings of the First ACM SIGOPS Conference on Timely Results in Operating Systems (Farmington, Pennsylvania) (TRIOS '13). Association for Computing Machinery, New York, NY, USA, Article 7, 14 pages. https://doi.org/10.1145/2524211.2524217
[35]
Pallavi Joshi, Haryadi S. Gunawi, and Koushik Sen. 2011. PREFAIL: A Programmable Tool for Multiple-Failure Injection. In Proceedings of the 2011 ACM International Conference on Object Oriented Programming Systems Languages and Applications (Portland, Oregon, USA) (OOPSLA '11). Association for Computing Machinery, New York, NY, USA, 171--188. https://doi.org/10.1145/2048066.2048082
[36]
Lukasz Juszczyk and Schahram Dustdar. 2010. Programmable Fault Injection Testbeds for Complex SOA. In Service-Oriented Computing, Paul P. Maglio, Mathias Weske, Jian Yang, and Marcelo Fantinato (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 411--425.
[37]
G.A. Kanawati, N.A. Kanawati, and J.A. Abraham. 1995. FERRARI: a flexible software-based fault and error injection system. IEEE Trans. Comput. 44, 2 (1995), 248--260. https://doi.org/10.1109/12.364536
[38]
Samuel C. Kendall, Jim Waldo, Ann Wollrath, and Geoff Wyant. 1994. A Note on Distributed Computing. Technical Report. USA.
[39]
Tanakorn Leesatapornwongsa, Mingzhe Hao, Pallavi Joshi, Jeffrey F. Lukman, and Haryadi S. Gunawi. 2014. SAMC: Semantic-Aware Model Checking for Fast Discovery of Deep Bugs in Cloud Systems. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (Broomfield, CO) (OSDI'14). USENIX Association, USA, 399--414.
[40]
Jeffrey F. Lukman, Huan Ke, Cesar A. Stuardo, Riza O. Suminto, Daniar H. Kurniawan, Dikaimin Simon, Satria Priambada, Chen Tian, Feng Ye, Tanakorn Leesatapornwongsa, Aarti Gupta, Shan Lu, and Haryadi S. Gunawi. 2019. FlyMC: Highly Scalable Testing of Complex Interleavings in Distributed Systems. In Proceedings of the Fourteenth EuroSys Conference 2019 (Dresden, Germany) (EuroSys '19). Association for Computing Machinery, New York, NY, USA, Article 20, 16 pages. https://doi.org/10.1145/3302424.3303986
[41]
Paul D. Marinescu and George Candea. 2009. LFI: A practical and general library-level fault injector. In 2009 IEEE/IFIP International Conference on Dependable Systems Networks. 379--388. https://doi.org/10.1109/DSN.2009.5270313
[42]
Friedemann Mattern. 1988. Virtual Time and Global States of Distributed Systems. In PARALLEL AND DISTRIBUTED ALGORITHMS. North-Holland, 215--226.
[43]
D. L. Parnas. 1972. On the Criteria to Be Used in Decomposing Systems into Modules. Commun. ACM 15, 12 (Dec. 1972), 1053--1058. https://doi.org/10.1145/361598.361623
[44]
H. Tucker, L. Hochstein, N. Jones, A. Basiri, and C. Rosenthal. 2018. The Business Case for Chaos Engineering. IEEE Cloud Computing 5, 03 (may 2018), 45--54. https://doi.org/10.1109/MCC.2018.032591616
[45]
Michele Tufano, Fabio Palomba, Gabriele Bavota, Rocco Oliveto, Massimiliano Di Penta, Andrea De Lucia, and Denys Poshyvanyk. 2015. When and Why Your Code Starts to Smell Bad. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, Vol. 1. 403--414. https://doi.org/10.1109/ICSE.2015.59
[46]
Bin Xin, William N. Sumner, and Xiangyu Zhang. 2008. Efficient Program Execution Indexing. SIGPLAN Not. 43, 6 (June 2008), 238--248. https://doi.org/10.1145/1379022.1375611
[47]
Junfeng Yang, Tisheng Chen, Ming Wu, Zhilei Xu, Xuezheng Liu, Haoxiang Lin, Mao Yang, Fan Long, Lintao Zhang, and Lidong Zhou. 2009. MODIST: Transparent Model Checking of Unmodified Distributed Systems. In 6th USENIX Symposium on Networked Systems Design and Implementation (NSDI 09). USENIX Association, Boston, MA. https://www.usenix.org/conference/nsdi-09/rnodist-transparent-model-checking-unmodified-distributed-systems
[48]
Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Renna Rodrigues, Xu Zhao, Yongle Zhang, Pranay U. Jain, and Michael Stumm. 2014. Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14). USENIX Association, Broomfield, CO, 249--265. https://www.usenix.org/conference/osdi14/technical-sessions/presentation/yuan
[49]
Long Zhang, Brice Morin, Philipp Haller, Benoit Baudry, and Martin Monperrus. 2019. A Chaos Engineering System for Live Analysis and Falsification of Exception-handling in the JVM. IEEE Transactions on Software Engineering (2019), 1--1. https://doi.org/10.1109/TSE.2019.2954871
[50]
Xiang Zhou, Xin Peng, Tao Xie, Jun Sun, Chao Ji, Wenhai Li, and Dan Ding. 2021. Fault Analysis and Debugging of Microservice Systems: Industrial Survey, Benchmark System, and Empirical Study. IEEE Transactions on Software Engineering 47, 2 (2021), 243--260. https://doi.org/10.1109/TSE.2018.2887384
[51]
Thomas Zimmermann, Rahul Premraj, and Andreas Zeller. 2007. Predicting Defects for Eclipse. In Third International Workshop on Predictor Models in Software Engineering (PROMISE'07: ICSE Workshops 2007). 9--9. https://doi.org/10.1109/PROMISE.2007.10

Cited By

View all
  • (2024)EXCHAINProceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation10.5555/3691825.3691937(2047-2062)Online publication date: 16-Apr-2024
  • (2024)Efficient exposure of partial failure bugs in distributed systems with inferred abstract statesProceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation10.5555/3691825.3691895(1267-1283)Online publication date: 16-Apr-2024
  • (2024)Building AI Agents for Autonomous Clouds: Challenges and Design PrinciplesProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698525(99-110)Online publication date: 20-Nov-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SoCC '21: Proceedings of the ACM Symposium on Cloud Computing
November 2021
685 pages
ISBN:9781450386388
DOI:10.1145/3472883
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 November 2021

Check for updates

Author Tags

  1. fault injection
  2. fault tolerance
  3. verification

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

SoCC '21
Sponsor:
SoCC '21: ACM Symposium on Cloud Computing
November 1 - 4, 2021
WA, Seattle, USA

Acceptance Rates

Overall Acceptance Rate 169 of 722 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1,155
  • Downloads (Last 6 weeks)43
Reflects downloads up to 27 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)EXCHAINProceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation10.5555/3691825.3691937(2047-2062)Online publication date: 16-Apr-2024
  • (2024)Efficient exposure of partial failure bugs in distributed systems with inferred abstract statesProceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation10.5555/3691825.3691895(1267-1283)Online publication date: 16-Apr-2024
  • (2024)Building AI Agents for Autonomous Clouds: Challenges and Design PrinciplesProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698525(99-110)Online publication date: 20-Nov-2024
  • (2024)Efficient Reproduction of Fault-Induced Failures in Distributed Systems with Feedback-Driven Fault InjectionProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695979(46-62)Online publication date: 4-Nov-2024
  • (2024)If At First You Don’t Succeed, Try, Try, Again...? Insights and LLM-informed Tooling for Detecting Retry Bugs in Software SystemsProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695971(63-78)Online publication date: 4-Nov-2024
  • (2024)MicroFI: Non-Intrusive and Prioritized Request-Level Fault Injection for Microservice ApplicationsIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2024.3363902(1-18)Online publication date: 2024
  • (2024)Guardian of the Resiliency: Detecting Erroneous Software Changes Before They Make Your Microservice System Less Fault-Resilient2024 IEEE/ACM 32nd International Symposium on Quality of Service (IWQoS)10.1109/IWQoS61813.2024.10682951(1-10)Online publication date: 19-Jun-2024
  • (2024)Informed and Assessable Observability Design Decisions in Cloud-Native Microservice Applications2024 IEEE 21st International Conference on Software Architecture (ICSA)10.1109/ICSA59870.2024.00015(69-78)Online publication date: 4-Jun-2024
  • (2024)OXN - Automated Observability Assessments for Cloud-Native Applications2024 IEEE 21st International Conference on Software Architecture Companion (ICSA-C)10.1109/ICSA-C63560.2024.00035(167-170)Online publication date: 4-Jun-2024
  • (2024)Towards antifragility of cloud systemsInformation and Software Technology10.1016/j.infsof.2024.107519174:COnline publication date: 1-Oct-2024
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media