research-article

SETSUDŌ: perturbation-based testing framework for scalable distributed systems

Authors:

Gogul Balakrishnan,

Nadia PapakonstantinouAuthors Info & Claims

TRIOS '13: Proceedings of the First ACM SIGOPS Conference on Timely Results in Operating Systems

Article No.: 7, Pages 1 - 14

https://doi.org/10.1145/2524211.2524217

Published: 03 November 2013 Publication History

Abstract

Modern scalable distributed systems are designed to be partition-tolerant. They are often required to support increasing load in service requests elastically, and to provide seamless services even when some servers malfunction. Partition-tolerance enables such systems to withstand arbitrary loss of messages as "perceived" by the communicating nodes. However, partition-tolerance and robustness are not tested rigorously in practice. Often severe system-level design defects stay hidden even after deployment, possibly resulting in loss of revenue or customer satisfaction.

We propose a novel perturbation-based rigorous testing framework, named SETSUDŌ, especially targeted to expose system-level defects in scalable distributed systems. It applies perturbations (i.e., controlled changes) from the environment of a system during testing, and leverages awareness of system-internal states to precisely control their timing. It uses a flexible instrumentation framework to select relevant internal states and to implement the system code for perturbations. It also provides a test policy language framework, where sequences of perturbation scenarios at a high level are converted automatically to system-level test code. This test code is weaved-in automatically with application code during testing, and any observed defects are reported. We have implemented our perturbation testing framework and demonstrate its evaluation on several open source projects, where it was successful in exposing known, as well as some unknown, defects. Our framework leverages small-scale testing, and avoids upfront infrastructure costs typically needed for large-scale stress testing.

References

[1]

Apache JMeter. http://jmeter.apache.org/.

[2]

Apache ZooKeeper. http://zookeeper.apache.org.

[3]

Cassandra. http://cassandra.apache.org/.

[4]

Chaos monkey. http://techblog.netflix.com/2012/07/chaos-monkey-released-into-wild.html.

[5]

CouchDB. http://couchdb.apache.org/.

[6]

An empty or just replicated index cannot become the leader of a shard after a leader goes down. https://issues.apache.org/jira/browse/SOLR-3939.

[7]

Hadoop. http://hadoop.apache.org/.

[8]

Hadoop Team. Fault Injection framework: How to use it, test using artificial faults, and develop new faults. http://issues.apache.org.

[9]

HBase. http://hbase.apache.org/.

[10]

Hive. http://hive.apache.org/.

[11]

HP - Enterprise Software. http://www8.hp.com/us/en/software/enterprise-software.html.

[12]

Root region doesn't get re-assigned in servershutdownhandler. https://issues.apache.org/jira/browse/HBASE-6289.

[13]

Selenium - Web Browser Automation. http://docs.seleniumhq.org/.

[14]

SolrCloud. http://wiki.apache.org/solr/SolrCloud.

[15]

Solrcloud leader election on single node stucks the initialization. https://issues.apache.org/jira/browse/SOLR-3993.

[16]

System dashboard - ASF JIRA. https://issues.apache.org/jira.

[17]

The Aspectj Project. http://www.eclipse.org/aspectj/.

[18]

ZooKeeper: Because Coordinating Distributed Systems is a Zoo. http://zookeeper.apache.org/doc/trunk/.

[19]

R. Banabic and G. Candea. Fast black-box testing of system recovery code. In Eurosys, 2012.

Digital Library

[20]

E. A. Brewer. Towards robust distributed systems (invited talk). Principles of Distributed Computing, July 2000.

Digital Library

[21]

P. Broadwell, N. Sastry, and J. Traupman. FIG: A prototype tool for online verification of recovery. In In Workshop on Self-Healing, Adaptive and Self-Managed Systems, 2002.

[22]

T. Chandra, R. Griesemer, and J. Redstone. Paxos made live - an engineering perspective. In PODC, 2007.

Digital Library

[23]

S. Dawson, F. Jahanian, and T. Mitton. Experiments on six commercial tcp implementations using a software fault injection tool. Software Practice and Experience, 27: 1385--1410, 1996.

Digital Library

[24]

S. Gilbert and N. Lynch. Brewers conjecture and the feasibility of consistent, available, partition-tolerant web services. ACM SIGACT News, pages 51--59, 2002.

Digital Library

[25]

H. S. Gunawi, T. Do, P. Joshi, P. Alvaro, J. M. Hellerstein, A. C. Arpaci-Dusseau, R. H. Arpaci-Dusseau, K. Sen, and D. Borthakur. FATE and DESTINI: A framework for cloud recovery testing. In NSDI, 2011.

Digital Library

[26]

H. Guo, M. Wu, L. Zhou, G. Hu, J. Yang, and L. Zhang. Practical software model checking via dynamic interface reduction. In SOSP, pages 265--278. ACM, 2011.

Digital Library

[27]

A. Henry. Cloud storage FUD: Failure and uncertainty and durability. In FAST, 2009.

[28]

W. Hoarau, S. Tixeuil, and F. Vauchelles. FAIL-FCI: Versatile fault injection. Future Generation Computer Systems, 23(7): 913--919, Aug. 2007.

Digital Library

[29]

T. Hoff. Netflix: Continually test by failing servers with chaos monkey. http://highscalability.com, 2010.

[30]

P. Joshi, H. S. Gunawi, and K. Sen. PREFAIL: A programmable tool for multiple-failure injection. In OOPSLA, pages 171--188. ACM, 2011.

Digital Library

[31]

L. Juszczyk and S. Dustdar. Programmable fault injection testbeds for complex SOA. In International Conference on Service Oriented Computing. Springer, 2010.

[32]

L. Keller, P. Marinescu, and G. Candea. AFEX: An automated fault explorer for faster system testing. In Technical Report EPFL-REPORT-151651, 2008.

[33]

P. Marinescu and G. Candea. LFI: A practical and general library-level fault injector. In IEEE/IFIP International Conference on Dependable Systems and Networks, 2009, pages 379--388, 2009.

[34]

P. D. Marinescu and G. Candea. Efficient testing of recovery code using fault injection. ACM Trans. Comput. Syst., 29(2), 2011.

Digital Library

[35]

J. Yang, T. Chen, M. Wu, Z. Xu, X. Liu, H. Lin, M. Yang, F. Long, L. Zhang, and L. Zhou.MoDist: Transparent model checking of unmodified distributed systems. In NSDI, pages 213--228, 2009.

Digital Library

[36]

J. Yang, C. Sar, and D. Engler. EXPLODE: A lightweight, general system for finding serious storage system errors. In OSDI, 2006.

Digital Library

[37]

J. Yang, P. Twohey, D. Engler, and M. Musuvathi. Using model checking to find serious file system errors. In OSDI, 2004.

Digital Library

Cited By

Chen HChen PYu GLi XHe ZZhang H(2024)MicroFI: Non-Intrusive and Prioritized Request-Level Fault Injection for Microservice ApplicationsIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2024.3363902(1-18)Online publication date: 2024
https://doi.org/10.1109/TDSC.2024.3363902
Wu HYu SNiu XNie CPei YHe QYang Y(2023)Enhancing Fault Injection Testing of Service Systems via Fault-Tolerance BottleneckIEEE Transactions on Software Engineering10.1109/TSE.2023.3285357(1-17)Online publication date: 2023
https://doi.org/10.1109/TSE.2023.3285357
Babaei MDingel J(2023)Efficient regression testing of distributed real-time reactive systems in the context of model-driven developmentSoftware and Systems Modeling10.1007/s10270-023-01086-522:5(1565-1587)Online publication date: 6-Mar-2023
https://doi.org/10.1007/s10270-023-01086-5
Show More Cited By

Index Terms

SETSUDŌ: perturbation-based testing framework for scalable distributed systems

Recommendations

Mitigating the effects of flaky tests on mutation testing
ISSTA 2019: Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis

Mutation testing is widely used in research as a metric for evaluating the quality of test suites. Mutation testing runs the test suite on generated mutants (variants of the code under test), where a test suite kills a mutant if any of the tests fail ...
A Static Approach to Prioritizing JUnit Test Cases

Test case prioritization is used in regression testing to schedule the execution order of test cases so as to expose faults earlier in testing. Over the past few years, many test case prioritization techniques have been proposed in the literature. Most ...
Spartan: a spectral and entropy-based partial-scan and test point insertion algorithm

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

TRIOS '13: Proceedings of the First ACM SIGOPS Conference on Timely Results in Operating Systems

November 2013

155 pages

ISBN:9781450324632

DOI:10.1145/2524211

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGOPS: ACM Special Interest Group on Operating Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 November 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Conference

SOSP '13

Sponsor:

SIGOPS

SOSP '13: ACM SIGOPS 24th Symposium on Operating Systems Principles

November 3 - 6, 2013

Pennsylvania, Farmington

Upcoming Conference

SOSP '24

Sponsor:
sigops

ACM SIGOPS 30th Symposium on Operating Systems Principles

November 4 - 6, 2024

Austin , TX , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

27
Total Citations
View Citations
182
Total Downloads

Downloads (Last 12 months)16
Downloads (Last 6 weeks)0

Reflects downloads up to 30 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Chen HChen PYu GLi XHe ZZhang H(2024)MicroFI: Non-Intrusive and Prioritized Request-Level Fault Injection for Microservice ApplicationsIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2024.3363902(1-18)Online publication date: 2024
https://doi.org/10.1109/TDSC.2024.3363902
Wu HYu SNiu XNie CPei YHe QYang Y(2023)Enhancing Fault Injection Testing of Service Systems via Fault-Tolerance BottleneckIEEE Transactions on Software Engineering10.1109/TSE.2023.3285357(1-17)Online publication date: 2023
https://doi.org/10.1109/TSE.2023.3285357
Babaei MDingel J(2023)Efficient regression testing of distributed real-time reactive systems in the context of model-driven developmentSoftware and Systems Modeling10.1007/s10270-023-01086-522:5(1565-1587)Online publication date: 6-Mar-2023
https://doi.org/10.1007/s10270-023-01086-5
Wang LJiang YWang ZHuo QDai JXie SLi RFeng MXu YJiang Z(2023)The operation and maintenance governance of microservices architecture systemsJournal of Software: Evolution and Process10.1002/smr.243335:10Online publication date: 12-Oct-2023
https://dl.acm.org/doi/10.1002/smr.2433
Fanticelli HRabenjamina SViana AStanica RDe Oliveira LZiviani A(2022)Data-driven Mobility Analysis and Modeling: Typical and Confined Life of a Metropolitan PopulationACM Transactions on Spatial Algorithms and Systems10.1145/35172228:3(1-33)Online publication date: 27-Sep-2022
https://dl.acm.org/doi/10.1145/3517222
Feng YLi SYing M(2022)Verification of Distributed Quantum ProgramsACM Transactions on Computational Logic10.1145/351714523:3(1-40)Online publication date: 6-Apr-2022
https://dl.acm.org/doi/10.1145/3517145
Fomin FGolovach PThilikos D(2022)Parameterized Complexity of Elimination Distance to First-Order Logic PropertiesACM Transactions on Computational Logic10.1145/351712923:3(1-35)Online publication date: 6-Apr-2022
https://dl.acm.org/doi/10.1145/3517129
Gratzer DCavallo EKavvos GGuatto ABirkedal L(2022)Modalities and Parametric AdjointsACM Transactions on Computational Logic10.1145/351424123:3(1-29)Online publication date: 6-Apr-2022
https://dl.acm.org/doi/10.1145/3514241
Gupta VBedathur S(2022)Doing More with Less: Overcoming Data Scarcity for POI Recommendation via Cross-Region TransferACM Transactions on Intelligent Systems and Technology10.1145/351171113:3(1-24)Online publication date: 13-Apr-2022
https://dl.acm.org/doi/10.1145/3511711
Chen YZhang JYuan XZhang SChen KWang XGuo S(2022)SoK: A Modularized Approach to Study the Security of Automatic Speech Recognition SystemsACM Transactions on Privacy and Security10.1145/351058225:3(1-31)Online publication date: 19-May-2022
https://dl.acm.org/doi/10.1145/3510582
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents