Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3236024.3236030acmconferencesArticle/Chapter ViewAbstractPublication PagesfseConference Proceedingsconference-collections
research-article

An empirical study on crash recovery bugs in large-scale distributed systems

Published: 26 October 2018 Publication History

Abstract

In large-scale distributed systems, node crashes are inevitable, and can happen at any time. As such, distributed systems are usually designed to be resilient to these node crashes via various crash recovery mechanisms, such as write-ahead logging in HBase and hinted handoffs in Cassandra. However, faults in crash recovery mechanisms and their implementations can introduce intricate crash recovery bugs, and lead to severe consequences.
In this paper, we present CREB, the most comprehensive study on 103 Crash REcovery Bugs from four popular open-source distributed systems, including ZooKeeper, Hadoop MapReduce, Cassandra and HBase. For all the studied bugs, we analyze their root causes, triggering conditions, bug impacts and fixing. Through this study, we obtain many interesting findings that can open up new research directions for combating crash recovery bugs.

References

[1]
Ramnatthan Alagappan, Aishwarya Ganesan, Yuvraj Patel, Thanumalayan Sankaranarayana Pillai, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2016. Correlated Crash Vulnerabilities. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI). 151–167.
[2]
Peter Alvaro, Joshua Rosen, and Joseph M. Hellerstein. 2015. Lineage-driven Fault Injection. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD). 331–346.
[3]
Mike Burrows. 2006. The Chubby Lock Service for Loosely-Coupled Distributed Systems. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation (OSDI). 335–350.
[4]
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. 2008. Bigtable: A Distributed Storage System for Structured Data. ACM Transactions on Computer Systems 26, 2 (2008), 1–26.
[5]
Haogang Chen, Daniel Ziegler, Tej Chajed, Adam Chlipala, M. Frans Kaashoek, and Nickolai Zeldovich. 2015. Using Crash Hoare logic for certifying the FSCQ file system. Proceedings of the 25th Symposium on Operating Systems Principles - SOSP ’15 (2015), 18–37.
[6]
Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking Cloud Serving Systems with YCSB. In Proceedings of the 1st ACM Symposium on Cloud Computing (SOCC). 143–154.
[7]
Ting Dai, Jingzhu He, Xiaohui Gu, and Shan Lu. 2018. Understanding Real-World Timeout Problems in Cloud Server Systems. In Proceeding of the IEEE International Conference on Cloud Engineering (IC2E). 1–11.
[8]
Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified Data Processing on Large Clusters. In Proceedings of 6th Symposium on Operating Systems Design and Implementation (OSDI). 137–149.
[9]
Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. 2007. Dynamo: Amazon’s Highly Available Keyvalue Store. In Proceedings of the 21th ACM Symposium on Operating Systems Principles (SOSP). 205–220.
[10]
Cormac Flanagan and Patrice Godefroid. 2005. Dynamic Partial-Order Reduction for Model Checking Software. In Proceedings of the 32nd ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL). 110–121.
[11]
Pedro Fonseca. 2017. An Empirical Study on the Correctness of Formally Verified Distributed Systems. In Proceedings of the 12th European Conference on Computer Systems (EuroSys). 328–343.
[12]
Aishwarya Ganesan, Ramnatthan Alagappan, Andrea C Arpaci-dusseau, and Remzi H Arpaci-dusseau. 2017. Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions. In Proceedings of the 15th Usenix Conference on File and Storage Technologies (FAST). 149–165.
[13]
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. The Google File System. In Proceedings of the 9th ACM Symposium on Operating Systems Principles (SOSP). 29–43.
[14]
Haryadi S. Gunawi, Thanh Do, Pallavi Joshi, Peter Alvaro, Joseph M. Hellerstein, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Koushik Sen, and Dhruba Borthakur. 2011. FATE and DESTINI: A Framework for Cloud Recovery Testing. In Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation (NSDI). 238–252.
[15]
Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, Thanh Do, Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius Martin, and Anang D. Satria. 2014. What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems. In Proceedings of the ACM Symposium on Cloud Computing (SOCC). 1–14.
[16]
Haryadi S Gunawi, Abhishek Rajimwale, Andrea C Arpaci-dusseau, and Remzi H Arpaci-dusseau. 2008. SQCK : A Declarative File System Checker. In Proceedings of the 8th USENIX Symposium on Operating System Design and Implementation (OSDI). 131–146.
[17]
Huayang Guo, Ming Wu, Lidong Zhou, Gang Hu, Junfeng Yang, and Lintao Zhang. 2011. Practical Software Model Checking via Dynamic Interface Reduction. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles (SOSP). 265–278.
[18]
Zhenyu Guo, Sean Mcdirmid, Mao Yang, Li Zhuang, Pu Zhang, and Yingwei Luo. 2013. Failure Recovery: When the Cure is Worse Than the Disease. In Proceedings of 14th Workshop on Hot Topics in Operating Systems (HotOS). 1–6.
[19]
Chris Hawblitzel, Jon Howell, Manos Kapritsos, Jacob R Lorch, Bryan Parno, Michael L Roberts, Srinath Setty, and Brian Zill. 2015. IronFleet : Proving Practical Distributed Systems Correct. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP). 1–17.
[20]
N. Hayashibara, X. Defago, R. Yared, and T. Katayama. 2004. The ϕ accrual failure detector. In Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems. 66–78.
[21]
Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, and Ion Stoica. 2011. Mesos: A Platform for Fine-grained Resource Sharing in the Data Center. In Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation (NSDI). 295–308.
[22]
Patrick Hunt, Mahadev Konar, Flavio P. Junqueira, and Benjamin Reed. 2010. ZooKeeper: Wait-free Coordination for Internet-scale Systems. In Proceedings of the USENIX Conference on USENIX Annual Technical Conference (USENIX ATC). 11–11.
[23]
Pallavi Joshi, Malay Ganai, Gogul Balakrishnan, Aarti Gupta, and Nadia Papakonstantinou. 2013. SETSUDO: Perturbation-based Testing Framework for Scalable Distributed Systems Pallavi. In Conference on Timely Results in Operating Systems (TRIOS). 1–14.
[24]
Pallavi Joshi, Haryadi S. Gunawi, and Koushik Sen. 2011. PREFAIL : A Programmable Tool for Multiple-Failure Injection. In Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications (OOPSLA). 171–188.
[25]
Eric Koskinen and Junfeng Yang. 2016. Reducing Crash Recoverability to Reachability. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL). 97–108.
[26]
Vinod Kumar Vavilapalli et al. 2013. Apache Hadoop YARN: Yet Another Resource Negotiator. In Proceedings of the 4th annual Symposium on Cloud Computing (SOCC). 1–16.
[27]
Tanakorn Leesatapornwongsa, Mingzhe Hao, Pallavi Joshi, Jeffrey F Lukman, and Haryadi S Gunawi. 2014. SAMC: Semantic-Aware Model Checking for Fast Discovery of Deep Bugs in Cloud Systems. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI). 399–414.
[28]
Tanakorn Leesatapornwongsa, Jeffrey F. Lukman, Shan Lu, and Haryadi S. Gunawi. 2016. TaxDC: A Taxonomy of Non-Deterministic Concurrency Bugs in Datacenter Distributed Systems. In Proceedings of the 21st International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 517–530.
[29]
Mohsen Lesani, Christian J Bell, and Adam Chlipala. 2016. Chapar: Certified Causally Consistent Distributed Key-Value Stores. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL). 357–370.
[30]
Haopeng Liu, Guangpu Li, Jeffrey F. Lukman, Jiaxin Li, Shan Lu, Haryadi S. Gunawi, and Chen Tian. 2017. DCatch : Automatically Detecting Distributed Concurrency Bugs in Cloud Systems Cloud systems. In Proceedings of the 22nd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 677–691.
[31]
Haopeng Liu, Xu Wang, Guangpu Li, Shan Lu, Feng Ye, and Chen Tian. 2018. FCatch : Automatically Detecting Time-of-fault Bugs in Cloud Systems. In Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).
[32]
Xuezheng Liu, Zhenyu Guo, Xi Wang, Feibo Chen, Xiaochen Lian, Jian Tang, Ming Wu, M. Frans Kaashoek, and Zheng Zhang. 2008. D3S: Debugging Deployed Distributed Systems. In Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation (NSDI). 423–437.
[33]
Jie Lu, Feng Li, Lian Li, and Xiaobing Feng. 2018. CloudRaid : Hunting Concurrency Bugs in the Cloud via Log-Mining. In Proceedings of the 26th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE). To appear.
[34]
Thanumalayan Sankaranarayana Pillaic, Vijay Chidambaram, Ramnatthan Alagappan, Samer Al-Kiswany, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2014. All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI). 433–448.
[35]
Colin Scott, Aurojit Panda, Arvind Krishnamurthy, Vjekoslav Brajkovic, George Necula, and Scott Shenker. 2016. Minimizing Faulty Executions of Distributed Systems. In Proceedings of 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI). 291–309.
[36]
Koushik Sen and Gul Agha. 2006. Automated Systematic Testing of Open Distributed Programs. In Proceedings of the 9th International Conference on Fundamental Approaches to Software Engineering (FASE). 339–356.
[37]
Guosai Wang, Wei Xu, and Lifei Zhang. 2017. What Can We Learn from Four Years of Data Center Hardware Failures ? In Proceedings of 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). 25–36.
[38]
James R. Wilcox, Doug Woos, Pavel Panchekha, Zachary Tatlock, Xi Wang, Michael D. Ernst, and Thomas Anderson. 2015. Verdi: A Framework for Implementing and Formally Verifying Distributed Systems. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). 357–368.
[39]
Wei Xu, Armando Fox, David Patterson, and Michael I. Jordan. 2009. Detecting Large-Scale System Problems by Mining Console Logs. In Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles (SOSP). 117–132.
[40]
Junfeng Yang, Tisheng Chen, Ming Wu, Zhilei Xu, Xuezheng Liu, Haoxiang Lin, Mao Yang, Fan Long, Lintao Zhang, and Lidong Zhou. 2009. MODIST: ESEC/FSE’18, November 4–9, 2018, Lake Buena Vista, FL, USA Yu Gao et al. Transparent Model Checking of Unmodified Distributed Systems. In Proceedings of the 6th USENIX symposium on Networked systems design and implementation (NSDI). 213–228.
[41]
Junfeng Yang, Paul Twohey, Dawson Engler, and Madanlal Musuvathi. 2004. Using Model Checking to Find Serious File System Errors. In Proceedings ofthe Sixth Symposium on Operating Systems Design and Implementation (OSDI). 273– 288.
[42]
Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Renna Rodrigues, Xu Zhao, Yongle Zhang, Pranay U. Jain, and Michael Stumm. 2014. Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI). 249–265.
[43]
Andreas Zeller and Ralf Hildebrandt. 2002. Simplifying and Isolating Failure-Inducing Input. IEEE Transactions on Software Engineering (TSE) 8, 2 (2002), 183– 200.
[44]
Mai Zheng, Joseph Tucek, Dachuan Huang, Feng Qin, Mark Lillibridge, Elizabeth S. Yang, Bill W. Zha, and Shashank Singh. 2014. Torturing Databases for Fun and Profit. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI). 449–464.
[45]
Apache Cassandra. Retrieved from http://cassandra.apache.org.
[46]
Apache Flume Project. Retrieved from http://flume.apache.org.
[47]
Apache Hadoop. Retrieved from http://hadoop.apache.org.
[48]
Apache HBase. Retrieved from http://hadoop.apache.org/hbase.
[49]
Apache ZooKeeper. Retrieved from http://zookeeper.apache.org.
[50]
Chaos Monkey. Retrieved from https://github.com/Netflix/SimianArmy/wiki/Chaos-Monkey.
[51]
Dafny is a verification-aware programming language. Retrieved from https://github.com/Microsoft/dafny.
[52]
Fault Injection Framework and Development Guide. Retrieved from https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoophdfs/FaultInjectFramework.html.
[53]
FIT: Failure Injection Testing. Retrieved from https://medium.com/netflixtechblog/fit-failure-injection-testing-35d8e2a9bb2.
[54]
HDFS Architecture. Retrieved from http://hadoop.apache.org/%0Adocs/current/hadoop-project-dist/hadoop-hdfs/ HdfsDesign.html.
[55]
HintedHandoff. Retrieved from https://wiki.apache.org/cassandra/HintedHandoff. 2016. The 10 Biggest Cloud Outages of 2016. Retrieved from http://www.crn.com/slide-shows/cloud/300083247/the-10-biggest-cloudoutages-of-2016.htm. 2017. The 10 Biggest Cloud Outages of 2017 (So Far). Retrieved from http://www.crn.com/slide-shows/cloud/300089786/the-10-biggest-cloudoutages-of-2017-so-far.htm.
[56]
The Coq Proof Assistant. Retrieved from https://coq.inria.fr/.
[57]
Write Ahead Log (WAL). Retrieved from http://hbase.apache.org/book.html#wal.

Cited By

View all
  • (2024)X-Lifecycle Learning for Cloud Incident Management using LLMsCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663861(417-428)Online publication date: 10-Jul-2024
  • (2024)Automated Root Causing of Cloud Incidents using In-Context Learning with GPT-4Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663846(266-277)Online publication date: 10-Jul-2024
  • (2024)State Reconciliation Defects in Infrastructure as CodeProceedings of the ACM on Software Engineering10.1145/36607901:FSE(1865-1888)Online publication date: 12-Jul-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ESEC/FSE 2018: Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering
October 2018
987 pages
ISBN:9781450355735
DOI:10.1145/3236024
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 October 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Distributed systems
  2. crash recovery bugs
  3. empirical study

Qualifiers

  • Research-article

Conference

ESEC/FSE '18
Sponsor:

Acceptance Rates

Overall Acceptance Rate 112 of 543 submissions, 21%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)91
  • Downloads (Last 6 weeks)14
Reflects downloads up to 25 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)X-Lifecycle Learning for Cloud Incident Management using LLMsCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663861(417-428)Online publication date: 10-Jul-2024
  • (2024)Automated Root Causing of Cloud Incidents using In-Context Learning with GPT-4Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663846(266-277)Online publication date: 10-Jul-2024
  • (2024)State Reconciliation Defects in Infrastructure as CodeProceedings of the ACM on Software Engineering10.1145/36607901:FSE(1865-1888)Online publication date: 12-Jul-2024
  • (2024)An Empirical Study on Kubernetes Operator BugsProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680396(1746-1758)Online publication date: 11-Sep-2024
  • (2024)Bugs in Pods: Understanding Bugs in Container Runtime SystemsProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680366(1364-1376)Online publication date: 11-Sep-2024
  • (2024)FaultFuzz: A Coverage Guided Fault Injection Tool for Distributed SystemsProceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings10.1145/3639478.3640036(129-133)Online publication date: 14-Apr-2024
  • (2024)Intelligent Monitoring Framework for Cloud Services: A Data-Driven ApproachProceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice10.1145/3639477.3639753(381-391)Online publication date: 14-Apr-2024
  • (2024)Automatic Root Cause Analysis via Large Language Models for Cloud IncidentsProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3629553(674-688)Online publication date: 22-Apr-2024
  • (2024)Understanding Transaction Bugs in Database SystemsProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639207(1-13)Online publication date: 20-May-2024
  • (2024)What's Wrong With Low-Code Development Platforms? An Empirical Study of Low-Code Development Platform BugsIEEE Transactions on Reliability10.1109/TR.2023.329500973:1(695-709)Online publication date: Mar-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media