Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article
Public Access

Protocol-Aware Recovery for Consensus-Based Distributed Storage

Published: 03 October 2018 Publication History

Abstract

We introduce protocol-aware recovery (Par), a new approach that exploits protocol-specific knowledge to correctly recover from storage faults in distributed systems. We demonstrate the efficacy of Par through the design and implementation of <underline>c</underline>orruption-<underline>t</underline>olerant <underline>r</underline>ep<underline>l</underline>ication (Ctrl), a Par mechanism specific to replicated state machine (RSM) systems. We experimentally show that the Ctrl versions of two systems, LogCabin and ZooKeeper, safely recover from storage faults and provide high availability, while the unmodified versions can lose data or become unavailable. We also show that the Ctrl versions achieve this reliability with little performance overheads.

References

[1]
Ittai Abraham, Gregory Chockler, Idit Keidar, and Dahlia Malkhi. 2006. Byzantine disk paxos: Optimal resilience with byzantine shared memory. Distributed Computing 18, 5 (2006), 387--408.
[2]
Ramnatthan Alagappan, Aishwarya Ganesan, Eric Lee, Aws Albarghouthi, Vijay Chidambaram, Andrea Arpaci-Dusseau, and Remzi Arpaci-Dusseau. 2018. Protocol-aware recovery for consensus-based storage. In Proceedings of the 16th USENIX Conference on File and Storage Technologies (FAST’18).
[3]
Ramnatthan Alagappan, Aishwarya Ganesan, Yuvraj Patel, Thanumalayan Sankaranarayana Pillai, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2016. Correlated crash vulnerabilities. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI'16).
[4]
Apache. 2017. Kakfa. Retrieved April 21, 2017 from http://kafka.apache.org/.
[5]
Apache. 2008. ZooKeeper. Retrieved April 21, 2017 from https://zookeeper.apache.org/.
[6]
Apache. 2008. ZooKeeper Guarantees, Properties, and Definitions. Retrieved April 21, 2017 from https://zookeeper.apache.org/doc/r3.2.2/zookeeperInternals.html#sc_guaranteesPropertiesDefinitions.
[7]
Apache Cassandra. 2017. Cassandra Replication. Retrieved April 21, 2017 from http://docs.datastax.com/en/cassandra/2.0/cassandra/architecture/architectureDataDistributeReplication_c.html.
[8]
Apache ZooKeeper. 2014. Applications and Organizations using ZooKeeper. Retrieved April 21, 2017 from https://cwiki.apache.org/confluence/display/ZOOKEEPER/PoweredBy.
[9]
Remzi H. Arpaci-Dusseau and Andrea C. Arpaci-Dusseau. 2015. Operating Systems: Three Easy Pieces (0.91 ed.). Arpaci-Dusseau Books.
[10]
Lakshmi N. Bairavasundaram, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Garth R. Goodson, and Bianca Schroeder. 2008. An analysis of data corruption in the storage stack. In Proceedings of the 6th USENIX Symposium on File and Storage Technologies (FAST’08).
[11]
Lakshmi N. Bairavasundaram, Garth R. Goodson, Shankar Pasupathy, and Jiri Schindler. 2007. An analysis of latent sector errors in disk drives. In Proceedings of the 2007 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’07).
[12]
Lakshmi N. Bairavasundaram, Meenali Rungta, Nitin Agrawal, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, and Michael M. Swift. 2008. Analyzing the effects of disk-pointer corruption. In Proceedings of the International Conference on Dependable Systems and Networks (DSN’08).
[13]
Lakshmi Narayanan Bairavasundaram. 2008. Characteristics, Impact, and Tolerance of Partial Disk Failures. Ph.D. dissertation. University of Wisconsin, Madison.
[14]
Mahesh Balakrishnan, Dahlia Malkhi, Vijayan Prabhakaran, Ted Wobber, Michael Wei, and John D. Davis. 2012. CORFU: A shared log design for flash clusters. In Proceedings of the 9th Symposium on Networked Systems Design and Implementation (NSDI’12).
[15]
Andrew D. Birrell, Roy Levin, Michael D. Schroeder, and Roger M. Needham. 1982. Grapevine: An exercise in distributed computing. Communications of the ACM 25, 4 (April 1982), 260--274.
[16]
William J. Bolosky, Dexter Bradshaw, Randolph B. Haagens, Norbert P. Kusters, and Peng Li. 2011. Paxos replicated state machines as the basis of a high-performance data store. In Proceedings of the 8th Symposium on Networked Systems Design and Implementation (NSDI’11).
[17]
Mike Burrows. 2006. The chubby lock service for loosely-coupled distributed systems. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation (OSDI’06).
[18]
Tushar D. Chandra, Robert Griesemer, and Joshua Redstone. 2007. Paxos made live: An engineering perspective. In Proceedings of the 26th ACM Symposium on Principles of Distributed Computing.
[19]
Vijay Chidambaram, Thanumalayan Sankaranarayana Pillai, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2013. Optimistic crash consistency. In Proceedings of the 24th ACM Symposium on Operating Systems Principles (SOSP’13).
[20]
Vijay Chidambaram, Tushar Sharma, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2012. Consistency without ordering. In Proceedings of the 10th USENIX Symposium on File and Storage Technologies (FAST'12).
[21]
Allen Clement, Manos Kapritsos, Sangmin Lee, Yang Wang, Lorenzo Alvisi, Mike Dahlin, and Taylor Riche. 2009. Upright cluster services. In Proceedings of the 22nd ACM Symposium on Operating Systems Principles (SOSP’09).
[22]
Miguel Correia, Daniel Gómez Ferro, Flavio P. Junqueira, and Marco Serafini. 2012. Practical hardening of crash-tolerant systems. In 2012 USENIX Annual Technical Conference (USENIX ATC’12).
[23]
Jeff Dean. 2010. Building Large-Scale Internet Services. Retrieved April 21, 2017 from http://static.googleusercontent.com/media/research.google.com/en//people/jeff/SOCC2010-keynote-slides.pdf.
[24]
Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. 2007. Dynamo: Amazon’s highly available key-value store. In Proceedings of the 21st ACM Symposium on Operating Systems Principles (SOSP’07).
[25]
Diego Ongaro. 2014. Raft TLA+ Specification. Retrieved April 21, 2017 from https://github.com/ongardie/raft.tla.
[26]
epaxos. 2012. epaxos Source Code. Retrieved April 21, 2017 from https://github.com/efficient/epaxos.
[27]
etcd. 2014. etcd. Retrieved April 21, 2017 from https://coreos.com/etcd.
[28]
etcd. 2014. etcd: Production Users. Retrieved April 21, 2017 from https://coreos.com/etcd/docs/latest/production-users.html.
[29]
Daniel Fryer, Dai Qin, Jack Sun, Kah Wai Lee, Angela Demke Brown, and Ashvin Goel. 2014. Checking the integrity of transactional mechanisms. In Proceedings of the 12th USENIX Symposium on File and Storage Technologies (FAST’14).
[30]
Daniel Fryer, Kuei Sun, Rahat Mahmood, TingHao Cheng, Shaun Benjamin, Ashvin Goel, and Angela Demke Brown. 2012. Recon: Verifying file system consistency at runtime. In Proceedings of the 10th USENIX Symposium on File and Storage Technologies (FAST'12).
[31]
Aishwarya Ganesan, Ramnatthan Alagappan, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2017. Redundancy does not imply fault tolerance: Analysis of distributed storage reactions to file-system faults. ACM Transactions on Storage 13, 3 (Sept. 2017), 20:1--20:33.
[32]
Aishwarya Ganesan, Ramnatthan Alagappan, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2017. Redundancy does not imply fault tolerance: Analysis of distributed storage reactions to single errors and corruptions. In Proceedings of the 15th USENIX Conference on File and Storage Technologies (FAST'17).
[33]
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. The Google file system. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP'03).
[34]
Matthias Grawinkel, Thorsten Schafer, Andre Brinkmann, Jens Hagemeyer, and Mario Porrmann. 2011. Evaluation of applied intra-disk redundancy schemes to improve single disk reliability. In Proceedings of the 19th Annual Meeting of the IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS).
[35]
Kevin M. Greenan, Darrell D. E. Long, Ethan L. Miller, Thomas Schwarz, and Avani Wildani. 2009. Building flexible, fault-tolerant flash-based storage systems. In Proceedings of the 5th Workshop on Hot Topics in System Dependability (HotDep’09).
[36]
Laura M. Grupp, Adrian M. Caulfield, Joel Coburn, Steven Swanson, Eitan Yaakobi, Paul H. Siegel, and Jack K. Wolf. 2009. Characterizing flash memory: Anomalies, observations, and applications. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’09).
[37]
James Hamilton. 2007. On designing and deploying internet-scale services. In Proceedings of the 21st Annual Large Installation System Administration Conference (LISA’07).
[38]
James Myers. 2014. Data Integrity in Solid State Drives. Retrieved April 21, 2017 from http://intel.ly/2cF0dTT.
[39]
John Goerzen. 2017. Silent Data Corruption Is Real. Retrieved April 21, 2017 from http://changelog.complete.org/archives/9769-silent-data-corruption-is-real.
[40]
Jonathan Corbet. 2008. Responding to ext4 Journal Corruption. Retrieved April 21, 2017 from https://lwn.net/Articles/284037/.
[41]
Flavio P. Junqueira, Benjamin C. Reed, and Marco Serafini. 2011. Zab: High-performance broadcast for primary-backup systems. In Proceedings of the International Conference on Dependable Systems and Networks (DSN’11).
[42]
Dmitrii Kuvaiskii, Rasha Faqeh, Pramod Bhatotia, Pascal Felber, and Christof Fetzer. 2016. HAFT: Hardware-assisted fault tolerance. In Proceedings of the EuroSys Conference (EuroSys’16).
[43]
Leslie Lamport. 2001. Paxos made simple. ACM SIGACT News 32, 4 (2001), 18--25.
[44]
Shengyun Liu, Paolo Viotti, Christian Cachin, Vivien Quéma, and Marko Vukolic. 2016. XFT: Practical fault tolerance beyond crashes. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI'16).
[45]
LogCabin. 2014. LogCabin. Retrieved April 21, 2017 from https://github.com/logcabin/logcabin.
[46]
Jacob R. Lorch, Atul Adya, William J. Bolosky, Ronnie Chaiken, John R. Douceur, and Jon Howell. 2006. The SMART way to migrate replicated stateful services. In Proceedings of the EuroSys Conference (EuroSys’06).
[47]
Parisa Jalili Marandi, Christos Gkantsidis, Flavio Junqueira, and Dushyanth Narayanan. 2016. Filo: Consolidated consensus as a cloud service. In 2016 USENIX Annual Technical Conference (USENIX ATC’16).
[48]
Justin Meza, Qiang Wu, Sanjev Kumar, and Onur Mutlu. 2015. A large-scale study of flash memory failures in the field. In Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’15).
[49]
MongoDB. 2017. MongoDB Replication. Retrieved April 21, 2017 from https://docs.mongodb.org/manual/replication/.
[50]
Iulian Moraru, David G. Andersen, and Michael Kaminsky. 2013. There is more consensus in egalitarian parliaments. In Proceedings of the 24th ACM Symposium on Operating Systems Principles (SOSP’13).
[51]
Iyswarya Narayanan, Di Wang, Myeongjae Jeon, Bikash Sharma, Laura Caulfield, Anand Sivasubramaniam, Ben Cutler, Jie Liu, Badriddine Khessib, and Kushagra Vaid. 2016. SSD failures in datacenters: What? When? and Why?. In Proceedings of the 9th ACM International on Systems and Storage Conference (SYSTOR’16).
[52]
Diego Ongaro. 2014. Consensus: Bridging Theory and Practice. Ph.D. dissertation. Stanford University.
[53]
Diego Ongaro and John Ousterhout. 2014. In search of an understandable consensus algorithm. In 2014 USENIX Annual Technical Conference (USENIX ATC’14).
[54]
Bernd Panzer-Steindel. 2007. Data integrity. CERN/IT.
[55]
Thanumalayan Sankaranarayana Pillai, Ramnatthan Alagappan, Lanyue Lu, Vijay Chidambaram, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2017. Application crash consistency and performance with CCFS. In Proceedings of the 15th USENIX Conference on File and Storage Technologies (FAST'17).
[56]
Thanumalayan Sankaranarayana Pillai, Vijay Chidambaram, Ramnatthan Alagappan, Samer Al-Kiswany, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2014. All file systems are not created equal: On the complexity of crafting crash-consistent applications. In Proceedings of the 11th Symposium on Operating Systems Design and Implementation (OSDI’14).
[57]
Vijayan Prabhakaran, Lakshmi N. Bairavasundaram, Nitin Agrawal, Haryadi S. Gunawi, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2005. IRON file systems. In Proceedings of the 20th ACM Symposium on Operating Systems Principles (SOSP’05).
[58]
Redis. 2015. Redis. Retrieved April 21, 2017 from http://redis.io/.
[59]
Redis. 2015. Redis Replication. Retrieved April 21, 2017 from http://redis.io/topics/replication.
[60]
Robert Ricci, Eric Eide, and CloudLab Team. 2014. Introducing CloudLab: Scientific infrastructure for advancing cloud architectures and applications. USENIX ;login: 39, 6 (2014).
[61]
Robert Harris. 2007. Data Corruption Is Worse than You Know. Retrieved April 21, 2017 from http://www.zdnet.com/article/data-corruption-is-worse-than-you-know/.
[62]
Fred B. Schneider. 1990. Implementing fault-tolerant services using the state machine approach: A tutorial. ACM Computing Surveys 22, 4 (Dec. 1990), 299--319.
[63]
Bianca Schroeder, Sotirios Damouras, and Phillipa Gill. 2010. Understanding latent sector errors and how to protect against them. In Proceedings of the 8th USENIX Symposium on File and Storage Technologies (FAST'10).
[64]
Bianca Schroeder, Raghav Lagisetty, and Arif Merchant. 2016. Flash reliability in production: The expected and the unexpected. In Proceedings of the 14th USENIX Conference on File and Storage Technologies (FAST’16).
[65]
Michael D. Schroeder, Andrew D. Birrell, and Roger M. Needham. 1984. Experience with grapevine: The growth of a distributed system. ACM Transactions on Computer Systems 2, 1 (Feb. 1984), 3--23.
[66]
Thomas Schwarz, Ahmed Amer, Thomas Kroeger, Ethan L. Miller, Darrell D. E. Long, and Jehan-François Pâris. 2016. RESAR: Reliable storage at exabyte scale. In Proceedings of the 24th Annual Meeting of the IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS).
[67]
Romain Slootmaekers and Nicolas Trangez. 2012. Arakoon: A distributed consistent key-value store. In SIGPLAN OCaml Users and Developers Workshop, Vol. 62.
[68]
Stackoverflow. 2015. Can ext4 Detect Corrupted File Contents? Retrieved April 21, 2017 from http://stackoverflow.com/questions/31345097/can-ext4-detect-corrupted-file-contents.
[69]
Stackoverflow. 2013. ZooKeeper Clear State. Retrieved April 21, 2017 from http://stackoverflow.com/questions/17038957/org-apache-hadoop-hbase-pleaseholdexception-master-is-initializing.
[70]
Michael M. Swift, Brian N. Bershad, and Henry M. Levy. 2003. Improving the reliability of commodity operating systems. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP'03).
[71]
D. B. Terry, M. M. Theimer, Karin Petersen, A. J. Demers, M. J. Spreitzer, and C. H. Hauser. 1995. Managing update conflicts in Bayou, a weakly connected replicated storage system. In Proceedings of the 15th ACM Symposium on Operating Systems Principles (SOSP’95).
[72]
Thanh Do, Tyler Harter, Yingchao Liu, Haryadi S. Gunawi, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau. 2013. HARDFS: Hardening HDFS with selective and lightweight versioning. In Proceedings of the 11th Conference on File and Storage Technologies (FAST’13).
[73]
Theodore Ts’o. 2008. What to Do when the Journal Checksum is Incorrect. Retrieved April 21, 2017 from https://lwn.net/Articles/284038/.
[74]
Robbert Van Renesse, Nicolas Schiper, and Fred B. Schneider. 2015. Vive la différence: Paxos vs. viewstamped replication vs. zab. IEEE Transactions on Dependable and Secure Computing 12, 4 (2015), 472--484.
[75]
Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike Dahlin. 2013. Robustness in the Salus scalable block store. In Proceedings of the 10th Symposium on Networked Systems Design and Implementation (NSDI’13).
[76]
Yupu Zhang, Abhishek Rajimwale, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2010. End-to-end data integrity for file systems: A ZFS case study. In Proceedings of the 8th USENIX Symposium on File and Storage Technologies (FAST'10).
[77]
ZooKeeper Jira Issues. 2012. Unable to Load Database on Disk when Restarting after Node Freeze. Retrieved April 21, 2017 from https://issues.apache.org/jira/browse/ZOOKEEPER-1546.

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Storage
ACM Transactions on Storage  Volume 14, Issue 3
Special Issue on FAST 2018 and Regular Papers
August 2018
210 pages
ISSN:1553-3077
EISSN:1553-3093
DOI:10.1145/3282875
  • Editor:
  • Sam H. Noh
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 October 2018
Accepted: 01 July 2018
Received: 01 May 2018
Published in TOS Volume 14, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Storage faults
  2. consensus
  3. data corruption
  4. fault tolerance

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)186
  • Downloads (Last 6 weeks)28
Reflects downloads up to 21 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2023)The Dirty Secret of SSDs: Embodied CarbonACM SIGEnergy Energy Informatics Review10.1145/3630614.36306163:3(4-9)Online publication date: 25-Oct-2023
  • (2022)A Study of Failure Recovery and Logging of High-Performance Parallel File SystemsACM Transactions on Storage10.1145/348344718:2(1-44)Online publication date: 28-Apr-2022
  • (2020)Hampa: Solver-Aided Recency-Aware ReplicationComputer Aided Verification10.1007/978-3-030-53288-8_16(324-349)Online publication date: 21-Jul-2020
  • (2019)Who's afraid of uncorrectable bit errors? online recovery of flash errors with distributed redundancyProceedings of the 2019 USENIX Conference on Usenix Annual Technical Conference10.5555/3358807.3358891(977-991)Online publication date: 10-Jul-2019

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media