Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article
Public Access

Building Consistent Transactions with Inconsistent Replication

Published: 16 December 2018 Publication History

Abstract

Application programmers increasingly prefer distributed storage systems with strong consistency and distributed transactions (e.g., Google’s Spanner) for their strong guarantees and ease of use. Unfortunately, existing transactional storage systems are expensive to use—in part, because they require costly replication protocols, like Paxos, for fault tolerance. In this article, we present a new approach that makes transactional storage systems more affordable: We eliminate consistency from the replication protocol, while still providing distributed transactions with strong consistency to applications.
We present the Transactional Application Protocol for Inconsistent Replication (TAPIR), the first transaction protocol to use a novel replication protocol, called inconsistent replication, that provides fault tolerance without consistency. By enforcing strong consistency only in the transaction protocol, TAPIR can commit transactions in a single round-trip and order distributed transactions without centralized coordination. We demonstrate the use of TAPIR in a transactional key-value store, TAPIR-KV. Compared to conventional systems, TAPIR-KV provides better latency and better throughput.

References

[1]
Atul Adya, Robert Gruber, Barbara Liskov, and Umesh Maheshwari. 1995. Efficient optimistic concurrency control using loosely synchronized clocks. In Proceedings of the ACM International Conference on Management of Data (SIGMOD’95).
[2]
Marcos K. Aguilera, Arif Merchant, Mehul Shah, Alistair Veitch, and Christos Karamanolis. 2007. Sinfonia: A new paradigm for building scalable distributed systems. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP’07).
[3]
Peter Bailis, Aaron Davidson, Alan Fekete, Ali Ghodsi, Joseph M. Hellerstein, and Ion Stoica. 2014. Highly available transactions: Virtues and limitations. In Proceedings of the Conference on Very Large Databases (VLDB’14).
[4]
Jason Baker, Chris Bond, James Corbett, J. J. Furman, Andrey Khorlin, James Larson, Jean-Michel Léon, Yawei Li, Alexander Lloyd, and Vadim Yushprakh. 2011. Megastore: Providing scalable, highly available storage for interactive services. In Proceedings of the Conference on Innovative Data Systems Research (CIDR’11).
[5]
Mahesh Balakrishnan, Dahlia Malkhi, Ted Wobber, Ming Wu, Vijayan Prabhakaran, Michael Wei, John D. Davis, Sriram Rao, Tao Zou, and Aviad Zuck. 2013. Tango: Distributed data structures over a shared log. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP’13).
[6]
Hal Berenson, Phil Bernstein, Jim Gray, Jim Melton, Elizabeth O’Neil, and Patrick O’Neil. 1995. A critique of ANSI SQL isolation levels. In Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data. ACM.
[7]
Philip A. Bernstein, Vassos Hadzilacos, and Nathan Goodman. 1987. Concurrency Control and Recovery in Database Systems. Addison Wesley.
[8]
Ken Birman and Thomas A. Joseph. 1987. Exploiting virtual synchrony in distributed systems. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP’87).
[9]
Mike Burrows. 2006. The Chubby lock service for loosely coupled distributed systems. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI’06).
[10]
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. 2008. Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst. 26, 2, Article 4 (June 2008), 26.
[11]
Yanzhe Chen, Xinda Wei, Jiaxin Shi, Rong Chen, and Haibo Chen. 2016. Fast and general distributed transactions using RDMA and HTM. In Proceedings of the 11th ACM SIGOPS EuroSys (EuroSys’16). ACM.
[12]
Austin T. Clements, M. Frans Kaashoek, Nickolai Zeldovich, Robert T. Morris, and Eddie Kohler. 2013. The scalable commutativity rule: Designing scalable software for multicore processors. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP’13).
[13]
Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, Hans-Arno Jacobsen, Nick Puz, Daniel Weaver, and Ramana Yerneni. 2008. PNUTS: Yahoo!’s hosted data serving platform. Proceedings of the Conference on Very Large Databases (VLDB’08).
[14]
Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking cloud serving systems with YCSB. In Proceedings of the ACM Symposium on Cloud Computing (SOCC’10).
[15]
James C. Corbett et al. 2012. Spanner: Google’s globally distributed database. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI’12).
[16]
James Cowling and Barbara Liskov. 2012. Granola: Low-overhead distributed transaction coordination. In Proceedings of the USENIX Annual Technical Conference (ATC’12).
[17]
James Cowling, Daniel Myers, Barbara Liskov, Rodrigo Rodrigues, and Liuba Shrira. 2006. HQ replication: A hybrid quorum protocol for Byzantine fault tolerance. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI’06).
[18]
Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. 2007. Dynamo: Amazon’s highly available key-value store. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP’07).
[19]
Akon Dey, Alan Fekete, Raghunath Nambiar, and Uwe Rohm. 2014. YCSB+T: Benchmarking web-scale transactional databases. In Proceedings of the International Conference on Data Engineering Workshops (ICDEW’14).
[20]
Aleksandar Dragojević, Dushyanth Narayanan, Orion Hodson, and Miguel Castro. 2014. FaRM: Fast remote memory. In Proceedings of the 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI’14). USENIX.
[21]
Aleksandar Dragojević, Dushyanth Narayanan, Edmund B. Nightingale, Matthew Renzelmann, Alex Shamis, Anirudh Badam, and Miguel Castro. 2015. No compromises: Distributed transactions with consistency, availability, and performance. In Proceedings of the 25th ACM Symposium on Operating Systems Principles (SOSP’15). ACM.
[22]
Jiaqing Du, Sameh Elnikety, and Willy Zwaenepoel. 2013. Clock-SI: Snapshot isolation for partitioned data stores using loosely synchronized clocks. In Proceedings of the 32nd IEEE Symposium on Reliable Distributed Systems (SRDS’13). IEEE.
[23]
Robert Escriva, Bernard Wong, and Emin Gun Sirer. 2013. Warp: Multi-Key Transactions for Key-Value Stores. Technical Report. Cornell.
[24]
Michael J. Fischer, Nancy A. Lynch, and Michael S. Patterson. 1985. Impossibility of distributed consensus with one faulty process. J. ACM 32, 2 (Apr. 1985), 374--382.
[25]
David K. Gifford. 1979. Weighted voting for replicated data. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI’79).
[26]
Jim Gray and Leslie Lamport. 2006. Consensus on transaction commit. ACM Trans. Database Syst. 31, 1 (Mar. 2006), 133--160.
[27]
Patrick Hunt, Mahadev Konar, Flavio Paiva Junqueira, and Benjamin Reed. 2010. ZooKeeper: Wait-free coordination for internet-scale systems. In Proceedings of the USENIX Annual Technical Conference (ATC’10).
[28]
Flavio Junqueira, Yanhua Mao, and Keith Marzullo. 2007. Classic Paxos vs Fast Paxos: Caveat emptor. In Proceedings of the 3rd Workshop on Hot Topics in System Dependability (HotDep’07). USENIX.
[29]
David Karger, Eric Lehman, Tom Leighton, Rina Panigrahy, Matthew Levine, and Daniel Lewin. 1997. Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the world wide web. In Proceedings of the ACM Symposium on Theory of Computing (STOC’97).
[30]
Tim Kraska, Gene Pang, Michael J. Franklin, Samuel Madden, and Alan Fekete. 2013. MDCC: Multi-data center consistency. In Proceedings of the ACM SIGOPS EuroSys (EuroSys’13).
[31]
Hsiang-Tsung Kung and John T. Robinson. 1981. On optimistic methods for concurrency control. ACM Trans. Database Syst. 6, 2 (June 1981), 213--226.
[32]
Rivka Ladin, Barbara Liskov, Liuba Shrira, and Sanjay Ghemawat. 1992. Providing high availability using lazy replication. ACM Trans. Comput. Syst. 10, 4 (Nov. 1992), 360--391.
[33]
Avinash Lakshman and Prashant Malik. 2010. Cassandra: A decentralized structured storage system. ACM SIGOPS Operat. Syst. Rev. 44, 2 (Apr. 2010), 35--40.
[34]
Leslie Lamport. 1994. ACM Trans. Prog. Lang. Syst. 16, 3 (May 1994), 872--923.
[35]
Leslie Lamport. 2001. Paxos made simple. ACM SIGACT News 32, 4 (Dec. 2001), 51--58.
[36]
Leslie Lamport. 2005. Generalized Consensus and Paxos. Technical Report 2005-33. Microsoft Research.
[37]
Leslie Lamport. 2006a. Fast Paxos. Distrib. Comput. 19, 2 (2006).
[38]
Leslie Lamport. 2006b. Lower bounds for asynchronous consensus. Distrib. Comput. 19, 2 (Oct. 2006), 104--125.
[39]
Costin Leau. 2013. Spring Data Redis--Retwis-J. Retrieved from http://docs.spring.io/spring-data/data-keyvalue/examples/retwisj/current/.
[40]
Jialin Li, Ellis Michael, and Dan R. K. Ports. 2017. Eris: Coordination-free consistent transactions using network multi-sequencing. In Proceedings of the 26th ACM Symposium on Operating Systems Principles (SOSP’17). ACM.
[41]
Jialin Li, Ellis Michael, Adriana Szekeres, Naveen Kr. Sharma, and Dan R. K. Ports. 2016. Just say no to Paxos overhead: Replacing consensus with network ordering. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16). USENIX.
[42]
Barbara Liskov, Miguel Castro, Liuba Shrira, and Atul Adya. 1999. Providing persistent objects in distributed systems. In Proceedings of the European Conference on Object-Oriented Programming (ECOOP’99).
[43]
Barbara Liskov and James Cowling. 2012. Viewstamped replication revisited. Technical report MIT-CSAIL-TR-2012-021. MIT.
[44]
Wyatt Lloyd, Michael J. Freedman, Michael Kaminsky, and David G. Andersen. 2011. Don’t settle for eventual: Scalable causal consistency for wide-area storage with COPS. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP’11).
[45]
Hatem Mahmoud, Faisal Nawab, Alexander Pucher, Divyakant Agrawal, and Amr El Abbadi. 2013. Low-latency multi-datacenter databases using replicated commit. Proceedings of the Conference on Very Large Databases (VLDB’13).
[46]
Dahlia Malkhi and Michael Reiter. 1998. Byzantine quorum systems. Distrib. Comput. 11 (1998), 203--213.
[47]
MongoDB. 2013. MongoDB: A open-source document database. Retrieved from http://www.mongodb.org/.
[48]
Iulian Moraru, David G. Andersen, and Michael Kaminsky. 2013. There is more consensus in Egalitarian parliaments. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP’13).
[49]
Shuai Mu, Yang Cui, Yang Zhang, Wyatt Lloyd, and Jinyang Li. 2014. Extracting more concurrency from distributed transactions. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI’14).
[50]
Shuai Mu, Lamont Nelson, Wyatt Lloyd, and Jinyang Li. 2016. Consolidating concurrency control and consensus for commits under conflicts. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16). USENIX.
[51]
Brian M. Oki and Barbara H. Liskov. 1988. Viewstamped replication: A new primary copy method to support highly available distributed systems. In Proceedings of the ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing (PODC’88).
[52]
Dan R. K. Ports, Jialin Li, Vincent Liu, Naveen Kr. Sharma, and Arvind Krishnamurthy. 2015. Designing distributed systems using approximate synchrony in data center networks. In Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI’15).
[53]
Redis. 2013. Redis: Open Source Data Structure Server. Retrieved from http://redis.io/.
[54]
Yasushi Saito and Marc Shapiro. 2005. Optimistic replication. Comput. Surveys 37, 1 (Mar. 2005), 42--81.
[55]
Salvatore Sanfilippo. 2013. WAIT: Synchronous replication for Redis. Retrieved from http://antirez.com/news/66.
[56]
Yee Jiun Song and Robbert van Renesse. 2008. Bosco: One-step Byzantine asynchronous consensus. In Proceedings of the International Symposium on Distributed Computing (DISC’08).
[57]
Yair Sovran, Russell Power, Marcos K. Aguilera, and Jinyang Li. 2011. Transactional storage for geo-replicated systems. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP’11).
[58]
Douglas B. Terry, Marvin M. Theimer, Karin Petersen, Alan J. Demers, Mike J. Spreitzer, and Carl H. Hauser. 1995. Managing update conflicts in Bayou, a weakly connected replicated storage system. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP’95).
[59]
Robert H. Thomas. 1979. A majority consensus approach to concurrency control for multiple copy databases. ACM Trans. Database Syst. 4, 2 (June 1979), 180--209.
[60]
Xingda Wei, Jiaxin Shi, Yanzhe Chen, Rong Chen, and Haibo Chen. 2015. Fast in-memory transaction processing using RDMA and HTM. In Proceedings of the 25th ACM Symposium on Operating Systems Principles (SOSP’15). ACM.
[61]
Irene Zhang, Naveen Kr. Sharma, Adriana Szekeres, Arvind Krishnamurthy, and Dan R. K. Ports. 2015a. Building consistent transactions with inconsistent replication. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP’15).
[62]
Irene Zhang, Naveen Kr. Sharma, Adriana Szekeres, Arvind Krishnamurthy, and Dan R. K. Ports. 2015b. Building Consistent Transactions with Inconsistent Replication (extended version). Technical Report 2014-12-01 v2. University of Washington. Retrieved from http://syslab.cs.washington.edu/papers/tapir-tr-v2.pdf.
[63]
Yang Zhang, Russell Power, Siyuan Zhou, Yair Sovran, Marcos K. Aguilera, and Jinyang Li. 2013. Transaction chains: Achieving serializability with low latency in geo-distributed storage systems. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP’13).

Cited By

View all
  • (2024)GaussDB-Global: A Geographically Distributed Database System2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00383(5111-5118)Online publication date: 13-May-2024
  • (2024)FC: Adaptive Atomic Commit via Failure Detection2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00162(2026-2039)Online publication date: 13-May-2024
  • (2024)Efficient Partial Order Based Transaction Processing for Permissioned Blockchains2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00152(1888-1901)Online publication date: 13-May-2024
  • Show More Cited By

Recommendations

Reviews

George R. Mayforth

Large-scale transaction processing systems must balance the need for minimal response times with maintaining the integrity of the data being processed. In theory, transactions are processed sequentially in the order received. Very large systems exceed the capacity of even a supercomputer; therefore, the data processing must be divided among multiple systems. Consider the classic example: an airline reservation system. Multiple agents globally enter hundreds of thousands (or millions) of transactions daily, linking customers to available seats on specific flights. The traditional way to handle such a huge transaction load takes two forms. First, the total database is divided into "shards," that is, segments that are parts of the whole but can be treated independently. Within each shard there may be replicas, that is, multiple instances of the same data in that shard. The redundancy ensures the integrity of the data. In this complex system there remains the requirement of apparent serial processing, that is, transactions handled in chronological sequence. This implies coordination and synchronization among the replicas in a shard because their data cannot be treated independently. A strict synchronization protocol further implies fairly large overhead as replicas communicate with each other to maintain the integrity of the process. The paper proposes to reduce processing overhead by means of an "inconsistent" replication protocol that permits replicas to process transactions in different sequences. Maintaining the correct sequence of transaction processing is accomplished by pushing the maintenance of serial integrity up to the transaction processing and application layers. This reduces the replication overhead significantly. The authors provide a detailed description of the inconsistent replication (IR) protocol and associated transaction processing system, transactional application protocol for inconsistent replication (TAPIR), including handling failures and offering proofs of correctness regarding transaction consistency. They compare the proposed system to multiple existing systems, including experimental results running TAPIR-IR plus two similar systems built using traditional protocols. The experiments involved transaction processing and data replication in the US, Europe, and Asia. The experimental results show increased throughput by up to a factor of three times versus the two conventional systems. Given the potential performance increase, real-world testing is needed to verify the experimental results. The text is clearly written, supplemented with useful and readable charts. Many comparisons to earlier systems and a large set of references are provided. This paper should be of interest to those who design and implement large-scale transaction processing systems.

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Computer Systems
ACM Transactions on Computer Systems  Volume 35, Issue 4
November 2017
97 pages
ISSN:0734-2071
EISSN:1557-7333
DOI:10.1145/3297862
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 December 2018
Accepted: 01 August 2018
Revised: 01 July 2018
Received: 01 October 2016
Published in TOCS Volume 35, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Distributed transactional storage
  2. inconsistent replication
  3. strict serializability

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)372
  • Downloads (Last 6 weeks)43
Reflects downloads up to 14 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)GaussDB-Global: A Geographically Distributed Database System2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00383(5111-5118)Online publication date: 13-May-2024
  • (2024)FC: Adaptive Atomic Commit via Failure Detection2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00162(2026-2039)Online publication date: 13-May-2024
  • (2024)Efficient Partial Order Based Transaction Processing for Permissioned Blockchains2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00152(1888-1901)Online publication date: 13-May-2024
  • (2023)PolarDB-SCC: A Cloud-Native Database Ensuring Low Latency for Strongly Consistent ReadsProceedings of the VLDB Endowment10.14778/3611540.361156216:12(3754-3767)Online publication date: 1-Aug-2023
  • (2023)Detock: High Performance Multi-region Transactions at ScaleProceedings of the ACM on Management of Data10.1145/35892931:2(1-27)Online publication date: 20-Jun-2023
  • (2023)Rethink the Linearizability Constraints of Raft for Distributed SystemsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.323539935:11(11815-11829)Online publication date: 1-Nov-2023
  • (2023)Knock Out 2PC with Practicality Intact: a High-performance and General Distributed Transaction Protocol2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00179(2317-2331)Online publication date: Apr-2023
  • (2023)Churn-Tolerant Leader Election Protocols2023 IEEE 43rd International Conference on Distributed Computing Systems (ICDCS)10.1109/ICDCS57875.2023.00032(96-107)Online publication date: Jul-2023
  • (2022)CornusProceedings of the VLDB Endowment10.14778/3565816.356583716:2(379-392)Online publication date: 23-Nov-2022
  • (2022)Competitive Consistent Caching for Transactions2022 IEEE 38th International Conference on Data Engineering (ICDE)10.1109/ICDE53745.2022.00207(2154-2167)Online publication date: May-2022
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media