Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Using Paxos to build a scalable, consistent, and highly available datastore

Published: 01 January 2011 Publication History

Abstract

Spinnaker is an experimental datastore that is designed to run on a large cluster of commodity servers in a single datacenter. It features key-based range partitioning, 3-way replication, and a transactional get-put API with the option to choose either strong or timeline consistency on reads. This paper describes Spinnaker's Paxos-based replication protocol. The use of Paxos ensures that a data partition in Spinnaker will be available for reads and writes as long a majority of its replicas are alive. Unlike traditional master-slave replication, this is true regardless of the failure sequence that occurs. We show that Paxos replication can be competitive with alternatives that provide weaker consistency guarantees. Compared to an eventually consistent datastore, we show that Spinnaker can be as fast or even faster on reads and only 5% to 10% slower on writes.

References

[1]
Cassandra. http://cassandra.apache.org.
[2]
Errors in Database Systems, Eventual Consistency, and the CAP Theorem. http://cacm.acm.org/blogs.
[3]
M. K. Aguilera, A. Merchant, M. A. Shah, A. C. Veitch, and C. T. Karamanolis. Sinfonia: A New Paradigm for Building Scalable Distributed Systems. In ACM Trans. on Computer Systems, pages 5:1--5:48, 27(3), 2009.
[4]
D. G. Andersen, J. Franklin, M. Kaminsky, A. Phanishayee, L. Tan, and V. Vasudevan. FAWN: A Fast Array of Wimpy Nodes. In SOSP, pages 1--14, 2009.
[5]
L. N. Bairavasundaram, G. R. Goodson, B. Schroeder, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. An Analysis of Data Corruption in the Storage Stack. In FAST, pages 8:1--8:28, 2008.
[6]
E. A. Brewer. Towards Robust Distributed Systems. In PODC, pages 7--7, 2000.
[7]
D. G. Campbell, G. Kakivaya, and N. Ellis. Extreme Scale with full SQL Language Support in Microsoft SQL Azure. In SIGMOD, pages 1021--1024, 2010.
[8]
E. Cecchet, G. Candea, and A. Ailamaki. Middleware-Based Database Replication: The Gaps Between Theory and Practice. In SIGMOD, pages 739--752, 2008.
[9]
T. D. Chandra, R. Griesemer, and J. Redstone. Paxos Made Live: An Engineering Perspective. In PODC, pages 398--407, 2007.
[10]
F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A Distributed Storage System for Structured Data. In OSDI, pages 205--218, 2006.
[11]
B. F. Cooper, R. Ramakrishnan, U. Srivastava, A. Silberstein, P. Bohannon, H.-A. Jacobsen, N. Puz, D. Weaver, and R. Yerneni. PNUTS: Yahoo!'s Hosted Data Serving Platform. PVLDB, 1:1277--1288, August 2008.
[12]
G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: Amazon's Highly Available Key-Value Store. In SOSP, pages 205--220, 2007.
[13]
D. J. DeWitt, R. H. Katz, F. Olken, L. D. Shapiro, M. R. Stonebraker, and D. Wood. Implementation Techniques for Main Memory Database Systems. In SIGMOD, pages 1--8, 1984.
[14]
S. Elnikety, S. G. Dropsho, and F. Pedone. Tashkent: Uniting Durability with Transaction Ordering for High-Performance Scalable Database Replication. In EuroSys, pages 117--130, 2006.
[15]
S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google File System. In SOSP, pages 29--43, 2003.
[16]
H. Hsiao and D. J. Dewitt. Chained Declustering: A New Availability Strategy for Multiprocessor Database Machines. In ICDE, pages 227--254, 1990.
[17]
P. Hunt, M. Konar, F. P. Junqueira, and B. Reed. Zookeeper: Wait-Free Coordination for Internet-scale Systems. In USENIX, 2010.
[18]
B. Kemme and G. Alonso. Don't Be Lazy, Be Consistent: Postgres-R, A New Way to Implement Database Replication. In VLDB, pages 134--143, 2000.
[19]
L. Lamport. The Part-Time Parliament. In ACM Trans. on Computer Systems, pages 133--169, 16(2), 1998.
[20]
L. Lamport. Paxos Made Simple. ACM SIGACT News, 32(4):18--25, December 2001.
[21]
L. Lamport, D. Malkhi, and L. Zhou. Vertical Paxos and Primary-Backup Replication. In PODC, pages 312--313, 2009.
[22]
M. K. McKusick and S. Quinlan. GFS: Evolution on Fast-Forward. In ACM Queue, 7(7), 2009.
[23]
M. Pease, R. Shostak, and L. Lamport. Reaching Agreement in the Presence of Faults. In Journal of The ACM, pages 228--234, 1980.
[24]
C. Plattner and G. Alonso. Ganymed: Scalable Replication for Transactional Web Applications. In Middleware, pages 155--174, 2004.
[25]
D. Skeen. Nonblocking Commit Protocols. In SIGMOD, pages 133--142, 1981.
[26]
F. Yang, J. Shanmugasundaram, and R. Yerneni. A Scalable Data Platform for a Large Number of Small Applications. In CIDR, 2009.

Cited By

View all
  • (2024)Cloud Actor-Oriented Database Transactions in OrleansProceedings of the VLDB Endowment10.14778/3685800.368580117:12(3720-3730)Online publication date: 1-Aug-2024
  • (2023)Joining Parallel and Partitioned State Machine Replication Models for Enhanced Shared Logging PerformanceProceedings of the 12th Latin-American Symposium on Dependable and Secure Computing10.1145/3615366.3615422(90-99)Online publication date: 16-Oct-2023
  • (2022)In-network leaderless replication for distributed data storesProceedings of the VLDB Endowment10.14778/3523210.352321315:7(1337-1349)Online publication date: 22-Jun-2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 4, Issue 4
January 2011
59 pages

Publisher

VLDB Endowment

Publication History

Published: 01 January 2011
Published in PVLDB Volume 4, Issue 4

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)22
  • Downloads (Last 6 weeks)1
Reflects downloads up to 22 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Cloud Actor-Oriented Database Transactions in OrleansProceedings of the VLDB Endowment10.14778/3685800.368580117:12(3720-3730)Online publication date: 1-Aug-2024
  • (2023)Joining Parallel and Partitioned State Machine Replication Models for Enhanced Shared Logging PerformanceProceedings of the 12th Latin-American Symposium on Dependable and Secure Computing10.1145/3615366.3615422(90-99)Online publication date: 16-Oct-2023
  • (2022)In-network leaderless replication for distributed data storesProceedings of the VLDB Endowment10.14778/3523210.352321315:7(1337-1349)Online publication date: 22-Jun-2022
  • (2022)High Availability Framework and Query Fault Tolerance for Hybrid Distributed Database SystemsProceedings of the 31st ACM International Conference on Information & Knowledge Management10.1145/3511808.3557086(3451-3460)Online publication date: 17-Oct-2022
  • (2022)Scalable and adaptive log manager in distributed systemsFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-022-1357-517:2Online publication date: 8-Aug-2022
  • (2022)Efficient and stable quorum-based log replication and replay for modern cluster-databasesFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-020-0210-y16:5Online publication date: 1-Oct-2022
  • (2021)Accurate and efficient follower log repair for Raft-replicated database systemsFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-019-8349-015:2Online publication date: 1-Apr-2021
  • (2020)A Framework for supporting DBMS-like indexes in the cloudProceedings of the VLDB Endowment10.14778/3402707.34027114:11(702-713)Online publication date: 3-Jun-2020
  • (2020)MeerkatProceedings of the Fifteenth European Conference on Computer Systems10.1145/3342195.3387529(1-14)Online publication date: 15-Apr-2020
  • (2020)Dependency Preserved Raft for TransactionsDatabase Systems for Advanced Applications10.1007/978-3-030-59410-7_14(228-245)Online publication date: 24-Sep-2020
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media