Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Nezha: Deployable and High-Performance Consensus Using Synchronized Clocks

Published: 01 December 2022 Publication History

Abstract

This paper presents a high-performance consensus protocol, Nezha, which can be deployed by cloud tenants without support from cloud providers. Nezha bridges the gap between protocols such as Multi-Paxos and Raft, which can be readily deployed, and protocols such as NOPaxos and Speculative Paxos, that provide better performance, but require access to technologies such as programmable switches and in-network prioritization, which cloud tenants do not have.
Nezha uses a new multicast primitive called deadline-ordered multicast (DOM). DOM uses high-accuracy software clock synchronization to synchronize sender and receiver clocks. Senders tag messages with deadlines in synchronized time; receivers process messages in deadline order, on or after their deadline.
We compare Nezha with Multi-Paxos, Fast Paxos, Raft, (optimized) NOPaxos, and 2 recent protocols, Domino and TOQ-EPaxos, that use synchronized clocks. In throughput, Nezha outperforms all baselines by a median of 5.4X (range: 1.9--20.9X). In latency, Nezha outperforms five baselines by a median of 2.3X (range: 1.3--4.0X), with one exception: it sacrifices 33% of latency compared with our optimized NOPaxos in one test. We also prototype two applications, a key-value store and a fair-access stock exchange, on top of Nezha to show that Nezha only modestly reduces their performance relative to an unreplicated system.

References

[1]
Marcos K. Aguilera, Naama Ben-David, Rachid Guerraoui, Virendra J. Marathe, Athanasios Xygkis, and Igor Zablotchi. 2020. Microsecond Consensus for Microsecond Applications. In Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). USENIX Association, 599--616. https://www.usenix.org/conference/osdi20/presentation/aguilera
[2]
Ailidani Ailijiang, Aleksey Charapko, and Murat Demirbas. 2019. Dissecting the Performance of Strongly-Consistent Replication Protocols. In Proceedings of the 2019 International Conference on Management of Data (Amsterdam, Netherlands) (SIGMOD '19). Association for Computing Machinery, New York, NY, USA, 1696--1710.
[3]
Martin Biely, Zarko Milosevic, Nuno Santos, and André Schiper. 2012. S-Paxos: Offloading the Leader for High Throughput State Machine Replication. In Proceedings of the 2012 IEEE 31st Symposium on Reliable Distributed Systems. 111--120.
[4]
Tushar Deepak Chandra, Vassos Hadzilacos, and Sam Toueg. 1996. The Weakest Failure Detector for Solving Consensus. J. ACM 43, 4 (jul 1996), 685--722.
[5]
Aleksey Charapko, Ailidani Ailijiang, and Murat Demirbas. 2021. PigPaxos: Devouring the Communication Bottlenecks in Distributed Consensus. In Proceedings of the 2021 International Conference on Management of Data. Association for Computing Machinery, New York, NY, USA, 235--247.
[6]
Austin T. Clements, M. Frans Kaashoek, Nickolai Zeldovich, Robert T. Morris, and Eddie Kohler. 2013. The Scalable Commutativity Rule: Designing Scalable Software for Multicore Processors. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (Farminton, Pennsylvania) (SOSP '13). Association for Computing Machinery, New York, NY, USA, 1--17.
[7]
Ben Darnell. [n.d.]. Scaling Raft. https://www.cockroachlabs.com/blog/scaling-raft.
[8]
etcd. [n.d.]. Benchmarking etcd v3. https://etcd.io/docs/v3.5/benchmarks/etcd-3-demo-benchmarks/.
[9]
Jinkun Geng, Anirudh Sivaraman, Balaji Prabhakar, and Mendel Rosenblum. 2022. Nezha: Deployable and High-Performance Consensus Using Synchronized Clocks [Technical Report]. https://arxiv.org/abs/2206.03285
[10]
Yilong Geng. 2018. Self-Programming Networks: Architecture and Algorithms. Ph.D. Dissertation. Stanford University. https://www.proquest.com/dissertations-theses/self-programming-networks-architecture-algorithms/docview/2438700930/se-2?accountid=14026
[11]
Yilong Geng, Shiyu Liu, Zi Yin, Ashish Naik, Balaji Prabhakar, Mendel Rosenblum, and Amin Vahdat. 2018. Exploiting a Natural Network Effect for Scalable, Finegrained Clock Synchronization. In Proceedings of the 15th USENIX Conference on Networked Systems Design and Implementation (Renton, WA, USA) (NSDI'18). USENIX Association, Berkeley, CA, USA, 81--94.
[12]
Ahmad Ghalayini, Jinkun Geng, Vighnesh Sachidananda, Vinay Sriram, Yilong Geng, Balaji Prabhakar, Mendel Rosenblum, and Anirudh Sivaraman. 2021. CloudEx: A Fair-Access Financial Exchange in the Cloud. In Proceedings of the Workshop on Hot Topics in Operating Systems (HotOS '21). 8.
[13]
Jim Gray, Prakash Sundaresan, Susanne Englert, Ken Baclawski, and Peter J. Weinberger. 1994. Quickly Generating Billion-Record Synthetic Databases. SIGMOD Rec. 23, 2 (may 1994), 243--252.
[14]
Maurice P. Herlihy and Jeannette M. Wing. 1990. Linearizability: A Correctness Condition for Concurrent Objects. ACM Trans. Program. Lang. Syst. 12, 3 (July 1990), 463--492.
[15]
Keon Jang, Justine Sherry, Hitesh Ballani, and Toby Moncaster. 2015. Silo: Predictable Message Latency in the Cloud. SIGCOMM Comput. Commun. Rev. 45, 4 (Aug. 2015), 435--448.
[16]
Theo Jepsen, Stephen Ibanez, Gregory Valiant, and Nick McKeown. 2022. From Sand to Flour: The Next Leap in Granular Computing with NanoSort. arXiv preprint arXiv:2204.12615 (2022).
[17]
Xin Jin, Xiaozhou Li, Haoyu Zhang, Nate Foster, Jeongkeun Lee, Robert Soulé, Changhoon Kim, and Ion Stoica. 2018. NetChain: Scale-Free Sub-RTT Coordination. In Proceedings of the 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18).
[18]
Kostis Kaffes, Timothy Chong, Jack Tigar Humphries, Adam Belay, David Mazières, and Christos Kozyrakis. 2019. Shinjuku: Preemptive Scheduling for microsecond-scale Tail Latency. In Proceedings of the 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19). USENIX Association, Boston, MA, 345--360. https://www.usenix.org/conference/nsdi19/presentation/kaffes
[19]
Wayne Kevin. [n.d.]. Longest Increasing Subsequence. https://www.cs.princeton.edu/courses/archive/spring13/cos423/lectures/LongestIncreasingSubsequence.pdf.
[20]
Leslie Lamport. 1998. The Part-Time Parliament. ACM Trans. Comput. Syst. 16, 2 (May 1998), 133--169.
[21]
Leslie Lamport. 2006. Fast Paxos. Distributed Computing 19 (October 2006), 79--103. https://www.microsoft.com/en-us/research/publication/fast-paxos/
[22]
Leslie Lamport et al. 2001. Paxos made simple. ACM Sigact News 32, 4 (2001), 18--25.
[23]
Jialin Li, Ellis Michael, Naveen Kr. Sharma, Adriana Szekeres, and Dan R. K. Ports. [n.d.]. NOPaxos Code Repository. https://github.com/UWSysLab/NOPaxos.
[24]
Jialin Li, Ellis Michael, Naveen Kr. Sharma, Adriana Szekeres, and Dan R. K. Ports. 2016. Just Say No to Paxos Overhead: Replacing Consensus with Network Ordering. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI'16). USENIX Association, USA.
[25]
Barbara Liskov. 1991. Practical Uses of Synchronized Clocks in Distributed Systems. In Proceedings of the Tenth Annual ACM Symposium on Principles of Distributed Computing (Montreal, Quebec, Canada) (PODC '91). Association for Computing Machinery, New York, NY, USA, 1--9.
[26]
Barbara Liskov and James Cowling. 2012. Viewstamped replication revisited. (2012).
[27]
Jennifer Lundelius and Nancy Lynch. 1984. A New Fault-Tolerant Algorithm for Clock Synchronization. In Proceedings of the Third Annual ACM Symposium on Principles of Distributed Computing (Vancouver, British Columbia, Canada) (PODC '84). Association for Computing Machinery, New York, NY, USA.
[28]
Jennifer Lundelius and Nancy Lynch. 1984. An upper and lower bound for clock synchronization. Information and Control 62, 2 (1984), 190--204.
[29]
Yanhua Mao, Flavio P. Junqueira, and Keith Marzullo. 2008. Mencius: Building Efficient Replicated State Machines for WANs. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (San Diego, California) (OSDI'08). USENIX Association, USA, 369--384.
[30]
Whittaker Michael, Ailijiang Ailidani, Charapko Aleksey, Demirbas Murat, Giridharan Neil, Hellerstein Joseph, Howard Heidi, Stoica Ion, and Szekeres. Adriana. 2021. Scaling Replicated State Machines with Compartmentalization. Proc. VLDB Endow. (2021), 12.
[31]
Microsoft. [n.d.]. Global data distribution with Azure Cosmos DB-under the hood. https://docs.microsoft.com/en-us/azure/cosmos-db/global-dist-under-the-hood.
[32]
Microsoft. [n.d.]. NIC series. https://docs.microsoft.com/en-us/azure/virtual-machines/nc-series.
[33]
Jeffrey C. Mogul and Ramana Rao Kompella. 2015. Inferring the Network Latency Requirements of Cloud Tenants. In Proceedings of the 15th Workshop on Hot Topics in Operating Systems (HotOS XV). USENIX Association, Kartause Ittingen, Switzerland. https://www.usenix.org/conference/hotos15/workshop-program/presentation/mogul
[34]
Iulian Moraru, David G. Andersen, and Michael Kaminsky. 2013. There is More Consensus in Egalitarian Parliaments. In Proceedings of the 24th ACM Symposium on Operating Systems Principles (Farminton, Pennsylvania) (SOSP '13). Association for Computing Machinery, New York, NY, USA, 358--372.
[35]
Shuai Mu, Lamont Nelson, Wyatt Lloyd, and Jinyang Li. 2016. Consolidating Concurrency Control and Consensus for Commits under Conflicts. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). USENIX Association, Savannah, GA, 517--532. https://www.usenix.org/conference/osdi16/technical-sessions/presentation/mu
[36]
Diego Ongaro and John Ousterhout. 2014. In Search of an Understandable Consensus Algorithm. In Proceedings of the 2014 USENIX Annual Technical Conference (USENIX ATC 14). USENIX Association, Philadelphia, PA, 305--319. https://www.usenix.org/conference/atc14/technical-sessions/presentation/ongaro
[37]
John Ousterhout and Diego Ongaro. [n.d.]. Implementing Replicated Logs with Paxos. https://ongardie.net/static/raft/userstudy/paxos.pdf.
[38]
Seo Jin Park and John Ousterhout. 2019. Exploiting Commutativity for Practical Fast Replication. In Proceedings of the 16th USENIX Conference on Networked Systems Design and Implementation (NSDI'19) (Boston, MA, USA). USENIX Association, USA, 47--64.
[39]
PingCap. [n.d.]. TiKV-Data Sharding. https://tikv.org/deep-dive/scalability/data-sharding.
[40]
Dan R. K. Ports, Jialin Li, Vincent Liu, Naveen Kr. Sharma, and Arvind Krishnamurthy. 2015. Designing Distributed Systems Using Approximate Synchrony in Data Center Networks. In Proceedings of the 12th USENIX Conference on Networked Systems Design and Implementation (Oakland, CA) (NSDI'15). USENIX Association, USA, 43--57.
[41]
Redis Enterprise. [n.d.]. Redis. https://redis.io.
[42]
C. Schensted. 1961. Longest Increasing and Decreasing Subsequences. Canadian Journal of Mathematics 13 (1961), 179--191.
[43]
Fred B. Schneider. 1993. Replication Management Using the State-Machine Approach. ACM Press/Addison-Wesley Publishing Co., USA, 169--197.
[44]
Chrysoula Stathakopoulou, Matej Pavlovic, and Marko Vukolić. 2022. State Machine Replication Scalability Made Simple. In Proceedings of the Seventeenth European Conference on Computer Systems (Rennes, France) (EuroSys '22). Association for Computing Machinery, New York, NY, USA, 17--33.
[45]
Rebecca Taft, Irfan Sharif, Andrei Matei, Nathan VanBenschoten, Jordan Lewis, Tobias Grieger, Kai Niemi, Andy Woods, Anne Birzin, Raphael Poss, Paul Bardea, Amruta Ranade, Ben Darnell, Bram Gruneir, Justin Jaffray, Lucy Zhang, and Peter Mattis. 2020. CockroachDB: The Resilient Geo-Distributed SQL Database. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (Portland, OR, USA) (SIGMOD '20). Association for Computing Machinery, New York, NY, USA, 1493--1509.
[46]
Sarah Tollman, Seo Jin Park, and John Ousterhout. 2021. EPaxos Revisited. In Proceedings of the 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21). USENIX Association, 613--632. https://www.usenix.org/conference/nsdi21/presentation/tollman
[47]
Feiran Wang. 2019. Building High-performance Distributed Systems with Synchronized Clocks. Ph.D. Dissertation. Stanford University. https://www.proquest.com/dissertations-theses/building-high-performance-distributed-systems/docview/2467863602/se-2?accountid=14026
[48]
Zhaoguo Wang, Changgeng Zhao, Shuai Mu, Haibo Chen, and Jinyang Li. 2019. On the Parallels between Paxos and Raft, and How to Port Optimizations. In Proceedings of the 2019 ACM Symposium on Principles of Distributed Computing. Association for Computing Machinery, New York, NY, USA, 445--454.
[49]
Michael Whittaker, Neil Giridharan, Adriana Szekeres, Joseph M Hellerstein, Heidi Howard, Faisal Nawab, and Ion Stoica. 2020. Matchmaker paxos: A reconfigurable consensus protocol [technical report]. arXiv preprint arXiv:2007.09468 (2020).
[50]
Michael Whittaker, Neil Giridharan, Adriana Szekeres, Joseph M Hellerstein, and Ion Stoica. 2020. Bipartisan paxos: A modular state machine replication protocol. arXiv preprint arXiv:2003.00331 (2020).
[51]
Yunjing Xu, Zachary Musgrave, Brian Noble, and Michael Bailey. 2013. Bobtail: Avoiding Long Tails in the Cloud. In Proceedings of the 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI 13). USENIX Association, Lombard, IL, 329--341. https://www.usenix.org/conference/nsdi13/technical-sessions/presentation/xu_yunjing
[52]
Yahoo! [n.d.]. YCSB Workload. https://github.com/brianfrankcooper/YCSB/tree/master/workloads.
[53]
Xinan Yan, Linguan Yang, and Bernard Wong. 2020. Domino: Using Network Measurements to Reduce State Machine Replication Latency in WANs. In Proceedings of the 16th International Conference on Emerging Networking EXperiments and Technologies (Barcelona, Spain) (CoNEXT '20). Association for Computing Machinery, New York, NY, USA, 351--363.

Cited By

View all
  • (2024)Hybrid Shared-Buffer for Multi-Master DatabasesJournal of Database Management10.4018/JDM.35692035:1(1-27)Online publication date: 7-Nov-2024

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 16, Issue 4
December 2022
426 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 December 2022
Published in PVLDB Volume 16, Issue 4

Check for updates

Badges

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)46
  • Downloads (Last 6 weeks)2
Reflects downloads up to 21 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Hybrid Shared-Buffer for Multi-Master DatabasesJournal of Database Management10.4018/JDM.35692035:1(1-27)Online publication date: 7-Nov-2024

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media