Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3423211.3425689acmconferencesArticle/Chapter ViewAbstractPublication PagesmiddlewareConference Proceedingsconference-collections
research-article

Resilient Cloud-based Replication with Low Latency

Published: 11 December 2020 Publication History

Abstract

Existing approaches to tolerate Byzantine faults in geo-replicated environments require systems to execute complex agreement protocols over wide-area links and consequently are often associated with high response times. In this paper we address this problem with Spider, a resilient replication architecture for geo-distributed systems that leverages the availability characteristics of today's public-cloud infrastructures to minimize complexity and reduce latency. Spider models a system as a collection of loosely coupled replica groups whose members are hosted in different cloud-provided fault domains (i.e., availability zones) of the same geographic region. This structural organization makes it possible to achieve low response times by placing replica groups in close proximity to clients while still enabling the replicas of a group to interact over short-distance links. To handle the inter-group communication necessary for strong consistency Spider uses a reliable group-to-group message channel with first-in-first-out semantics and built-in flow control that significantly simplifies system design.

References

[1]
Hussam Abu-Libdeh, Lonnie Princehouse, and Hakim Weatherspoon. 2010. RACS: A Case for Cloud Storage Diversity. In Proceedings of the 1st Symposium on Cloud Computing (SoCC '10). 229--240.
[2]
Amazon EC2. 2020. Regions and Availability Zones. https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html.
[3]
Amazon Web Services. 2011. Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region. https://aws.amazon.com/message/65648/.
[4]
Yair Amir, Brian Coan, Jonathan Kirsch, and John Lane. 2007. Customizable Fault Tolerance for Wide-Area Replication. In Proceedings of the 26th International Symposium on Reliable Distributed Systems (SRDS '07). 65--82.
[5]
Yair Amir, Brian Coan, Jonathan Kirsch, and John Lane. 2010. Prime: Byzantine Replication Under Attack. IEEE Transactions on Dependable and Secure Computing 8, 4 (2010), 564--577.
[6]
Yair Amir, Claudiu Danilov, Danny Dolev, Jonathan Kirsch, John Lane, Cristina Nita-Rotaru, Josh Olsen, and David Zage. 2010. Steward: Scaling Byzantine Fault-Tolerant Replication to Wide Area Networks. IEEE Transactions on Dependable and Secure Computing 7, 1 (2010), 80--93.
[7]
Pierre-Louis Aublin, Rachid Guerraoui, Nikola Knežević, Vivien Quéma, and Marko Vukolić. 2015. The Next 700 BFT Protocols. ACM Transactions on Computer Systems 32, 4 (2015), 12:1-12:45.
[8]
Pierre-Louis Aublin, Sonia Ben Mokhtar, and Vivien Quéma. 2013. RBFT: Redundant Byzantine Fault Tolerance. In Proceedings of the 33rd International Conference on Distributed Computing Systems (ICDCS '13). 297--306.
[9]
Amy Babay, John Schultz, Thomas Tantillo, Samuel Beckley, Eamon Jordan, Kevin Ruddell, Kevin Jordan, and Yair Amir. 2019. Deploying Intrusion-Tolerant SCADA for the Power Grid. In Proceedings of the 49th International Conference on Dependable Systems and Networks (DSN '19). 328--335.
[10]
Amy Babay, Thomas Tantillo, Trevor Aron, Marco Platania, and Yair Amir. 2018. Network-attack-resilient Intrusion-tolerant SCADA for the Power Grid. In Proceedings of the 48th International Conference on Dependable Systems and Networks (DSN'18). 255--266.
[11]
Johannes Behl, Tobias Distler, and Rüdiger Kapitza. 2015. Consensus-Oriented Parallelization: How to Earn Your First Million. In Proceedings of the 16th Middleware Conference (Middleware '15). 173--184.
[12]
Christian Berger, Hans P. Reiser, João Sousa and Alysson Bessani. 2019. Resilient Wide-Area Byzantine Consensus Using Adaptive Weighted Replication. In Proceedings of the 38th International Symposium on Reliable Distributed Systems (SRDS '19).
[13]
Alysson Bessani, Miguel Correia, Bruno Quaresma, Fernando André, and Paulo Sousa. 2013. DepSky: Dependable and Secure Storage in a Cloud-of-Clouds. ACM Transactions on Storage (TOS) 9, 4 (2013), 12:1-12:33.
[14]
Alysson Bessani, João Sousa, and Eduardo E. P. Alchieri. 2014. State Machine Replication for the Masses with BFT-SMaRt. In Proceedings of the 44th International Conference on Dependable Systems and Networks (DSN '14). 355--362.
[15]
Alysson Neves Bessani, Paulo Sousa, Miguel Correia, Nuno Ferreira Neves, and Paulo Veríssimo. 2008. The CRUTIAL Way of Critical Infrastructure Protection. IEEE Security & Privacy 6, 6 (2008), 44--51.
[16]
Carlos Carvalho, Daniel Porto, Luís Rodrigues, Manuel Bravo, and Alysson Bessani. 2018. Dynamic Adaptation of Byzantine Consensus Protocols. In Proceedings of the 33rd Annual ACM Symposium on Applied Computing (SAC '18). 411--418.
[17]
Miguel Castro and Barbara Liskov. 1999. Practical Byzantine Fault Tolerance. In Proceedings of the 3rd Symposium on Operating Systems Design and Implementation (OSDI '99). 173--186.
[18]
Allen Clement, Manos Kapritsos, Sangmin Lee, Yang Wang, Lorenzo Alvisi, Mike Dahlin, and Taylor Riche. 2009. UpRight Cluster Services. In Proceedings of the 22nd Symposium on Operating Systems Principles (SOSP '09). 277--290.
[19]
Alírio Santos de Sá, Allan Edgard Silva Freitas, and Raimundo José de Araújo Macêdo. 2013. Adaptive Request Batching for Byzantine Replication. SIGOPS Operating System Review 47, 1 (2013), 35--42.
[20]
Tobias Distler, Christian Cachin, and Rüdiger Kapitza. 2016. Resource-efficient Byzantine Fault Tolerance. IEEE Trans. Comput. 65, 9 (2016), 2807--2819.
[21]
Jiaqing Du, Daniele Sciascia, Sameh Elnikety, Willy Zwaenepoel, and Fernando Pedone. 2014. Clock-RSM: Low-Latency Inter-Datacenter State Machine Replication Using Loosely Synchronized Physical Clocks. In Proceedings of the 44th International Conference on Dependable Systems Networks (DSN '14). 343--354.
[22]
Michael Eischer, Markus Büttner, and Tobias Distler. 2019. Deterministic Fuzzy Checkpoints. In Proceedings of the 38th International Symposium on Reliable Distributed Systems (SRDS '19).
[23]
Michael Eischer and Tobias Distler. 2018. Latency-Aware Leader Selection for Geo-Replicated Byzantine Fault-Tolerant Systems. In Proceedings of the 1st Workshop on Byzantine Consensus and Resilient Blockchains (BCRB '18). 140--145.
[24]
Michael Eischer and Tobias Distler. 2019. Scalable Byzantine Fault-tolerant State-Machine Replication on Heterogeneous Servers. Computing 101, 2 (2019), 97--118.
[25]
Michael Eischer and Tobias Distler. 2020. Resilient Cloud-based Replication with Low Latency (Extended Version). arXiv:2009.10043 [cs.DC]
[26]
Miguel Garcia, Nuno Neves, and Alysson Bessani. 2016. SieveQ: A Layered BFT Protection System for Critical Services. IEEE Transactions on Dependable and Secure Computing 15, 3 (2016), 511--525.
[27]
Google Compute Engine. 2020. Regions and Zones. https://cloud.google.com/compute/docs/regions-zones/.
[28]
Guy Golan Gueta, Ittai Abraham, Shelly Grossman, Dahlia Malkhi, Benny Pinkas, Michael Reiter, Dragos-Adrian Seredinschi, Orr Tamir, and Alin Tomescu. 2019. SBFT: A Scalable and Decentralized Trust Infrastructure. In Proceedings of the 49th International Conference on Dependable Systems and Networks (DSN '19). 568--580.
[29]
Suyash Gupta, Sajjad Rahnama, Jelle Hellings, and Mohammad Sadoghi. 2020. ResilientDB: Global Scale Resilient Blockchain Fabric. Proceedings of the VLDB Endowment 13, 6 (2020), 868--883.
[30]
Rüdiger Kapitza, Johannes Behl, Christian Cachin, Tobias Distler, Simon Kuhnle, Seyed Vahid Mohammadi, Wolfgang Schröder-Preikschat, and Klaus Stengel. 2012. CheapBFT: Resource-efficient Byzantine Fault Tolerance. In Proceedings of the 7th European Conference on Computer Systems (EuroSys '12). 295--308.
[31]
Leslie Lamport. 1998. The Part-Time Parliament. ACM Transactions on Computer Systems 16, 2 (1998), 133--169.
[32]
Bijun Li, Nico Weichbrodt, Johannes Behl, Pierre-Louis Aublin, Tobias Distler, and Rüdiger Kapitza. 2018. Troxy: Transparent Access to Byzantine Fault-Tolerant Systems. In Proceedings of the 48th International Conference on Dependable Systems and Networks (DSN '18). 59--70.
[33]
Bijun Li, Wenbo Xu, Muhammad Zeeshan Abid, Tobias Distler, and Rüdiger Kapitza. 2016. SAREK: Optimistic Parallel Ordering in Byzantine Fault Tolerance. In Proceedings of the 12th European Dependable Computing Conference (EDCC '16). 77--88.
[34]
Shengyun Liu and Marko Vukolić. 2017. Leader Set Selection for Low-Latency Geo-Replicated State Machine. IEEE Transactions on Parallel and Distributed Systems 28, 7 (2017), 1933--1946.
[35]
Yanhua Mao, Flavio P. Junqueira, and Keith Marzullo. 2008. Mencius: Building Efficient Replicated State Machines for WANs. In Proceedings of the 8th Symposium on Operating Systems Design and Implementation (OSDI '08). 369--384.
[36]
Yanhua Mao, Flavio P. Junqueira, and Keith Marzullo. 2009. Towards Low Latency State Machine Replication for Uncivil Wide-Area Networks. In Proceedings of the 5th Workshop on Hot Topics in System Dependability (HotDep '09).
[37]
Jean-Philippe Martin and Lorenzo Alvisi. 2006. Fast Byzantine Consensus. IEEE Transactions on Dependable and Secure Computing 3, 3 (2006), 202--215.
[38]
Microsoft Azure. 2020. Azure Regions. https://azure.microsoft.com/enus/global-infrastructure/regions/.
[39]
Zarko Milosevic, Martin Biely, and André Schiper. 2013. Bounded Delay in Byzantine-Tolerant State Machine Replication. In Proceedings of the 32nd International Symposium on Reliable Distributed Systems (SRDS '13). 61--70.
[40]
Iulian Moraru, David G Andersen, and Michael Kaminsky. 2013. There Is More Consensus in Egalitarian Parliaments. In Proceedings of the 24th Symposium on Operating Systems Principles (SOSP '13). 358--372.
[41]
André Nogueira, Miguel Garcia, Alysson Bessani, and Nuno Neves. 2018. On the Challenges of Building a BFT SCADA. In Proceedings of the 48th International Conference on Dependable Systems and Networks (DSN'18). 163--170.
[42]
Ricardo Padilha, Enrique Fynn, Robert Soulé, and Fernando Pedone. 2016. Callinicos: Robust Transactional Storage for Distributed Data Structures. In Proceedings of the 2016 USENIX Conference on Usenix Annual Technical Conference (ATC '16). 223--235.
[43]
Ricardo Padilha and Fernando Pedone. 2013. Augustus: Scalable and Robust Storage for Cloud Applications. In Proceedings of the 8th European Conference on Computer Systems (EuroSys '13). 99--112.
[44]
Nicolas Schiper, Pierre Sutra, and Fernando Pedone. 2010. P-Store: Genuine Partial Replication in Wide Area Networks. In Proceedings of the 29th International Symposium on Reliable Distributed Systems (SRDS '10). 214--224.
[45]
Victor Shoup. 2000. Practical Threshold Signatures. In Proceedings of the 19th International Conference on Theory and Application of Cryptographic Techniques (EUROCRYPT '00). 207--220.
[46]
João Sousa and Alysson Bessani. 2015. Separating the WHEAT from the Chaff: An Empirical Design for Geo-Replicated State Machines. In Proceedings of the 34th International Symposium on Reliable Distributed Systems (SRDS '15). 146--155.
[47]
Joao Sousa, Alysson Bessani, and Marko Vukolić. 2018. A Byzantine Fault-tolerant Ordering Service for the Hyperledger Fabric Blockchain Platform. In Proceedings of the 48th International Conference on Dependable Systems and Networks (DSN '18). 51--58.
[48]
Douglas B. Terry, Vijayan Prabhakaran, Ramakrishna Kotla, Mahesh Balakrishnan, Marcos K. Agu ilera, and Hussam Abu-Libdeh. 2013. Consistency-based Service Level Agreements for Cloud Storage. In Proceedings of the 24th Symposium on Operating Systems Principles (SOSP'13). 309--324.
[49]
Gene Tsudik. 1992. Message Authentication with One-Way Hash Functions. ACM SIGCOMM Computer Communication Review 22, 5 (1992), 29--38.
[50]
Giuliana Santos Veronese, Miguel Correia, Alysson Neves Bessani, and Lau Cheuk Lung. 2009. Spin One's Wheels? Byzantine Fault Tolerance with a Spinning Primary. In Proceedings of the 28th International Symposium on Reliable Distributed Systems (SRDS '09). 135--144.
[51]
Giuliana Santos Veronese, Miguel Correia, Alysson Neves Bessani, and Lau Cheuk Lung. 2010. EBAWA: Efficient Byzantine Agreement for Wide-Area Networks. In Proceedings of the 12th Symposium on High-Assurance Systems Engineering (HASE '10). 10--19.
[52]
Jian Yin, Jean-Philippe Martin, Arun Venkataramani, Lorenzo Alvisi, and Mike Dahlin. 2003. Separating Agreement from Execution for Byzantine Fault Tolerant Services. In Proceedings of the 19th Symposium on Operating Systems Principles (SOSP '03). 253--267.

Cited By

View all
  • (2024)Fault‐tolerance approaches for distributed and cloud computing environments: A systematic review, taxonomy and future directionsConcurrency and Computation: Practice and Experience10.1002/cpe.808136:13Online publication date: 18-Mar-2024
  • (2023)Fluidity: Location-Awareness in Replicated State MachinesProceedings of the 38th ACM/SIGAPP Symposium on Applied Computing10.1145/3555776.3577763(192-201)Online publication date: 27-Mar-2023
  • (2023)Making Intrusion Tolerance Accessible: A Cloud-Based Hybrid Management Approach to Deploying Resilient Systems2023 42nd International Symposium on Reliable Distributed Systems (SRDS)10.1109/SRDS60354.2023.00033(254-267)Online publication date: 25-Sep-2023
  • Show More Cited By

Index Terms

  1. Resilient Cloud-based Replication with Low Latency

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    Middleware '20: Proceedings of the 21st International Middleware Conference
    December 2020
    455 pages
    ISBN:9781450381536
    DOI:10.1145/3423211
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 11 December 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Byzantine fault tolerance
    2. geo-replication

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Funding Sources

    Conference

    Middleware '20
    Sponsor:
    Middleware '20: 21st International Middleware Conference
    December 7 - 11, 2020
    Delft, Netherlands

    Acceptance Rates

    Overall Acceptance Rate 203 of 948 submissions, 21%

    Upcoming Conference

    MIDDLEWARE '24
    25th International Middleware Conference
    December 2 - 6, 2024
    Hong Kong , Hong Kong

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)26
    • Downloads (Last 6 weeks)6
    Reflects downloads up to 26 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Fault‐tolerance approaches for distributed and cloud computing environments: A systematic review, taxonomy and future directionsConcurrency and Computation: Practice and Experience10.1002/cpe.808136:13Online publication date: 18-Mar-2024
    • (2023)Fluidity: Location-Awareness in Replicated State MachinesProceedings of the 38th ACM/SIGAPP Symposium on Applied Computing10.1145/3555776.3577763(192-201)Online publication date: 27-Mar-2023
    • (2023)Making Intrusion Tolerance Accessible: A Cloud-Based Hybrid Management Approach to Deploying Resilient Systems2023 42nd International Symposium on Reliable Distributed Systems (SRDS)10.1109/SRDS60354.2023.00033(254-267)Online publication date: 25-Sep-2023
    • (2023)Managing Consensus in Distributed Transaction SystemsBlockchains10.1002/9781119781042.ch3(53-89)Online publication date: 8-Sep-2023
    • (2021)Mitigating Virtualization Failures Through Migration to a Co-Located HypervisorIEEE Access10.1109/ACCESS.2021.30986449(105255-105269)Online publication date: 2021

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media