research-article

Resilient Cloud-based Replication with Low Latency

Authors:

Michael Eischer,

Tobias DistlerAuthors Info & Claims

Middleware '20: Proceedings of the 21st International Middleware Conference

Pages 14 - 28

https://doi.org/10.1145/3423211.3425689

Published: 11 December 2020 Publication History

Abstract

Existing approaches to tolerate Byzantine faults in geo-replicated environments require systems to execute complex agreement protocols over wide-area links and consequently are often associated with high response times. In this paper we address this problem with Spider, a resilient replication architecture for geo-distributed systems that leverages the availability characteristics of today's public-cloud infrastructures to minimize complexity and reduce latency. Spider models a system as a collection of loosely coupled replica groups whose members are hosted in different cloud-provided fault domains (i.e., availability zones) of the same geographic region. This structural organization makes it possible to achieve low response times by placing replica groups in close proximity to clients while still enabling the replicas of a group to interact over short-distance links. To handle the inter-group communication necessary for strong consistency Spider uses a reliable group-to-group message channel with first-in-first-out semantics and built-in flow control that significantly simplifies system design.

References

[1]

Hussam Abu-Libdeh, Lonnie Princehouse, and Hakim Weatherspoon. 2010. RACS: A Case for Cloud Storage Diversity. In Proceedings of the 1st Symposium on Cloud Computing (SoCC '10). 229--240.

Digital Library

[2]

Amazon EC2. 2020. Regions and Availability Zones. https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html.

[3]

Amazon Web Services. 2011. Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region. https://aws.amazon.com/message/65648/.

[4]

Yair Amir, Brian Coan, Jonathan Kirsch, and John Lane. 2007. Customizable Fault Tolerance for Wide-Area Replication. In Proceedings of the 26th International Symposium on Reliable Distributed Systems (SRDS '07). 65--82.

Digital Library

[5]

Yair Amir, Brian Coan, Jonathan Kirsch, and John Lane. 2010. Prime: Byzantine Replication Under Attack. IEEE Transactions on Dependable and Secure Computing 8, 4 (2010), 564--577.

Digital Library

[6]

Yair Amir, Claudiu Danilov, Danny Dolev, Jonathan Kirsch, John Lane, Cristina Nita-Rotaru, Josh Olsen, and David Zage. 2010. Steward: Scaling Byzantine Fault-Tolerant Replication to Wide Area Networks. IEEE Transactions on Dependable and Secure Computing 7, 1 (2010), 80--93.

Digital Library

[7]

Pierre-Louis Aublin, Rachid Guerraoui, Nikola Knežević, Vivien Quéma, and Marko Vukolić. 2015. The Next 700 BFT Protocols. ACM Transactions on Computer Systems 32, 4 (2015), 12:1-12:45.

Digital Library

[8]

Pierre-Louis Aublin, Sonia Ben Mokhtar, and Vivien Quéma. 2013. RBFT: Redundant Byzantine Fault Tolerance. In Proceedings of the 33rd International Conference on Distributed Computing Systems (ICDCS '13). 297--306.

Digital Library

[9]

Amy Babay, John Schultz, Thomas Tantillo, Samuel Beckley, Eamon Jordan, Kevin Ruddell, Kevin Jordan, and Yair Amir. 2019. Deploying Intrusion-Tolerant SCADA for the Power Grid. In Proceedings of the 49th International Conference on Dependable Systems and Networks (DSN '19). 328--335.

[10]

Amy Babay, Thomas Tantillo, Trevor Aron, Marco Platania, and Yair Amir. 2018. Network-attack-resilient Intrusion-tolerant SCADA for the Power Grid. In Proceedings of the 48th International Conference on Dependable Systems and Networks (DSN'18). 255--266.

[11]

Johannes Behl, Tobias Distler, and Rüdiger Kapitza. 2015. Consensus-Oriented Parallelization: How to Earn Your First Million. In Proceedings of the 16th Middleware Conference (Middleware '15). 173--184.

Digital Library

[12]

Christian Berger, Hans P. Reiser, João Sousa and Alysson Bessani. 2019. Resilient Wide-Area Byzantine Consensus Using Adaptive Weighted Replication. In Proceedings of the 38th International Symposium on Reliable Distributed Systems (SRDS '19).

[13]

Alysson Bessani, Miguel Correia, Bruno Quaresma, Fernando André, and Paulo Sousa. 2013. DepSky: Dependable and Secure Storage in a Cloud-of-Clouds. ACM Transactions on Storage (TOS) 9, 4 (2013), 12:1-12:33.

[14]

Alysson Bessani, João Sousa, and Eduardo E. P. Alchieri. 2014. State Machine Replication for the Masses with BFT-SMaRt. In Proceedings of the 44th International Conference on Dependable Systems and Networks (DSN '14). 355--362.

[15]

Alysson Neves Bessani, Paulo Sousa, Miguel Correia, Nuno Ferreira Neves, and Paulo Veríssimo. 2008. The CRUTIAL Way of Critical Infrastructure Protection. IEEE Security & Privacy 6, 6 (2008), 44--51.

Digital Library

[16]

Carlos Carvalho, Daniel Porto, Luís Rodrigues, Manuel Bravo, and Alysson Bessani. 2018. Dynamic Adaptation of Byzantine Consensus Protocols. In Proceedings of the 33rd Annual ACM Symposium on Applied Computing (SAC '18). 411--418.

Digital Library

[17]

Miguel Castro and Barbara Liskov. 1999. Practical Byzantine Fault Tolerance. In Proceedings of the 3rd Symposium on Operating Systems Design and Implementation (OSDI '99). 173--186.

[18]

Allen Clement, Manos Kapritsos, Sangmin Lee, Yang Wang, Lorenzo Alvisi, Mike Dahlin, and Taylor Riche. 2009. UpRight Cluster Services. In Proceedings of the 22nd Symposium on Operating Systems Principles (SOSP '09). 277--290.

Digital Library

[19]

Alírio Santos de Sá, Allan Edgard Silva Freitas, and Raimundo José de Araújo Macêdo. 2013. Adaptive Request Batching for Byzantine Replication. SIGOPS Operating System Review 47, 1 (2013), 35--42.

Digital Library

[20]

Tobias Distler, Christian Cachin, and Rüdiger Kapitza. 2016. Resource-efficient Byzantine Fault Tolerance. IEEE Trans. Comput. 65, 9 (2016), 2807--2819.

Digital Library

[21]

Jiaqing Du, Daniele Sciascia, Sameh Elnikety, Willy Zwaenepoel, and Fernando Pedone. 2014. Clock-RSM: Low-Latency Inter-Datacenter State Machine Replication Using Loosely Synchronized Physical Clocks. In Proceedings of the 44th International Conference on Dependable Systems Networks (DSN '14). 343--354.

Digital Library

[22]

Michael Eischer, Markus Büttner, and Tobias Distler. 2019. Deterministic Fuzzy Checkpoints. In Proceedings of the 38th International Symposium on Reliable Distributed Systems (SRDS '19).

[23]

Michael Eischer and Tobias Distler. 2018. Latency-Aware Leader Selection for Geo-Replicated Byzantine Fault-Tolerant Systems. In Proceedings of the 1st Workshop on Byzantine Consensus and Resilient Blockchains (BCRB '18). 140--145.

[24]

Michael Eischer and Tobias Distler. 2019. Scalable Byzantine Fault-tolerant State-Machine Replication on Heterogeneous Servers. Computing 101, 2 (2019), 97--118.

Digital Library

[25]

Michael Eischer and Tobias Distler. 2020. Resilient Cloud-based Replication with Low Latency (Extended Version). arXiv:2009.10043 [cs.DC]

[26]

Miguel Garcia, Nuno Neves, and Alysson Bessani. 2016. SieveQ: A Layered BFT Protection System for Critical Services. IEEE Transactions on Dependable and Secure Computing 15, 3 (2016), 511--525.

[27]

Google Compute Engine. 2020. Regions and Zones. https://cloud.google.com/compute/docs/regions-zones/.

[28]

Guy Golan Gueta, Ittai Abraham, Shelly Grossman, Dahlia Malkhi, Benny Pinkas, Michael Reiter, Dragos-Adrian Seredinschi, Orr Tamir, and Alin Tomescu. 2019. SBFT: A Scalable and Decentralized Trust Infrastructure. In Proceedings of the 49th International Conference on Dependable Systems and Networks (DSN '19). 568--580.

[29]

Suyash Gupta, Sajjad Rahnama, Jelle Hellings, and Mohammad Sadoghi. 2020. ResilientDB: Global Scale Resilient Blockchain Fabric. Proceedings of the VLDB Endowment 13, 6 (2020), 868--883.

Digital Library

[30]

Rüdiger Kapitza, Johannes Behl, Christian Cachin, Tobias Distler, Simon Kuhnle, Seyed Vahid Mohammadi, Wolfgang Schröder-Preikschat, and Klaus Stengel. 2012. CheapBFT: Resource-efficient Byzantine Fault Tolerance. In Proceedings of the 7th European Conference on Computer Systems (EuroSys '12). 295--308.

Digital Library

[31]

Leslie Lamport. 1998. The Part-Time Parliament. ACM Transactions on Computer Systems 16, 2 (1998), 133--169.

Digital Library

[32]

Bijun Li, Nico Weichbrodt, Johannes Behl, Pierre-Louis Aublin, Tobias Distler, and Rüdiger Kapitza. 2018. Troxy: Transparent Access to Byzantine Fault-Tolerant Systems. In Proceedings of the 48th International Conference on Dependable Systems and Networks (DSN '18). 59--70.

[33]

Bijun Li, Wenbo Xu, Muhammad Zeeshan Abid, Tobias Distler, and Rüdiger Kapitza. 2016. SAREK: Optimistic Parallel Ordering in Byzantine Fault Tolerance. In Proceedings of the 12th European Dependable Computing Conference (EDCC '16). 77--88.

[34]

Shengyun Liu and Marko Vukolić. 2017. Leader Set Selection for Low-Latency Geo-Replicated State Machine. IEEE Transactions on Parallel and Distributed Systems 28, 7 (2017), 1933--1946.

Digital Library

[35]

Yanhua Mao, Flavio P. Junqueira, and Keith Marzullo. 2008. Mencius: Building Efficient Replicated State Machines for WANs. In Proceedings of the 8th Symposium on Operating Systems Design and Implementation (OSDI '08). 369--384.

[36]

Yanhua Mao, Flavio P. Junqueira, and Keith Marzullo. 2009. Towards Low Latency State Machine Replication for Uncivil Wide-Area Networks. In Proceedings of the 5th Workshop on Hot Topics in System Dependability (HotDep '09).

[37]

Jean-Philippe Martin and Lorenzo Alvisi. 2006. Fast Byzantine Consensus. IEEE Transactions on Dependable and Secure Computing 3, 3 (2006), 202--215.

Digital Library

[38]

Microsoft Azure. 2020. Azure Regions. https://azure.microsoft.com/enus/global-infrastructure/regions/.

[39]

Zarko Milosevic, Martin Biely, and André Schiper. 2013. Bounded Delay in Byzantine-Tolerant State Machine Replication. In Proceedings of the 32nd International Symposium on Reliable Distributed Systems (SRDS '13). 61--70.

Digital Library

[40]

Iulian Moraru, David G Andersen, and Michael Kaminsky. 2013. There Is More Consensus in Egalitarian Parliaments. In Proceedings of the 24th Symposium on Operating Systems Principles (SOSP '13). 358--372.

Digital Library

[41]

André Nogueira, Miguel Garcia, Alysson Bessani, and Nuno Neves. 2018. On the Challenges of Building a BFT SCADA. In Proceedings of the 48th International Conference on Dependable Systems and Networks (DSN'18). 163--170.

[42]

Ricardo Padilha, Enrique Fynn, Robert Soulé, and Fernando Pedone. 2016. Callinicos: Robust Transactional Storage for Distributed Data Structures. In Proceedings of the 2016 USENIX Conference on Usenix Annual Technical Conference (ATC '16). 223--235.

[43]

Ricardo Padilha and Fernando Pedone. 2013. Augustus: Scalable and Robust Storage for Cloud Applications. In Proceedings of the 8th European Conference on Computer Systems (EuroSys '13). 99--112.

Digital Library

[44]

Nicolas Schiper, Pierre Sutra, and Fernando Pedone. 2010. P-Store: Genuine Partial Replication in Wide Area Networks. In Proceedings of the 29th International Symposium on Reliable Distributed Systems (SRDS '10). 214--224.

Digital Library

[45]

Victor Shoup. 2000. Practical Threshold Signatures. In Proceedings of the 19th International Conference on Theory and Application of Cryptographic Techniques (EUROCRYPT '00). 207--220.

Digital Library

[46]

João Sousa and Alysson Bessani. 2015. Separating the WHEAT from the Chaff: An Empirical Design for Geo-Replicated State Machines. In Proceedings of the 34th International Symposium on Reliable Distributed Systems (SRDS '15). 146--155.

Digital Library

[47]

Joao Sousa, Alysson Bessani, and Marko Vukolić. 2018. A Byzantine Fault-tolerant Ordering Service for the Hyperledger Fabric Blockchain Platform. In Proceedings of the 48th International Conference on Dependable Systems and Networks (DSN '18). 51--58.

[48]

Douglas B. Terry, Vijayan Prabhakaran, Ramakrishna Kotla, Mahesh Balakrishnan, Marcos K. Agu ilera, and Hussam Abu-Libdeh. 2013. Consistency-based Service Level Agreements for Cloud Storage. In Proceedings of the 24th Symposium on Operating Systems Principles (SOSP'13). 309--324.

Digital Library

[49]

Gene Tsudik. 1992. Message Authentication with One-Way Hash Functions. ACM SIGCOMM Computer Communication Review 22, 5 (1992), 29--38.

Digital Library

[50]

Giuliana Santos Veronese, Miguel Correia, Alysson Neves Bessani, and Lau Cheuk Lung. 2009. Spin One's Wheels? Byzantine Fault Tolerance with a Spinning Primary. In Proceedings of the 28th International Symposium on Reliable Distributed Systems (SRDS '09). 135--144.

Digital Library

[51]

Giuliana Santos Veronese, Miguel Correia, Alysson Neves Bessani, and Lau Cheuk Lung. 2010. EBAWA: Efficient Byzantine Agreement for Wide-Area Networks. In Proceedings of the 12th Symposium on High-Assurance Systems Engineering (HASE '10). 10--19.

Digital Library

[52]

Jian Yin, Jean-Philippe Martin, Arun Venkataramani, Lorenzo Alvisi, and Mike Dahlin. 2003. Separating Agreement from Execution for Byzantine Fault Tolerant Services. In Proceedings of the 19th Symposium on Operating Systems Principles (SOSP '03). 253--267.

Digital Library

Cited By

Kirti MMaurya AYadav R(2024)Fault‐tolerance approaches for distributed and cloud computing environments: A systematic review, taxonomy and future directionsConcurrency and Computation: Practice and Experience10.1002/cpe.808136:13Online publication date: 18-Mar-2024
https://doi.org/10.1002/cpe.8081
Köstler JReiser HHauck FHabiger GHong JLanperne MPark JCerny TShahriar H(2023)Fluidity: Location-Awareness in Replicated State MachinesProceedings of the 38th ACM/SIGAPP Symposium on Applied Computing10.1145/3555776.3577763(192-201)Online publication date: 27-Mar-2023
https://dl.acm.org/doi/10.1145/3555776.3577763
Khan MBabay A(2023)Making Intrusion Tolerance Accessible: A Cloud-Based Hybrid Management Approach to Deploying Resilient Systems2023 42nd International Symposium on Reliable Distributed Systems (SRDS)10.1109/SRDS60354.2023.00033(254-267)Online publication date: 25-Sep-2023
https://doi.org/10.1109/SRDS60354.2023.00033
Show More Cited By

Index Terms

Resilient Cloud-based Replication with Low Latency
1. Computer systems organization
  1. Dependable and fault-tolerant systems and networks

Recommendations

Tutorial on geo-replication in data center applications
SIGMETRICS '13: Proceedings of the ACM SIGMETRICS/international conference on Measurement and modeling of computer systems

Data center applications increasingly require a *geo-replicated* storage system, that is, a storage system replicated across many geographic locations. Geo-replication can reduce access latency, improve availability, and provide disaster tolerance. It ...
Scalable Byzantine fault-tolerant state-machine replication on heterogeneous servers

When provided with more powerful or extra hardware, state-of-the-art Byzantine fault-tolerant (BFT) replication protocols are unable to effectively exploit the additional computing resources: on the one hand, in settings with heterogeneous servers ...
Adaptive strength geo-replication strategy
PaPoC '15: Proceedings of the First Workshop on Principles and Practice of Consistency for Distributed Data

The amount of data being processed in Data Centres (DCs) keeps growing at an enormous rate so that full replication may start being impractical. The application of replication between DCs is used to increase data availability in the presence of site ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

Middleware '20: Proceedings of the 21st International Middleware Conference

December 2020

455 pages

ISBN:9781450381536

DOI:10.1145/3423211

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 December 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Deutsche Forschungsgemeinschaft

Conference

Middleware '20

Sponsor:

ACM

Middleware '20: 21st International Middleware Conference

December 7 - 11, 2020

Delft, Netherlands

Acceptance Rates

Overall Acceptance Rate 203 of 948 submissions, 21%

Upcoming Conference

MIDDLEWARE '24

25th International Middleware Conference

December 2 - 6, 2024

Hong Kong , Hong Kong

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
196
Total Downloads

Downloads (Last 12 months)26
Downloads (Last 6 weeks)6

Reflects downloads up to 26 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Kirti MMaurya AYadav R(2024)Fault‐tolerance approaches for distributed and cloud computing environments: A systematic review, taxonomy and future directionsConcurrency and Computation: Practice and Experience10.1002/cpe.808136:13Online publication date: 18-Mar-2024
https://doi.org/10.1002/cpe.8081
Köstler JReiser HHauck FHabiger GHong JLanperne MPark JCerny TShahriar H(2023)Fluidity: Location-Awareness in Replicated State MachinesProceedings of the 38th ACM/SIGAPP Symposium on Applied Computing10.1145/3555776.3577763(192-201)Online publication date: 27-Mar-2023
https://dl.acm.org/doi/10.1145/3555776.3577763
Khan MBabay A(2023)Making Intrusion Tolerance Accessible: A Cloud-Based Hybrid Management Approach to Deploying Resilient Systems2023 42nd International Symposium on Reliable Distributed Systems (SRDS)10.1109/SRDS60354.2023.00033(254-267)Online publication date: 25-Sep-2023
https://doi.org/10.1109/SRDS60354.2023.00033
Walter Behrens HCandan KBoscovic D(2023)Managing Consensus in Distributed Transaction SystemsBlockchains10.1002/9781119781042.ch3(53-89)Online publication date: 8-Sep-2023
https://doi.org/10.1002/9781119781042.ch3
Cerveira FBarbosa RMadeira H(2021)Mitigating Virtualization Failures Through Migration to a Co-Located HypervisorIEEE Access10.1109/ACCESS.2021.30986449(105255-105269)Online publication date: 2021
https://doi.org/10.1109/ACCESS.2021.3098644

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents