research-article

Open access

Exploring Scalability of BFT Blockchain Protocols through Network Simulations

Authors:

Christian Berger,

Sadok Ben Toumia,

Hans P. ReiserAuthors Info & Claims

Formal Aspects of Computing, Volume 36, Issue 4

Article No.: 24, Pages 1 - 29

https://doi.org/10.1145/3689343

Published: 05 December 2024 Publication History

PDF eReader

Abstract

Novel Byzantine fault-tolerant (BFT) state machine replication protocols improve scalability for their practical use in distributed ledger technology, where hundreds of replicas must reach consensus. Assessing that BFT protocol implementations meet their performance expectations requires careful evaluation. We propose a new methodology using scalable network simulations to predict BFT protocol performance. Our simulation architecture allows for the integration of existing BFT implementations without modification or re-implementation, offering a cost-effective alternative to large-scale cloud experiments. We validate our method by comparing simulation results with real-world cloud deployments, showing that simulations can accurately predict performance at larger scales when network limitations dominate.

In our study, we applied this methodology to assess the performance of several “blockchain-generation” BFT protocols, including HotStuff, Kauri, Narwhal & Tusk, and BullShark, under realistic network conditions (with constrained 25 Mbit/s bandwidth) and induced faults. Kauri emerges as the top performer, achieving 6,742 operations per second (op/s) with 128 replicas, outperforming BullShark (2,318 op/s) and Tusk (1,952 op/s). HotStuff, using secp256k1 and BLS signatures, reaches 494 op/s and 707 op/s, respectively, demonstrating the efficiency of BLS-signature aggregation for saving bandwidth. This study demonstrates that state-of-the-art asynchronous BFT protocols can achieve competitive throughput in large-scale, real-world scenarios.

1 Introduction

Over the past few years, the development of novel distributed ledger technologies (DLTs) gained momentum and accelerated the research on Byzantine fault-tolerant (BFT)-based blockchain protocols. One of the main ambitions behind the use of BFT protocols like PBFT [14] instead of the Proof-of-Work [39] mechanism within the consensus layer of blockchain fabric was to substitute the energy-inefficient computation race of Proof-of-Work by a more efficient agreement between all correct blockchain replicas to decide which block is appended next to the ledger [53]. Traditional and well-studied BFT protocols like PBFT can accomplish this task, yet the cost of running agreements among many replicas results in a sharp performance decline in large-scale systems, as shown in Figure 1. Research on how to address the scalability challenges of PBFT and improve its performance in larger systems led to an abundant variety of new BFT protocol designs that have recently seen the light of day [1, 3, 13, 16, 17, 18, 23, 28, 36, 40, 44, 50, 51, 52, 60].

Fig. 1.

As of the time of writing, many of these protocols are directly linked to an existing blockchain project (like Avalanche [44]), or, are even employed (often as further evolved successors) in multiple blockchains. For instance, HotStuff [60] was advanced by Facebook in its Libra project (later renamed to DiemBFT), which then discontinued and, eventually, the DiemBFT consensus was adopted by another blockchain, Aptos.¹ Jolteon [23] is another successor of HotStuff that is currently employed in the Flow² and Monad³ blockchains. The BFT protocol Narwhal & Tusk [17] (and its successor BullShark [50]) were integrated in the Sui⁴ blockchain and are currently under development to be used by Aptos and Celo.⁵

Cloud-scale Deployments.

Asserting that these novel BFT protocols can provide sufficient performance in realistic, large-scale systems requires careful evaluation of their runtime behavior. For this purpose, research papers describing these protocols contain evaluations with large-scale deployments that are conducted on cloud infrastructures like AWS, where experiments deploy up to several hundred replicas (for examples see References [13, 16, 36, 40, 59] and many more) to demonstrate a BFT protocol’s performance at large-scale, thus justifying its practicability in a blockchain setting.

Evaluations using real protocol deployments usually offer the best realism but can be costly and time-consuming, especially when testing multiple configurations. The high costs associated with benchmarking a system that runs hundreds of virtual machines on AWS hinder the rapid validation of prototype implementations during their development stage, including testing these implementations in failure scenarios. Thus, an interesting alternative for cheap and rapid validation of novel BFT protocol implementations can be to predict the system performance using simulations.

Simulating BFT Protocols.

BFTSim [48] is the first simulator that was developed for an eye-to-eye comparison of BFT protocols, but it lacks the necessary scalability to be useful for the newer “blockchain generation” of BFT protocols (and apparently only up to \(n=32\) PBFT replicas can be successfully simulated [56]). Moreover, BFTSim demands a BFT protocol to be modeled in the P2 language [37], which is somewhat error-prone considering the complexity of BFT protocols and also time-consuming. A more recent tool [56] allows for scalable simulation of BFT protocols, but it unfortunately also requires a complete re-implementation of the BFT protocol in JavaScript. Further, this tool cannot make predictions on the system throughput and thus its performance evaluation is limited to observing latency.

In our approach, we address a critical gap in the state-of-the-art by introducing a simulation method for predicting BFT system performance that is highly scalable, without necessitating any re-implementation of the protocol. This key feature distinguishes our approach from conventional methods and represents a significant advancement for predicting BFT system performance in a practical way.

Why Not Just Use Network Emulation?.

Emulation tries to duplicate the exact behavior of what is being emulated. Emulators like Kollaps can be used to reproduce AWS-deployed experiments with BFT protocols on a local server farm [25]. A clear advantage of emulation is how it preserves realism: BFT protocols still operate in real time and use real kernel and network protocols. In contrast to simulation, emulation is not similarly resource-friendly, as it executes the real application code in real time, thus requiring many physical machines at hand to conduct large-scale experiments.

Simulation as (Better?) Alternative.

Simulation decouples simulated time from real time and employs abstractions that help accelerate executions: Aspects of interest are captured through a model, which means the simulation only mimics the protocol’s environment (or also its actual protocol behavior if the application model is re-implemented). This has the advantage of easier experimental control, excellent reproducibility (i.e., deterministic protocol runs), and increased scalability when compared to emulation.

These benefits of emulation come at some cost: As a potential drawback remains the question of the validity of results, since the model may not fairly enough reflect reality. Furthermore, another existing limitation of all current BFT simulators is the need to modify or (usually) re-implement the BFT protocol to use it within a simulation engine.

1.1 Our Contributions

In our approach, we address these limitations and aim at making simulation a useful approach for large-scale BFT protocol performance prediction:

—

We define a software architecture for high-performance and scalable network simulation, in which we can plug an existing, unmodified BFT protocol implementation into a simulation, without requiring any re-implementation or source code modifications. By doing this, we ensure validity on the application level, since the actual application binaries are used to start real Linux processes that are finally connected into the simulation engine.

—

A threat to validity is the fact that when we solely rely on network simulation, it neglects the impact of processing time due to CPU-intensive tasks of BFT protocols on performance, namely, signature generation and verification. We conducted experiments that show that the performance results of simulations can display a useful approximation to real measurements in large-scale systems. This is because, at a certain number of replicas (often as soon as \(n \ge 32\)), the overall system performance is mostly dictated by the underlying network, which persists as a performance bottleneck in the system. We provide more detailed insights on this in our validity analysis in Section 5.

—

To demonstrate the usability of our method, we use BFT protocols from the “blockchain generation,” namely, HotStuff [60], Kauri [40], Narwhal & Tusk [17], and BullShark [50], to conduct simulations at a large scale.

—

Based on our experimental scalability analysis, we contrast these BFT protocols with each other and discuss potential advantages in their designs. Particularly, we explore if the novel, asynchronous BFT protocols (such as Tusk and BullShark) match the performance of their partially synchronous relatives (HotStuff and Kauri).

This journal article is the extended version of a previously published conference paper [10]. In this version, we extend our research scope towards asynchronous BFT protocols, thus featuring insights derived from our simulations with Narwhal & Tusk and BullShark (the last listed contribution). Apart from novel experimentation and added discussions, this journal article also serves additional elaborations in the methodology section that were omitted due to space reasons in the previous conference version.

1.2 Structure

This article is structured as follows:

—

In Section 2, we summarize related research work and explain how our approach differs from prior methods.

—

In Section 3, we briefly review the basics of BFT protocols to guide the reader through the different communication strategies that these protocols employ.

—

In the main part of our article, we explain our methodology, which includes both the design of our simulation architecture (Section 4) and a validity analysis that compares simulation results with real measurements (Section 5).

—

In the second part of our article (Section 6), we evaluate the performance of selected BFT protocols under varying boundary conditions. In respect to a possible blockchain use case, we construct large-scale and realistic wide-area network scenarios with up to \(n=256\) replicas and heterogenous network latencies derived from real planetary-scale deployments, and, in some scenarios, with failures. We envision an apples-to-apples comparison of the performance of scalable BFT protocol implementations in realistic networks.

—

Finally, we conclude in Section 7 and outline directions for future work in Section 8.

2 Related Work

2.1 Simulation or Modeling of BFT Protocols

BFTSim. The first simulator specifically designed for traditional BFT protocols such as PBFT [14] or Zyzzyva [33] is BFTSim [48]. It is tailored for small replica groups, and its limited scalability renders BFTSim impractical for newer larger-scale BFT protocols. BFTSim requires modeling BFT protocols in the P2 language, which introduces error-proneness given the complexity of protocols, such as PBFT’s view change mechanism and Zyzzyva’s numerous corner cases. While capable of simulating faults, it only considers non-malicious behavior, lacking functionality to tackle sophisticated Byzantine attacks. BFTSim uses ns-2 for realistic networking and is resource-friendly, running on a single machine.

In contrast to BFTSim, our method allows us to experiment with real protocol implementations. We believe this has two main benefits: First, in some sense, it strengthens the validity of simulation results, because no error-prone re-implementation or modeling of a (very complex) BFT protocol is needed. Instead, the source code of the implementation serves as the single reference point (i.e., following the principle of code is law). This advantage also means that we can save a lot of time and evaluate a larger number of BFT protocols without much effort. It is easy for the developer of a new BFT protocol to adapt our method, because it only requires writing a new protocol connector and subsequently, the developer can compare his implementation against several already-supported BFT protocols in pre-defined benchmarking scenarios. Second, the presented methodology improves on the scalability aspect, because it can run with hundreds of replicas, which seems to be impossible for BFTSim (according to the experimentation of Reference [56], only up to 32 PBFT replicas and a limited number of clients is supported in BFTSim).

Scalable BFT Simulation. Recently, Wang et al. [56] introduced a BFT simulator that exhibits resource-friendliness, high scalability, and includes an “attacker module” with predefined attacks such as partitioning, adaptive, and rushing attacks. Similar to BFTSim, it requires the re-implementation of a BFT protocol (in JavaScript). One limitation to consider is that this simulator cannot yet measure throughput. Additionally, instead of emulating real network protocols, it assumes a high-level model to capture network characteristics. In this model, message delays are represented by a variable sampled from Gaussian or Poisson distributions, which must be defined in advance. Latencies are not derived from a real-world statistical latency map (thus, not reflecting heterogeneous latencies as they occur in real WANs).

In contrast to the simulator of Wang et al., the simulation method presented in this work does not require a re-implementation of the protocol in JavaScript and can also measure throughput, which is an important performance metric for blockchains. It should be also noted that the goals of the tool of Wang et al. and our tool were somewhat different: The authors of Reference [56] were interested in analyzing the potential impact of problematic network conditions (such as misconfigured timeouts) and attacks on BFT protocol latency, which makes it suitable for a validation purpose. Our method focuses on mimicking realistic deployment scenarios (such as planetary-scale deployments) with authentic latency maps to reason about the system performance (both latency and throughput) in real-world blockchain deployments.

Further Approaches. Furthermore, related work in behavior prediction includes stochastic modeling of BFT protocols [41] and a prior high-level simulation model that examined the impact of various message exchange patterns in BFT protocols, yet it falls short in terms of its suitability for analyzing real-world system metrics [47]. Moreover, validation capabilities of BFT protocols have been improved through the generation of unit tests using the Twins methodology [5].

To the best of our knowledge, the methodology for evaluating BFT protocols as presented in this work is the only one featuring a plug-and-play capability, which not only alleviates the evaluation of a broad variety of BFT protocols but also helps strengthen the validity of results, because no error-prone re-implementation is involved. We present a comparison of our approach with the two main competing methods in terms of features in Table 1.

Table 1.

	BFTSim [48]	Wang et al. [56]	This work
plug & play utility	✗	✗	✓
scalability	✗	✓	✓
resource-friendly	✓	✓	✓
assess throughput	✓	✗	✓
assess latency	✓	✓	✓

Table 1. Simulation-based Methodologies Envisioned for BFT Protocol Research

2.2 Generic Simulators and Emulators

Additionally, there are tools available for emulating or simulating a wide range of generic distributed systems. These tools can substantiate powerful building blocks for evaluating BFT protocols.

Emulators. Emulators such as Mininet [30, 35] and Kollaps [25] create realistic networks that run actual Internet protocols and application code with time synchronized with wall clock. Both approaches offer a high level of realism but are less resource-friendly. Mininet, although not scalable, had this issue addressed with the introduction of MaxiNet [57], enabling distributed emulation using multiple physical machines. Kollaps [25] is a scalable emulator but requires a significant number of physical machines for conducting large-scale experiments.

Simulators. Furthermore, ns-3 [43] is a resource-friendly and scalable network simulator, but it necessitates the development of an application model, thus impeding application layer realism (and preventing plug-and-play utility). Phantom [32] uses a hybrid emulation/simulation architecture: It executes real applications as native OS processes, co-opting the processes into a high-performance network and kernel simulation and thus can scale to large system sizes. The methodology presented in this work utilizes Phantom [32] to conduct the simulations, because its hybrid design allows for high-performance network simulations. We address open challenges that remain when using Phantom for BFT protocols and large-scale systems (see Section 4.2.1).

2.3 Blockchain Research

Several simulators have been developed for blockchain research, including Shadow-Bitcoin [38], the Bitcoin blockchain simulator [24], BlockSim [22], SimBlock [2], and ChainSim [54]. However, these simulators primarily focus on constructing models that accurately represent the features of Proof-of-Work (PoW) consensus mechanisms. As a result, as of the time of writing, they are less suitable for adoption when conducting BFT protocol research. Our proposed methodology is explicitly designed for BFT protocol research and is efficient in conducting large-scale simulations. Recent research also investigated how to faithfully simulate any BFT protocol that uses public coins and shared objects, like common coins as used in Block DAGs [4].

2.4 Tools for Vulnerability Detection or Performance Evaluation of BFT Protocols

Vulnerability Detection. A few tools have been developed that aim to spot vulnerabilities in BFT protocols. Twins [5] is a unit test case generator designed to simulate Byzantine behavior through the replication of cryptographic IDs of replicas, thus producing protocol runs that feature equivocations or forgotten replica states. We plan to explore the integration of Twins within our simulation methodology in future work. Related work also investigates checking the liveness of streamlined BFT protocols [19]: The authors introduced the concepts of partial state and hot state to represent relevant information about protocol executions. It then employs temperature checking and lasso detection to identify recurring states, thus identifying possible liveness issues in an automated way. ByzFuzz [58] is a fuzzing technique designed to automatically detect bugs in implementations of BFT SMR protocols. BFTDiagnosis [55] is a security testing framework that can inject pre-defined “dishonest behavior” in specific replicas.

Performance Evaluation. Further, there are tools dedicated to the performance evaluation of BFT protocols. HyPerf [29] is a hybrid methodology based on model checking and simulation techniques for evaluating the performance of BFT protocols. BFT-Bench [27] is a benchmarking framework for evaluating and comparing BFT protocols in practice by scheduling experiments with BFT protocols on a local cluster of available machines. Diabolo [26] is a benchmarking suite to evaluate blockchain performance under various workload scenarios. Recently, a BFT testing framework was proposed for the Flow blockchain [31].

3 BFT Protocols

BFT state machine replication (SMR) protocols achieve fault tolerance by coordinating client interactions with a set of independent service replicas [46]. The replicated service remains functional as long as the number of faulty replicas does not surpass a threshold t out of n replicas. In BFT SMR, replicas order (blocks of) operations issued by clients through a consensus primitive to preserve consistency among all replicas.

3.1 Synchrony Models

BFT protocols rely on synchrony models to capture temporal behavior and timing assumptions, which are important for the concrete protocol designs: In the asynchronous system model, no assumptions are made about upper bounds for the network transmission time or the time needed to perform local computations. These are said to complete eventually, i.e., they happen after an unknown (but finite) amount of time. In contrast, the synchronous system model assumes the existence of known upper bounds. A sweet spot is captured by the partially synchronous system model [21], which assumes a system may start in asynchrony but bounds exist that only hold eventually, i.e., after some unknown time span that is modeled by the global stabilization time (GST). Partial synchrony is employed in PBFT and many BFT protocols that followed. PBFT guarantees liveness only under partial synchrony but always remains safe even when the system is asynchronous.

3.2 Running Agreements

Replicas repeatedly agree on a block of operations. To complete a single agreement instance, replicas usually go through multiple phases. In our approach, we model each of the phases such that it consists of several components:

(1)

Dissemination of one or more proposals: In some cases, this component can be omitted if the phase uses the output of the previous phase as its input.

(2)

Confirmation: This component requires voting to confirm a proposal.

(3)

Aggregation: Votes that match across different replicas are collected to form a quorum certificate.

Figure 2(a) illustrates these phases for the well-known PBFT protocol. The first phase, prepare, encompasses both dissemination (pre-prepare messages) and confirmation (pprepare messages). Quorum certificates contain votes from sufficient replicas to guarantee that no two different blocks can receive a certificate, thus committing the block. After an agreement completes, replicas execute the operations. An operation completes at the client if the client can verify that it was committed in some block and the execution result is correct.

Fig. 2.

In many BFT protocols, replicas operate in views, which behave orthogonal to the agreement instances (see Figure 2(a)). A view defines a composition of replicas, select leader(s), and in some cases establish the dissemination pattern. Some protocols re-use a single view under a stable leader (as long as agreement instances finish), as shown in Figure 2(a), but others may change the view for each stage or instance. In leader-driven BFT protocols, a dedicated view change phase synchronizes replicas, replaces the leader, and eventually ensures liveness by electing a new leader under whose regency, agreement instances succeed.

3.3 PBFT

In 1999, Castro and Liskov proposed the Practical Byzantine Fault Tolerance (PBFT) protocol, which became known as the first practical approach for tolerating Byzantine faults [14]. PBFT’s practicality comes from its optimal resilience threshold (\(t= \frac{n-1}{3}\)) and its high performance, comparable to non-replicated systems.⁶ We show the message flow of PBFT’s normal operation in Figure 2(b). In PBFT, the leader collects client operations in a block and broadcasts the block in a pre-prepare message to all replicas. Subsequently, all replicas vote and collect quorum certificates through the messages prepare and commit, which are realized as all-to-all broadcasts. PBFT does not scale well for larger system sizes, because all operations are tunneled through a single leader, who must disseminate large blocks to all of the other replicas. This makes the leader’s up-link bandwidth a bottleneck for the whole system’s performance. Further, PBFT’s all-to-all broadcasts incur \(O(n^2)\) messages (and authenticators⁷) to be transmitted in the system. PBFT was followed by a lineage of BFT protocols exploring the possible design space (see References [1, 20] for an overview).

3.4 HotStuff

The HotStuff leader implements linear message complexity by gathering votes from all other replicas and disseminating a quorum certificate [60]. To reduce the cost of transmitting message authenticators, the leader can use a simple aggregation technique to compress \(n-t\) signatures into a single fixed-size threshold signature. This threshold signature scheme uses the quorum size as a threshold, and a valid threshold signature implies that a quorum of replicas has signed it. Consequently, the threshold signature has a size of \(O(1),\) which is a significant improvement over transmitting \(O(n)\) individual signatures.

The original implementation of HotStuff (we refer to it as HotStuff-secp256k1) is based on elliptic curves and does not feature signature aggregation (i.e., combining multiple signatures into a single signature of fixed size). Later, an implementation was made available by Reference [40] that uses bls signatures [12] (which we refer to as HotStuff-bls) that features signature aggregation. As depicted in Figure 2(c), the communication flow remains imbalanced where each follower replica communicates exclusively with the leader (which is the center of a star topology), while the leader has to communicate with all other replicas.

3.5 Kauri

The use of a tree-based communication topology offers an advantage, as it distributes the responsibility of aggregating and disseminating votes and quorum certificates, thus relieving the leader. Kauri [40] is a tree-based BFT SMR protocol (see Figure 2(d)) that organizes all replicas in a 3-layered tree (the root, who is also the leader, internal replicas, and leaf replicas). Kauri introduces a timeout for aggregation to address failures of the leaf replicas. To handle the failure of internal tree replicas, Kauri employs a reconfiguration scheme, which guarantees to find a correct set of internal replicas, given that the number of failures lies below a certain threshold.

The added latency caused by the additional number of communication steps is mitigated through a more sophisticated pipelining approach (that can start several agreement instances per protocol stage) than the pipelining mechanism employed in HotStuff, which launches only a single agreement instance for each protocol stage.

3.6 Narwhal & Tusk and BullShark

Narwhal [17] is an asynchronous BFT mempool protocol, which can be utilized for the reliable dissemination of operations. The main idea for improving scalability in Narwhal & Tusk is to separate the dissemination of data from the actual consensus. The Narwhal primary protocol is used to implement a quorum-based, reliable Byzantine broadcast, which creates a directed acyclic graph (DAG). Every block in the DAG structure needs to be acknowledged by a quorum of \(2t+1\) replicas, whose signatures form a certificate of availability. In Narwhal, each replica runs a primary for the reliable broadcast of small pieces of metadata (the block headers) while the actual blocks are disseminated by the workers (there can be several workers on each replica). The primary protocol as depicted in Figure 3(a) is executed by all replicas (not just by \(r_0\) as illustrated in the figure) to reliably broadcast block headers. The certificates of availability guarantee that missing blocks can be requested by worker threads replicas that voted on the respective block headers.

Fig. 3.

Tusk [17] is the asynchronous BFT consensus algorithm that runs on Narwhal. To achieve consensus, all that is needed is to interpret the DAG created by Narwhal through a deterministic rule, committing all blocks in a specific order. For this purpose, Tusk uses a shared random coin to select a leader in every second round (this is done in retrospect after two more rounds, i.e., round \(i+2\) selects the leader of round i). If a leader has sufficient support of \(t+1\) replicas (see the blue links in Figure 3(b)), then its block is committed along with all of its causal history of blocks (see the green blocks in Figure 3(b)). In this figure, all five rounds are necessary to commit the green blocks. Round 4 is used to confirm the unknown leader L2 through voting, while round 5 is the round that determines that \(r_2\) is the chosen leader for \(L2\) through the random coin.

BullShark [50] is a successor of Narwhal & Tusk that aims to achieve lower latency in synchronous networks by adding a fast path, so it can be viewed as an optimization of the former for the partially synchronous setting. BullShark works on the same DAG structure, as depicted in Figure 3(b), but interprets it as a succession of waves, where each wave consists of four rounds in the DAG. Different from Tusk, BullShark distinguishes between steady-state leaders and fallback leaders. Each wave contains a single fallback leader (in the wave’s first round but selected in retrospect) and two predefined steady-state leaders (in the first and third rounds). The envisioned reduction in latency comes from the fact that the partially synchronous BullShark can commit one round faster than Tusk (two rounds are needed instead of three), since in Tusk an additional round is necessary to determine the leader in retrospective using the random coin.

4 Methodology

In this section, we first justify and motivate our ambition of evaluating actual BFT protocol implementations through a simulation of the running distributed system. In this simulation, replica and client components are instantiated using the provided protocol implementations and integrated into an event-based simulation that constructs and manages the system environment. Moreover, we explain the selection of protocols that we made in Section 3. All of our code and the experiment description files are open-source and available on GitHub.⁸ While our evaluations focus on BFT protocols, our proposed methodology can be also applied to crash fault-tolerant (CFT) protocols such as Paxos [34] and Raft [42] without changes.

Subsequently, we describe the software architecture of our simulation approach. The approach involves the user inputting a simple experimental description file (EDF), specified in YAML format, to a frontend. The frontend then prepares all runtime artifacts, creates a realistic network topology, and schedules a new experiment. This scheduling is done by launching an instance of the backend, which runs the network simulation. A detailed overview is displayed in Figure 4. After that, we validate simulation results by comparing them with measurements from real setups that we mimicked.

Fig. 4.

4.1 Why Simulate BFT Protocol Implementations?

One of the main benefits of our concrete methodology is the plug-and-play utility. This means we can guarantee application realism because the actual implementation is used to start real Linux processes that serve as the application model, thus duplicating actual implementation behavior. In particular, it means in regard to the application level the overall approach can be considered an emulation. At the same time, users experience no re-implementation or modeling effort, which can easily introduce errors due to the high complexity of BFT protocols and is also time-consuming.

Furthermore, simulating actual BFT protocol implementations rather than specifically crafted models is useful for the purpose of rapid prototyping and validation. Some implementation-level bugs in BFT systems might not occur in “common” \(n=4\) scenarios, and thus simulations can be utilized to conduct integration tests at a larger scale. Similarly, they can be utilized by developers to create integration/regression tests thinking “did my last commit negatively affect the protocol performance at a larger scale?” Some advantages generally exist when using simulations. First, simulations make it easy to investigate the runtime behavior of BFT protocols in an inexpensive way, much cheaper than real-world deployments. Nowadays more and more BFT protocol implementations are published open-source. Simulations make it easy to compare these protocols under common conditions and fairly reason about their performance in scenarios in which performance becomes network-bound. In particular, it is possible to explore the parameter space of both protocol parameters and network conditions in a systematic way and in a controlled environment, which even produces deterministic results.

Moreover, in our methodology, we can create large network topologies by using latency maps (such as those provided by Wonderproxy⁹), which can provide more regions than what most cloud providers, e.g., AWS, can offer. Last, simulation can also serve a didactical purpose. Simulations can help to achieve a better understanding of how BFT protocols behave at a large scale. For instance, the methodology can be used to support teaching in distributed system labs at universities by letting students gain hands-on experience with a set of already implemented BFT protocols.

Phantom as Choice for the Backend.

We chose Phantom [32] for the part of the backend that finally conducts the simulations. The main reason lies in its high-performance, hybrid simulation/emulation architecture, which offers the possibility to directly execute applications (thus benefiting realism) while still running them in the context of a cohesive network simulation [32]. We conducted a comparison with other approaches in an earlier workshop paper [9].

4.2 The Frontend: Accelerating Large-scale Simulations of BFT Protocol Implementations

Conducting large-scale simulations of BFT protocols requires tackling a set of challenges first. This is because of the following reasons:

4.2.1 Challenges.

①

Realistic Network. The simulation quality depends on realistic and large network topologies for arbitrary system sizes. The characteristics of their communication links should ideally resemble real-world deployments. This is crucial to allow realistic simulation of wide-area network environments.

②

Bootstrapping. We need aid in setting up the BFT protocol implementations for their deployment, since bootstrapping BFT protocol implementations in a plug-and-play manner involves many steps that can be tedious, error-prone, and protocol-specific. Examples include the generation of protocol-specific runtime artifacts like cryptographic key material, or configuration files that differ for every BFT protocol.

③

Bulk Experimentation. When developing and testing BFT algorithms, different combinations of protocol settings result in numerous experiments being conducted. Since simulations run in virtual time, they can take hours, depending on the host system’s specifications. For the sake of user experience and convenience, we find it necessary for experiments to be specified in bulk and run sequentially without any need for user intervention.

④

Inducing Faults. Additionally, it should be possible to evaluate the performance of BFT protocols when faults occur during the runtime of the system, thus the capability of seamlessly inducing faults needs to be integrated.

⑤

Supervision of the Runtime. We may want to track and evaluate resources needed during simulation runs, such as CPU utilization and memory usage.

⑥

Extensibility. It should be easy to integrate new BFT protocol implementations into the methodology.

These reasons led us to develop a frontend, a tool on top of the Phantom simulator [32] to simplify and accelerate the evaluation of unmodified BFT protocol implementations.

4.2.2 Experimental Description and Frontend Design.

The frontend is composed of several components (see Figure 4) and follows a modular architecture, in that it is not tailored to a specific BFT protocol, but is easily extensible.

Scheduler.

The toolchain is administered by a scheduler that manages all tools, i.e., for preparing an environment, configuring runtime artifacts for a BFT protocol, and initializing a resource monitor. The scheduler invokes protocol connectors to set up a BFT protocol and loads experiments description files (see Figure 5 for an example that specifies a single experiment) that contain a set of experiments to be conducted for the specified BFT protocol. Finally, it starts Phantom, once an experiment is ready for its execution (addressing ③). The scheduler also initializes a resource monitor to collect information on resource consumption (such as allocated memory and CPU time) during simulation runs and also the total simulation time (addressing ⑤). These statistics can serve as indicators of a possible need for vertically scaling the host machine and as estimates for the necessary resources to run large simulations.

Fig. 5.

Environment Generator.

The environment generator creates network topologies¹⁰ as a complete graph for any system size, resembling realistic LAN or WAN settings (thus addressing ①). To replicate the geographic dispersion of replicas realistically, the environment generator employs a cloudping component, which retrieves real round-trip latencies between all AWS regions from Cloudping.¹¹ This allows the tool to create network topologies that resemble real BFT protocol deployments on the AWS cloud infrastructure. We also implemented a larger latency map that uses 51 distinct locations sourced from Wonderproxy’s latency statistics. The cloudping component can either load an up-to-date latency map from an online source or use one of the existing ones from the repository. Note that using the same latency map is necessary for maintaining determinism and thus a requirement for reproducibility. The EDF.network description defines the distribution of replicas and clients on a latency map and configures bandwidth and packet loss.

Protocol Connectors.

For each BFT protocol implementation that we want to simulate, it is necessary to create protocol configuration files and sufficiently many public key pairs required to launch all replicas and clients. Since protocol options and cryptographic primitives vary, depending on the concrete BFT protocol, we implement the protocol-specific setup routine (see ②) as a tool called protocol connector, which is invoked by the scheduler. A connector must implement the methods build() and configure(). This way, it is simple to extend our toolchain and support new BFT protocols, as it only requires writing a new protocol connector (thus addressing ⑥—in our experience, this means writing between 100 and 200 lines of code in JavaScript).

Fault Induction.

Our frontend can also induce faults during simulation runs, which is important to reason about the performance in “uncivil”¹² scenarios. Since BFT protocols sometimes employ different resilience thresholds, we allow the user to specify a desired threshold of replicas in which faults are induced (therefore addressing ④). We model a static threshold adversary as often assumed by BFT protocols.

A simple and currently supported fault type is type: crash, which terminates the faulty replica processes at a specific timestamp within the simulation. Another scenario that we can run is a denial-of-service attack by setting type: dos and specifying an overload parameter, which leads to a malicious client being instantiated that sends a larger number of requests (a multiple, e.g., \(100\times\) of what the normal clients send) to test if implementations can withstand such a scenario, i.e., by limiting the number of requests accepted from a single client and ensuring a fair batching strategy. Further, we support packetloss, which describes the ratio of packets to be dropped on the network level (however, this setting currently needs to be configured in the EDF.network section) to assess how well BFT implementations can perform when networks behave lossy.

In future work, we want to explore more sophisticated (Byzantine) fault behavior, for instance, by seeking inspiration from the Twins [5] methodology. Twins is a unit test case generator for Byzantine behavior by duplicating cryptographic identities of replicas (e.g., leading to equivocations).

4.3 The Backend: Using Phantom to Simulate BFT Protocols as Native Linux Processes

As backend, we use Phantom, which employs a hybrid simulation/emulation architecture, in which real, unmodified applications execute as normal processes on Linux and are hooked into the simulation through a system call interface using standard kernel facilities [32]. An advantage of this is that this method preserves application layer realism as real BFT protocol implementations are executed. At the same time, Phantom is resource-friendly and runs on a single machine.

By utilizing its hybrid architecture, Phantom occupies a favorable position between the pure simulator ns-3 [43] and the pure emulator Mininet [35]. It maintains sufficient application realism necessary for BFT protocol execution while exhibiting greater resource-friendliness and scalability compared to emulators. Because Phantom strikes a balance that caters to the needs of BFT protocol research, we chose it as the backend for conducting protocol simulations.

Simulated Environment.

In Phantom, a network topology (the environment) can be described by specifying a graph, where virtual hosts are nodes and communication links are edges. The graph is attributed: For instance, virtual hosts specify available uplink/downlink bandwidth and links specify latency and packet loss. Each virtual host can be used to run one or more applications. This results in the creation of real Linux processes that are initialized by the simulator controller process as managed processes (managed by a Phantom worker). The Phantom worker uses LD_PRELOAD to preload a shared library (called the shim) for co-opting its managed processes into the simulation (see Figure 4). LD_PRELOAD is extended by a second interception strategy, which uses seccomp¹³ for cases in which preloading does not work.

Simulation Engine.

The shim constructs an inter-process communication channel (IPC) to the simulator controller process and intercepts functions at the system call interface. While the shim may directly emulate a few system calls, most system calls are forwarded and handled by the simulator controller process, which simulates kernel and networking functionality, for example, the passage of time, I/O operations on file, socket, pipe, timer, event descriptors, and packet transmissions.

Deterministic Execution.

Throughout the simulation, Phantom preserves determinism: It employs a pseudo-random generator, which is seeded from a configuration file to emulate all randomness needed during simulation, in particular, the emulation of getrandom or reads of /dev/*random. Each Phantom worker only allows a single thread of execution across all processes it manages so each of the remaining managed processes/threads are idle, thus preventing concurrent access of managed processes’ memory [32].

In our workflow, Phantom is invoked by the Scheduler as soon as a new simulation experiment is ready for its execution and the host’s hardware resources are available.

5 Validation

In this section, we compare measurements of real BFT protocol runs with results that we achieve through simulations.

BFT Protocol Selection.

We justify the selection of BFT protocols in the following way: Our main ambition was to showcase the impact of different communication strategies (i.e., all-to-all, star, and tree) towards system performance, and thus we selected a single “representative” BFT protocol for each strategy (namely, PBFT, HotStuff, and Kauri, respectively). We also employ representatives of asynchronous, leaderless BFT protocols (Narwhal & Tusk and BullShark). We present an overview of the BFT protocols and corresponding implementations that we employ in Table 2.

Table 2.

Framework	BFT Protocol	Language	Repository on github.com
libhotstuff	HotStuff [59]	C++	/hot-stuff/libhotstuff
themis [45]	PBFT [14]	Rust	/ibr-ds/themis
bft-smart	BFT-SMaRt [11]	Java	/bft-smart/library
hotstuff-bls [40]	HotStuff [59] (with BLS)	C++	/Raycoms/Kauri-Public/
kauri	Kauri [40]	C++	/Raycoms/Kauri-Public/
narwhal	Narwhal & Tusk [17]	Rust	/facebookresearch/narwhal/
bullshark	BullShark [50]	Rust	/facebookresearch/narwhal/tree/bullshark

Table 2. BFT Protocol Implementations that We Simulated

Simulator Machine Specification and Data Collection.

Our simulator runs on an Ubuntu 20.04 VM with 214 GB memory and 20 threads (16 threads used for simulation) on a host with an Intel Xeon Gold 6210U CPU with 2.5 GHz and 20 cores. Since the simulation runs in a virtual environment, performance results such as throughput or latency only depend on the experimental description (i.e., the network, replica, and client settings or induced faults at runtime) but not on the specification of the host. If a host with a faster CPU or more cores is utilized, then it can accelerate the time needed to conduct simulations and the host must provide sufficient RAM to support large-scale simulations.

5.1 HotStuff-secp256k1

In our first evaluation, we try to mimic the evaluation setup of the HotStuff paper (the arXiv version; see Reference [59]) to compare their measurements with our simulation results. Their setup consists of more than a hundred virtual machines deployed in an AWS data center; each machine has up to 10 Gbit/s bandwidth and there is less than 1 ms latency between each pair of machines (we use 1 ms in the simulation). The employed block size is 400. We compare against two measurement series: “p1024,” where the payload size of request and responses is 1,024 bytes and “10ms” with an empty payload, but the latency of all communication links is set to 10 ms. Our goal is to investigate how faithfully the performance of HotStuff can be predicted by regarding only the networking capabilities of replicas, which manifests at the point where the network becomes the bottleneck for system performance.

Observations. We display our results in Figure 6. The simulation results for the payload experiment indicate a similar drop in performance as the real measurements for \(n \ge 32\). For a small-sized replica group, the network simulation predicts higher performance: 200k op/s. This equals the theoretical maximum limited only through the 1 ms link latency, which leads to pipelined HotStuff committing a block of 400 requests every 2 ms. The difference in throughput decreases once the performance of HotStuff becomes more bandwidth-throttled (at \(n\ge 32\)). We also achieve close results in the “10ms” setting: 80 ms in the simulation vs. 84.1 ms real, and 20k op/s in the simulation vs. 19.2k op/s real for \(n=4\); but with an increasing difference¹⁴ for higher n, i.e., 84 ms vs. 106 ms and 19k.2 op/s vs. 15.1k op/s for \(n=128\). The problem is that this experiment does not use any payload, which makes the performance less sensitive to a network bottleneck (which is usually caused by limited available bandwidth).

Fig. 6.

5.2 BFT-SMaRt and PBFT

In our next experiment, we validate BFT-SMaRt and PBFT¹⁵ by using measurements taken from Reference [59] and conducting our own experiment on a WAN that is constructed using four different AWS regions.

LAN environment.

To begin with, we mimic the “p1024” setup from Reference [59] (as the real measurement data we found for BFT-SMaRt is from Reference [59]), thus creating an environment with 1 ms network speed and 10 Gbit/s bandwidth. We simulate both protocols: BFT-SMaRt and PBFT, because they utilize an identical normal-case message pattern.

Observations. We display our results in Figure 7. We observe that initially (\(n\le\)32) the real BFT-SMaRt performance results are quite lower than what our simulations predict. This changes with an increasing n, i.e., at \(n=128\), we observe 9,280 op/s for BFT-SMaRt (real measurement: 8,557 op/s) and 9,354 op/s for our PBFT implementation. In the latency graph, we observe a noticeable gap between real and simulated BFT-SMaRt. The main reason for this is that we could not exactly reproduce the operation sending rate from Reference [59], as it was not explicitly stated in their experimental setup.

Fig. 7.

Geo-distribution.

Next, we experiment with geographic dispersion, putting each BFT-SMaRt replica in a distinct AWS region. Our experimental setup is similar to experiments found in papers that research latency improvements (see References [6, 7, 49]). We employ an \(n=4\) configuration and choose the regions Oregon, Ireland, São Paulo, and Sydney for the deployment of a replica and a client application each. We run clients one after another, and each samples 1,000 requests without payload and measures end-to-end latency, while the leader replica (in Oregon) measures the system’s consensus latency.

Observations. We notice that consensus latency is only slightly higher in the simulation (237 ms vs. 249 ms), and further, the simulation results also display slightly higher end-to-end request latencies in all clients (see Figure 8). The deviation between simulated and real execution is the lowest in Oregon (1.3%) and the highest in São Paulo (3.5%).

Fig. 8.

5.3 Kauri and HotStuff-bls

Moreover, we mimic the global experiment from Kauri [40], which uses a varying number of 100, 200, and 400 replicas. The global setup assumes replicas being connected over a planetary-scale network, in which each replica possesses only 25 Mbit/s bandwidth and has a latency of 100 ms to every other replica. We validate two implementations that were made by Reference [40]: HotStuff-bls, an implementation of HotStuff that uses bls instead of secp256k1 (this the originally implemented version of HotStuff), and Kauri, which enriches HotStuff-bls through tree-based message dissemination (and aggregation) and an enhanced pipelining scheme.

Observations. Figure 9 shows our results. Overall, we observe that for both implementations, the results we could obtain for system throughput are almost identical. At \(n=400\), for Kauri, we observe 4,518 op/s (real: 4,584 op/s), and for HotStuff-bls it is 230 op/s (real: 252.67 op/s).

Fig. 9.

We also evaluated on the latency of Kauri deciding blocks (as in Figure 8 from Reference [40]) comparing with the \(n=100\) and 25 Mbit/s latency experiment. While the real experiment in the Kauri paper reports a latency of 563 ms, in the simulation, deciding a block seems to take at least 585 ms.

5.4 Tusk and Bullshark

Last but not least, we validate the protocols Tusk and Bullshark, which are the representatives of asynchronous BFT protocols. For this purpose, we collect real system data by running benchmarks on the AWS cloud infrastructure and replicate geographic deployment that is also used in References [17, 50], and we set a bandwidth limit of 25 Mbit/s on each host to make it more consistent with the global experiment from Kauri [40]. To reproduce the deployment scenario that was originally proposed in References [17, 50] and ensure geographic dispersion, we utilize 20 replicas and 20 clients, which are evenly distributed among the AWS regions Virginia, Sydney, Stockholm, Tokyo, and California. We repeatedly run the benchmark application from Reference [17] using incrementally higher operation sending rates on the clients until the system throughput is saturated and only latency increases. In the benchmark, clients send requests of 500 Bytes size and replicas employ a block header size of 5 KB.

Observations. Figure 10 displays our observations. As expected, we see that the real experimental data of Bullshark indicates a performance improvement when compared to Tusk, most notably, displaying a request latency that is about 2 seconds lower while maintaining similar (or slightly better) throughput levels. The simulation results track real-world observations quite closely. To be concrete, the throughput results show almost no deviation until the saturation point is reached (around a sending rate of 5k requests/s). Further, we found that the latency results of the simulation engine resemble the real-world observation but display minor but noticeable deviations: In Bullshark the simulated latency seems to be about 0.2 s slower than the real measurement data, while the simulated Tusk system displays latencies roughly 1 s higher than what we observed on AWS.

Fig. 10.

5.5 Resource Consumption and Implementations

Further, we investigate how resource utilization, i.e., memory usage and simulation time, grows with an increasing system scale. For this purpose, we use the HotStuff-secp256k1 “10ms” simulations (which display a somewhat steady system performance for increasing system scale) on a Ubuntu 20.04 VM with 214 GB memory and 20 threads (16 threads used for simulation) on a host with an Intel Xeon Gold 6210U CPU at 2.5 GHz. We observe that active host memory and elapsed time grow with increasing system scale (see Figure 11). Based on the practically linear increase in memory utilization in Figure 11, we estimate that 512 replicas will need about 64 GiB memory, and it should be feasible to simulate up to 512 HotStuff replicas with a well-equipped host.

Fig. 11.

During our validations, we tested several open-source BFT frameworks (see Table 2) that have been written in different programming languages (C++, Rust, Java). From our experience, the Java-based BFT-SMaRt library was the most memory-hungry implementation, but we were still able to simulate \(n=128\) replicas on our commodity hardware server without problems, which strengthens our belief in the scalability and resource-friendliness of our methodology.

6 Experimental Results

In this section, we simulate and compare different BFT protocol implementations and gain insights into the performance and behavior of these implementations across various scenarios:

(1)

failure-free: Benchmark protocols in failure-free execution for increasing system size and varying block size;

(2)

packetloss: The simulated network behaves lossy;

(3)

denial-of-service attack: A specific client tries to overload the system by submitting too many operations;

(4)

crashing-replicas: Similiar to (1) but with induced crash faults at a certain point of simulated time.

The high-level goal is to present an apples-to-apples comparison of BFT implementations in a controlled environment.

6.1 General Setup

In our controlled environment, we simulate a real planetary-scale network, using 21 existing regions from the AWS cloud infrastructure, as shown in Figure 12. Latencies are retrieved from real latency statistics by the cloudping component of our frontend. Replicas vary in number and are evenly distributed across all regions. We use a 25 Mbit/s bandwidth rate, consistent with related research to model commodity hardware in world-spanning networks (e.g., see the global setup of Reference [40]). We utilize a variable number of clients to submit requests to the replicas. For instance, to unleash the maximum performance of Tusk and Bullshark, both protocols demand at least one client per replica, while the other BFT protocols can be often saturated with fewer replicas. In such cases, we carefully select both the client count and concurrent request rate to fully saturate and thereby maximize¹⁶ the observable system throughput.

Fig. 12.

In our simulations, we use the protocols PBFT, HotStuff-secp256k1, HotStuff-bls, and Kauri as representatives for partially synchronous BFT protocols and Tusk and Bullshark as representatives for asynchronous BFT protocols. Operations carry a payload of 500 bytes (roughly the average size of a Bitcoin operation), and the default block size is 1,000 operations for the partially synchronous protocols and 10 operations for the DAG-based protocols Tusk and Bullshark, since in these protocols every replica disseminates blocks. Each simulation deploys replicas and clients and then runs the BFT protocol for at least 120 seconds of simulated time within the environment.

6.2 Failure-free Scenario: Scalability and Block Size

In our first experiment, we evaluate the baseline performance of the BFT protocols in our constructed environment assuming that no failures happen. Further, we repeat each simulation while varying the system size n (namely, using a total of 64, 128, and 256 replicas) and varying the block size (for instance, by using both the default size of 1,000 operations and a block size that is more optimal for a protocol in respect to achieving lower latency). We denote variations in the employed block size by adding the postfix “-blockSize” to the protocol name.

Observations for the Partially Synchronous Protocols.

We present our results in Figure 13. In particular, we can observe striking differences in BFT protocol performance being in different orders of magnitudes.

Fig. 13.

Kauri-1k displays the highest performance among all protocol implementations. with a throughput of 7,192 op/s at a latency of 535 ms (at \(n=64\)) and still maintaining 4,917 op/s at a latency of 765 ms at \(n=256\). HotStuff-secp256k1-1k manages a throughput of 1,266 op/s at a latency of 4.48 s (at \(n=64\)), which decreases to 246 op/s and a latency of 21.681 s at \(n=256\). This means at \(n=256\), Kauri achieves a throughput increase of almost \(20\times\) over HotStuff-secp256k1 in our heterogeneous setup (4,917 op/s vs. 246 op/s). On a side note, the evaluations of Kauri report a possible increase of up to \(28\times\) over HotStuff in a setup created entirely with homogeneous latencies [40].

We further observe a surprisingly low performance of the PBFT-1k implementation we tested (75 op/s and latency of 23.45 s at \(n=64\)). The problem with this implementation is that it includes full operations in blocks, causing long dissemination delays for large blocks over the limited 25 Mbit/s bandwidth per host. Other protocol implementations include only SHA-256 hashes of operations in the blocks, which then accelerates proposal dissemination time.

For the purpose of a fair comparison, we simulate this PBFT implementation as if it would use the “big request” optimization¹⁷ [14] (and denote it by PBFT-opt), which achieves 466 op/s and a latency of 3.7 s at \(n=64\). HotStuff-secp256k1 still beats the PBFT implementation we used because of its use of pipelining and re-use of quorum certificates during aggregation, which can improve performance in this setting.

Notably, varying the block size impacts observed latency. Smaller blocks can be more quickly disseminated, which then decreases latency, in particular, if clients operate in a closed loop, i.e., only issue a constant number of k operations concurrently then wait for obtaining responses and completing pending operations before submitting new operations. We also observe that at \(n=256\), the bls implementation of HotStuff-400-bls increases throughput by \(1.85\times\) over its implementation with secp256k1 (353 vs. 191 op/s).

Observations for the Asynchronous Protocols.

Both asynchronous protocols Tusk and Bullshark can achieve competitive performance (see Figure 14) to their partially synchronous counterparts for their \(n=64\) and \(n=128\) deployments but experience a significant drop in performance in the \(n=256\) setting. We think the reason for this drop is that the available bandwidth of 25 Mbit/s becomes exhausted by the Narwhal protocol solely for collecting votes and disseminating certificates, leaving little available bandwidth for the actual transmission of operations. To put the results into perspective, at \(n=128\), Tusk and Bullshark can achieve around 1.9k and 2.3k op/s. This means the throughput of Bullshark is \(3.8\times\) higher than the throughput of HotStuff-secp256k1. In the setting with \(n=64\) replicas, Bullshark can achieve a latency that is competitive to HotStuff-secp256k1 (5.4 vs. 4.5 seconds) while Tusk is notably slower. At \(n=128\) and higher, the latency of the asynchronous protocols spikes and is consistently higher than the latency displayed by HotStuff-secp256k1.

Fig. 14.

6.3 Packet Loss

In our next experiment, we study the impact of lossy network links on system throughput. For this purpose, we employ the same environment as in the last section with a fixed size of \(n=128\) replicas but introduce a packet loss of \(2\%\) on every network link.

Observations.

We display our simulation results in Figure 15. Overall, we observe the same protocol performance for PBFTand HotStuff-secp256k1, while the throughput of Kauri only slightly drops. These observations indicate that a small packet loss of only 2% does not impact BFT protocol performance much. All tested protocol implementations use TCP to establish reliable channels. We speculate the reason for the maintained performance might be rooted in the extra redundancy provided by BFT replication algorithms. These algorithms only need about 2/3 of vote messages to make progress and form a quorum certificate. Additionally, vote messages are typically very small and contain only minimal metadata (such as block number and replica ID), a hash value, and a signature, which likely keeps them below the maximum transmission unit limit.

Fig. 15.

6.4 Denial-of-service Attack

In this experiment, we examine how well BFT protocol implementations can tolerate specific clients trying to overload the system. We use the usual \(n=128\) replicas setup. To overload the system, we let a specific “malicious” client submit a larger number (i.e., \(10\times\) more than in Section 6.2) of outstanding operations to the system during a short time interval of 120 s and investigate the impact on request latency that normal clients observe.

Observations.

We show our simulation results in Figure 16. More outstanding client requests lead to higher observed latency in PBFT (4.9 s to 15.2 s) and HotStuff-secp256k1 (4.1 s to 8.2 s). This is because submitted requests are queued and need to wait for an increasing amount of time to be processed. Interestingly, we found almost identical latency results for Kauri and HotStuff-bls. After looking into the specifics of their implemented benchmark application, it became clear to us that these implementations only report consensus latency of replicas and not the end-to-end request latency observed by clients. In this light, the latency results obtained are—at least for this experiment—not helpful for a direct comparison.¹⁸

Fig. 16.

None of the tested implementations of partially synchronous BFT protocols had mechanisms in place that would prevent overloading the system; for instance, limiting the number of requests that are accepted from a single (potentially malicious) client.

Tusk and Bullshark also show no significant increase in latency. This is mainly because, in these two protocols, each client submits operations only to its closest replica, and then every replica proposes blocks simultaneously, while in the partially synchronous protocols, a single leader is contacted by all replicas. This leader becomes an easy target, because its available bandwidth can quickly become exhausted. The more decentralized nature of Tusk and Bullshark constitute an inherent defense against targeted DOS attacks. Apart from this, Narwhal does not provide automated load balancing: If clients are not evenly distributed among replica regions (more specifically: replicas that they contact), then a client that contacts the same replica as many other clients may experience higher latencies than a client that resides in a replica region without many other clients and whose operations can be serviced instantly.

6.5 Crashing Replicas

Finally, we investigate the crash fault resilience of the protocol implementations we tested, using our usual \(n=128\) replicas setup. We induce a crash fault at the leader replica at time \(\tau =60\) s. During our simulations, we noticed that the PBFT implementation’s¹⁹ view change did not work properly. After contacting the developers, we received a patch (a recent commit was missing in the public GitHub repository) that resolved the issue, at least for smaller systems. This illustrates how our methodology can help detect protocol implementation bugs.

Observations.

We show our results in Figure 17. The failover time of a protocol generally depends on its timeout parameterization, and we cannot exclude that tighter timeouts could have been possible (although noticeable, in Reference [40], it is stated that Kauri can use more aggressive timeout values than HotStuff). Notably, we observe a few interesting behaviors: In HotStuff, after the failover, the new HotStuff leader pushes a larger block leading to a throughput spike: It tries to commit the large block fast by building a three-chain²⁰ in which the large block is followed by empty blocks (this leads to a short, second throughput drop to 0). After that, the throughput stabilizes. In our observation, the new Kauri leader can more quickly recover protocol performance than in HotStuff. The throughput level is slightly higher, which is because the new leader seems to be located in a more favorable region of the world (leader location impacts BFT protocol performance).

Fig. 17.

The PBFT implementation completed the failover for small system sizes only and with a longer failover time than HotStuff or Kauri. We inspected the concrete behavior of the implementation and noticed that the view change was implemented in an inefficient way (with respect to utilized bandwidth): For every request that timed out, the new leader would assign a sequence number and instantly propose it in a new block (containing only a single request). This way, the costs of running the agreement protocol in our simulated wide area network would not amortize over multiple (hundreds of) requests, because every timed-out request demanded to collect a quorum.

Tusk and BullShark do not suffer from a replica (leader) crash as the previously tested protocols do. This is because both Tusk and BullShark do not require a leader change mechanism, since all relevant information necessary to achieve consensus is provided by the DAG structure. In Tusk, the leader is selected randomly and in retrospect, and thus can not be easily targeted by an attacker (the same holds for BullShark for its fallback leader). Figure 17 shows how blocks in Tusk and BullShark are committed over time: If a round completes with a leader having sufficient support, then a block and all referenced blocks can be committed (in the DAG, this looks like the green blocks in Figure 3(b)) and thus many blocks are committed at the same time, while the duration of a single round is relatively long. Overall, this means that the throughput over time is quite unsteady and displays a series of throughput spikes, which are the times in which blocks can be ordered in the DAG.

6.6 Discussion and Lessons Learned

After comparing multiple protocols, we conclude that currently, Kauri seems to promise the biggest advantage in terms of performance, achieving a throughput of over \(10\times\) that of HotStuff. Despite this, the novel asynchronous, DAG-based BFT protocols, Tusk and BullShark, emerge as serious competitors, since they can make progress even when the network does temporarily not behave synchronously. A crucial role plays their decentralized nature, which naturally allows them to distribute the operation load more evenly and prevents performance degradation attacks. The Narwhal mempool protocol on which Tusk and BullShark are based can still be potentially further improved: An interesting feature would be automatic load balancing, i.e., coordinating client workload as evenly as possible among all replicas’ available bandwidth capabilities. To make Narwhal even more efficient, it could be refined to use the BLS signature scheme instead of ECDSA so the certificates of availability that need to be disseminated by Narwhal are compressed and require less bandwidth when there are many replicas (i.e., \(n\ge 256\)). We found that the choice of cryptographic primitives does make a difference, in particular, the BLS variant of HotStuff surpassing HotStuff (EDCSA) at a larger scale. In Kauri, we think future work should address automating the configuration of protocol parameters to optimize performance in a WAN environment—in particular, we experienced that configuring the pipeline parameters by hand can be quite bothersome and result in many dry runs to find a reasonably good configuration. The PBFT implementation that we employed for our tests came with a buggy view-change mechanism—a problem that we were able to quickly resolve, thanks to the judicious use of our simulation methodology to run integration tests at large scale, thus spotting implementation-level bugs that can not be easily found in the commonly tested \(n=4\) test scenario.

7 Conclusions

In this article, we introduced a methodology for evaluating the performance of BFT protocols through network simulations. A major advantage of our methodology compared to related approaches (i.e., References [48, 56]) is that our approach is the first one that seamlessly integrates existing protocol implementations without the need for error-prone and time-consuming re-implementations in modeling languages. Compared to these two related simulation approaches of BFT protocols, our method offers further advantages: Our presented method additionally scales to large system sizes (which the approach of Reference [48] does not support) and can additionally evaluate system throughput (evaluating this metric is not possible with the tool of Reference [56]).

Furthermore, we found that our method is valuable for studying the performance of protocols as system scale increases and in realistic environments. We conducted a validity analysis to evaluate the accuracy of simulation results for several BFT protocols. Our validity analysis revealed that at larger system scales, system throughput can be predicted quite well, while latency results exhibit deviations to varying degrees. For instance, at \(n=128\) replicas, the difference in observed throughput was 10.3% for HotStuff, 8.4% for BFT-SMaRt, and for Kauri and HotStuff-bls, we observed a 4.5% and 6.1% difference with \(n=100\) replicas, respectively. Additionally, the respective latency deviations we observed in these scenarios ranged between 3.9% for Kauri and 53.4% for BFT-SMaRt, for which we were unable to reproduce the exact operation sending rate that was originally utilized in the real deployment setup.

Moreover, we compared several BFT protocol implementations, including HotStuff, Kauri, Narwhal & Tusk, and BullShark, in a blockchain use case with up to 256 replicas distributed across 21 different regions and 25 Mbit/s network link bandwidth. For instance, our study revealed that Kauri delivers the best performance overall, i.e., displaying a throughput of 6,742 op/s at \(n=128\) replicas. The closest competitors are BullShark (2,318 op/s) and Tusk (1,952 op/s) which demonstrate that asynchronous BFT protocols can achieve competitive throughput levels (but at noticeably higher latency levels). In the same scenario, HotStuff only achieves 494 op/s when implemented with secpk1 signatures and 707 op/s when using the BLS signatures. PBFT resides at the bottom of the pile, managing only 161 op/s.

A further envisioned use case of our method is to spot implementation bugs, as simulations can be used to perform integration tests of distributed systems at a large scale in an inexpensive way. Overall, the proposed simulation-based evaluation method offers a valuable tool for researchers and practitioners working with BFT protocols in the context of DLT applications. It not only streamlines the scalability analysis process but also provides cost-effectiveness and accuracy, enabling the design and deployment of more efficient and resilient BFT-based distributed systems.

8 Future Work

For future work, we plan to expand the fault induction abilities of our method to allow for more complex behavior, e.g., malicious attacks carried out by a fraction of Byzantine replicas such as equivocations. For instance, we aim to explore more sophisticated Byzantine fault behavior by drawing inspiration from the Twins methodology [5]. Twins is a unit test case generator for Byzantine behavior that duplicates cryptographic IDs of replicas, which can lead to equivocations or forgotten protocol states. A further interesting aspect might be the impact of longer communication link breaks (such that TCP connections fail) to see how well such situations are handled in BFT protocol implementations.

We think the accuracy of simulation results, in particular, for smaller system sizes, can be further improved by the introduction of a simple CPU model. This would lead to a better approximation in scenarios where the underlying network is not the dominating factor for performance. For instance, the model could let Phantom record execution times of intercepted system calls and adjust the clock of each simulation process accordingly. A drawback of this approach remains the fact that the determinism of simulation runs can not be guaranteed any longer, but it could still give us deeper insights into the overhead of CPU-heavy operations of BFT protocols and lead to more accurate benchmarking.

Last but not least, we plan to extend our evaluations from implementations of permissioned BFT protocols to permissionless/Proof-of-Stake BFT systems, such as Avalanche [44], as deployed in real-world scenarios. We believe that it should be fairly feasible to run such systems only with minor modifications to our methodology (for instance, a tool that creates the initial distribution of stake/voting power may need to be integrated). This will allow us to mimic actual system deployments and gain more practical insights.

Footnotes

See https://aptoslabs.com/

See https://flow.com/

See https://www.monad.xyz/

⁴

See https://sui.io/

⁵

See https://celo.org/

⁶

PBFT was originally proposed with many optimizations to maintain high performance for its envisioned use in systems with \(n=4\) or \(n=7\) replicas.

⁷

Authenticators can be either signatures or message authentication codes (MACs), depending on the concrete implementation. The original implementation of PBFT employs MACs for improved performance.

⁸

See https://github.com/Delphi-BFT/tool

⁹

See https://wonderproxy.com/blog/a-day-in-the-life-of-the-internet/ to obtain information on these statistics.

¹⁰

A network topology defines a fixed latency for each link between two hosts and up and down bandwidth capacity per host.

¹¹

See https://www.cloudping.co/grid

¹²

We borrow the definition of uncivil from Reference [15] and refer to BFT protocol runs in which some clients or a bounded number of replicas are faulty.

¹³

Installing a secure computing (i.e., seccomp) filter in a process allows interposition on system calls that are not preloadable; see Reference [32] for more details.

¹⁴

The reason for this is that the “10ms” experiment uses requests with 0-byte payload, thus the CPU processing delays (i.e., for verifying an increasing number of signatures—these delays are not captured in the simulation) seem to become a more relevant factor than delays incurred by the limited network bandwidth of 10 Gbit/s.

¹⁵

We use a Rust-based implementation of PBFT (github.com/ibr-ds/themis), since the original by Castro et al. [14] does not compile on modern computers.

¹⁶

The number of concurrently submitted requests is sufficient to (1) fill the block size of each block that a BFT protocol processes in parallel, and (2) a block size of pending requests remains to wait at the leader so the next block can be disseminated immediately as soon as an ongoing consensus finalizes.

¹⁷

The big request operation in PBFT substitutes larger requests by a hash value and only inlines small operations into a block (batch, in PBFT parlance).

¹⁸

Note that we would have to modify the benchmarking application of Kauri to obtain request latency results that are comparable with the other protocols. A small takeaway message is that different BFT protocol implementations might use slightly different metrics in benchmark suites. It requires caution when comparing results, but it is not a hindrance to our general approach.

¹⁹

This statement only applies to the Rust-based implementation of PBFT (github.com/ibr-ds/themis) we tested, not the original by Castro et al. [14].

²⁰

The three-chain refers to the commit rule in HotStuff that states that a block is only committed if there is a valid three-chain (a chain of three succeeding blocks, each references its predecessor and collected a valid quorum certificate). We refer the interested reader to Reference [60].

References

[1]

Mohammad Javad Amiri, Chenyuan Wu, Divyakant Agrawal, Amr El Abbadi, Boon Thau Loo, and Mohammad Sadoghi. 2024. The bedrock of Byzantine fault tolerance: A unified platform for BFT protocols analysis, implementation, and experimentation. In 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI’24). 371–400.

Abstract

1 Introduction

Cloud-scale Deployments.

Simulating BFT Protocols.

Why Not Just Use Network Emulation?.

Simulation as (Better?) Alternative.

1.1 Our Contributions

1.2 Structure

2 Related Work

2.1 Simulation or Modeling of BFT Protocols

2.2 Generic Simulators and Emulators

2.3 Blockchain Research

2.4 Tools for Vulnerability Detection or Performance Evaluation of BFT Protocols

3 BFT Protocols

3.1 Synchrony Models

3.2 Running Agreements

3.3 PBFT

3.4 HotStuff

3.5 Kauri

3.6 Narwhal & Tusk and BullShark

4 Methodology

4.1 Why Simulate BFT Protocol Implementations?

Phantom as Choice for the Backend.

4.2 The Frontend: Accelerating Large-scale Simulations of BFT Protocol Implementations

4.2.1 Challenges.

4.2.2 Experimental Description and Frontend Design.

Scheduler.

Environment Generator.

Protocol Connectors.

Fault Induction.

4.3 The Backend: Using Phantom to Simulate BFT Protocols as Native Linux Processes

Simulated Environment.

Simulation Engine.

Deterministic Execution.

5 Validation

BFT Protocol Selection.

Simulator Machine Specification and Data Collection.

5.1 HotStuff-secp256k1

5.2 BFT-SMaRt and PBFT

LAN environment.

Geo-distribution.

5.3 Kauri and HotStuff-bls

5.4 Tusk and Bullshark

5.5 Resource Consumption and Implementations

6 Experimental Results

6.1 General Setup

6.2 Failure-free Scenario: Scalability and Block Size

Observations for the Partially Synchronous Protocols.

Observations for the Asynchronous Protocols.

6.3 Packet Loss

Observations.

6.4 Denial-of-service Attack

Observations.

6.5 Crashing Replicas

Observations.

6.6 Discussion and Lessons Learned

7 Conclusions

8 Future Work

Footnotes

References

Index Terms

Recommendations

BFT in Blockchains: From Protocols to Use Cases

Does my BFT protocol implementation scale?

From Byzantine Consensus to BFT State Machine Replication: A Latency-Optimal Transformation

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations