Nothing Special   »   [go: up one dir, main page]

Academia.eduAcademia.edu

ritas

IEEETRANSACTIONS TRANSACTIONS ON DEPENDABLE AND SECUREVOL. COMPUTING ,VOL. 8, NO. 1, XXXX JAN-FEB IEEE ON DEPENDABLE AND SECURE COMPUTING, X, NO. X, XXXXXX-XXXXXX 2011 1 RITAS: Services for Randomized Intrusion Tolerance Henrique Moniz, Student Member, IEEE, Nuno Ferreira Neves, Member, IEEE, Miguel Correia, Member, IEEE, and Paulo Verissimo, Fellow, IEEE ODERN society has been growing increasingly dependent on networked computer systems. The availability, confidentiality, and integrity of data and services are crucial attributes that must be enforced by real-world distributed systems. The typical approach to secure these systems has been one of almost complete prevention, i.e., to avoid successful attacks, or intrusions, at all cost. Once a breach ocurrs, manual intervention is necessary to restore system correctness. A different approach to deal with attacks has been gaining momentum within the scientific community - intrusion tolerance. Arising from the intersection of two classical areas of computer science, fault tolerance and security, its objective is to guarantee the correct behavior of a system even if some of its components are compromised and controlled by an intelligent adversary [1], [2], [3]. Within this domain of fault- and intrusion-tolerant distributed systems, there is an essential problem: consensus. This problem has been specified in different ways, but basically it aims to ensure that n processes are able to propose some values and then all agree on one of these values. Consensus has been shown to be equivalent to fundamental problems, such as state machine replication [4], group membership [5], and atomic broadcast [6], [7]. Hence, the relevance of consensus is noteworthy because it is a building block of several important distributed systems services. For example, to maintain data consistency in a replicated database, some form of consensus between the sites is needed. Synchronization of clocks, leader election, or practically any kind of coordinated activity between the various nodes of a distributed system can be built using consensus. Unsurprisingly, the consensus problem has received a lot of attention from the research community. Consensus, however, is impossible to solve deterministically in asynchronous systems (i.e., systems where there are no bounds to the communication delays and computation times) if a single process can crash (also known as the FLP result [8]). This is a significant result, in particular for intrusion-tolerant systems, because they usually assume an asynchronous model in order to avoid time dependencies. Time assumptions can often be broken, for example, with denial of service attacks. Throughout the years, several researchers have investigated techniques to circumvent the FLP result. Most of these solutions, however, required changes to the basic system model, with the explicit inclusion of stronger time assumptions (e.g., partial synchrony models [9], [10]), or by augmenting the system with devices that hide in their implementation these assumptions (e.g., failure detectors [11], [12], [13] or wormholes [14]). Randomization is another technique that has been around for more than two decades [15], [16]. One important advantage of this technique is that no additional timing assumptions are needed. To circumvent the FLP result, randomization uses a probabilistic approach where the termination of consensus is ensured with probability of 1. Although this line of research produced a number of important theoretical results, including several algorithms, randomization has been historically overlooked, in what pertains to the implementation of practical applications, because it has usually been considered to be too inefficient. The reasons for the assertion that “randomization is inefficient in practice” are simple to summarize. Randomized consensus algorithms, which are the most common form of these algorithms, usually have a large expected number of communication steps, i.e., a large time-complexity. Even when this complexity is constant, the expected number of communication steps is traditionally significant even for small numbers of processes, when compared, for instance, with solutions based on failure detectors1 . Many of those algorithms also rely on public-key cryptography, which increases the performance costs, especially for LANs or MANs in which This work was partially supported by the EU through NoE IST-4-026764NOE (RESIST) and project IST-4-027513-STP (CRUTIAL), and by the FCT through project POSI/EIA/60334/2004 (RITAS) and the Large-Scale Informatic Systems Laboratory (LASIGE). 1 An exception is the stack of randomized protocols proposed by Cachin et al. [17], [18], which terminate in a low expected number of communication steps. They, however, depend heavily on public-key cryptography which may seriously affect their performance [19]. Abstract— Randomized agreement protocols have been around for more than two decades. Often assumed to be inefficient due to their high expected communication and computation complexities, they have remained overlooked by the communityat-large as a valid solution for the deployment of fault-tolerant distributed systems. This paper aims to demonstrate that randomization can be a very competitive approach even in hostile environments where arbitrary faults can occur. A stack of randomized intrusion-tolerant protocols is described and its performance evaluated under several settings in both LAN and WAN environments. The stack provides a set of relevant services ranging from basic communication primitives up through atomic broadcast. The experimental evaluation shows that the protocols are efficient, especially in LAN environments where no performance reduction is observed under certain Byzantine faults. Index Terms— Intrusion Tolerance, Byzantine Agreement, Randomized protocols, Performance evaluation. I. I NTRODUCTION M IEEE TRANSACTIONS ON DEPENDABLE AND SECUREVOL. COMPUTING ,VOL. 8, NO. 1,XXXX JAN-FEB IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, X, NO. X, XXXXXX-XXXXXX Application Fig. 1. Applications wishing to use RITAS Vector Consensus Atomic Broadcast Multi✒valued Consensus Binary Consensus Reliable Broadcast Echo Broadcast RITAS protocol suite TCP IPSec AH Standard Internet Services The RITAS protocol stack. the time to compute a digital signature is usually much higher than the network delay. Nevertheless, two important points have been chronically ignored. First, consensus algorithms are not usually executed in oblivion, they are run in the context of a higher-level problem (e.g., atomic broadcast) that can provide a friendly environment for the “lucky” event needed for faster termination (e.g., many processes proposing the same value can lead to a quicker conclusion [20]). Second, for the sake of theoretical interest, the proposed adversary models usually assume a strong adversary that completely controls the scheduling of the network and decides which processes receive which messages and in what order. In practice, a real adversary does not possess this ability, but if it does, it will probably perform attacks in a distinct (and much simpler) manner to prevent the conclusion of the algorithm – for example, it can block the communication entirely. Therefore, in practice, the network scheduling can be “nice” and lead to a speedy termination. This paper describes the implementation of a stack of randomized intrusion-tolerant protocols and evaluates their performance under different fault loads. One of the main purposes is to show that randomization can be efficient and should be regarded as a valid solution for practical intrusiontolerant distributed systems. This implementation is called RITAS which stands for Randomized Intrusion-Tolerant Asynchronous Services. At the lowest level of the stack (see Figure 1) there are two broadcast primitives: reliable broadcast and echo broadcast. On top of these primitives, the most basic form of consensus is available, binary consensus. This protocol lets processes decide on a single bit and is, in fact, the only randomized algorithm of the stack. The rest of the protocols are built on top of this one. Building on the binary consensus layer, multi-valued consensus allows the agrement on values of arbitrary range. At the highest level there is vector consensus, which lets processes decide on a vector with values proposed by a subset of the processes, and atomic broadcast, which ensures total order. The protocol stack is executed over a reliable channel abstraction provided by standard Internet protocols – TCP ensures reliability, and IPSec guarantees cryptographic message integrity [21]. The protocols in RITAS have been previously described in the literature [22], [23], [7]. The 2011 2 implemented protocols are, in most cases, optimized versions of the original proposals that have significantly improved the overall performance. The protocols of RITAS share a set of important structural properties. (1) They are asynchronous in the sense that no assumptions are made on the processes’ relative execution and communication delays, thus preventing attacks against assumptions in the domain of time (a known problem in some protocols that have been presented in the past). (2) They attain optimal resilience, tolerating up to f = ⌊ n−1 3 ⌋ malicious processes out of a total of n processes, which is important since the cost of each additional replica has a significant impact in a real-world application. (3) They are signaturefree, meaning that no expensive public-key cryptography is used anywhere in the protocol stack, which is relevant in terms of performance since this type of cryptography is several orders of magnitude slower than symmetric cryptography. (4) They take decisions in a distributed way (there is no leader), thus avoiding the costly operation of detecting the failure of a leader, an event that can considerably delay the execution. The paper has two main contributions: (1) it presents the design and implementation of a stack of randomized intrusiontolerant protocols, discussing several optimizations – to the best of our knowledge, the implementation of a stack with the four structural properties above is novel; (2) it provides a detailed evaluation of RITAS in both LAN and WAN settings, showing that it has interesting latency and throughput values. II. R ELATED W ORK Randomized intrusion-tolerant protocols have been around since Ben-Or’s and Rabin’s seminal consensus protocols [15], [16]. These papers defined the two approaches that each of the subsequent works followed. Essentially all randomized protocols rely on a coin-tossing scheme that generates random bits. Ben-Or’s approach relies on a local coin-toss, while in Rabin’s shares of the coins are distributed by a trusted dealer before the execution of the protocol and, therefore, all processes see the same coins. Although many randomized asynchronous protocols have been designed throughout the years [15], [16], [22], [24], [25], [26], only recently one implementation of a stack of randomized multicast and agreement protocols has been reported, SINTRA [18]. These protocols are built on top of a binary consensus protocol that follows a Rabin-style approach, and in practice terminates in one or two communication steps [17]. The protocols, however, depend heavily on public-key cryptography primitives like digital and threshold signatures. The implementation of the stack is in Java and uses several threads. RITAS uses a different approach, Ben-Or-style, and resorts only to fast cryptographic operations such as hash functions. Randomization is only one of the techniques that can be used to circumvent the FLP impossibility result. Other techniques include failure detectors [12], [27], [28], [20], partialsynchrony [9] and distributed wormholes [29], [14]. Some of these techniques have been employed in the past to build other intrusion-tolerant protocol suites. The first evaluation of a set of asynchronous Byzantine protocols (reliable and atomic broadcast) was made for the IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING ,VOL. 8, NO. 1, JAN-FEB 2011 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. X, NO. X, XXXXXX-XXXXXX XXXX Rampart toolkit [23]. The reliable broadcast is implemented by Reiter’s echo broadcast (see Section IV-G) and the order is defined by a leader that also echo-broadcasts the order information. Even with such a simple protocol, and using small RSA keys (300 bits), the paper acknowledges that “public-key operations still dominate the latency of reliable multicast, at least for small messages”. Moreover, if a process does not echo-broadcast a message to all or if a malicious leader performs some attack against the ordering of the messages, these events have to be detected and the corrupt process removed from the group. This implies liveness is dependent on the cost of this detection [30] and synchrony assumptions are required about the network delay, allowing attacks where malicious processes delay others in order to force their removal. For this reason, Rampart relies on a group membership protocol not only to handle voluntary joins and leaves from the group, but also to detect and remove corrupt processes. This is a necessity (that emerges out of design) as much as it is a feature. RITAS can indeed be extended with a group membership protocol that handles dynamic groups, however it does not require one for its protocols to make progress because decisions are made in a decentralized way. Like Rampart, SecureRing is an intrusion-tolerant group communication system [31]. It relies on a token that rotates among the processes to decide the order of message deliveries. This signed token carries message digests, a solution that allows a lower number of signatures and an improvement in performance when compared to Rampart. In SecureRing, malicious behavior also has to be detected for the protocols to make progress, which means that it suffers from similar problems as Rampart. Worm-IT uses the wormhole abstraction to provide a membership service and a view-synchronous atomic multicast primitive [32]. It is designed under a hybrid system model. The system is considered to be asynchronous and subject to Byzantine failures with the exception of a small subset, the wormhole, that is assumed to be secure (i.e., can only crash) and synchronous. Critical steps of the protocols that require stronger environmental properties (such as agreement tasks) are executed inside the wormhole. Byzantine JazzEnsemble is another group communication system that resists Byzantine failures [33]. It relies on fuzzy mute and fuzzy verbose failure detectors to detect mute failures (i.e., a process neglecting to send messages) and verbose failures (i.e., a process sending too many messages), respectivelly. These kinds of failures can be identified based on locally observed events, which motivates the use of such failure detectors. Moreover, the system provides a vector consensus protocol and a uniform broadcast protocol, as well as modifications at each layer to overcome potential Byzantine attacks. BFT, while not a protocol stack, is an algorithm that provides a Byzantine-fault-tolerant state machine replication service [34]. In BFT, there are clients and servers. The clients issue requests to the servers, then requests are processed by the servers in total order, and a reply is returned to the clients. Servers are either primary or backup and there is only one primary at any given moment in the system. Client requests 3 are issued directly to the primary, which in turn multicasts the request to the backups. The replies are transmitted to clients by all servers. A client waits for f+1 replies with the same result in order to obtain the response. This comprises the normal operation of the algorithm. In case a primary fails, a view change must occur and servers must agree on a new primary. View changes are triggered by timeouts. After a view change the service resumes to its normal operation. It is hard to compare BFT and RITAS because they are designed with different assumptions and goals in mind. BFT is centralized and requires synchrony for liveness, while RITAS is decentralized and completely asynchronous. BFT is a system designed to perform a very specific task (i.e., state machine replication), while RITAS is a stack that provides several general broadcast and consensus protocols that can be applied to a multitude of scenarios including state machine replication. For instance, the Reliable Broadcast protocol in RITAS could be used as a primitive to implement state machine replication (more specifically, the dissemination of requests from a primary to the backups). III. S YSTEM M ODEL The system is composed by a group of n processes P = {p0 , p1 , ...pn−1 }. Group membership is static, i.e., the group is predefined and there cannot be joins or leaves during system operation. Processes are fully-connected. There are no constraints on the kind of faults that can occur in the system. This class of unconstrained faults is usually called arbitrary or Byzantine. Processes are said to be correct if they do not fail, i.e., if they follow their protocol until termination. Processes that fail are called corrupt. No assumptions are made about the behavior of corrupt processes – they can, for instance, stop executing, omit messages, send invalid messages either alone or in collusion with other corrupt processes. It is assumed that at most f = ⌊ n−1 3 ⌋ processes can be corrupt. The system is completely asynchronous. Therefore, there are no assumptions whatsoever about bounds on processing times or communications delays. Each pair of processes (pi , pj ) shares a secret key sij . It is out of the scope of this work to present a solution for distributing these keys, but it may require a trusted dealer or some kind of key distribution protocol based on publickey cryptography. Nevertheless, this is a long-term operation, normally performed before the execution of the protocols and does not interfere with their performance. Each process has access to a random bit generator that returns unbiased bits observable only by the process (if the process is correct). Some protocols use a cryptographic hash function H(m) that maps an input m of arbitrary length into a fixed-length output. We assume that it is impossible (1) to find two values m 6= m′ such that H(m) = H(m′ ), and, (2) given a certain output, to find an input that produces that output. The output of the function is often called a hash. IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING ,VOL. 8, NO. 1, JAN-FEB 2011 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. X, NO. X, XXXXXX-XXXXXX XXXX IV. P ROTOCOL S TACK This section briefly describes the function of each protocol and how it works. Since all protocols have already been described in the literature, no formal specifications are given, and some details are only provided to explain the optimizations. We have developed formal proofs showing that the optimized protocols behave according to their specification, but we could not present them in the paper due to lack of space [35]. A. Reliable Channel The two layers at the bottom of the stack implement a reliable channel (see Figure 1). This abstraction provides a point-to-point communication channel between a pair of correct processes with two properties: reliability and integrity. Reliability means that messages are eventually received, and integrity says that messages are not modified in the channel. In practical terms, these properties can be enforced using standard Internet protocols: reliability is provided by TCP, and integrity by the IPSec Authentication Header (AH) protocol [21]. B. Reliable Broadcast The reliable broadcast primitive ensures two properties: (1) all correct processes deliver the same messages; (2) if the sender is correct then the message is delivered. The implemented protocol was originally proposed by Bracha [22]. The protocol starts with the sender broadcasting a message (INIT, m) to all processes. Upon receiving this message a process sends a (ECHO, m) message to all processes. It then waits for at least ⌊ n+f 2 ⌋ + 1 (ECHO, m) messages or f + 1 (READY, m) messages, and then it transmits a (READY, m) message to all processes. Finally, a process waits for 2f + 1 (READY, m) messages to deliver m. Figure 2 illustrates the three communication steps of the protocol. 4 processes propose the same value v, then the decision must be v. The protocol has an expected number of communication steps for a decision of 2n−f , and uses the underlying reliable broadcast as the basic communication primitive. The protocol proceeds in 3-step rounds, running as many rounds as necessary for a decision to be reached. In the first step each process pi (reliably) broadcasts its proposal vi , waits for n−f valid messages (the definition of valid is given in the next paragraph) and changes vi to reflect the majority of the received values. In the second step, pi broadcasts vi , waits for the arrival of n − f valid messages, and if more than half of the received values are equal, vi is set to that value; otherwise vi is set to the undefined value ⊥. Finally, in the third step, pi broadcasts vi , waits for n−f valid messages, and decides if at least 2f + 1 messages have the same value v 6=⊥. Otherwise, if at least f + 1 messages have the same value v 6=⊥, then vi is set to v and a new round is initiated. If none of the above conditions apply, then vi is set to a random bit with value 1 or 0, with probability 21 , and a new round is initiated. A message received in the first step of the first round is always considered valid. A message received in any other step k, for k > 1, is valid if its value is congruent with any subset of n − f values accepted at step k − 1. For example, suppose that process pi receives n − f messages at step 1, where the majority has value 1. Then at step 2, it receives a message with value 0 from process pj . Remember that the message a process pj broadcasts at step 2 is the majority value of the messages received by it at step 1. That message cannot be considered valid by pi since value 0 could never be derived by a correct process pj that received the same n − f messages at step 1 as process pi (i.e., value 0 is not congruent). If process pj is correct, then pi will eventually receive the necessary messages for step 1, which will enable it to form a subset of n − f messages that validate the message with value 0. This validation technique has the effect of causing the processes that do not follow the protocol to be ignored. C. Echo Broadcast The echo broadcast primitive is a weaker and more efficient version of the reliable broadcast. Its properties are somewhat similar, however, it does not guarantee that all correct processes deliver a broadcast message if the sender is corrupt [24]. In this case, the protocol only ensures that the subset of correct processes that deliver will do it for the same message. The protocol is essentially the described reliable broadcast algorithm with the last communication step omitted. An instance of the protocol is started with the sender broadcasting a message (INITIAL, m) to all processes. When a process receives this message, it broadcasts a (ECHO, m) message to (ECHO, m) all processes. It then waits for more than n+f 2 messages to accept and deliver m. D. Binary Consensus A binary consensus allows correct processes to agree on a binary value. The implemented protocol is adapted from a randomized algorithm by Bracha [22]. Each process pi proposes a value vi ∈ {0, 1} and then all correct processes decide on the same value b ∈ {0, 1}. In addition, if all correct E. Multi-valued Consensus A multi-valued consensus allows processes to propose a value v ∈ V with arbitrary length. The decision is either one of the proposed values or a default value ⊥∈ / V. The implemented protocol is based on the multi-valued consensus proposed by Correia et al. [7]. It uses the services of the underlying reliable broadcast, echo broadcast, and binary consensus layers. The main differences from the original protocol are the use of echo broadcast instead of reliable broadcast at a specific point, and a simplification of the validation of the vectors used to justify the proposed values. The protocol starts when every process pi announces its proposal value vi by reliably broadcasting a (INIT, vi ) message. The processes then wait for the reception of n − f INIT messages and store the received values in a vector Vi . If a process receives at least n − 2f messages with the same value v, it echo-broadcasts a (VECT, v, Vi ) message containing this value together with the vector Vi that justifies the value. Otherwise, it echo-broadcasts the default value ⊥ that does not require justification. The next step is to wait for the reception IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING ,VOL. 8, NO. 1, JAN-FEB 2011 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. X, NO. X, XXXXXX-XXXXXX XXXX Reliable Broadcast 5 Atomic Broadcast (20 steps) (17 steps) (14 steps) (9 steps) Reliable Broadcast (AB_MSG) 3 steps Fig. 2. 3 steps 3 steps 2 steps 3 steps 3 steps 3 steps Overview of the messages exchanged and best-case number of communication steps in each protocol. of n − f valid VECT messages. A VECT message, received from process pj , and containing vector Vj , is considered valid if one of two conditions hold: (a) v =⊥; (b) there are at least n − 2f elements Vi [k] ∈ V such that Vi [k] = Vj [k] = vj . If a process does not receive two valid VECT messages with different values, and it received at least n − 2f valid VECT messages with the same value, it proposes 1 for an execution of the binary consensus, otherwise it proposes 0. If the binary consensus returns 0, the process decides on the default value ⊥. If the binary consensus returns 1, the process waits until it receives n − 2f valid VECT messages (if it has not done so already) with the same value v and decides on that value. F. Vector Consensus Vector consensus allows processes to agree on a vector with a subset of the proposed values. The protocol is the one described in [7] and uses reliable broadcast and multivalued consensus as underlying primitives. It ensures that every correct process decides on a same vector V of size n; if a process pi is correct, then V [i] is either the valued proposed by pi or the default value ⊥, and at least f + 1 elements of V were proposed by correct processes. The protocol starts by reliably broadcasting a message containing the proposed value by the process and setting the round number ri to 0. The protocol then proceeds in up to f rounds until a decision is reached. Each round is carried out as follows. A process waits until n − f + ri messages have been received and constructs a vector Wi of size n with the received values. The indexes of the vector for which a message has not been received have the value ⊥. The vector Wi is proposed as input for the multi-valued consensus. If it decides on a value Vi 6=⊥, then the process decides Vi . Otherwise, the round number ri is incremented and a new round is initiated. compromise the correctness of the protocol. The protocol uses reliable broadcast and multi-valued consensus as primitives. The atomic broadcast protocol is divided in two tasks (see Figure 2): (1) the broadcasting of messages, and (2) the agreement over which messages should be delivered. When a process pi wishes to broadcast a message m, it simply uses the reliable broadcast to send a (AB MSG, i, rbid, m) message where rbid is a local identifier for the message. Every message in the system can be uniquely identified by the tuple (i, rbid). The agreement task (2) is performed in rounds. A process pi starts by waiting for AB MSG messages to arrive. When such a message arrives, pi constructs a vector Vi with the identifiers of the received AB MSG messages and reliably broadcasts a (AB VECT, i, r, Vi ) message, where r is the round for which the message is to be processed. It then waits for n − f AB VECT messages (and the corresponding Vj vectors) to be delivered and constructs a new vector Wi with the identifiers that appear in f + 1 or more Vj vectors. The vector Wi is then proposed as input to the multi-valued consensus protocol and if the decided value W ′ is not ⊥, then the messages with their identifiers in the vector W ′ can be deterministically delivered by the process. V. I MPLEMENTATION This section describes the internal structure of the protocol stack, and provides an insight into the design considerations and practical issues that arose during the development of RITAS. The protocol stack was implemented in the C language and was packaged as a shared library with the goal of offering a simple interface to applications wishing to use the protocols. Some of the concepts presented here have been studied in other group communication systems such as Horus and Ensemble [36], [37]. G. Atomic Broadcast A. Single-threaded Operation An atomic broadcast protocol delivers messages in the same order to all processes. One can see atomic broadcast as a reliable broadcast plus the total order property. The implemented protocol was adapted from [7]. The main difference is that it has been changed to use multi-valued consensus instead of vector consensus and to utilize message identifiers for the agreement task instead of cryptographic hashes. These changes were made for efficiency and have been proved not to When developing a software component such as a protocol stack, there are two possible options regarding its operation: multi-threaded or single-threaded. The RITAS protocol stack runs in a single thread, independent of the application threads. In a typical multi-threaded protocol stack, every instance of a specific protocol is handled by a separate thread. Usually, there is a pivotal thread that reads messages from the network and instantiates protocol threads to handle messages that are IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING ,VOL. 8, NO. 1, JAN-FEB 2011 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. X, NO. X, XXXXXX-XXXXXX XXXX specific to them. Another option is to avoid the pivotal thread, and have the protocol threads reading messages directly from the network. The multi-threaded approach may be simpler to implement since each context is self-contained in a given thread, and there is virtually no need for protocol demultiplexing since messages can be addressed directly to the threads handling them. This leads to a cleaner implementation (i.e., more verbatim translations from pseudocode) because the protocol code has only to deal with one protocol instance (the context is implicit). Nevertheless, in a loaded system, with potentially several hundreds of threads, the constant context switching and synchronization between threads poses a serious performance impact on the stack, and may provoke an unfair internal scheduling. A single-threaded approach, while more complex to develop, allows a much more efficient stack operation when properly implemented. A single-threaded protocol stack ensures a fair first-come, first-served scheduling as messages are processed by the relevant protocol instances one-by-one as they are received. But this approach poses additional challenges. The contexts for the different protocol instances are not self-contained and require explicit management, which adds complexity to such tasks as message passing, protocol demultiplexing, and packet construction. The specific protocol code also becomes harder to implement since it has to juggle between multiple contexts. Since one of the main goals of RITAS was the implementation of an efficient protocol stack, the extra complexity of a single-threaded approach was outweighted by its potential performance advantages. B. Message Buffers In a multi-layered network protocol stack, messages have to be passed back and forth. A certain degree of flexibility is needed to manipulate the buffers that hold the messages because data may need to be prepended or appended to these buffers, and existing data may need to be transformed or deleted. Additionally, the number of operations that actually copy data has to be kept to a minimum to reduce performance penalties. In RITAS, information is passed along the protocol stack using message buffers (mbuf for short). A mbuf is used to store a message and several metadata related to its management. All communication between the different layers is done by passing pointers to mbufs. This way, it is possible to both eliminate the need to copy large chunks of data when passing messages from one layer to another, and have a data structure that facilitates the manipulation of messages. This data structure was inspired by the TCP/IP implementation in the Net/3 Operating System kernel [38]. A mbuf is usually created when a new message arrives from the network. The RITAS network scheduler creates a mbuf, then it reads the message from the socket directly into the mbuf, and passes the mbuf to the appropriate protocol layer. A mbuf can also be created by a specific protocol layer, for instance if it needs to send a message to other processes. Every mbuf is reutilized as much as possible. 6 Fig. 3. Communication flow among the protocol layers during an atomic broadcast. There are also specific rules as to when a mbuf should be destroyed. An outbound mbuf should be destroyed immediatly after its message is sent to all relevant processes. The exception is when a RITAS MBUF PROTECTED flag is set. In this case, the mbuf was explicitly marked for no destruction by a particular protocol layer, which then becomes solely responsible for the mbuf destruction. For an inbound mbuf, the last protocol to which the mbuf is going to be passed is responsible for its management. A protocol layer has three options, which are mutually exclusive, after it has processed the message contained in the mbuf : it passes the mbuf to an upper layer protocol, it destroys the mbuf, or it reuses the mbuf to transmit a new message. The chosen action depends on the semantic of the protocol and the current state of the particular protocol instance context to which the mbuf is relevant. C. Control Blocks and Protocol Handlers Each protocol implemented in RITAS is formed by two protocol-specific components: the control block, and the protocol handler. The control block is a data structure that holds the state of a specific instance of the protocol. It keeps track of things like the instance identification, the current protocol step, and the values received so far. The protocol handler is the set of functions that implement the operation of the protocol. It is formed by initialization and destruction functions, input and output functions, and one or more functions that export the protocol functionality. The purpose of the initialization and destruction functions is, respectively, to allocate a new control block and initialize all its variables and data structures, and to destroy the internal data structures and the control block itself. The input and output functions are used for inter-protocol communication, and both receive as parameters the respective control block and the mbuf to be processed. The communication between the protocols is depicted in Figure 3. D. The RITAS Channel and Control Block Chaining Since applications might perform several broadcast and/or agreement operations simultaneously, the ability to execute multiple instances of the same protocol is a requisite. Therefore, one needs to support many contexts for the different protocol instances. When a message is passed to a given protocol layer, that layer must be able to identify the relevant context for the message, and process the message according to it. This hints a necessity of having each protocol instance uniquely identified, and to have messages addressed to specific IEEE TRANSACTIONS ON DEPENDABLE AND SECURE VOL. COMPUTING ,VOL. 8, NO. 1, JAN-FEB 2011 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, X, NO. X, XXXXXX-XXXXXX XXXX protocol instances to avoid overlapping of multiple instances. Two techniques in RITAS make possible the efficient implementation of this functionality: the RITAS Channel, and Control Block Chaining. 1) RITAS Channel: This is a special protocol handler that sits between the broadcast layers and the Reliable Channel layer (the Reliable Channel layer corresponds to the implementation of TCP and IPSec that is accessed through the socket interface) (see Figure 3). It is the first layer to process messages after they are read from the network, and the last one before they are written to the network. The purpose of the RITAS channel is to build a header containing a unique identifier for each message. Messages are always addressed to a given RITAS Channel. The message is then passed along the appropriate protocol instances by a mechanism called control block chaining. 2) Control Block Chaining: This mechanism manages the linking of different protocol instances, solving several problems: it gives a means to unambiguously identify all messages, provides for seamless protocol demultiplexing, and facilitates control block management. Control block chaining works in the following way. Suppose an application executes an atomic broadcast. The creation of the atomic broadcast protocol instance is done by calling the corresponding initialization function that returns a pointer to a control block responsible for that instance. Since atomic broadcast uses multi-valued consensus and reliable broadcast as primitives, the atomic broadcast initialization function also calls the initialization functions for such protocols in order to create as many instances of these protocols as needed. The returned control blocks are kept and managed in the atomic broadcast control block. This mechanism is recursive since second-order protocol instances may need to use other protocols as primitives and so on. The result is a tree of control blocks that has its root at the protocol called by the application and goes down all the way, having control blocks for RITAS Channels as the leaf nodes. A unique identifier is given to each outbound message when the associated mbuf reaches the RITAS Channel layer. The tree is traversed bottom-up starting at the RITAS Channel control block and ending at the root control block. The message identifier is generated by appending the protocol instance ID of each traversed node to a local message identifier that was set by the node that created the mbuf. Protocol demultiplexing is done seamlessly. When a message arrives, its identification defines an association with a particular RITAS Channel control block. The RITAS Channel passes the mbuf to the upper layer by calling the appropriate input function of its parent control block. The message is processed by that layer and the mbuf keeps being passed in the same fashion. E. Out-of-context Messages THe asynchronous nature of the protocol stack allows scenarios where a process receives messages for a protocol instance whose context has not yet been created. These messages – called out-of-context (OOC) messages – have no context to handle them, though they will, eventually. 7 Since the correctness of the protocols depends on the eventual delivery of these messages, they cannot simply be discarded. All OOC messages are stored in a hash table. When a RITAS Channel is created, it checks this hash table for relevant messages. If any relevant messages exist, they are promptly delivered to the upper protocol instance. It is also possible for a protocol instance to be destroyed before consuming all of its OOC messages. To scenarios where OOC messages are kept indefinitely in the hash table, upon the destruction of a protocol, the hash table is checked and all relevant messages are deleted. This is not a solution for the case where a malicious process sends bogus OOC messsages that will never have a context. The problem of finite memory in Byzantine message-passing systems is an open issue in the research community. In principle, RITAS and other group communication systems could benefict from an approach such as the one in [39]. VI. P ERFORMANCE E VALUATION This section describes the performance evaluation of the protocol stack in both local-area-network (LAN) and widearea-network (WAN) environments. Two different performance analysis are made. First, a comparative evaluation is presented in order to gain insight on the stack, and on how protocols relate and build on one another performance-wise. Second, an in-depth analysis is conducted on how atomic broadcast performs under various conditions. This protocol is arguably the most interesting candidate for a detailed study because it utilizes all other protocols as primitives, either directly or indirectly, and it can be used for many practical applications [34], [32], [40]. A. Testbeds The experiments were carried out on three different testbeds. Two represent LAN environments which differ on the hardware and the number of nodes they accomodate, and the third represents a WAN environment with four nodes. The first LAN testbed, which will be refered as tb-lanslow, consisted on four 500 MHz Dell Pentium III PCs with 128 MB of RAM, running Linux kernel 2.6.5. The PCs were connected by an 100 Mbps HP ProCurve 2424M network switch. Bandwidth tests taken with the network performance tool lperf have shown a consistent throughput of 9.1 MB/s in full-duplex mode. The second LAN testbed, which will be referred as tblan-fast, consisted of 10 Dell PowerEdge 850 servers. These servers have Pentium 4 CPUs with 2.8 GHz of clock speed, and 2GB of RAM. They were connected by a Dell PowerConnect 2724 network switch with 10/100/1000 Mbps of bandwidth capacity. The operating system was Linux 2.6.11. Bandwidth tests showed a consistent throughput of 1.16 MB/s for the 10 Mbps setting, 11.5 MB/s for the 100 Mbps setting, and 67.88 MB/s for the 1000 Mbps setting. All values were taken in full-duplex mode, which was used in the experiments. The WAN testbed, which will be refered as tb-wan, consisted of four nodes, each one located in a different continent: a european node in Lisbon, Portugal (P4, 3Ghz, 1GB RAM); IEEE TRANSACTIONS ON DEPENDABLE AND SECURE VOL. COMPUTING ,VOL. 8, NO. 1, JAN-FEB 2011 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, X, NO. X, XXXXXX-XXXXXX XXXX a north american node in Berkeley, California (P4, 2.4Ghz, 1GB RAM); a south american node in Campinas, Brazil (Xeon, 3GHz, 1.5GB RAM); and an asian node in Ishikawa, Japan (P4, 3.4GHz, 3.5GB RAM). These nodes belong to the Planetlab platform [41] and their operating system was Linux 2.6.12. Table I shows the round-trip latency and bandwidth measurements taken between each pair of nodes. Berkeley - Ishikawa Lisbon - Berkeley Berkeley - Campinas Lisbon - Campinas Lisbon - Ishikawa Campinas - Ishikawa Latency (ms) 131 (0.26) 210 (1.12) 243 (1.37) 281 (1.24) 322 (1.78) 472 (0.85) Bandwidth (Kb/s) 1894 1167 990 845 740 165 TABLE I 8 delivers a message (for broadcast protocols) or a decision (for consensus protocols). The measured latency is the interval between these two instants. The average latency is obtained by taking the mean value of the sample of measured values. Outliers were identified and excluded from the sample. Echo Broadcast Reliable Broadcast Binary Consensus Multi-valued Cons. Vector Consensus Atomic Broadcast w/ IPSec (µs) 1724 2134 8922 16359 20673 23744 w/o IPSec (µs) 1497 1641 6816 11186 15382 18604 Overhead 15% 30% 30% 46% 34% 27% TABLE II AVERAGE LATENCY FOR ISOLATED EXECUTIONS OF EACH PROTOCOL IN TESTBED tb-lan-slow (100 M BPS ) WITH FOUR PROCESSES . AVERAGE ROUND - TRIP LATENCY AND BANDWIDTH BETWEEN EVERY PAIR OF NODES IN TESTBED tb-wan ( VARIANCE IS SHOWN IN PARENTHESES ) For testbeds tb-lan-fast and tb-lan-slow the IPSec implementation that was used was the one available in the Linux kernel and the reliable channels that were established between every pair of processes employed the IPSec AH protocol (with SHA-1) in transport mode [21]. For testbed tb-wan there was no IPSec available, so the experiments for this testbed were carried out with regular IP. This makes the protocols insecure since the integrity property of the channels is not provided, but our interest here is to evaluate the performance of the protocols. In practice, this is not affected because the latency added by the cryptographic operations (at the microseconds order) is negligible compared to the latency of the WAN links (around hundreds of milliseconds). B. Stack Analysis In order to get a better understanding about the relative overheads of each layer of the stack, we have run a set of experiments to determine the latencies of the protocols. These measurements were carried out in the following manner: a signaling machine, that does not participate in the protocols, is selected to control the benchmark execution. It starts by sending a 1-byte UDP message to the n processes to indicate which specific protocol instance they should create. Then, it transmits M messages, each one separated by a two second interval (in our case M was set to 100). Whenever one of these messages arrives, a process runs the protocol, either a broadcast or a consensus. In case of a broadcast, the process with the lowest identifier acts as the sender, while the others act as receivers. In case of a consensus, all processes propose identical initial values2 . The broadcast messages and the consensus proposals, all carry a 10-byte payload (except for binary consensus where the payload is 1 byte). The latency of each instance was obtained at a specific process. This process records the instant when the signal message arrives and the time when it either The results for testbed tb-lan-slow with four processes, shown in Table II, demonstrate the interdependencies among protocols and how much time is spent on each protocol. For example, in a single atomic broadcast instance roughly 2/3 of the time is taken running a multi-valued consensus. For a multi-valued consensus about 1/2 of the time is used by the binary consensus. And for vector consensus about 3/4 of the time is utilized by the multi-valued consensus. The experiments also demonstrated that consensus protocols were always able to reach a decision in one round because the initial proposals were identical. The table also shows the cost of using IPSec. This overhead could in part be attributed to the cryptographic calculations, but most of it is due to the increase on the size of the messages. For example, the total size of any Reliable Broadcast message – including the Ethernet, IP, and TCP headers – carrying a 10byte payload is 80 bytes. The IPSec AH header adds another 24 bytes, which accounts for an extra 30%. Echo Broadcast Reliable Broadcast Binary Consensus Multi-valued Consensus Vector Consensus Atomic Broadcast n 4 7 10 4 7 10 4 7 10 4 7 10 4 7 10 4 7 10 w/ IPSec (µs) 584 805 1045 667 907 1172 3094 8991 19741 4952 13335 25652 6022 16826 32674 6467 18496 33474 relative slowdown 38% 79% 36% 76% 190% 538% 169% 418% 179% 443% 186% 418% TABLE III AVERAGE LATENCY AND RELATIVE SLOWDOWN ( W. R . T. TO THE FOUR - PROCESS SCENARIO ) FOR ISOLATED EXECUTIONS OF EACH PROTOCOL ( WITH IPS EC ) IN TESTBED tb-lan-fast (1000 M BPS ). 2 The only protocol whose performance may directly suffer from different initial values is binary consensus since its termination is probabilistic. This protocol has been subject to a thorough evaluation in a different paper and its performance has been shown not to be significantly affected by different initial values [19]. Table III shows the performance results for testbed tb-lanfast. The average latency for all protocols is presented for IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. X, NO. X, XXXXXX-XXXXXX XXXX IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING ,VOL. 8, NO. 1, JAN-FEB 2011 three different group sizes: 4, 7, and 10 processes. The relative slowdown with respect to the four-process scenario is also shown for each protocol. The first conclusion that can be extracted from these results is that protocols in this testbed exhibit a much better performance than the previous testbed. The use of more powerful hardware had a significant impact on the performance of all protocols. For instance, in the case of binary consensus, the performance was improved three-fold, while for atomic broadcast, performance was increased almost four times. The network switch with increased bandwidth capacity, the network interface cards with better performance, and the machines in general with greater computational power are the obvious candidates to justify the performance gain. It is unclear, however, the relative weight of the various hardware components on the faster protocol execution. Later experiments isolate some of these parameters and demonstrate in greater depth the impact of network bandwidth and host computational power on the protocol stack performance. Another interesting observation from the results in Table III is the relative slowdown of each protocol when the group size increases. The reliable and echo broadcast protocols were less sensitive to a larger group size, while the slowdown for the remaining protocols was considerably accentuated due to the increase in the number of exchanged messages. Reliable and echo broadcast exchange O(n2 ) messages per communication step, while the remaining protocols exchange O(n3 ), thus being more sensitive to increasing group sizes. The results for the WAN environment are shown in Table IV. The performance of the protocols is significantly affected by the higher-latency, lower-bandwidth links of this testbed. As expected, the protocols with a larger number of messages exchanges suffer more due to the network delays. Echo Broadcast Reliable Broadcast Binary Consensus Multi-valued Consensus Vector Consensus Atomic Broadcast Latency (ms) 312.62 486.24 1408.75 2232.30 2629.34 2998.70 TABLE IV AVERAGE LATENCY FOR ISOLATED EXECUTIONS OF EACH PROTOCOL IN TESTBED tb-wan ( WITH FOUR PROCESSES ). C. Atomic Broadcast Analysis This section evaluates the atomic broadcast protocol in more detail. The experiments were carried out by having the n processes send a burst of k messages and measuring the interval between the beginning of the burst and the delivery of the last message. The benchmark was performed in the following way: processes wait for a 1-byte UDP message from the signaling machine, and then each one atomically broadcasts a burst of nk messages. Messages have a fixed size of m bytes. For every tested workload, the obtained measurement reflects the average value of 10 executions. 9 Two metrics are used to assess the performance of the atomic broadcast: burst latency (Lburst ) and maximum throughput (Tmax ). The burst latency is always measured at a specific process and is the interval between the instant when it receives the signal message and the moment when it delivers the k th message. The throughput for a specific burst is the burst size k divided by the burst latency Lburst (in seconds). The maximum throughput Tmax can be inferred as the value at which the throughput stabilizes (i.e., does not change with increasing burst sizes). Although no graphs for the burst latency are provided due to space constrains, by dividing the burst size by the throughput value one can obtain the corresponding burst latency in seconds. The measurements were taken by varying several system parameters: group size, network bandwidth, fault load, and message payload size. In the LAN environment, the impact of all these parameters is tested. In the WAN environment, only the fault load and the payload size are tested. The group size defines the number of processes n in the system and can assume three values: 4, 7, and 10. The network bandwidth is the amount of data that can be passed between every pair of processes in a given period of time. It can take three values: 10 Mbps, 100 Mbps, and 1000 Mbps. The fault load defines the types of faults that are injected in the system during its execution. The measurements were obtained under three fault loads. In the fault-free fault load all processes behave correctly. In the fail-stop fault load f processes crash before the measurements are taken (f is always set to the maximum number of processes that can fail as dictated by the system model, which means that f = ⌊ n−1 3 ⌋). Finally, in the Byzantine fault load f processes permanently try to disrupt the behavior of the protocols. At the binary consensus layer, they always propose zero trying to impose a zero decision. At the multi-valued consensus layer, they always propose the default value in both INIT and VECT messages trying to force correct processes to decide on the default value. The impact of any such attack, if successful, would be that correct processes do not reach an agreement over which messages should be delivered by the atomic broadcast protocol and, consequently, would have to start a new agreement round. The message payload size is the length of the data transmitted in each atomic broadcast (excluding protocol headers). Four values were used in the experiments: 10 bytes, 100 bytes, 1 Kilobyte, and 10 Kilobytes. 1) Group Size and Fault Load in LAN: The set of experiments described in this section had the objective of measuring the impact of both the group size and the fault load in a LAN environment. The network bandwidth was fixed to 100 Mbps in testbed tb-lan-slow, and to 1000 Mbps in testbed tb-lan-fast. The message payload size was 100 bytes. The group size was set to for 4, 7, and 10 processes. All three fault loads were tested: fault-free, fail-stop, and Byzantine. Figure 4 shows the performance of the atomic broadcast in testbed tb-lan-fast for the three different fault loads. Each curve shows the throughput for a different group size n. a) Fault-free fault load: From the graph in Figure 4 it is possible to observe that the stabilization point in the IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING ,VOL. 8, NO. 1, JAN-FEB 2011 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. X, NO. X, XXXXXX-XXXXXX XXXX 10 1 Gbps bandwidth / 100-byte messages fault-free fault load Byzantine fault load fail-stop fault load 3500 throughput (msgs/s) 3000 2500 2000 1500 1000 500 0 0 200 400 600 800 1000 1200 burst size (messages) Fig. 4. Throughput for atomic broadcast with different group sizes and fault loads for testbed tb-lan-fast. fault-free fault load / 100-byte messages 10-byte messages 100-byte messages 3500 2000 1500 1000 2500 2000 1500 1000 500 500 0 0 200 400 600 800 burst size (messages) 1000 1200 2500 throughput (msgs/s) throughput (msgs/s) throughput (msgs/s) 2500 0 Fig. 5. 1K-byte messages 3000 3000 2000 1500 1000 500 0 0 200 400 600 800 1000 1200 0 200 burst size (messages) 400 600 800 1000 1200 burst size (messages) Throughput for atomic broadcast with different bandwidth settings and message sizes for testbed tb-lan-fast. throughput curves indicates the maximum throughput Tmax . This value was around 2800 messages/s for a group size of 4 processes, 1500 msgs/s for 7 processes, and 1000 msgs/s for 10 processes. The burst latency for a burst size of 1000 was 354 ms, 700 ms, and 995 ms for 4, 7, and 10 processes, respectively. The group size had a significant impact on the protocol performance. The maximum throughput dropped almost to half from the four-process to the the seven-process scenario, and then about one third from the seven-process to the ten-process scenario. These results were expected because larger group sizes implicate that a larger number of messages must be exchanged. This imposes a higher load on the network, which decreases the maximum throughput. b) Fail-stop fault load: In this fault load, where f k processes crash, each correct process sends a burst of n−f messages. Looking at the curves, it is possible to conclude that performance is noticeably better with f crashed processes than in the fault-free situation. This happens because with f fewer processes there are fewer messages. The decreased contention, which does not necessarily occur at the network since the individual nodes are also susceptible to resource contention, allows operations to be executed faster. The maximum throughput Tmax is around 3000 messages per second for a group size of 4 processes, 1700 msgs/s for 7 processes, and 1050 msgs/s for 10 processes. The burst latency for a burst size of 1000 was 330 ms, 587 ms, and 989 ms for 4, 7, and 10 processes, respectively. c) Byzantine fault load: In this fault load, f processes try to disrupt the protocol. The maximum throughput Tmax is around 2800 messages per second for a group size of 4 processes, 1500 msgs/s for 7 processes, and 1000 msgs/s for 10 processes. The burst latency for a burst size of 1000 was 355 ms, 704 ms, and 966 ms for 4, 7, and 10 processes, respectively. There is no noticeable performance penalty when compared to the fault-free fault load. An important result is that all the consensus protocols reached agreement within one round, even under Byzantine faults. This can be explained in a intuitive way as follows. The experimental setting was a LAN, which not only provides a low-latency, high-throughput environment, but also keeps the nodes within symmetrical distance of each other. Due to this symmetry, in the atomic broadcast protocol, correct processes maintained a fairly consistent view of the received AB MSG messages because they all received these messages at approximately the same time. Any slight inconsistencies that, on occasion, existed over this view were squandered when processes broadcast vector V (which was built with the identifiers of the received AB MSG messages) and then constructed a new vector W (which serves as the proposal for the multi-valued consensus) with the identifiers that appeared in, at least, f + 1 of those V vectors. This mechanism caused all correct processes to propose identical values in every instance of the multi-value consensus, which allowed one-round decisions. d) Testbed tb-lan-slow vs. tb-lan-fast: Figure 6 (left chart) compares the performance for the fault-free and failstop scenarios with four processes in both testbeds. The curves for the Byzantine scenario were left out for legibility since, as observed above, they are practically the same as for the fault-free scenario. The bandwidth for testbed tb-lan-slow is 100 Mbps, and for tb-lan-fast is set to 1000 Mbps. Unsurprisingly, it can be observed that the performance is clearly superior in testbed tb-lan-fast. The greater computational power and network capacity of tb-lan-fast allows a IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. X, NO. X, XXXXXX-XXXXXX XXXX IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING ,VOL. 8, NO. 1, JAN-FEB 2011 maximum throughput about 4 times larger in the fault-free scenario (2800 msgs/s vs. 650 msgs/s), and 3 times larger in the fail-stop scenario (3000 msgs/s vs. 1000 msgs/s). The performance factor is larger in the fault-free case because the load increase in this scenario w.r.t. the fail-stop scenario pushes tb-lan-slow closer to its limit (i.e., consumes a greater percentage of resources) than tb-lan-fast. 2) Network Bandwidth and Message Size in LAN: This section analyzes in greater detail the impact of network bandwidth and message payload size in the protocol performance. In these experiments, no faults were injected and the group size was set to four processes. The network bandwidth was 1000, 100, and 10 Mbps for testbed tb-lan-fast, and 100 Mbps for testbed tb-lan-slow. Four message payload sizes were used: 10 bytes, 100 bytes, 1 Kilobyte, and 10 Kilobytes. Figure 5 shows the performance curves for testbed tb-lanfast with 10-byte, 100-byte, and 1K-byte message payloads. Each curve represents a different bandwidth value. While there is a clear performance difference between the protocol execution in the three network bandwidth scenarios, it is not accentuated as one would expect, if considering the bandwidth as the sole performance bottleneck. For instance, while the 1000 Mbps scenario has 100 times more bandwidth than the 10 Mbps scenario, the maximum throughput is only about 1.6 higher in the 1000 Mbps case (2900 msgs/s vs. 1800 msgs/s) with 10-byte messages. It is only for larger message payloads that the network bandwidth becomes a restricting factor. As later experiments confirm, the processing power of the individual nodes and the network latency considerably affect the performance, especially for small payload sizes. Finally, the middle and right charts of Figure 6 compare the protocol performance on both testbeds with similar bandwidth values. The purpose is solely to compare the impact of the individual node computational power on the protocol performance. As can be easily observed, testbed tb-lan-fast clearly outperforms testbed tb-lan-slow. It is only for large payloads (e.g., 10 Kbytes) that their performance becomes comparable as both the network bandwidth and latency become more restricting factors. 3) Fault Load and Message Size in WAN: This section describes the experiments that measure the impact of the message payload size and different types of faults in the WAN environment (testbed tb-wan). a) Fault-free fault load: The left chart of Figure 7 shows the performance of the atomic broadcast in testbed tb-wan when no faults occur in the system. Each curve shows the throughput for a different payload size. For 10-byte payloads, the maximum throughput is around 80 msgs/s (burst latency of 13 seconds for k = 1000). For 100-byte payloads, the maximum throughput is around 32 msgs/s (burst latency of 34 seconds for k = 1000). Finally, for 1K-byte payloads, the throughput stabilizes around 2,5 msgs/s (burst latency of 400 seconds for k = 1000). As expected, the throughput in the WAN environment is considerably lower than in the LAN environment. The higher latency and lower bandwidth of such an environment has a negative impact on the atomic broadcast performance, and makes it extremely sensitive to the message payload size. 11 b) Fail-stop fault load: The middle chart of Figure 7 shows the performance of the atomic broadcast (using 100byte payload messages) in testbed tb-wan when one of the processes fails by crashing. One curve shows the performance impact on the atomic broadcast protocol when the Campinas node crashes, and the other curve when a node other than Campinas crashes. When the crashed node is not Campinas, the performance of the protocol is similar for all the remaining scenarios, with the throughput stabilizing around 14 msgs/s (burst latency of 71 seconds for k = 1000). On the other hand, when the crashed node is Campinas, the performance of the atomic broadcast is boosted to around 120 msgs/s (burst latency of 8 seconds for k = 1000), a significant increase even if compared to the fault-free scenario. The first observation of these results is that when the crashed node is not Campinas, the performance is worse than the fault-free scenario by about 50%. Messages from the Campinas node were consistently the last ones to arrive at any given process for any particular communication step. This observation is coherent with the latency and bandwidth measurements taken. The links connecting to the Campinas node had the worst results on average. The conclusion is that the Campinas node is a performance bottleneck. When one process crashes (other than Campinas) this forces all processes to wait for the Campinas messages at every communication step. In the fault-free scenario this is offset by the fact that the other processes need not wait for the Campinas messages to advance in the execution of the protocols. They only need to wait for the messages that Campinas atomically broadcasts (but not the messages related to agreement executions) since, by definition of the experiment, all processes wait for nk messages from each process. The second observation is that the atomic broadcast has a considerably higher throughput when the crashed node is Campinas. This can be explained using the same rationale as the previous observation. Since the process crashed and by definition of the experiment, the other processes do not expect any messages from the Campinas node (not even atomically broadcast messages). Hence, the higher performance in this case, even when compared to the fault-free scenario. What is striking is really how much of a performance impact one slower process can have on the execution of the protocol. c) Byzantine fault load: The performance of the atomic broadcast in testbed tb-wan is shown on the right chart of Figure 7 when one of the processes tries to disrupt the execution of the protocol. One curve shows the performance impact on the atomic broadcast protocol when the Campinas node fails, and the other curve when a node other than Campinas fails. When the Byzantine node is not Campinas, the performance of the protocol is again very similar for all the cases with the maximum throughput being roughly around 12 msgs/s. When the Byzantine node is Campinas, the throughput climbs up to around 35 msgs/s but drops to around 20-25 msgs/s for higher burst sizes (i.e., k > 600). The main observations are similar for the fail-stop scenario. The protocol performance is worse when the Byzantine node is not Campinas, and better when it is Campinas. Naturally, this implies that when the Byzantine node is among the n − f IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING ,VOL. 8, NO. 1, JAN-FEB 2011 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. X, NO. X, XXXXXX-XXXXXX XXXX n = 4 / fault-free executions / 100 Mbps 3000 1400 3000 2500 1200 2500 2000 1500 1000 500 throughput (msgs/s) 3500 throughput (msgs/s) throughput (msgs/s) n = 4 / tb-fast: 1000Mbps, tb-slow: 100Mbps / 100-byte messages 2000 1500 1000 500 800 600 400 200 200 400 600 800 1000 0 0 1200 200 400 600 800 1000 0 1200 200 400 600 800 1000 1200 burst size (messages) burst size (messages) burst size (messages) Fig. 6. 1000 0 0 0 12 Comparative throughput for atomic broadcast for testbeds tb-lan-slow and tb-lan-fast. n = 4 / WAN m = 10 campinas, m = 100 35 60 50 40 30 20 100 80 60 40 10 20 0 0 500 1000 1500 2000 2500 throughput (msgs/s) 40 120 70 not campinas, m = 100 Byzantine fault load 140 80 throughput (msgs/s) 30 25 20 15 10 5 0 0 burst size (msgs) 500 1000 1500 0 2000 200 400 600 800 1000 1200 burst size (msgs) burst size (msgs) Throughput for atomic broadcast for testbed tb-wan. fastest, its power to delay the execution of the protocols is greater. Because the messages from the slower node are rarely processed by the other processes, the impact of its Byzantine actions is minimized or even non-existent. The performance of the atomic broadcast when the Byzantine node is Campinas is similar to the fault-free scenario for burst sizes less or equal to 600. It is only when the burst rises above this threshold that the node begins to show some capacity to delay the protocol execution. When more messages are processed in the system, there is a higher chance for some of the messages sent by Campinas to be among the first n − f to be received by other processes. 4) Relative Cost of Agreement: On all experiments only a few agreements were necessary to deliver an entire burst. The observed pattern was that a consensus was initiated immediately after the arrival of the first message. While the agreement task was being run, a significant portion of the burst would arrive, and so on until all the messages were delivered. This has the interesting effect of diluting the cost of the agreements when the load increases. Figure 8 shows the relative cost of the agreements with respect to the total number of (reliable and echo) broadcasts that was observed in the fault-free scenario with four processes and 100-byte messages in testbed tb-lan-fast. This relative cost is referred to as the efficiency of the atomic broadcast protocol. The curves for the other scenarios are almost identical; none of the testing parameters had a noticeable effect on the efficiency. Basically, two quantities were obtained for the transmission of every burst: the total number of (reliable and echo) broadcasts; and the total number of (reliable and echo) broadcasts that were necessary to execute the agreement operations. The 1 0,9 0,8 % agreements throughput (msgs/s) m = 1K 90 0 Fig. 7. m = 100 fail-stop fault load fault-free fault load 0,7 0,6 0,5 0,4 0,3 0,2 0,1 0 0 200 400 600 800 1000 1200 burst size (m essages) Fig. 8. Relative cost: percentage of (reliable or echo) broadcasts that are due to the agreements when a burst of messages is atomically broadcast. values depicted in the figure are the second quantity divided by the first. It is possible to observe that for small burst sizes, the cost of agreement is high – in a burst of 4 messages, it represents about 92% of all broadcasts. This number, however, drops exponentially, reaching as low as 6.3% for a burst size of 1000 messages. There is a downside to this result that is related to the individual message latency under an atomic broadcast burst. According to the observed pattern, for a burst of k messages, k − 1 are delivered exactly at the end of the burst. This means the individual message latency for those k−1 messages matches the whole burst latency and suggests that in a certain usage scenario the protocols could be optimized to provide a more sparse distribution of message delivery inside a burst (by sacrificing some efficiency). IEEE TRANSACTIONS ON DEPENDABLE ANDCOMPUTING, SECURE COMPUTING ,VOL. 8, NO. 1, JAN-FEB 2011 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE VOL. X, NO. X, XXXXXX-XXXXXX XXXX D. Summary of Results Some of the conclusions of the experimental evaluation are summarized in the following points: • • • • • • • • • The protocols are robust. In LAN environments, performance (and also correctness) is not affected by the tested fault patterns. The protocols are efficient with respect to the number of rounds to reach agreement. In the experiments with no Byzantine failures, the multi-valued consensus always reached an agreement with a value distinct from the default ⊥, and the binary consensus always terminated within one round. Since protocols do not carry out any recovery actions when a failure occurs, crashes have the effect of making executions faster for LAN environments. Fewer processes means less contention on the network. The network bandwidth only becomes a serious performance bottleneck when it becomes relatively small (i.e., 10 Mbps or WAN) or the message payloads become relatively large (i.e., 1 KB and 10 KB). Protocols perform much worse in a WAN, due to its higher-latency and lower-bandwith links. For LANs, the computational capability of the individual nodes has a strong influence on the protocol stack performance. In a WAN, the performance impact of a process crash can be positive or negative, depending on whether the process is relatively slow or relatively fast, respectively. A Byzantine process can have a negative impact on the performance of the protocols in a WAN environment, but only if the process can consistently broadcast valid messages that are delivered among the first n − f messages for any given step. On the atomic broadcast protocol, the cost of the agreements is diluted when the load is high. For a burst of 1000 messages, it represents only 6.3% of all (reliable or echo) broadcasts that were made. VII. C ONCLUSION The paper presents an implementation and evaluation of a stack of intrusion-tolerant randomized protocols. These protocols have a set of important structural properties, such as not requiring the use of public-key cryptography (relevant for good performance) and optimal resilience (significant in terms of system cost). The experiments led to several observations. First, randomized binary consensus protocols that in theory run in high numbers of steps, in practice may execute in only a few rounds under realistic conditions. Second, although atomic broadcast is equivalent to consensus, with the right implementation a high number of atomic broadcasts can be done with a small number of rounds of consensus. Consequently, the average cost in terms of throughput for atomic broadcast can be almost as little as a reliable broadcast. Third, taking decisions in a decentralized way is important to avoid performance penalties due to the existence of faults. In fact, the performance of our 13 protocols is approximately the same, or even improved, with realistic fault loads. In conclusion, randomization can, in fact, and contrary to a widespread belief in the scientific community, be a valid solution for the deployment of efficient distributed systems. This is true even if they are deployed in hostile environments where they are usually subject to malicious attacks. R EFERENCES [1] J. S. Fraga and D. Powell, “A fault- and intrusion-tolerant file system,” in Proceedings of the 3rd International Conference on Computer Security, Aug. 1985, pp. 203–218. [2] A. Avizienis, J.-C. Laprie, B. Randell, and C. Landwehr, “Basic concepts and taxonomy of dependable and secure computing,” IEEE Transactions on Dependable and Secure Computing, vol. 1, no. 1, pp. 11–33, Jan.Mar. 2004. [3] P. E. Verissimo, N. F. Neves, and M. P. Correia, “Intrusion-tolerant architectures: Concepts and design,” in Architecting Dependable Systems, ser. Lecture Notes in Computer Science, R. Lemos, C. Gacek, and A. Romanovsky, Eds. Springer-Verlag, 2003, vol. 2677. [4] F. B. Schneider, “Implementing faul-tolerant services using the state machine approach: A tutorial,” ACM Computing Surveys, vol. 22, no. 4, pp. 299–319, Dec. 1990. [5] R. Guerraoui and A. Schiper, “The generic consensus service,” IEEE Transactions on Software Engineering, vol. 27, no. 1, pp. 29–41, Jan. 2001. [6] V. Hadzilacos and S. Toueg, “A modular approach to fault-tolerant broadcasts and related problems,” Cornell University, Department of Computer Science, Tech. Rep. TR94-1425, May 1994. [7] M. Correia, N. F. Neves, and P. Verssimo, “From consensus to atomic broadcast: Time-free Byzantine-resistant protocols without signatures,” The Computer Journal, vol. 41, no. 1, pp. 82–96, Jan. 2006. [8] M. J. Fischer, N. A. Lynch, and M. S. Paterson, “Impossibility of distributed consensus with one faulty process,” Journal of the ACM, vol. 32, no. 2, pp. 374–382, Apr. 1985. [9] C. Dwork, N. Lynch, and L. Stockmeyer, “Consensus in the presence of partial synchrony,” Journal of the ACM, vol. 35, no. 2, pp. 288–323, Apr. 1988. [10] D. Dolev, C. Dwork, and L. Stockmeyer, “On the minimal synchronism needed for distributed consensus,” Journal of the ACM, vol. 34, no. 1, pp. 77–97, Jan. 1987. [11] T. Chandra and S. Toueg, “Unreliable failure detectors for reliable distributed systems,” Journal of the ACM, vol. 43, no. 2, pp. 225–267, Mar. 1996. [12] D. Malkhi and M. Reiter, “Unreliable intrusion detection in distributed computations,” in Proceedings of the 10th Computer Security Foundations Workshop, June 1997, pp. 116–124. [13] K. P. Kihlstrom, L. E. Moser, and P. M. Melliar-Smith, “Byzantine fault detectors for solving consensus,” The Computer Journal, vol. 46, no. 1, pp. 16–35, Jan. 2003. [14] N. F. Neves, M. Correia, and P. Verissimo, “Solving vector consensus with a wormhole,” IEEE Transactions on Parallel and Distributed Systems, vol. 16, no. 12, Dec. 2005. [15] M. Ben-Or, “Another advantage of free choice: Completely asynchronous agreement protocols,” in Proceedings of the 2nd ACM Symposium on Principles of Distributed Computing, Aug. 1983, pp. 27–30. [16] M. O. Rabin, “Randomized Byzantine generals,” in Proceedings of the 24th Annual IEEE Symposium on Foundations of Computer Science, Nov. 1983, pp. 403–409. [17] C. Cachin, K. Kursawe, and V. Shoup, “Random oracles in Contanstinople: Practical asynchronous Byzantine agreement using cryptography,” in Proceedings of the 19th ACM Symposium on Principles of Distributed Computing, July 2000, pp. 123–132. [18] C. Cachin and J. A. Poritz, “Secure intrusion-tolerant replication on the Internet,” in Proceedings of the International Conference on Dependable Systems and Networks, June 2002, pp. 167–176. [19] H. Moniz, M. Correia, N. F. Neves, and P. Verissimo, “Experimental comparison of local and shared coin randomized consensus protocols,” in Proceedings of the 25th IEEE Symposium on Reliable Distributed Systems (SRDS’06), Oct. 2006, pp. 235–244. [20] R. Friedman, A. Mostefaoui, and M. Raynal, “Simple and efficient oracle-based consensus protocols for asynchronous byzantine systems,” Transactions on Dependable and Secure Computing, vol. 2, no. 1, pp. 46–56, Jan.-March 2005. IEEE TRANSACTIONS ON DEPENDABLE ANDCOMPUTING, SECURE COMPUTING ,VOL. 8, NO. 1, JAN-FEB 2011 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE VOL. X, NO. X, XXXXXX-XXXXXX XXXX [21] S. Kent and R. Atkinson, “Security architecture for the internet protocol,” IETF Request for Comments: RFC 2093, Nov. 1998. [22] G. Bracha, “An asynchronous ⌊(n−1)/3⌋-resilient consensus protocol,” in Proceedings of the 3rd ACM Symposium on Principles of Distributed Computing, Aug. 1984, pp. 154–162. [23] M. Reiter, “Secure agreement protocols: Reliable and atomic group multicast in Rampart,” in Proceedings of the 2nd ACM Conference on Computer and Communications Security, Nov. 1994, pp. 68–80. [24] S. Toueg, “Randomized Byzantine agreements,” in Proceedings of the 3rd ACM Symposium on Principles of Distributed Computing, Aug. 1984, pp. 163–178. [25] R. Canetti and T. Rabin, “Fast asynchronous Byzantine agreement with optimal resilience,” in Proceedings of the 25th Annual ACM Symposium on Theory of Computing, 1993, pp. 42–51. [26] L. E. Moser and P. M. Melliar-Smith, “Byzantine-resistant total ordering algorithms,” Information and Computation, vol. 150, pp. 75–111, 1999. [27] R. Baldoni, J. Helary, M. Raynal, and L. Tanguy, “Consensus in Byzantine asynchronous systems,” in Proc. of the Int. Colloquium on Structural Information and Communication Complexity, June 2000, pp. 1–16. [28] J. P. Martin and L. Alvisi, “Fast Byzantine consensus,” in Proceedings of the IEEE International Conference on Dependable Systems and Networks, June 2005. [29] M. Correia, N. F. Neves, L. C. Lung, and P. Verissimo, “Low complexity Byzantine-resilient consensus,” Distributed Computing, vol. 17, no. 3, pp. 237–249, 2005. [30] H. Ramasamy, P. Pandey, J. Lyons, M. Cukier, and W. H. Sanders, “Quantifying the cost of providing intrusion tolerance in group communication systems,” in Proceedings of the International Conference on Dependable Systems and Networks, June 2002, pp. 229–238. [31] K. P. Kihlstrom, L. E. Moser, and P. M. Melliar-Smith, “The SecureRing group communication system,” ACM Transactions on Information and System Security, vol. 4, no. 4, pp. 371–406, 2001. [32] M. Correia, N. F. Neves, L. C. Lung, and P. Verissimo, “Worm-IT – a wormhole-based intrusion-tolerant group communication system,” Journal of Systems and Software, vol. 80, no. 2, pp. 178–197, 2007. [33] V. Drabkin, R. Friedman, and A. Kama, “Practical Byzantine group communication,” 26th IEEE International Conference on Distributed Computing Systems, 2006., pp. 36–36, 2006. [34] M. Castro and B. Liskov, “Practical Byzantine fault tolerance,” in Proceedings of the Third Symposium on Operating Systems Design and Implementation, Feb. 1999, pp. 173–186. [35] H. Moniz, “Randomized intrusion-tolerant asynchronous services,” Master’s thesis, Department of Informatics, University of Lisbon, February 2007, DI/FCUL TR-07-2. [Online]. Available: http://www.di.fc.ul.pt/tech-reports/07-2.pdf [36] R. van Renesse, K. P. Birman, and S. Maffeis, “Horus: a flexible group communication system,” Commun. ACM, vol. 39, no. 4, pp. 76–83, 1996. [37] R. van Renesse, K. Birman, M. Hayden, A. Vaysburd, and D. Karr, “Building adaptive systems using Ensemble,” Software - Practice and Experience, vol. 28, no. 9, pp. 963–979, 1998. [38] G. R. Wright and W. R. Stevens, TCP/IP Illustrated, Volume 2: The Implementation. Addison Wesley, 1995. [39] G. S. Veronese, M. Correia, L. C. Lung, and P. Verissimo, “On the effects of finite memory on intrusion-tolerant systems,” in PRDC ’07: Proceedings of the 13th Pacific Rim International Symposium on Dependable Computing (PRDC 2007). Washington, DC, USA: IEEE Computer Society, 2007, pp. 401–404. [40] M. K. Reiter, “The Rampart toolkit for building high-integrity services,” in Theory and Practice in Distributed Systems, ser. Lecture Notes in Computer Science. Springer-Verlag, 1995, vol. 938, pp. 99–110. [41] B. Chun, D. Culler, T. Roscoe, A. Bavier, L. Peterson, M. Wawrzoniak, and M. Bowman, “Planetlab: an overlay testbed for broad-coverage services,” SIGCOMM Computer Communication Review, vol. 33, no. 3, pp. 3–12, 2003. Henrique Moniz is a Ph.D. student at the Department of Informatics, University of Lisboa. He is a member of the LASIGE laboratory and the Navigators research group. He is also a teaching assistant at the Information Network Institute, Carnegie Mellon University, for activities related to the MSc in Information Technology - Information Security. His research interests are concerned with distributed algorithms in hostile environments. He is involved in several research projects related to intrusion tolerance 14 and security, including the RITAS (FCT), the CRUTIAL and HIDENETS (EC-IST) projects, and the ReSIST NoE. More information about him at http://lasige.di.fc.ul.pt/˜hmoniz. Nuno Ferreira Neves is an assistant professor of the Department of Informatics, University of Lisboa, and also an adjunct faculty at the Information Network Institute, Carnegie Mellon University, for activities related to the MSc in Information Technology - Information Security. He received a Ph.D in Computer Science from the University of Illinois at Urbana-Champaign (1998). His main research interests are in dependable and secure parallel and distributed systems, and in the recent years, he has participated in several European and National research projects in this area, namely CRUTIAL, Resist, AJECT, RITAS and MAFTIA. His work has been recognized with the IBM Scientific Prize in 2004 and the William C. Carter award at the IEEE FTCS in 1998. Currently, he is member of the editorial board of the International Journal of Critical Computer-Based Systems. More information about him is available at http://www.di.fc.ul.pt/˜nuno. Miguel Correia is an Assistant Professor of the Department of Informatics, University of Lisboa Faculty of Sciences. He received a PhD in Computer Science at the University of Lisboa in 2003. Miguel Correia is a member of the LASIGE research unit and the Navigators research team. He has been involved in several international and national research projects related to intrusion tolerance and security, including the MAFTIA and CRUTIAL EC-IST projects, and the ReSIST NoE. He is currently the coordinator of University of Lisboa’s degree on Informatics Engineering and an instructor at the joint Carnegie Mellon University and University of Lisboa MSc in Information Technology - Information Security. His main research interests are: intrusion tolerance, security, distributed systems, distributed algorithms. More information about him is available at http://www.di.fc.ul.pt/˜mpc. Paulo Verissimo is currently a professor of the Department of Informatics (DI) of the University of Lisboa, Faculty of Sciences (http://www.di.fc.ul.pt/˜pjv), and Director of LASIGE, a research laboratory of the DI (http://lasige.di.fc.ul.pt). He is Fellow of the IEEE. He is associate editor of the Elsevier Intl Journal on Critical Infrastructure Protection, and past associate editor of the IEEE Tacs. on Dependable and Secure Computing. He belonged to the European Security & Dependability Advisory Board. He is past Chair of the IEEE Technical Committee on Fault Tolerant Computing and of the Steering Committee of the DSN conference, and belonged to the Executive Board of the CaberNet European Network of Excellence. He was coordinator of the CORTEX IST/FET project (http://cortex.di.fc.ul.pt). Paulo Verissimo leads the Navigators research group of LASIGE, and is currently interested in: architecture, middleware and protocols for distributed, pervasive and embedded systems, in the facets of real-time adaptability and fault/intrusion tolerance. He is author of more than 130 refereed publications in international scientific conferences and journals in the area, and co-author of five books (e.g., http://www.navigators.di.fc.ul.pt/dssa/).