1 Introduction

In this paper we design a routing protocol for an unreliable and dynamically changing synchronous network that is resilient against malicious insiders who may try to destroy and alter messages or disrupt communication in any way. We model the network as a communication graph G=(V,E) where each vertex/node is a processor and each edge is a communication link. We do not assume that the topology of this graph is fixed or known by the nodes. Rather, we assume a complete graph on n vertices, where some of the edges are “up” and some are “down”, and the status of each edge can change dynamically at any time.

We concentrate on the most basic task, namely how two processors in the network can exchange information. Thus, we assume that there are two designated vertices, called the sender S and the receiver R, who wish to communicate with each other. We assume that the capacity of each edge in the network is fixed, so that any edge in the system that is “up” during some round can transmit exactly P bits of informationFootnote 1 during that round. We assume the sender has bundled the information he wishes to send the receiver into a sequence of “packets” {p 1,p 2,…} of size at most P.

We will evaluate our protocol using the following three considerations:

  1. 1.

    Correctness. A protocol is correct if the sequence of packets output by the receiver is a prefix of packets that were sent by the sender, without duplication or omission.

  2. 2.

    Throughput. This measures the number of packets that the receiver has obtained as a function of the number of rounds that have passed.

  3. 3.

    Processor Memory. This measures the memory required of each node by the protocol, independent of the number of packets to be transferred.

All three considerations will be measured in the worst-case scenario as standards that are guaranteed to exist regardless of adversarial interference. One can also evaluate a protocol based on its dependence on global information to make decisions. In the protocol we present in this paper, we will not assume there is any global view of the network available to the internal nodes. Such protocols are termed “local control” (or “distributed”) in that each node can make all routing decisions based only the local conditions of its adjacent edges and neighbors.

Our protocol is designed to be resilient against a malicious, polynomially bounded adversary who may attempt to impact the correctness, throughput, and memory of our protocol by disrupting links between the nodes or taking direct control over the nodes and forcing them to deviate from our protocol in any manner the adversary wishes. In order to relate our work to previous results and to clarify the two main forms of adversarial interference, we describe two separate (yet coordinated with each other) adversaries:Footnote 2

Edge-Scheduling Adversary. :

This adversary controls the links between nodes every round. More precisely, at each round, this adversary decides which edges in the network are up and which are down. We will say that the edge-scheduling adversary is conforming if for every round there is at least one path from the sender to the receiver (although the path may change each round).Footnote 3 The adversary can make any arbitrary poly-time computation to maximize interference in routing, so long as it remains conforming.

Node-Controlling Adversary. :

This adversary controls the nodes of the network. More precisely, each round this adversary decides which nodes to corrupt. Once corrupted, a node is forever under complete adversarial control and can behave in an arbitrary malicious manner. We say that the node-controlling adversary is conforming if every round there is a connection between the sender and receiver consisting of edges that are “up” for the round (as specified by the edge-scheduling adversary) and that passes through uncorrupted nodes. We emphasize that this path can change each round, and there is no other restriction on which nodes the node-controlling adversary may corrupt (allowing even a vast majority of corrupt nodes).

Although we could capture the two above forms of adversarial interference by a single adversary, it will be convenient to view these adversaries as distinct, as we deal with the challenges they pose to correctness, throughput, and memory in different ways. Namely, aside from the conforming condition, the edge-scheduling adversary cannot be controlled or eliminated. Edges themselves are not inherently “good” or “bad,” so identifying an edge that has failed does not allow us to forever refuse the protocol to utilize this edge, as it may come back up at any time (and indeed it could form a crucial link on the path connecting the sender and receiver that the conforming assumption guarantees). In sum, we cannot hope to control or alter the behavior of the edge-scheduling adversary, but must come up with a protocol that works well regardless of the behavior of the ever-present (conforming) edge-scheduling adversary.

By contrast, our protocol will limit the amount of influence the node-controlling adversary has on correctness, throughput, and memory. Specifically, we will show that if a node deviates from the protocol in a sufficiently destructive manner (in a well-defined sense), then our protocol will be able to identify it as corrupted in a timely fashion. Once a corrupt node has been identified, it will be eliminated from the network. Namely, our protocol will call for honest nodes to refuse all communication with nodes that have been identified as corrupt.Footnote 4 Thus, there is an inherent difference in how we handle the edge-scheduling adversary verses how we handle the node-controlling adversary. We can restrict the influence of the latter by eliminating the nodes it has corrupted, while the former must be dealt with in a more ever-lasting manner.

1.1 Previous Work

To motivate the importance of the problem we consider in this paper, and to emphasize the significance of our result, it will be useful to highlight recent works in related areas. To date, routing protocols that consider adversarial networks have been of two main flavors: End-to-End Communication protocols that consider dynamic topologies (a notion captured by our “edge-scheduling adversary”), and Fault Detection and Localization protocols, which handle devious behavior of nodes (as modeled by our “node-controlling adversary”).

End-to-End Communication

One of the most relevant research directions to our paper is the notion of End-to-End communication in distributed networks, considered by a number of authors, including Afek and Gafni and Rosen [2], Afek and Gafni [1], Awerbuch, Mansour and Shavit [6], Afek, Awebuch, Gafni, Mansour, Rosen, and Shavit [3], and Kushilevitz, Ostrovsky and Rosen [14]. Indeed, our starting point is the Slide protocolFootnote 5 developed in these works. It was designed to perform end-to-end communication with bounded memory in a model where (using our terminology) an edge-scheduling adversary controls the edges (subject to the constraint that there is no permanent cut between the sender and receiver). The Slide protocol has proven to be incredibly useful in a variety of settings, including multi-commodity flow (Awerbuch and Leighton [5]) and in developing routing protocols that compete well (in terms of packet loss) against an online bursty adversary ([4]). However, prior to our work there was no version of the Slide protocol that could handle malicious behavior of the nodes. A comparison of various versions of the Slide protocol and our protocol is featured in Table 1 of Sect. 1.3 below.

Table 1. Comparison of our protocol to related existing protocols and folklore. Above, n represents the number of nodes in the network, and P is the size of a packet.

Fault Detection and Localization Protocols

At the other end, there have been a number of works that explore the possibility of a node-controlling adversary that can corrupt nodes. In particular, there is a line of work that considers a network consisting of a single path from the sender to the receiver, culminating in the recent work of Barak, Goldberg and Xiao [8] (for further background on fault localization see references therein). In this model, the adversary can corrupt any node on the path (except the sender and receiver) in a dynamic and malicious manner. Since corrupting any node on the path will sever the honest connection between S and R, the goal of a protocol in this model is not to guarantee that all messages sent to R are received. Instead, the goal is to detect faults when they occur and to localize the fault to a single edge.

There have been many results that provide Fault Detection (FD) and Fault Localization (FL) in this model. In Barak et al. [8], they formalize the definitions in this model and the notion of a secure FD/FL protocol, as well as providing lower bounds in terms of communication complexity to guarantee accurate fault detection/location in the presence of a node-controlling adversary. While the Barak et al. paper has a similar flavor to our paper, we emphasize that their protocol does not seek to guarantee successful or efficient routing between the sender and receiver. Instead, their proof of security guarantees that if a packet is deleted, malicious nodes cannot collude to convince S that no fault occurred, nor can they persuade S into believing that the fault occurred on an honest edge. Localizing the fault in their paper relies on cryptographic tools, and in particular the assumption that one-way functions exist. Although utilizing these tools (such as MACs or Signature Schemes) increases communication cost, it is shown by Goldberg, Xiao, Barak, and Redford [12] that the existence of a protocol that is able to securely detect faults (in the presence of a node-controlling adversary) implies the existence of one-way functions, and it is shown in Barak et al. [8] that any protocol that is able to securely localize faults necessarily requires the intermediate nodes to have a trusted setup. The proofs of these results do not rely on the fact that there is a single path between S and R, and we can therefore extend them to the more general network encountered in our model to justify our use of cryptographic tools and a trusted-setup assumption (i.e. PKI) to identify malicious behavior.

Another paper that addresses routing in the Byzantine setting is the work of Awerbuch, Holmes, Nina-Rotary and Rubens [7], though this paper does not have a fully formal treatment of security, and indeed a counter-example that challenges its security is discussed in the appendix of [8].

Error Correction in the Active Setting

Due to space considerations, we will not be able to give a comprehensive account of all the work in this area. Instead we highlight some of the most relevant works and point out how they differ from our setting and results. For a lengthy treatment of error-correcting codes against polynomially bounded adversaries, we refer to the work of Micali et al. [15] and references therein. It is important to note that this work deals with a graph that has a single “noisy” edge, as modeled by an adversary who can partially control and modify information that crosses that edge. In particular, it does not address throughput efficiency or memory considerations in a full communication network, nor does it account for malicious behavior at the vertices. Also of relevance is the work on Rajagopalan and Schulman on error-correcting network coding [17], where they show how to correct noisy edges during distributed computation. Their work does not consider actively malicious nodes nor the question of throughput-efficient network routing, and thus is different from our setting. It should also be noted that their work utilizes Schulman’s tree-codes [18] that allow length-flexible online error correction. The important difference between our work and that of Schulman is that we do not restrict the amount of malicious activity of corrupt nodes.

1.2 Subsequent Work

Subsequent to the submission of this work, there has been a line of research that investigates the feasibility of routing in a network model that is more general than the network model considered here. Specifically, the network model of Bunn and Ostrovsky [10] is susceptible to both edge failures and corruptible nodes (as is the case in the present work), but is different from the model presented in Sect. 3 because of fewer assumptions placed on the overall connectivity of the network: there are no connectivity guarantees, and the network is asynchronous (i.e. there is no universal clock to coordinate transmissions across each network link).

Direct comparison of the present results to those in [10] is not possible, as the less restrictive network model of [10] requires a different metric for evaluating performance of protocols within that model. For example, the lack of any assumption on network connectivity means that all edges of the network may be inactive at all times, and thus all protocols would achieve zero throughput. Instead, competitive analysis is used in [10] to compare a given protocol’s throughput performance to an imaginary protocol that makes optimal routing decisions, based on the network’s actual behavior. Under this metric of performance, it is shown in [10] that for any (deterministic) protocol, there exists network behavior for which that protocol delivers a factor of n fewer packets than an ideal protocol could have delivered under the same network behavior; in other words, the best achievable “competitive-ratio” is 1/n (here, n is the number of nodes in the network). A specific protocol that achieves competitive-ratio of 1/n is also presented in [10].

Thus, the greater generality of the network considered in [10] comes at a cost: in practice, a competitive-ratio of 1/n is unacceptable. Since 1/n is shown to be the best achievable competitive-ratio in the network model of [10], developing protocols with better performance guarantees is only possible if additional assumptions are placed on network behavior.

While the different network model and definition of throughput make direct comparison impossible, many of the techniques used by the protocol in [10] to combat the combined efforts of the edge-scheduling and node-controlling adversaries are extensions of the work presented here.

1.3 Our Results

Prior to our work, no protocol has been demonstrated to be provably secure in networks that are simultaneously susceptible to faults occurring due to edge failures and faults occurring due to malicious activity of corrupt nodes.Footnote 6 The end-to-end communication works are not secure when the nodes are allowed to become corrupted by a node-controlling adversary, and the fault detection and localization works focus on a single path for some duration of time, and do not consider a fully distributed routing protocol that utilizes the entire network and attempts to maximize throughput efficiency while guaranteeing correctness in the presence of edge failures and corrupt nodes. Indeed, our work answers one of the open questions posed in the Barak et al. paper regarding fault localization on multiple paths. In this paper we bridge the gap between these two research areas and obtain the first routing protocol simultaneously secure against both an edge-scheduling adversary and a node-controlling adversary, even if these two adversaries attack the network using an arbitrary coordinated poly-time strategy. Furthermore, our protocol is distributed (local control) and achieves comparable efficiency standards in terms of throughput and processor memory as state-of-the-art protocols that are not secure against a node-controlling adversary. An informal statement of our result and comparison of our protocol to existing protocols can be found below. Although not included in the table, we emphasize that the linear transmission rate that we achieve (assuming at least n 5 packets are sent) is asymptotically optimal, as any protocol operating in a network with a single path connecting sender and receiver can do no better than one packet per round.

A Routing Theorem For Adversarial Networks (Informal): If one-way functions exist, then for any n-node graph and k sufficiently large, there exists a trusted-setup, local-control protocol that achieves the following properties in networks susceptible to any poly-time conforming Edge-Scheduling and Node-Controlling Adversaries:

  • Correctness. Achieves correctness with all but negligible (in k) probability of failure

  • Throughput. Achieves linear throughput: For any xΩ(n 5), transmits x packets in O(x) rounds

  • Memory. Requires O(n 4(k+logn)) memory per processor

While the protocol introduced in this paper provides an important first step in establishing the feasibility of throughput-efficient routing in highly unreliable networks, we emphasize that our protocol falls short of providing a practical routing solution for the following reasons. First, the protocol introduced in Sect. 5 is fairly complex, requiring multiple pages of pseudo-code to describe it in detail. Second, our protocol assumes a synchronous network, and it remains an open problem of extending it to asynchronous network settings. Finally, our protocol is expensive in terms of both processor memory and computation. As shown in Table 1 and discussed in Sect. 2, both the complexity and processor costs come at the expense of guaranteeing optimal throughput (e.g. otherwise one could use Flooding + Signatures protocol). As this is primarily a feasibility result, we have put at a premium the requirement of provably correct routing in networks that are susceptible to deliberate, malicious attacks on any subset of the nodes and edges. It is worth noting that in practice, one typically values protocols that require less processor-memory and processor-computation, and enjoy greater simplicity and/or observed throughput; characteristics that can be obtained by relaxing the corruption model and/or removing the requirement of perfect correctness.

2 Challenges and Naïve Solutions

The Slide protocol (and its variants) have been studied in a variety of theoretical contexts, including multi-commodity flow (Awerbuch and Leighton [5]), and in networks controlled by an online bursty adversary (Aiello et al. [4]).

Before proceeding, it will be useful to consider a couple of naïve solutions that achieve the goal of correctness (but perform poorly in terms of throughput), and help to illustrate some of the technical challenges that our theorem resolves. Consider the approach of having the sender continuously flood Footnote 7 a single signed packet into the network for n rounds. Since the conforming assumption guarantees that the network provides a path between the sender and receiver through honest nodes at every round, this packet will reach the receiver within n rounds, regardless of adversarial interference. After n rounds, the sender can begin flooding the network with the next packet, and so forth.Footnote 8 Notice that this solution will require each processor to store and continuously broadcast a single packet at any time, and hence this solution achieves excellent efficiency in terms of processor memory. However, notice that the throughput rate is sub-linear, namely after x rounds, only O(x/n) packets have been received by the receiver.

One idea to try to improve the throughput rate might be to have the sender streamline the process, sending packets with ever-increasing sequence numbers without waiting for n rounds before sending the next packet. In particular, across each of his edges the sender will send every packet Footnote 9 once, waiting only for the neighboring node’s confirmation of receipt before sending the next packet across that edge. The protocol calls for the internal nodes to act similarly. Analysis of this approach shows that not only has the attempt to improve throughput failed (it is still O(x/n) in the worst-case scenarioFootnote 10), but additionally this modification requires arbitrarily largeFootnote 11 processor memory, since achieving correctness in the dynamic topology of the graph will force the nodes to remember all of the packets they see until they broadcast them across all adjacent edges or have seen confirmation of their receipt from the receiver.

2.1 Challenges in Dealing with Node-Controlling Adversaries

In this section, we discuss some potential strategies that the node-controlling and edge-scheduling adversariesFootnote 12 may incorporate to disrupt network communication. Although our theorem will work in the presence of arbitrary malicious activity of the adversarial controlled nodes (except with negligible probability), it will be instructive to list a few obvious forms of devious behavior that our protocol must protect against. It is important to stress that this list is not intended to be exhaustive. Indeed, we do not claim to know all the specific ways an arbitrary polynomially bounded adversary may force nodes to deviate from a given protocol, and in this paper we prove that our protocol is secure against all possible deviations.

  • Packet Deletion/Modification. Instead of forwarding a packet, a corrupt node “drops it to the floor” (i.e. deletes it or effectively deletes it by forever storing it in memory), or modifies the packet before passing it on. Another manifestation of this is if the sender/receiver requests fault localization information of the internal nodes, such as providing documentation of their interactions with neighbors. A corrupt node can then block or modify information that passes through it in attempt to hide malicious activity or implicate an honest node.

  • Introduction of Junk/Duplicate Packets. The adversary can attempt to disrupt communication flow and “jam” the network by having corrupted nodes introduce junk packets or re-broadcast old packets. Notice that junk packets can be handled by using cryptographic signatures to prevent introduction of “new” packets, but this does not control the re-transmission of old, correctly signed packets.

  • Disobedience of Transfer Rules. If the protocol specifies how nodes should make decisions on where to send packets, etc., then corrupt nodes can disregard these rules. This includes “lying” to adjacent nodes about their current state.

  • Coordination of Edge Failures. The edge-scheduling adversary can attempt to disrupt communication flow by scheduling edge failures in any manner that is consistent with the conforming criterion. Coordinating edge failures can be used to impede correctness, memory, and throughput in various ways: e.g. packets may become lost across a failed edge, stuck at a suddenly isolated node, or arrive at the receiver out of order. A separate issue arises concerning fault localization: when the sender/receiver requests documentation from the internal nodes, the edge-scheduling adversary can slow progress of this information, as well as attempt to protect corrupt nodes by allowing them to “play-dead” (setting all of its adjacent edges to be down), so that incriminating evidence cannot reach the sender.

2.2 Highlights of Our Solution

Our starting point is the Slide protocol [3], which has enjoyed practical success in networks with dynamic topologies, but is not secure against nodes that are allowed to behave maliciously. We chose the Slide protocol as our starting point because of its proven ability to work well in networks with dynamic topology (with frequent edge failures), see e.g. [2, 6], and [3]. Furthermore, the protocol has proven to be robust in its ability to be readily extendible to more complex network settings (which was important for our goal of extending to networks whose nodes cannot be trusted), such as multi-commodity flow [5], networks controlled by an adversary inserting packets in attempt to jam the network [4], and networks with more stringent demands on available processor memory [14].

We provide a detailed description of our version of the Slide protocol in Sect. 4, but highlight the main ideas here. Begin by viewing the edges in the graph as consisting of two directed edges, and associate to each end of a directed edge a stack data-structure able to hold 2n packets and to be maintained by the node at that end. The protocol specifies the following simple, local condition for transferring a packet across a directed edge: if there are more packets in the stack at the originating end than the terminating end, transfer a packet across the edge. Similarly, within a node’s local stacks, packets are shuffled to average out the stack heights along each of its edges. Intuitively, packet movement is analogous to the flow of water: high stacks create a pressure that force packets to “flow” to neighboring lower stacks. At the source, the sender maintains the pressure by filling his outgoing stacks (as long as there is room) while the receiver relieves pressure by consuming packets and keeping his stacks empty. Loosely speaking, packets traveling to nodes “near” the sender will therefore require a very large potential, packets traveling to nodes near the receiver will require a small potential, and packet transfers near intermediate nodes will require packages to have a moderate potential. Assuming these potential requirements exist, packets will pass from the sender with a high potential, and then “flow” downwards across nodes requiring less potential, all the way to the receiver.

Because the Slide protocol provides a fully distributed protocol that works well against an edge-scheduling adversary, our starting point is to extend the protocol by using digital signaturesFootnote 13 to provide resilience against Byzantine attacks and arbitrary malicious behavior of corrupt nodes. This proved to be a highly nontrivial task that required us to develop a lot of additional machinery, both in terms of additional protocol ideas and novel techniques for proving correctness. We give a detailed explanation of our techniques in Sect. 5 and pseudo-code in Appendix C, as well as providing proofs of security in Appendix D. However, below we first give a sample of some of the key ideas we used in ensuring our additional machinery would be provably secure against a node-controlling adversary, and yet not significantly affect throughput or memory, compared to the original Slide protocol:

  • Addressing the “Coordination of Edge-Scheduling” Issues. In the absence of a node-controlling adversary, previous versions of the Slide protocol (e.g. [3]) are secure and efficient against an edge-scheduling adversary. In particular, the following explains how previous authors of the Slide protocol combated the problem of faulty edges in a network. It will be useful to discuss how some of the challenges posed by a network with a dynamic topology are handled. See Sect. 4 for a thorough description of the Slide protocol.

    For the Slide protocol, the total capacity of the stack data-structure is bounded by 4n 3. That is, each of the n nodes can hold at most 2n packets in each of their 2n stacks (along each directed edge) at any time.

    • To handle the loss of packets due to an edge going down while transmitting a packet, a node is required to maintain a copy of each packet it transmits along an edge until it receives confirmation from the neighbor of successful receipt.

    • To handle packets becoming stuck in some internal node’s stack due to edge failures, error correction is utilized. In particular, the sequence of packets {p 1,p 2,…} is partitioned into messages, with each message containing Θ(n 3) packets. The messages are then expanded into codewords, and then the codewords are divided into codeword packets of size P. The transformation from packets to codeword packets has the property that the receiver can “decode” a message (thus obtaining the original Θ(n 3) packets corresponding to the message), even if it is missing a fraction of the codeword packets. In particular, if an error-correcting code allowing a fraction of λ faults is utilized, then since the capacity of the network is 4n 3 packets, if the sender is able to pump 4n 3/λ codeword packets into the network and there is no malicious deletion or modification of packets, then the receiver will necessarily have received enough packets to decode the message.

    • The Slide protocol has a natural bound in terms of memory per processor of O(n 2logn) bits, where the bottleneck is the possibility of a node holding up to 2n 2 packets in its stacks, where each packet requires O(logn) bits to describe its position in the code.

    Of course, these techniques are only valid if nodes are acting honestly, which leads us to our first extension idea.

  • Handling Packet Modification and Introduction of Junk Packets. Before inserting any packets into the network, the sender will authenticate each packet using his digital signature, and intermediate nodes and the receiver never accept or forward packets not appropriately signed. This simultaneously prevents honest nodes becoming bogged down with junk packets, as well as ensuring that if the receiver has obtained enough authenticated packets to decode, a node-controlling adversary cannot impede the successful decoding of the message as the integrity of the codeword packets is guaranteed by the inforgibility of the sender’s signature.

  • Fault Detection. In the absence of a node-controlling adversary, our protocol looks almost identical to the Data Dispersal Algorithm (a variant of Slide) of [3], with the addition of signatures that accompany all interactions between two nodes. First, the sender attempts to pump the 4n 3/λ codeword packets of the first message into the network, with packet movement exactly as in the original Slide protocol. We consider all possible outcomes:

    1. 1.

      The sender is able to insert all codeword packets and the receiver is able to decode. In this case, the message was transmitted successfully, and our protocol proceeds to begin transferring the next message.

    2. 2.

      The sender is able to insert all codeword packets, but the receiver has not received enough to decode. In this case, the receiver floods the network with a single-bit message indicating packet deletion has occurred.

    3. 3.

      The sender is able to insert all codeword packets, but the receiver cannot decode because he has received duplicated packets. Although the sender’s authenticating signature guarantees the receiver will not receive junk or modified packets, a corrupt node is able to duplicate valid packets. Therefore, the receiver may receive enough packets to decode, but cannot because he has received duplicates. In this case, the receiver floods the network with a single packet indicating the label of a duplicated packet.

    4. 4.

      After some amount of time, the sender still has not inserted all codeword packets. In this case, the duplication of old packets is so severe that the network has become jammed, and the sender is prevented from inserting packets even along the honest path that the conforming assumption guarantees. If the sender believes the jamming cannot be accounted for by edge failures alone, he will halt transmission and move to localizing a corrupt node.Footnote 14 One contribution this paper makes is to prove a lower bound on the insertion rate of the sender for the Slide protocol in the absence of the node-controlling adversary. This bound not only alerts the sender when the jamming he is experiencing exceeds what can be expected in the absence of corrupt nodes, but it also provides a mechanism for localizing the offending node(s).

    The above four cases exhaust all possibilities. Furthermore, if the transmission of a message is not successful, the sender is not only able to detect the fact that malicious activity has occurred, but he is also able to distinguish the form of the malicious activity, i.e. which case 2-4 he is in. Meanwhile, for the top case, our protocol enjoys (within a constant factor) an equivalent throughput rate as the original Slide protocol.

  • Fault Localization. Once a fault has been detected, it remains to describe how to localize the problem to the offending node. To this end, we use digital signatures to achieve a new mechanism we call “Routing with Responsibility.” By forcing nodes to sign key parts of every communication with their neighbors during the transfer of packets, they can later be held accountable for their actions. In particular, once the sender has identified the reason for failure (cases 2–4 above), he will request all internal nodes to return testimonies, which are signatures on the relevant parts of the communication with their neighbors. These testimonies consist of three pieces of information: (1) The net number of packets a node has transferred with each neighbor; (2) For each packet p, the net number of times a node has transferred p with each neighbor; and (3) An integer representing the net “potential drop” a node has had with each neighbor (roughly, this integer relates the relative number of packets each node was storing each time a packet is transferred between them). Note that in terms of memory, the cost of storing these testimonies is O(n 4), controlled by item (2), which has each node storing O(n 3) signatures from each neighbor (there are O(n 3) relevant packets p).

    We prove that no matter what the reason for failure, if the sender has the complete testimony from every node, he can with overwhelming probability identify and eliminate a corrupt node. Of course, malicious nodes may choose not to send incriminating information. We handle this separately (see the “Blacklist” bullet below).

  • Processor Memory. The signatures on the communication that a node has with its neighbors for the purpose of fault localization is a burden on the memory required of each processor that is not encountered in the original Slide protocol. One major challenge was to reduce the amount of signed information each node must maintain as much as possible, while still guaranteeing that each node has maintained “enough” information to identify a corrupt node in the case of arbitrary malicious activity leading to a failure of type 2–4 above. The content of Theorem 5.1 in Sect. 5 demonstrates that the extra memory required of our protocol is a factor of n 2 higher than that of the original Slide protocol.

  • Incomplete Information. As already mentioned, we show that regardless of the reason of failure 2–4 above, once the sender receives the testimonies from every node, a corrupt node can be identified. However, this relies on the sender obtaining all of the relevant information; the absence of even a single node’s information can prevent the localization of a fault. We address this challenge in the following ways:

    1. 1.

      We minimize the amount of information the sender requires of each node, so that a node need not be connected to the sender for very many rounds in order for the sender to get its information. Specifically, regardless of the reason for failure 2–4 above, a testimony consists of only n pieces of information from each node: one packet for each of its edges.

    2. 2.

      If the sender does not have the n pieces of information from a node, it cannot afford to wait indefinitely. After all, the edge-scheduling adversary may keep the node disconnected indefinitely, or a corrupt node may simply refuse to respond. For this purpose, we create a blacklist for non-responding nodes, which will disallow them from transferring codeword packets in the future. This way, anytime the receiver fails to decode a codeword as in cases 2–4 above, the sender can request the information he needs, blacklist nodes not responding within some short amount of time, and then re-attempt to transmit the codeword using only non-blacklisted nodes. Nodes should not transfer codeword packets to blacklisted nodes, but they do still communicate with them to transfer the information the sender has requested. If a new transmission again fails, the sender will only need to request information from nodes that were participating, i.e. he will not need to collect new information from blacklisted nodes (although the nodes will remain blacklisted until the sender gets the original information he requested of them). Nodes will be removed from the blacklist and re-allowed to route codeword packets as soon as the sender receives their information.

  • The Blacklist. Blacklisting nodes is a delicate matter; we want to place malicious nodes “playing dead” on this list, while at the same time we do not want honest nodes that are temporarily disconnected from being on this list for too long. We show in Theorem 5.2 and Lemma D.31 that the occasional honest node that gets put on the blacklist will not significantly hinder packet transmission. Intuitively, this is true because any honest node that is an important link between the sender and receiver will not remain on the blacklist for very long, as his connection to the sender guarantees the sender will receive all requested information from the node in a timely manner.

    Ultimately, the blacklist allows us to control the amount of malicious activity a single corrupt node can contribute to. Indeed, we show that each failed message transmission (cases 2–4 above) can be localized (eventually) to (at least) one corrupt node. More precisely, the blacklist allows us to argue that malicious activity can cause at most n failed transmissions before a corrupt node can necessarily be identified and eliminated. Since there are at most n corrupt nodes, this bounds the number of failed transmissions at n 2. The result of this is that other than at most n 2 failed message transmissions, our protocol enjoys the same throughput efficiency of the old Slide protocol. The statement of this fact can be found in Theorem 5.2 in Sect. 5.

3 The Formal Model

It will be useful to describe two models in this section, one in the presence of an edge-scheduling adversary (all nodes act “honestly”), and one in the presence of an adversary who may “corrupt” some of the nodes in the network. In Sect. 4 we present an efficient protocol (“Slide”) that works well in the edge-scheduling adversarial model, and then extend this protocol in Sect. 5 to work in the additional presence of the node-controlling adversary.

3.1 The Edge-Scheduling Adversarial Model

We model a communication network by an undirected graph G=(V,E), where |V|=n. Each vertex (or node) represents a processor that is capable of storing information (in its buffers) and passing information to other nodes along the edges. We assume a synchronous network, so that there is a universal clock that each node has access to.Footnote 15 The global time is divided into discrete chunks, called rounds, which consists of two equal intervals of unit time called stages, and all nodes are synchronized in terms of when each stage begins and ends.

We do not assume that the topology of the graph is fixed or known by the nodes. Rather, we assume a complete graph on n vertices, where some of the edges are “up” and some are “down”, and the status of each edge can change dynamically at any time. We assume a fixed bandwidth/capacity P for each edge; so that an edge that is “up” during a stage can transmit up to P bits of information across it. Our protocol of Sect. 4 requires that PΩ(logn), while the protocol of Sect. 5 requires that PΩ(k+logn), where k is the security parameter (discussed in Sect. 3.2 below).

The network is local control, so that the only information that nodes have concerning the state of the network comes from the local communication they have with their neighbors across each edge. During each stage, each node first makes a decision (based on packets it has received in previous stages) about which packets to send across each edge, then it sends these packets, and finally the node receives packets that were sent to it (across edges that were “up” during that stage). In this paper, the constraints we are concerned with in terms of the processors is with respect to processor memory; we ignore computation costs and assume that the computation required of each node at the start of each stage (in terms of making routing decisions) happens instantaneously.

There are two designated vertices, called the sender S and the receiver R, who wish to communicate with each other through this network. We assume the sender has bundled the information he wishes to send the receiver into a sequence of “packets” {p 1,p 2,…} of size at most P. The sole purpose of the network is to transmit the messages from S to R, so S is the only node that introduces new messages into the network, and R is the only node that removes them from the network (although below we introduce a node-controlling adversary who may corrupt the intermediate nodes and attempt to disrupt the network by illegally deleting/introducing messages). As mentioned in the Introduction, the three commodities we care about are Correctness, Throughput, and Processor Memory. We define each of these notions in terms of the network model:

  1. 1.

    Correctness. A protocol is correct if the sequence of packets output by the receiver is a prefix of packets that were sent by the sender, without duplication or omission.

  2. 2.

    Throughput (Rate). This measures the number of packets that the receiver has obtained as a function of the number of rounds that have passed.

  3. 3.

    Processor Memory. This measures the memory required of each node by the protocol, i.e. the maximum number of packets and/or control information (measured in bits) an internal node may be required to store at any moment in time. Memory may be a function of the size of the network n, but is independent of the number of packets to be transferred.

Although the edges in our model are bidirectional, it will be useful to consider each link as consisting of two directed edges. Except for the conforming restriction (see below), we allow the edges of our network to fail and resurrect arbitrarily. We model this via an Edge-Scheduling Adversary, who controls the edges of the network: at any time, the edge-scheduling adversary controls whether an edge is able to deliver a packet or not. Note that the edge-scheduling adversary only controls the status of the edges, i.e. he cannot duplicate or alter the packets that pass across the edges (aside from preventing them from being delivered by deactivating an edge). Below, we introduce a second adversary that will be able to modify and duplicate packets (as well as many other forms of destructive behavior); we describe a protocol that handles a coordinated attack by both adversaries in Sect. 5.

We say that an edge is active during a given stage/round if the edge-scheduling adversary allows that edge to remain “up” for the entirety of that stage/round. We impose one restriction on the edge-scheduling adversary:

Definition 3.1

An edge-scheduling adversary is conforming if for every round of the protocol, there exists at least one path between S and R consisting of edges that active for the entirety of the round.

For a given round t, we will refer to the path guaranteed by the conforming assumption as the active path of round t. Notice that although the conforming assumption guarantees the existence of an active path for each round, it is not assumed that any node (including S and R) is aware of what that path is. Furthermore, this path may change from one round to the next. The edge-scheduling adversary cannot affect the network in any way other than controlling the status of the edges. In the next section, we introduce a node-controlling adversary who can take control of the nodes of the network.Footnote 16

3.2 The Node-Controlling + Edge-Scheduling Adversarial Model

This model begins with the edge-scheduling adversarial model described above, and adds a polynomially bounded Node-Controlling Adversary that is capable of corrupting nodes in the network. The node-controlling adversary is malicious, meaning that the adversary can take complete control over the nodes he corrupts, and can therefore force them to deviate from any protocol in whatever manner he likes. We further assume that the adversary is dynamic, which means that he can corrupt nodes at any stage of the protocol, deciding which nodes to corrupt based on what he has observed thus far.Footnote 17 For a thorough discussion of these notions, see [13] and references therein.

As in Multi-Party Computation (MPC) literature, we will need to specify an “access-structure” for the adversary:

Definition 3.2

A node-controlling adversary is conforming if he does not corrupt any nodes who have been or will be a part of any round’s active path.

Apart from this restriction, the node-controlling adversary may corrupt whoever he likes (i.e. it is not a threshold adversary). Note that the conforming assumption implicitly demands that S and R are incorruptible, since they are always a part of any active path. Also, this restriction on the adversary is really more a statement about when our results remain valid. This is similar to e.g. threshold adversary models, where the results are only valid if the number of corrupted nodes does not exceed some threshold value t. Once corrupted, a node is forever under the control of the node-controlling adversary (although the adversary may choose to have the node behave honestly).

Notice that because correctness, throughput, and memory are the only commodities that our model values, an honest-but-curious adversary is completely benign, as privacy does not need to be protectedFootnote 18 (indeed, any intermediate node is able to read any packet that is passed through it). Our techniques for preventing/detecting malicious behavior will be to incorporate a digital signature scheme that will serve the dual purpose of validating information that is passed between nodes, as well as holding nodes accountable for information that their signature committed them to.

We assume that there is a Public-Key Infrastructure (PKI) that allows digital signatures. In particular, before the protocol begins we choose a security parameter k sufficiently large and run a key generation algorithm for a digital signature scheme, producing n=|G| (secret key, verification key) pairs (sk N ,vk N ). As output to the key generation, each processor NG is given its own private signing key sk N and a list of all n signature verification keys \(vk_{\widehat{N}}\) for all nodes \(\widehat{N} \in G\). In particular, this allows the sender and receiver to sign messages to each other that cannot be forged (except with negligible probability in the security parameter) by any other node in the system.

4 Routing Protocol in the Edge-Scheduling Adversarial Model

In this section we formally describe our edge-scheduling protocol, which is essentially the protocol of [3] (we have modified the specifics in order to fit our network model and allow analysis in this model, but have not changed the original protocol in any substantial way). As mentioned in the Introduction, this protocol is motivated by the Slide protocol developed in [2, 6], and [3], and as such we will refer to the protocol presented in this section as “Slide.”

4.1 Definitions and High-Level Ideas

Recall from Sect. 3.1 that the goal of the protocol is to transmit a sequence of packets {p 1,p 2,…} of size P from the sender S to the receiver R.

To allow for packets to become stuck in isolated nodes (which may be possible based on the dynamic topology of the network graph, as controlled by the scheduling-adversary), we will utilize error correction (see e.g. [13]).Footnote 19 Specifically, the packets {p 1,p 2,…} are first grouped together into messages {m 1,m 2,…}, with each message consisting of \(\frac{6 \sigma n^{3}}{\lambda }\) packets. The messages are then expanded by a factor of 1/σ into codewords {c 1,c 2,…}, and each codeword is partitioned into codeword packets of size P to be transmitted through the network from S to R. We emphasize the distinction between the original packets that the sender is ultimately trying to get to the receiver, verses the codeword packets that are the actual contents that are relayed through the network. We will frequently use the terminology “packet” when referring to “codeword packets”; relying on context to disambiguate which type of packet we mean.

We assume that the codewords are formed as part of the setup of our protocol. Given the above construction of codewords, we emphasize the following quantity of codeword packets per codeword:

$$ D := \frac{6n^3}{\lambda} = \mbox{number of (codeword) packets per codeword}. $$
(1)

Note that the only “noise” in our network results from undelivered packets or out-dated packets (in the edge-scheduling adversarial model, any packet that R receives has not been altered). Therefore, since each codeword consists of \(D = \frac{6n^{3}}{\lambda}\) packets, by definition of λ, if R receives \((1-\lambda)D = (1-\lambda ) (\frac{6n^{3}}{\lambda} )\) packets corresponding to the same codeword, he will be able to decode:

Fact 1

If the receiver has obtained \(D-6n^{3} = (1-\lambda ) (\frac{6n^{3}}{\lambda} )\) packets from any codeword, he will be able to decode the codeword to obtain the corresponding message.

Each node will maintain a stack (i.e. FILO buffers) along each of its (directed) edges that can hold up to 2n packets concurrently. Because our model allows for edges to go up/down, we force each node to keep incoming and outgoing buffers for every possible edge, even if that edge is not part of the graph at the outset.

We introduce now the notion of height of a buffer, which will be used to determine when packets are transferred and how packets are shuffled between the internal buffers of a node between rounds.

Definition 4.1

The height of an incoming/outgoing buffer is the number of packets currently stored in that buffer. Also, the height of a packet refers to its position in the buffer/stack, i.e. one plus the number of packets below it.

The presence of an edge-scheduling adversary that can force edges to fail at any time complicates the interaction between the nodes. Note that our model does not assume that the nodes are aware of the status of any of its adjacent edges, so failed edges can only be detected when information that was supposed to be passed along the edge does not arrive. We handle edge failures as follows. First, the incoming/outgoing buffers at either end of an edge will be given a status (normal or problem). Also, to account for a packet that may be lost due to edge failure during transmission across that edge, a node at the receiving end of a failed edge may have to leave room in its corresponding incoming buffer.Footnote 20 We refer to this gap as a ghost packet, but emphasize that the height of an incoming buffer is not affected by ghost packets (by definition, height only counts packets that are present in the buffer). Similarly, when a sending node “sends” a packet across an edge, it actually only sends a copy of the packet, leaving the original packet in its outgoing buffer. We will refer to the original copy left in the outgoing buffer as a flagged packet, and note that flagged packets continue to contribute to the height of an outgoing buffer until they are deleted.

Codewords are transferred sequentially, so that at any time, the sender is only inserting packets corresponding to a single codeword. We refer to the rounds for which the sender is inserting codeword packets corresponding to the ith codeword as the i th transmission. Lemma 4.9 below states that after the sender has inserted D−2n 3 packets corresponding to the same codeword, the receiver can necessarily decode. Therefore, when the sender has inserted D−2n 3 packets of some codeword, he will clear his outgoing buffers and begin distributing packets corresponding to the next codeword.

4.2 Detailed Description of the Edge-Scheduling Protocol

We describe now the two main parts of the edge-scheduling adversarial routing protocol: the Setup and the Routing Phase. See Appendix A for pseudo-code.

Setup

Each internal (i.e. not S or R) node has the following buffers:

  1. 1.

    Incoming Buffers. Recall that we view each bidirectional edge as consisting of two directed edges. Then for each incoming edge, a node will have a buffer that has the capacity to hold 2n packets at any given time. Additionally, each incoming buffer will be able to store a “Status” bit (‘0’ for “normal” and ‘1’ for “problem”), the label of the “Last-Received” packet, and the “Round-Received” index (the round in which this incoming buffer last accepted a packet, see Definition 4.4 below). The way that this additional information is used will be described in the “Routing Rules for Receiving Node” section below.

  2. 2.

    Outgoing Buffers. For each outgoing edge, a node will have a buffer that can hold up to 2n packets at any given time. Like incoming buffers, each outgoing buffer will also be able to store a status bit, the index label of one packet (called the “Flagged” packet), and a “Problem-Round” index (index of the most recent round in which the status bit switched to ‘1’).

The receiver will only have incoming buffers (with capacity of one) and a large Storage Buffer that can hold up to D packets. Similarly, the sender has only outgoing buffers (with capacity 2n) and the input stream of packets, which are clustered into messages and expanded into codewords. The codeword packets for the current codeword are distributed to the sender’s outgoing buffers whenever there is room for them there.

Also as part of the Setup, all nodes learn the relevant parameters (P, n, λ, and σ).

Routing Phase

As indicated in Sect. 3.1, we assume a synchronous network, so that there are well-defined rounds in which information is passed between nodes. Each round consists of two units of time, called Stages. The formal treatment of the Routing Phase can be found in the pseudo-code in Appendix A. Informally, Fig. 1 below considers a directed edge E(A,B) from A (including A=S) to B (including B=R), and describes what communication each node sends in each stage.

Fig. 1.
figure 1

Description of communication exchange along directed edge E(A,B) during the Routing Phase of any round.

In addition to this communication, each node must update its internal state based on the communication it receives. In particular, from the communication A receives from B in Stage 1 of any round, A can determine if B has received the most recent packet A sent. If so, A will delete this packet and switch the status of the outgoing buffer along this edge to “normal.” If not, A will keep the packet as a flagged packet, and switch the status of the outgoing buffer along this edge to “problem.” At the other end, if B does not receive A’s Stage 1 communication or B does not receive a packet it was expecting from A in Stage 2, then B will leave a gap in its incoming buffer (termed a “ghost packet”) and will switch this buffer’s status to “problem.” On the other hand, if B successfully receives a packet in Stage 2, it will switch the buffer back to “normal” status.

Re-Shuffle Rules

At the end of each round, nodes will shuffle the packets they are holding according to the following rules:

  1. 1.

    Take a packet from the fullest buffer and shuffle it to the emptiest buffer, provided the difference in height is at least two (respectively one) when the packet is moved between two buffers of the same type (respectively when the packet moves from an incoming buffer to an outgoing buffer). Packets will never be re-shuffled from an outgoing buffer to an incoming buffer. If two (or more) buffers are tied for having the most packets, then a packet will preferentially be chosen from incoming buffers over outgoing buffers. Conversely, if two (or more) buffers are tied for the emptiest buffer, then a packet will preferentially be given to outgoing buffers over incoming buffers. Ties for which buffer to choose are broken in a round-robin fashion.

  2. 2.

    Repeat the above step until the difference between the fullest buffer and the emptiest buffer does not meet the criterion outlined in Step 1.

Recall that when a packet is shuffled locally between two buffers, packets travel in a FILO manner, so that the top-most packet of one buffer is shuffled to the top spot of the next buffer. When an outgoing buffer has a flagged packet or an incoming buffer has a ghost packet, we use instead the following modifications to the above re-shuffle rules. Recall that in terms of measuring a buffer’s height, flagged packets are counted but ghost packets are not.

  • Outgoing buffers do not shuffle flagged packets. In particular, if Rule 1 above selects to transfer a packet from an outgoing buffer, the top-most non-flagged packet will be shuffled. This may mean that a gap is created between the flagged packet and the next non-flagged packet.

  • Incoming buffers do not re-shuffle ghost packets. In particular, ghost packets will remain in the incoming buffer that created them, although we do allow ghost packets to slide down within its incoming buffer during re-shuffling. Also, packets shuffled into an incoming buffer are not allowed to occupy the same slot as a ghost packetFootnote 21 (they will take the first non-occupied slot).

The sender and receiver have special rules for re-shuffling packets. Namely, during the re-shuffle phase the sender will fill each of his outgoing buffers (in an arbitrary order) with packets corresponding to the current codeword. Meanwhile, the receiver will empty all of its incoming buffers into its storage buffer. If at any time R has received enough packets to decode the current codeword (Fact 1 says this amount is at most D−6n 3), then R outputs the corresponding message, and deletes all packets corresponding to this codeword from its storage buffer (also, R will not store any packets that it receives in future rounds that correspond to this codeword).

4.3 Analysis of the Edge-Scheduling Adversarial Protocol

We now evaluate our edge-scheduling protocol in terms of our three measurements of performance: correctness, throughput, and processor memory.Footnote 22 The following theorem concerns the memory requirements of our edge-scheduling protocol, which is bottle-necked by the O(n 2) packets that each internal node has the capacity to store in its buffers.

Theorem 4.2

The edge-scheduling protocol described in Sect4.2 (and in the pseudo-code in Appendix A) requires at most O(n 2 P) bits of memory of the internal processors.

Proof

Each internal node needs to hold at most O(n 2) packets of size P at any time (nodes have 2(n−2) buffers, each able to hold 2n packets).Footnote 23 □

The throughput standard expressed in Theorem 4.3 below will serve an additional purpose when we move to the node-controlling adversary setting: the sender will know that malicious activity has occurred when the throughput standard of Theorem 4.3 is not observed. Note that the theorem below implicitly states that our edge-scheduling protocol is correct.

Theorem 4.3

The transmission of every message m i takes at most 3D rounds; that is, within 3D rounds of the time the sender begins inserting packets corresponding to any codeword c i , the receiver will have received enough codeword packets to decode the message. Thus, for any x>3D, after x rounds R has received Ω(x) packets from the input stream of packets {p 1,p 2,…}, and thus our edge-scheduling adversarial protocol enjoys a linear throughput rate.

The first statement of the theorem implies that for any \(y \in\mathbb {N}\), after 3yD rounds R will have received at least y messages. This in turn implies the second statement, since each message is comprised of Θ(n 3) packets from the original packet stream and \(D=\frac{6n^{3}}{\lambda}=\space\varTheta(n^{3})\). The proof of the first statement of Theorem 4.3 is rather involved, and will require many lemmas and subclaims that follow from the Routing and Re-Shuffle Rules of Sect. 4.1. We sketch the proof of this theorem below. Pseudo-code, as well as all technical proofs relying heavily on the pseudo-code, can be found in Appendices A and B. We begin with the following definitions:

Definition 4.4

We will say that a packet is accepted by a buffer B in round t if B receives and stores that packet in round t, either due to a packet transfer or re-shuffling.

Definition 4.5

We say that the sender inserts a packet into the network in round t if any internal node (or R) accepts the packet (as in Definition 4.4) in round t. Note that this definition does not require that S receives the verification of receipt, so S may not be aware that a packet was inserted.

Definition 4.6

The sender is blocked from inserting any packets in some round t if the sender is not able to insert any packets in t (see Definition 4.5). Let β T denote the number of rounds in a transmission T that the sender was blocked.

The following definition formalizes the notion of “potential,” and will be necessary to prove throughput-performance bounds. A good way to think about potential is to imagine each packet contributes to a buffer’s potential by an amount proportional to the height of the packet in the buffer. This way, when a packet is transferred from (the top of, since packets are transferred FILO) one buffer to (the top of) another buffer, there will be a drop in overall potential (the sending node will decrease in potential by an amount greater than the increase in potential to the receiving node, based on the routing rules of the Slide protocol). This net potential drop for each packet transfer will be important for arguing a linear throughput rate, see e.g. Lemma 4.11.

Definition 4.7

For any buffer BS,R that has height h at time t, define the potential of B at time t, denoted by \(\varPhi^{B}_{t}\), to be

$$ \varPhi^B_t := \sum_{i=1}^h i = \frac{h(h+1)}{2}. $$

For any internal node NG∖{R,S}, define the node’s potential \(\varPhi^{N}_{t}\) to be the sum of its buffer’s potentials:

$$ \varPhi^N_t := \sum_{\text{Buffers $B$ of $N$}} \varPhi^B_t. $$

Define the network potential Φ t at time t to be the sum of all the internal buffers’ potentials:

$$ \varPhi_t := \sum_{N \in G \setminus\{R,S\}}\varPhi^N_t. $$

It will be useful to break an internal node’s potential into two parts. The first part, which we term packet duplication potential, is the sum of the heights of the flagged packets in the node’s outgoing buffers that have already been accepted by the neighboring node (as in Definition 4.4). Recall that a flagged packet is a packet that was sent along an outgoing edge, but the sending node is maintaining a copy of the packet until it gets confirmation of receipt. Therefore, the contribution of packet duplication potential to overall network potential is the extraneous potential; it represents the over-counting of duplicated packets. We emphasize that not all flagged packets count towards packet duplication potential, since packets are flagged as soon as the sending node decides to send a packet, but the flagged packet’s height does not count towards packet duplication potential until the receiving node has accepted the packet (which may happen in a later round or not at all).

The other part of network potential will be termed non-duplicated potential, and is the sum of the heights of all non-flagged packets together with flagged packets that have not yet been accepted. Note that the separation of potential into these two parts is purely for analysis of the Slide protocol; indeed the nodes are not able to determine if a given flagged packet contributes to packet duplication or non-duplicated potential. For convenience, we will often refer to (network) non-duplicated potential simply as (network) potential (the meaning should be clear from context).

Notice that when a node accepts a packet, its own (non-duplicated) potential instantaneously increases by the height that this packet assumes in the relevant incoming buffer. Meanwhile, the sending node’s non-duplicated potential drops by the height that the packet occupied in its outgoing buffer, and there is a simultaneous and equivalent increase in this sending node’s packet duplication potential. Note that if we did not introduce the notion of duplicated potential, it would not necessarily be the case that network potential never increases as a result of a packet transfer. In particular, Lemma 4.11 would no longer be valid, and the proof of Theorem 4.3 becomes more difficult.

Definition 4.8

The height of a packet in an incoming/outgoing buffer is the spot it occupies in that buffer.

We now (restate and) prove the main theorem of Sect. 4.3 (the lemmas required in the proof will be stated within the proof, and proven after the proof of the theorem).

Theorem 4.3

The transmission of every message m i takes at most 3D rounds; that is, within 3D rounds of the time the sender begins inserting packets corresponding to any codeword c i , the receiver will have received enough codeword packets to decode the message. Thus, for any x>3D, after x rounds R has received Ω(x) packets from the input stream of packets {p 1,p 2,…}, and thus our edge-scheduling adversarial protocol enjoys a linear throughput rate.

Proof

The second statement follows from the first, as discussed above, so it remains to prove the first statement. Let t denote the round that S first tries to insert packets corresponding to a new codeword b i into the network. Considering the rounds between t and t+3D, we apply the pigeonhole principle to argue that either D rounds pass in which S can insert a packet, or 2D rounds pass in which no packets are inserted. In the former case, R can decode by Lemma 4.9:

Lemma 4.9.If at any time D−2n 3 distinct packets corresponding to some codeword b i have been inserted into the network, then R can decode message m i .

It remains to prove the theorem in the latter case. Note that the network non-duplicated potential drops by at least n in each of the 2D rounds in which no packets are inserted (a total drop of 2nD) by Lemma 4.10:

Lemma 4.10.If at any point in any transmission T , the number of blocked rounds is β T , then there has been a decrease in the network’s non-duplicated potential by at least T .

Meanwhile, the increase to network potential between t and t+3D caused by duplicated potential is at most by 2n 3−8n 2+8n by Lemma 4.11:

Lemma 4.11.Every change in network potential comes from one of the following three events:

  1. 1.

    S inserts a packet into the network.

  2. 2.

    R receives a packet.

  3. 3.

    A packet that was sent from one internal node to another is accepted; the verification of packet receipt is received by the sending node; a packet is shuffled between buffers of the same node; or a packet is moved within a buffer.

Furthermore, changes in network potential due to item (1) are strictly non-negative and changes due to item (2) are strictly non-positive. Also, changes in network non-duplicated potential due to item (3) are strictly non-positive. Finally, at all times, network packet duplication potential is bounded between zero and 2n 3−8n 2+8n.

Combining these two facts, we have that (not counting changes in potential caused by packet insertions) the network potential drops by at least 2nD−2n 3+8n 2−8n between t and t+3D. Since network potential can never be negative, we must account for this (non-duplicated) potential drop with positive contributions to potential change.

Note that the potential already in the network at the start of t adds to the potential at most 4n 4−14n 3+8n 2+8n:

Claim.The maximum amount of potential (see Definition 4.7) in the internal buffers of the network at any time is 2n(2n+1)(n−2)2.

Proof. A buffer contributes the most to network potential when it is full, in which case it contributes \(\sum_{i=1}^{2n} i = n(2n+1)\). Since there are 2(n−2) buffers per internal node, and n−2 internal nodes, the maximum amount of potential in the internal buffers is as claimed. □

Therefore, packet insertions must account for the remaining change in potential of (2nD−2n 3+8n 2−8n)−(4n 4−14n 3+8n 2+8n)=2nD−4n 4+(12n 3−16n)≥2nD−4n 4 (where the last inequality assumes n≥3). Lemma 4.11 (stated above) also indicates that the only way network potential can increase (other than the contribution of packet duplication potential which has already been accounted for) is when S inserts a packet (a maximum increase of 2n per packet), so it must be that S inserted at least (2nD−4n 4)/2n=D−2n 3 packets into the network between t and t+3D, and again R can decode by Lemma 4.9 (stated above). □

The following Lemma will bound the number of rounds that S needs to insert packets corresponding to the same codeword.

Lemma 4.9

If at any time D−2n 3 distinct packets corresponding to some codeword b i have been inserted into the network, then R can decode message m i .

Proof

The following claim guarantees that every packet that has been inserted into the network has either reached R or is in the incoming/outgoing buffer of an internal node:

Claim.Before the end of any transmission T , any packet that was inserted into the network during T is either in some buffer (perhaps as a flagged packet) or has been received by R.

Proof. Our protocol dictates that when a node sends a packet to a neighboring node, it maintains a copy of the packet until it gets confirmation of receipt from the neighbor. We restate this as Claim B.14 in Appendix B, where it is proven in terms of the pseudo-code. □

The maximum number of packets that can be stored in some incoming/outgoing buffer of an internal node is bounded by 4n 3: Each node has (n−2) outgoing buffers (one to each node except itself and S) and (n−2) incoming buffers (one from each node except itself and R), and thus a total of 2(n−2) buffers. Each of these buffers has capacity 2n, and there are n−2 internal nodes, so the internal buffer capacity of the network is 4n(n−2)2. Therefore, if D−2n 3 distinct packets corresponding to b i have been inserted, then R has necessarily received \(D - 6n^{3} = (1-\lambda) (\frac {6n^{3}}{\lambda} )\) of these, and so R can decode message m i by Fact 1. □

The following Lemma will be useful in bounding the number of rounds in which no packets are inserted.

Lemma 4.10

If at any point in any transmission T, the number of blocked rounds is β T , then there has been a decrease in the network’s non-duplicated potential by at least T .

Proof

The idea of the proof is to argue that each blocked round creates a drop in non-duplicated potential of at least n as follows. If the sender is blocked from inserting a packet, the node N adjacent to the sender (along the guaranteed active path for that round) will necessarily have a full incoming buffer along its edge to the sender. The following claim states that buffers are balanced at all times, and hence all of N’s outgoing buffers are also full:

Claim.After re-shuffling, (and hence at the very end/beginning of each round), all of the buffers of each node are balanced . In particular, there are no incoming buffers that have height strictly bigger than any outgoing buffers, and the difference in height between any two buffers is at most one.

Proof. The Re-Shuffle rules dictate that if there is ever a buffer whose height is at least two bigger than another buffer, then a packet will be shuffled from the higher buffer to the lower one. Similarly, packets are preferentially taken from incoming buffers and shuffled to outgoing buffers if there is ever an incoming buffer with larger height than an outgoing buffer. We restate this as Claim B.3 in Appendix B, where it is proven in terms of the pseudo-code. □

Meanwhile, at the opposite end of the active honest path, the node adjacent to the receiver will necessarily send a packet to the receiver if there is anything in its outgoing buffer along this edge, and this will result in a drop of potential of whatever height the packet had in the outgoing buffer.

Therefore, near the front-end of the active honest path, the buffers are full, while at the end of the path, a packet will be transferred to height zero (in the receiver’s buffer). Intuitively, it therefore seems that tracking all packet movements along the active honest path should result in a drop of potential of at least 2n. As the counter-example in the footnote shows,Footnote 24 this argument does not work exactly (we are only guaranteed a drop of n), but the structure of the proof is guided by this intuition.

This lemma is restated in Lemma B.18 in Appendix B, where we prove it based on the pseudo-code of our protocol. □

Lemma 4.11

Every change in network potential comes from one of the following three events:

  1. 1.

    S inserts a packet into the network.

  2. 2.

    R receives a packet.

  3. 3.

    A packet that was sent from one internal node to another is accepted; the verification of packet receipt is received by the sending node; a packet is shuffled between buffers of the same node; or a packet is moved within a buffer.

Furthermore, changes in network potential due to item (1) are strictly non-negative and changes due to item (2) are strictly non-positive. Also, changes in network non-duplicated potential due to item (3) are strictly non-positive. Finally, at all times, network packet duplication potential is bounded between zero and 2n 3−8n 2+8n.

Proof

Since network potential counts the heights of the internal nodes’ buffers, it only changes when these heights change, which in turn happens exclusively when there is packet movement. Note that all packet movement falls under one of two categories: (1) Packets transferred between two nodes; and (2) Packets re-shuffled between the buffers on one node. Both of these fall under one of the three items listed in the lemma, thus proving the first statement in the lemma.

That network potential changes due to packet insertion by S are strictly non-negative is obvious (either the receiving node’s potential increases by the height the packet assumed, or the receiving node is R and the packet does not contribute to potential). Similarly, that potential change upon packet receipt by R is strictly non-positive is clear, since packets at R do not count towards potential (see Definition 4.7). Also, since only flagged packets (but not necessarily all of them) contribute to network packet duplication potential, the largest this value can have is the maximum height of a flagged packet times the maximal possible number of flagged packets in the network. By the fact that outgoing buffers have at most one flagged packet at any time,Footnote 25 there are at most (n−2)2 flagged packets in the network at any given time, and each one has maximal height 2n (the maximum capacity of each buffer), so network packet duplication potential is bounded by 2n 3−8n 2+8n.

It remains to prove that changes in network non-duplicated potential due to item (3) are strictly non-positive. To do this, we look at all places where there is packet movement in our protocol, and argue each will result in a non-positive change to non-duplicated potential. Clearly potential changes caused by re-shuffling packets is non-positive, since the re-shuffle rules dictate that packets will only be re-shuffled if they decrease in height or stay the same height in the new buffer.

Meanwhile, when a packet is sent across an edge between two nodes, there are two possibilities: the receiving node accepts the packet (as in Definition 4.4), or the receiving node has already accepted this packet (but the sending node is sending it again because it never got confirmation of receipt). In the latter case, the packet is not stored by the receiving node, and the sending node’s copy contributes towards duplicated potential within the outgoing buffer, so non-duplicated potential is not affected. In the former case, the flagged packet in the sending node’s outgoing buffer still counts towards non-duplication potential. Notice that upon receipt there are two changes to network non-duplicated potential: it increases by the height the packet assumes in the incoming buffer it arrived at, and it decreases by the height the packet had in the corresponding outgoing buffer (this decrease is because the flagged packet in the outgoing buffer will count towards packet duplication potential instead of non-duplicated potential the instant the packet is accepted). The decrease outweighs the increase since the packet’s height in the incoming buffer is less than or equal to the height it had in the corresponding outgoing buffer, which follows from the protocol rules which specify a packet transfer should occur if and only if the sending buffer has height strictly larger than the receiving buffer.Footnote 26

This lemma is restated in Lemma B.4 in Appendix B, where we prove it based on the pseudo-code of our protocol. □

5 Routing Against a (Node-Controlling + Edge-Scheduling) Adversary

In this section we introduce a routing protocol for networks susceptible to both edge failures and corruptible nodes. The protocol will be an extension of the Slide protocol presented in Sect. 4, with added mechanisms to handle the fact that nodes cannot be relied upon to behave honestly (i.e. they may deviate from protocol specifications). We will refer to this protocol as “Mal-Slide,” to emphasize it is (as will be proven below) secure against a malicious node-controlling adversary.

The main difference between the Slide protocol of Sect. 4 and the Mal-Slide protocol will be the introduction of control information, which will contain the relevant pieces of (signed) information necessary to identify a corrupt node. Notice that the conforming assumption placed on the node-controlling adversary implicitly states that the sender and receiver are incorruptible. The Mal-Slide protocol makes use of this by dictating that these two nodes (and primarily the sender) bear the burden of detecting faults (transmission failures caused by misbehaving nodes), and localizing the faults to the offending node(s). In particular, the control information used to identify a corrupt node will be collected by the sender, and we must allocate system resources (e.g. processor memory and bandwidth) to store this information and transfer it through the network to the sender.

Intuitively, the Mal-Slide protocol can be thought of as cycling through two disjoint phases:

Routing Phase. :

Codeword packets are transferred through the network from the sender to the receiver, using the same protocol rules as in the Slide protocol of Sect. 4. The only two differences will be: (1) All codeword packets are signed by the sender upon insertion (for authentication, i.e. to protect against packet modification and/or the insertion of junk packets), and this signature accompanies the codeword packet until it is received by the receiver; and (2) One packet of “control information” is piggy-backed onto the transfer of codeword packets, in a manner to be described below.

Regulation Phase. :

When a message transmission fails (the receiver could not decode the current codeword after the Routing Phase), the Regulation Phase begins. The internal nodes send back the control information to the sender, who uses it to identify a corrupt node. Once identified, corrupt nodes are eliminated from the network, meaning all nodes are forbidden to communicate in any way with corrupt nodes.

In actuality, these two phases happen concurrently, so that the Mal-Slide protocol does not waste time waiting for the control information to make its way back to the sender before beginning the next Routing Phase. Instead, the control information will be transmitted to the sender by piggy-backing it onto the ordinary communication of each round, in a manner to be described below.

In the next four subsections, we describe more precisely the Mal-Slide protocol, and in particular exactly what control information is collected, how it is transferred back through the network to the sender, and how this information is used by the sender to identify corrupt nodes. We then state and prove in Sect. 5.5 theorems concerning the correctness, throughput, and memory requirements of the Mal-Slide protocol.

5.1 Control Information

Anytime two nodes transfer a codeword packet between them, they will also transfer (signed) control information that contains values the nodes have been storing and updating through the course of the transmission. In particular, for any pair of nodes (A,B) and directed edge E(A,B):

  1. 1.

    Nodes A and B maintain a running tally of the total number of codeword packets A has sent to B in the current transmission.

  2. 2.

    Every time A sends a packet to B, the packet had some height h in A’s outgoing buffer, and the packet assumed some height h′ in B’s incoming buffer. The height difference hh′ represents the potential drop that resulted due to the packet transfer (see Definition 4.7), which was shown in Lemma 4.11 to be always non-negative (assuming honest behaving nodes). Nodes A and B maintain a running tally of the cumulative potential drop as a result of codeword packets A has sent to B in the current transmission.

  3. 3.

    For every codeword packet p that A has sent to B in the current transmission, A and B maintain a running tally of the total number of times A has passed this specific packet to B.

Node A sends the updated value of each of these three quantities, together with a timestamp indicating the current round and transmission and his signature on all of these items, during Stage Two communication with B (see Fig. 2 below). Note that the values that are sent by A reflect the most recent values of the three quantities in (1)–(3), i.e. the changes made to these quantities based on the current packet being transferred have been updated in the values A sends to B.

Fig. 2.
figure 2

Communication exchange along directed edge E(A,B) during some round.

Although the routing of codeword packets in the Mal-Slide protocol is the same as the basic Slide protocol, because corrupt nodes are not guaranteed to behave honestly in the present network model, Theorem 4.3 (which guaranteed the receiver could decode the message after 3D rounds) is no longer valid. In particular, after the 3D rounds of a transmission in the Mal-Slide protocol, the receiver might not have received enough (valid) codeword packets to decode the message. We will refer to such transmissions as failed transmissions. Below we outline three types of malicious behavior that corrupt nodes may engage in to force a transmission to fail, and how the control information can be used to identify a corrupt node in each case. Although this list is in no way intended to be comprehensive of all possible forms of malicious behavior, it turns out that the mechanisms we put in place to handle the following three issues will be enough to handle all forms of malicious behavior (not just the ones listed below).

  1. 1.

    Packet Deletion. Suppose that corrupt nodes refuse to forward on the packets they receive, so that by the time the sender has inserted all D of the codeword packets, the receiver has not received enough of them to decode the message.Footnote 27 In this case, there is necessarily a corrupt node that is deleting packets (or equivalently, storing more packets than it is allowed to). The sender can identify the corrupt node if he can find a node who input x packets and output y packets, where xy>4n(n−1) (the quantity on the right-hand side of the inequality represents the total capacity of an internal node to store packets: it has n−1 incoming and outgoing buffers, each of capacity 2n). Control information of type (1) provides exactly this information (provided the sender can collect this information from all nodes).

  2. 2.

    Packet Duplication. Suppose corrupt nodes are duplicating packets in such a manner that keeps nodes “near” the sender (assume for the moment a relatively stable network topology) at full capacity, thus making it impossible for the sender to insert packets (even along the active honest path; recall Definition 3.2). In particular, suppose this strategy prevents the sender from inserting all D codeword packets by the time 3D rounds have passed. This means that there have been at least 2D rounds in which the sender was blocked from inserting any packets (see Definition 4.6). Note that Lemma 4.10 of Sect. 4.3 concerns only nodes along the active honest path, and in particular, since these nodes are honest, the lemma remains valid (it is restated and proven in Lemma D.14 in Appendix D for completeness). This lemma states that these (at least) 2D blocked rounds will cause a recorded potential drop of at least 2nD, and this drop is recorded in the control information of type (2). Since the sender inserted fewer than D packets in this case, the cumulative recorded increase in potential as a result of these insertions is less than 2nD (a single packet insertion can cause potential to raise by at most 2n, since this is the highest height a packet may be stored in an incoming buffer). Since the overall decrease in potential (at least 2nD) outweighs the overall increase (less than 2nD), there is necessarily a node that is responsible for more potential loss than gain. With control information of type (2) collected from all of the nodes, the sender can identify such a node, which is necessarily corrupt.

  3. 3.

    Packet Deletion + Packet Duplication. Suppose that the adversary recognizes he will be caught if he only employs a strategy of packet deletion or packet detection (as in Cases (1)–(2) above). Instead, the adversary forces corrupt nodes to replace valid packets they receive with old packets they have already sent forward. This way, their actions appear consistent in terms of analyzing the control information of types (1) and (2). Notice that transmissions when this occurs can now fail, despite the fact the sender was able to insert all D packets of the codeword and the receiver got (1−λ)D packets (which is ordinarily what is needed to decode): If too many of the received packets are duplications, then the receiver may not have the (1−λ)D distinct packets required for decoding. In this case there is at least one packet p that the receiver has received more than once. Since packets are never duplicated by honest nodes (they will never resend a packet before getting an acknowledgement of receipt from the node they sent the previous copy to), the sender can identify a corrupt node by finding a node that output the duplicated packet p more times than that node input p. Notice that the sender can identify such a node with control information of type (3) (provided the sender can collect this information from all nodes).

The following four cases clearly cover all possible outcomes for a transmission. Notice that the first case corresponds to a successful transmission, while the latter three are failed transmissions, and they roughly correspond to the three above malicious strategies (F2 corresponds to Packet Duplication, F3 to Packet Deletion, and F4 to Packet Deletion + Duplication).

  1. S1.

    The receiver was able to decode the codeword within 3D rounds

  2. F2.

    The receiver could not decode, and the sender inserted less than D packets in 3D rounds

  3. F3.

    The receiver could not decode, the sender inserted D packets, and the receiver did not receive any duplicated codeword packets

  4. F4.

    The receiver could not decode and cases F2 and F3 do not happen. In other words, the sender inserted D packets, and the receiver could not decode because he received at least one duplicated packet p.

Notice that there are two forms of control information: (1) The current information, which consists of the three quantities mentioned in the section above, and which are being updated and signed for every new packet transfer between two neighbors; and (2) Control information pertaining to previous (failed) transmissions, which represent the values that each of the three types of quantities had at the end of the earlier failed transmission in question. We emphasize the difference between these two forms of control information: the former kind is being stored and continuously updated between every pair of nodes, and no attempt is made to transmit this information beyond the two nodes it pertains to. If a transmission fails, then the values that each of the three quantities had at the end of that transmission are locked (i.e. they will not be updated/changed in the next transmission), and now (honest) nodes will attempt to transmit the final values of these quantities through the network back to the sender, where they will be collected and used to identify a corrupt node. To distinguish between these two different forms of control information, we will refer to the information for the current transmission as control packets and the information that corresponds to previous failed transmissions as a node’s testimony for the failed transmission in question.

Notice that a node’s testimony consists of n−1 packets, i.e. one packet for each neighbor. In particular, for a given node A and one of its neighbors B, the node will return in its (signed) testimony the final (value, signature) pairs on the relevant three quantities from the control packets.Footnote 28

In the next section, we describe exactly the protocol rules for how the testimonies are communicated through the network. We state once-and-for-all that if a node ever receives a packet from a neighbor that has faulty information (e.g. a control packet that does not reflect accurate values or has a faulty signature, or a codeword packet that does not carry the sender’s signature), then the node ignores all communication from the offending neighbor for that round (treating the edge as having failed for the entirety of the round).Footnote 29

5.2 Gathering Control Information

As mentioned at the beginning of Sect. 5, intuitively the Mal-Slide protocol can be viewed as cycling between a routing phase and a regulation phase. However, if these two phases are separated in practice, a problem is encountered: How long should the sender wait during the regulation phase to gather the testimonies he requires to identify a corrupt node? If the sender waits for all the information he needs before returning to transmitting the next message, then a set of corrupt nodes can refuse to return their testimonies. Since the sender cannot see the status of the edges of the network, he is unable to determine if these nodes are “playing dead” (refusing to use the links available to them to transmit their testimonies) and thus should be eliminated as corrupt, or if they are honest nodes that are (temporarily) disconnected from the rest of the network. Even though the sender will necessarily be getting some control information back (e.g. from the honest nodes that are “near” him on the active honest path), this information may not be enough to identify a corrupt node. Furthermore, the sender cannot simply eliminate non-responding nodes from the network, as it is possible that they are honest, and that at some point in the future they will form a crucial link on the active honest path of later rounds.

The Mal-Slide protocol avoids the problem of an indefinite regulation phase by performing regulation tasks in conjunction with the actual routing. Figure 2 describes how the routing and regulation phases are combined.

Notice that control packets (control information corresponding to the current transmission) are sent in both directions in each round: B sends one to A during Stage 1, and A sends one to B in Stage 2 (the values B uses for each of the three relevant quantities in the control packet it sends to A are current as of the previous packet that B received from A, whereas the values A uses in the control packet it sends to B are current as of the current packet being sent). A relevant testimony packet that B has (either his own or that of another node) can be sent from B to A in Stage 2. As can be seen from Fig. 2, B may potentially have multiple packets of control information (e.g. testimonies) that he would like to send to A during Stage 2. Since B can only send one packet of control information during Stage 2, we describe the rules for how B determines which packet to send below (after describing the remaining forms of control information found in Fig. 2).

There remains one last change between the Slide protocol of Sect. 4 and the Mal-Slide protocol: the identification of corrupt nodes requires that the sender has collected enough control information from the internal nodes. The procedure for how the Mal-Slide protocol handles the collection of this information, and in particular how it prevents a (set of) corrupt node(s) from delaying the collection of this data by refusing to return information that will implicate themselves, is the content of the next section.

5.3 The Blacklist

As discussed above, when a transmission fails as in Case F2, F3, or F4, the sender will request nodes to return testimonies so that the sender can identify a corrupt node. Loosely speaking, the sender maintains a blacklist consisting of all nodes for which the sender has not yet collected their complete testimony (recall that each node’s testimony consists of n−1 packets, corresponding to the three quantities recorded from the node’s communication with its n−1 neighbors). Blacklisted nodes are not allowed to transfer any codeword packets.

More precisely, when a transmission T fails, all nodes who participated in the transmission (i.e. nodes that were not already on the blacklist for at least one round during T) are added to the blacklist, and the sender records T as the transmission these nodes were added to blacklist. A node is not removed from the blacklist until the sender receives the node’s complete testimony. Meanwhile, nodes on the blacklist are not allowed to send or receive any codeword packets.Footnote 30

Notice that each failed transmission has a set of nodes that were blacklisted at the end of that particular transmission: the nodes who were not already on a previous blacklist. However, nodes are never on more than one transmission’s blacklist, as if they were blacklisted for the entirety of transmission T that fails, then they are not added to the blacklist for T. After all, the blacklist serves two purposes: (1) It ensures that corrupt nodes cannot continue to cause transmissions to fail while withholding their testimonies from the sender; and (2) It provides a list of nodes that participated in each failed transmission. As will be shown in the next section, the sender will be able to identify a corrupt node as soon as there exists a failed transmission T in which the sender is able to collect the complete testimony from all nodes that participated in the transmission; i.e. nodes that were not on the blacklist during at least one round of T. Because blacklisted nodes are not supposed to be transferring codeword packets, if a transmission T fails, then the sender does not need the testimony from any node that was already blacklisted for the entirety of T (all values in these testimonies should be zero, since the nodes were not allowed to transfer any codeword packets).

After a failed transmission, every node (except for the sender, but including the receiver) is therefore on the blacklist (they were either added at the end of the transmission that just failed, or they were already on the blacklist for an earlier failed transmission). Note that even though nodes on the blacklist are not allowed to transfer codeword packets, they are allowed to transfer control information (see Fig. 2).

In the following section we collect the ideas from Sects. 5.15.3 and describe how each transmission T progresses.

5.4 Overview of the Mal-Slide Protocol

The following steps describe the procedure for every transmission of the Mal-Slide protocol (pseudo-code is given in Appendix C).

  1. 1.

    The sender begins each transmission by forming the Start Of Transmission (SOT) broadcast. This consists of:

    1. (a)

      A single packet indicating how many total packets will comprise the SOT for the current transmission T (i.e. how many eliminated and blacklisted nodes there are)

    2. (b)

      The list of eliminated nodes

    3. (c)

      The list of blacklisted nodes; this includes for each blacklisted node the transmission in which the node was placed on the blacklist

    As mentioned above, if the previous transmission T−1 failed (as in Cases F2–F4 above), then the sender blacklists all nodes that were not already on the blacklist, and indicates these new nodes were blacklisted due to transmission T−1. Additionally, for nodes that were blacklisted as a result of a transmission that failed as in F4, the sender indicates the packet p for which the receiver got at least two copies in that failed transmission (the sender has access to this information from the End Of Transmission packet, see Item (5) below). Notice that the SOT broadcast consists of at most n packets, as every node (other than the sender) is either eliminated or on at most one blacklist.

    Each of the above three items are timestamped with the current transmission index T and signed by the sender. Notice from Fig. 2 that SOT packets are sent during Stage Two communication.

  2. 2.

    Nodes are not allowed to transfer codeword packets until they have received the SOT for the current transmission. This way, no node is (legally) transferring any packets from the current codeword until they have an updated view of the blacklist and eliminated nodes. When the sender has sent the complete SOT to a neighbor, he may begin inserting codeword packets to that neighbor (assuming the neighbor is not blacklisted).

  3. 3.

    If a node ever learns it has been blacklisted (from the SOT broadcast), it will form its testimony: the final values of the three types of control information from each of its neighbors from the indicated transmission. The n−1 packets of the testimony are then queued for delivery to the sender. The mechanism for transmitting testimonies (and indeed all forms of control information sent in Stage Two, see Fig. 2) back to the sender is flooding: a node will send every testimony packet it has seen (its own testimonies and the testimonies it has collected from its neighbors) across every adjacent edge. We note that because a node is on at most one blacklist at any time, and because the testimony of a single node consists of n−1 packets, a node need store and transfer at most n 2 testimony packets at any time. Also, the priority of sending control information in Stage Two (see Fig. 2) ensures that nodes know the most recent transmission that each blacklisted neighbor was placed on the blacklist before touching any of its neighbors’ testimonies (this way a node knows which of its neighbors testimony packets are current and valid).

    We note that sending control information back to the sender via flooding does not affect the overall throughput efficiency of Mal-Slide because there is much less control information than codeword packets: O(n 2) verses Θ(n 3), respectively. In particular, a protocol that employs flooding for codeword packets sent from sender to receiver may suffer in a factor of 1/n in terms of throughput efficiency. However, because there are only n 2 packets of control information that need to be transmitted to the sender in any transmission, the fact that transmissions last Θ(n 3) rounds means that the even a loss of 1/n in throughput efficiency for these packets will not impede their ability to reach the sender by the end of the transmission.

  4. 4.

    When the sender receives the complete testimony of a blacklisted node, it removes the node from the blacklist. In particular, the sender creates a (signed and timestamped with the index of the current transmission number) packet indicating the node to be cleared from the blacklist, and queues this packet for delivery with the rest of the control information sent during Stage Two (see Fig. 2). If the sender ever has received enough of the testimonies to eliminate a corrupt node, then he starts the current transmission over again, including in the new SOT broadcast the identity of the node that has just been eliminated.

  5. 5.

    If at any time:

    1. (a)

      The receiver can decode the current codeword, OR

    2. (b)

      The receiver has received a duplicated packet p

    then the receiver forms the End Of Transmission (EOT) packet. This packet contains the label of the duplicated packet p (in Case (5b)), or else a bit indicating successful decoding, and is signed and timestamped by the receiver and queued for delivery with the rest of the control information (see Fig. 2). The EOT packet is used to alert the sender to end the current transmission. We restate now (using current terminology) the four ways a transmission can end. Notice that one of them necessarily happens within 3D rounds of starting a transmission, and hence every transmission (whether successful or failed) lasts at most 3D rounds:

    1. S1.

      Sender receives EOT packet indicating Receiver was able to decode current codeword

    2. F2.

      3D rounds have passed, and Sender has not been able to insert D packets, nor has he received an EOT packet

    3. F3.

      Sender has inserted D packets but has not received an EOT packet

    4. F4.

      Sender receives an EOT packet indicating Receiver got some packet p twice

5.5 Analysis of the Mal-Slide Protocol

In this section we present the main ideas for why the Mal-Slide protocol is secure against the combined attack of the edge-scheduling and node-controlling adversaries, and analyze the performance of the Mal-Slide protocol in terms of correctness, memory, and throughput. Pseudo-code for Mal-Slide can be found in Appendix C, and proofs of all of the below lemmas and theorems are proven in terms of the pseudo-code in Appendix D.

Theorem 5.1

The Mal-Slide protocol requires at most O(n 2 P+n 4(k+logn)) bits of memory of the internal nodes.

Proof

The most memory-intensive cost of the Mal-Slide protocol is the requirement for nodes to store control information of type (3): For every codeword packet p, each node must store the total number of times it has transferred p to each neighbor. There are Θ(n 3) packets per codeword, each node has O(n) neighbors, the packet’s position within the codeword and timestampFootnote 31 can be described using Θ(logn) bits, and it costs Θ(k) bits to store signatures on each of these, so the overall cost of storing this information is O(n 4(k+logn)). Note the other information a node needs to store (current codeword packets and all other kinds of control information) collectively only require O(n 2 P) bits of memory, where PΩ(k+logn). □

The following theorem, which states that the Mal-Slide protocol has the same (asymptotic) throughput rate as the basic Slide protocol of Sect. 4, implicitly guarantees that the Mal-Slide protocol is correct.

Theorem 5.2

Except for at most n 2/2 transmissions that may fail due to malicious activity, the Mal-Slide protocol enjoys linear throughput. More precisely, after x transmissions, the receiver has correctly decoded at least xn 2/2 messages. If the number of transmissions x is quadratic in n or greater, than the failed transmissions due to adversarial behavior become asymptotically negligible. Since a transmission lasts O(n 3) rounds and messages contain Θ(n 3) packages from the original stream of packets, information is transferred through the network at a linear rate.

Proof

Having the sender sign all packets ensures that the final message that the receiver decodes in each of these transmissions is unaltered (ensuring correctness). We showed at the end of the previous section that every transmission lasts at most 3D=Θ(n 3) rounds. Therefore, transmissions that are successful (Case S1) enjoy a linear throughput rate: Θ(n 3) packets from the original input stream have been received in O(n 3) rounds. It remains to show that there are at most n 2/2 failed transmissions, which follows from the following two lemmas (proven after this proof):

Lemma 5.3.There can be at most n−1 failed transmissions before the sender necessarily has all testimonies corresponding to one of these failed transmissions.

Lemma 5.4.If there is a transmission T for which the sender has collected the complete testimonies from every node that participated in T , then the sender can necessarily identify a corrupt node.

Lemma 5.3 guarantees that there can be at most n−1 failed transmissions before the sender has necessarily collected the complete testimony of every node who participated in one of these transmissions, and then the sender will be able to eliminate a node by Lemma 5.4. This effectively reduces the network to n−1 nodes, and we can repeat this argument to ensure that there can be a total of at most n 2/2 failed transmissions. □

Lemma 5.3

There can be at most n−1 failed transmissions before the sender necessarily has all testimonies corresponding to one of these failed transmissions.

Proof

(Sketch) As discussed in Sect. 5.3, a node can be on at most one blacklist at any time. In particular, if a node participated in some failed transmission T, then the node is not allowed to participate in any future transmission until the sender receives the node’s complete testimony. We begin with the following observation:

Observation.The receiver participates in every transmission, i.e. by the end of every transmission, the sender has received the receiver’s complete testimony (if the receiver was on the blacklist Footnote 32 for the previous transmission).

Proof.  (Sketch) This follows from the conforming assumption (which guarantees that every round has an active path going through honest nodes) together with how testimonies are relayed through the network to the sender. In particular, in any transmission there are at most Θ(n 2) packets of control information (includes SOT, EOT, nodes to remove from the blacklist, and testimony packets). Because the receiver is linked to the sender in each of the 3D rounds of the transmission, its information will take at most Θ(n 3) rounds to reach the sender, where the constant in the Θ is small enough to ensure the sender has the receiver’s information by the time 3D rounds have passed. This observation is restated and proven in terms of the pseudo-code in Lemma D.8 of Appendix D. □

Therefore, if there have been n−2 failed transmissions and the sender has not collected all of the testimonies from participating nodes for any of the transmissions, then in all subsequent transmissions, either the sender completes his knowledge of the testimonies for a failed transmission, or just the sender and the receiver participate in the transmission. In the latter case, the transmission is guaranteed to be successful, since both the sender and receiver are honest, and the active honest path must have been a direct link between them for the vast majority of this transmission (otherwise, if there is some honest node A that is part of numerous active honest paths during the transmission, then the sender will necessarily receive all of A’s outstanding testimony, completing his knowledge of all testimonies for some failed transmission). This lemma is restated and proven in terms of the pseudo-code in Lemma D.9 of Appendix D. □

Lemma 5.4

If there is a transmission T for which the sender has collected the complete testimonies from every node that participated in T, then the sender can necessarily identify a corrupt node.

Proof

(Sketch) Failed transmissions fall under Case F2, F3, or F4. The mechanism for how the testimonies can be used to identify a corrupt node in each of these cases was described in Sect. 5.1 above. Below we go into a little more detail, but reserve the formal proof with references to pseudo-code in Lemma D.27 in Appendix D.

Case F2. In this case, the Sender has the complete testimony from every participating node for a transmission that failed due to Case F2. We will use control information of type 2 to identify a corrupt node. The key observation is to note that a variant of Lemma 4.10 from Sect. 4 remains valid when applied to the active honest path that exists each round, since all nodes on this path are honest (this is proven in Lemma D.14):

$$ -n \geq\sum_{N \in\mathsf{P}_{\mathtt{t}}} \varDelta\varPhi_N, $$

where tβ is any blocked round, and P t denotes the set of nodes comprising the active honest path for round t (as guaranteed by the conforming assumption). Since the Sender was unable to insert D packets (otherwise we would be in Case F3), the number of blocked rounds is at least 2D (transmissions that fail as in Case F2 have 3D rounds), and hence:

$$ -2nD \geq\sum_{\mathtt{t} \in\beta_{\mathtt{t}}} \biggl( \sum _{N \in \mathsf{P}_{\mathtt{t}}} \varDelta\varPhi_N \biggr). $$
(2)

Meanwhile, potential only increases on packets inserted by S; intuitively this is Lemma 4.11, which is proven for the malicious model in Lemma D.11. Since S inserted fewer than D packets (we are in Case F2), and each inserted packet adds at most 2n to total potential (the maximum height a packet can have in a buffer is the capacity of the buffer), we have that the total increase in potential due to packet insertions in the failed transmission in question obeys:

$$ 2nD > \varDelta\varPhi^*, $$
(3)

where ΔΦ represents the changes in potential caused by all packet insertions. Therefore, we have that the total change in potential during this failed transmission satisfies

$$\begin{aligned} \varDelta\varPhi&= \mbox{(Cumm. inc. from packet insertions)} + \bigl(\mbox{Cumm. change as in }\mathrm{(2)}\bigr) \\&\quad {} + \mbox{(All other changes)} < 0, \end{aligned}$$
(4)

where the inequality comes from (2) and (3), and the fact that all other contributions to potential are strictly non-positive (intuitively this is Lemma 4.11, which is proven for the malicious model in Lemma D.11). (4) shows that the cumulative change in potential is negative, which means there have been more packet transfers than is possible if all nodes behaved honestly. We find the offending node by looking at control information of type 2 to find a node that is responsible for too much potential drop (intuitively, this node is duplicating packets, causing extra packet transfers that artificially increase potential drop). This argument is formalized in Theorem D.28.

Case F3. In this case, the Sender has the complete testimony from every participating node for a transmission that failed due to Case F3. Nodes exchange signatures every time a packet is passed between them, acknowledging the packet transfer; in particular, they keep a running count of the number of packets they have sent to each neighbor, and they refuse to send/receive packets from a given neighbor until they have a current signature from this neighbor on this count (this is control information of type 1). Therefore, for any honest node N in the network, summing over these (signed) counts from all of its neighbors, the total number of packets received by N minus the total number of packets sent by N will equal the number of packets currently stored in N’s buffers (up to an “off-by-one” error along each edge, due to the fact that N may not have the most recent signature from a neighbor reflecting the most recent packet sent/received along the edge). Meanwhile, summing over all nodes (including Sender and Receiver), the net number of packets sent minus the total number of packets received should be zero: every packet sent by some node is received by another (again the “off-by-one” error can lead to a cumulative error of up to 2n on this sum, since it comes from the signed counts):

$$\begin{aligned} 0 &\approx\sum_{N \in G^*} \bigl( (\textrm{Num. packets } \textit{rec'd}\textrm{ by } N \textrm{ from all neighbors})\\ &\quad{} - (\textrm{Num. packets }\textit{sent}\textrm{ by } N \textrm{ to all neighbors}) \bigr), \end{aligned}$$

where G represents the subset of nodes that participated in the transmission in question. Separating out the Sender and Receiver’s contribution to this sum:

$$\begin{aligned} 6n^3 &\approx \sum_{N \in G^* \setminus\{S, R\}} \bigl( (\mbox{Num. packets }\textit{rec'd}\mbox{ by $N$ from all neighbors})\\ &\quad{} - (\mbox{Num. packets }\textit{sent}\mbox{ by $N$ to all neighbors}) \bigr), \end{aligned}$$

where we have used the fact that the Sender inserted all D packets and the Receiver got fewer than D−6n 3 of them (otherwise R could have decoded the message, resulting in a successful transmission). By an averaging argument, there is some node N that satisfies

$$\begin{aligned} 6n^2 &\leq \bigl( (\mbox{Num. packets }\textit{rec'd}\mbox{ by $N$ from all neighbors}) \\ &\quad{} - (\mbox{Num. packets }\textit{sent}\mbox{ by $N$ to all neighbors}) \bigr). \end{aligned}$$
(5)

As noted above, the quantity on the right-hand side of (5) represents the current number of packets N is currently storing in its buffers, and a value of 6n 2 is more than N is allowed to store, implying that N is corrupt. The proof of Theorem D.33 formalizes this argument.

Case F4. In this case, the Sender has the complete testimony from every participating node for a transmission that failed due to Case F4. The proof of this case is analogous to the proof of Case F3 above, except that rather than considering the total number of packets that were sent/received between each node, here the Sender looks at the number of times the packet p was exchanged between each node (available via control information of type 3). Summing over all the counts of times p was sent minus the times p was received yields zero:Footnote 33

$$\begin{aligned} 0 &= \sum_{N \in G^*} \bigl( (\mbox{Num. times $p$ }\textit{rec'd}\mbox{ by $N$ from all neighbors} ) \\ &\quad{} - (\mbox{Num. times $p$ }\textit{sent}\mbox{ by $N$ to all neighbors}) \bigr) \end{aligned}$$
(6)

But the Receiver’s contribution to this sum is at least negative two since R received p at least twice (since we are in Case F4), while the Sender’s contribution to this sum is exactly one (since S inserted p exactly once):

$$\begin{aligned} -1 &= \sum_{N \in G^* \setminus\{S, R\}} \bigl((\mbox{Num. times $p$ }\textit{rec'd}\mbox{ by $N$ from all neighbors})\\ &\quad{} - (\mbox{Num. times $p$ }\textit{sent}\mbox{ by $N$ to all neighbors}) \bigr) \end{aligned}$$

Using an averaging argument, there exists some node N with

$$\begin{aligned} -1 &\geq \bigl((\mbox{Num. packets }\textit{rec'd}\mbox{ by $N$ from all neighbors})\\ &\quad{} - (\mbox{Num. packets }\textit{sent}\mbox{ by $N$ to all neighbors}) \bigr), \end{aligned}$$

which implies N sent p more times than N received p, something that will never be true for honest nodes. This argument is formalized in Theorem D.34. □

6 Conclusion

In this paper, we have presented a protocol that is secure simultaneously against conforming node-controlling and edge-scheduling adversaries. Our results are of a theoretical nature, with rigorous proofs of correctness and guarantees of performance. Surprisingly, our protocol demonstrates that adding additional protection against the node-controlling adversary, on top of protection against the edge-scheduling adversary, can be achieved without any additional asymptotic cost in terms of throughput.

While our results do provide a significant step in the search for protocols that work in a dynamic setting (edge failures controlled by the edge-scheduling adversary) where some of the nodes are susceptible to corruption (by a node-controlling adversary), there remain important open questions. The original Slide protocolFootnote 34 requires each internal node to have buffers of size O(n 2logn), while ours requires O(n 4logn), though this can be slightly improved with additional assumptions.Footnote 35

In practice, the extra factor of n 2 in terms of memory may make our protocol infeasible for implementation, even for overlay networks. While the need for signatures inherently force an increase in memory per node in our protocol verses the original Slide protocol, this is not what contributes to the extra O(n 2) factor. Rather, the only reason we need the extra memory is to handle the third kind of malicious behavior, which roughly corresponds to the mixed adversarial strategy of a corrupt node replacing a valid packet with an old packet that the node has duplicated. Recall that in order to detect this, for every packet a node sees and for every neighbor, a node must keep a (signed) record of how many times this packet has traversed the adjacent edge (the Θ(n 3) packets per codeword and O(n) neighbors per node yield the O(n 4) bound on memory). Therefore, one open problem is finding a less memory-intensive way to handle this type of adversarial behavior.

Apart from the memory costs of our protocol, the computation costs and the complexity of the protocol make it unrealistic to employ in practice. If a practical protocol is to be developed for the network model considered here, there must be work done to reduce these complexities and costs.

Another open question is how altering the assumptions made in the network model affects the optimal protocol performance that is achievable. For example, Bunn and Ostrovsky [9] explore routing in asynchronous networks that have no minimal connectivity assumptions. It would be interesting to see how performance is affected in networks that make different (combinations) of assumptions.

One final open problem is to extend our setting of end-to-end communication to the case of multiple sender/receivers. In particular, can one do better than the trivial extension of our protocol, which would add a factor of Θ(n 2) to the already burdensome memory cost of Θ(n 4) per node.