Narada

A Case for End System Multicast
Yang-hua Chu, Sanjay G. Rao, and Hui Zhang

Carnegie Mellon University
fyhchu,sanjay,hzhangg@cs.cmu.edu
ABSTRACT and routers). One of the most important architectural deci-

The conventional wisdom has been that IP is the natural sions is then the division of functionality between end sys-
protocol layer for implementing multicast related function- tems and networks.
ality. However, ten years after its initial proposal, IP Multi-
cast is still plagued with concerns pertaining to scalability, In the Internet architecture, the internetworking layer, or IP,
network management, deployment and support for higher implements a minimal functionality | a best-eort unicast
layer functionality such as error, ow and congestion con- datagram service, and end systems implement all other im-
trol. In this paper, we explore an alternative architecture for portant functionality such as error, congestion, and ow con-
small and sparse groups, where end systems implement all trol. Such a minimalist approach is probably the single most
multicast related functionality including membership man- important technical reason for the Internet's growth from
agement and packet replication. We call such a scheme End a small research network into a global, commercial infras-
System Multicast. This shifting of multicast support from tructure with heterogeneous technologies, applications, and
routers to end systems has the potential to address most administrative authorities. The growth of the network in
problems associated with IP Multicast. However, the key turn unleashes the development of new applications, which
concern is the performance penalty associated with such a require richer network functionality.
model. In particular, End System Multicast introduces du-
plicate packets on physical links and incurs larger end-to- The key architecture question is: what new features should
end delay than IP Multicast. In this paper, we study this be added to the IP layer? Multicast and QoS are the two
question in the context of the Narada protocol. In Narada, most important features that have been or are being added
end systems self-organize into an overlay structure using a to the IP layer. While QoS is a functionality that cannot be
fully distributed protocol. In addition, Narada attempts to provided by end systems alone and thus has to be supported
optimize the eciency of the overlay based on end-to-end at the IP layer, this is not the case for multicast. In partic-
measurements. We present details of Narada and evaluate ular, it is possible for end systems to implement multicast
it using both simulation and Internet experiments. Prelimi- services on top of the IP unicast service.
nary results are encouraging. In most simulations and Inter-
net experiments, the delay and bandwidth penalty are low. In deciding whether to implement multicast services at the
We believe the potential benets of repartitioning multicast IP layer or at end systems, there are two conicting consid-
functionality between end systems and routers signicantly erations that we need to reconcile. According to the end-to-
outweigh the performance penalty incurred. end arguments 17], a functionality should be (a) pushed to
higher layers if possible, (b) unless implementing it at the
1. INTRODUCTION lower layer can achieve large performance benet that out-
Traditional network architectures distinguish between two weighs the cost of additional complexity at the lower layer.
types of entities: end systems (hosts) and the network (switches In his seminal work in 1989 4], Deering argues that this sec-
This research was sponsored by DARPA under contract ond consideration should prevail and multicast should be im-
numbers N66001-96-C-8528, F30602-99-1-0518, and E30602- plemented at the IP layer. This view so far has been widely
97-2-0287, and by NSF under grant numbers Career Award accepted. IP Multicast is the rst signicant feature that
NCR-9624979 ANI-9730105, and ANI-9814929. Additional has been added to the IP layer since its original design and
support was provided by Intel, Lucent, and Ericsson. Views most routers today implement IP Multicast. Despite this, IP
and conclusions contained in this document are those of the Multicast has several drawbacks that have so far prevented
authors and should not be interpreted as representing the of- the service from being widely deployed. First, IP Multicast
cial policies, either expressed or implied, of DARPA, NSF, requires routers to maintain per group state, which not only
Intel, Lucent, Ericsson or the U.S. government. violates the \stateless" architectural principle of the origi-
nal design, but also introduces high complexity and serious
scaling constraints at the IP layer. Second, the current IP
Multicast model allows for an arbitrary source to send data
to an arbitrary group. This makes the network vulnerable
to ooding attacks by malicious sources, and complicates
network management and provisioning. Third, IP Multi-
cast requires every group to dynamically obtain a globally
unique address from the multicast address space and it is
dicult to ensure this in a scalable, distributed and consis- self-organizing protocols in that it does not require a native
tent fashion. Fourth, IP Multicast is a best eort service. multicast medium. We present details in Section 3.
Providing higher level features such as reliability, congestion
control, ow control, and security has been shown to be more We evaluate the performance penalty of the overlay Narada
dicult than in the unicast case. Finally, IP Multicast calls produces using simulations. In a group of 128 members, the
for changes at the infrastructural level, and this slows down delay between at least 90% of pairs of members increases
the pace of deployment. While there have been attempts by a factor of at most 4 compared to the unicast delay be-
to partially address some of the issues at the IP layer 9, tween them. Further, no physical link carries more than
15, 21], fundamental concerns pertaining to the \stateful" 9 identical copies of a given packet. We have also imple-
architecture of IP Multicast and support for higher layer mented Narada and conducted preliminary Internet exper-
functionality have remained unresolved. iments. For a group of 13 members, the delay between at
least 90% of pairs of members increases by a factor of at
In this paper, we revisit the issue of whether multicast re- most 1:5 compared to the unicast delay between them.
lated functionality should be implemented at the IP layer or
at the end systems. In particular, we consider a model in While we refer to end systems in this paper, it is with the
which multicast related features, such as group membership, understanding that this may easily be generalized to nodes
multicast routing and packet duplication, are implemented at the edge of the network such as campus wide proxies and
at end systems, assuming only unicast IP service. We call edge routers. Proxies can exploit native multicast available
the scheme End System Multicast. Here, end systems par- at the LAN level, have larger processing power than hosts,
ticipating in the multicast group communicate via an overlay are better connected to the Internet and allow for organi-
structure. The structure is an overlay in the sense that each zation level agreements. Further, they could potentially ex-
of its edges corresponds to a unicast path between two end ploit temporal and spatial locality in group formations, At
systems in the underlying Internet. the other end of the spectrum, tree building mechanisms
might be built into actual applications running on hosts,
We believe that End System Multicast has the potential to and could exploit application specic requirements and pe-
address most problems associated with IP Multicast. Since culiarities. Irrespective of these dierences, common to all
all packets are transmitted as unicast packets, network pro- these architectures is the design and evaluation of an actual
visioning is not aected and deployment may be acceler- self-organization protocol for construction of overlay struc-
ated. End System Multicast maintains the stateless nature tures for data delivery. An end system in our paper refers to
of the network by requiring end systems, which subscribe the entity that actually takes part in the self-organization
only to a small number of groups, to perform additional protocol, and could be a host, or an application or a proxy.
complex processing for any given group. In addition, we be-
lieve that solutions for supporting higher layer features such We believe End System Multicast is more appropriate for
as error, ow, and congestion control can be signicantly small size and sparse groups such as audio/video conferenc-
simplied by leveraging well understood unicast solutions ing, virtual classroom and multiparty network games, but is
for these problems. We hope to demonstrate this in future not appropriate for handling applications such as Internet
works. Finally, an end system based architecture no longer TV that involve a single source streaming high bandwidth
requires global consistency in naming of groups and allows and real time data to several hundred thousand recipients.
for application specic naming. Regardless, we believe End System Multicast is of interest
as we hypothesize that there will be an explosion of small
While End System Multicast has many advantages, several sized and sparse groups in the near future.
issues need to be resolved before it become a practical alter-
native to IP Multicast. In particular, an overlay approach
to multicast, however ecient, cannot perform as well as 2. END SYSTEM MULTICAST
IP Multicast. It is impossible to completely prevent mul- In this section, we contrast IP Multicast, End System Mul-
tiple overlay edges from traversing the same physical link ticast and naive unicast with an example, and outline fun-
and thus some redundant trac on physical links is unavoid- damental issues in the design of an End System Multicast
able. Further, communication between end systems involves protocol.
traversing other end systems, potentially increasing latency.
In this paper, we focus on two fundamental questions per- Consider Figure 1(a) which depicts an example physical
taining to the End System Multicast architecture: (i) what topology. R1 and R2 are routers, while A, B , C and D are
are the performance implications of using an overlay archi- end systems. Link delays are as indicated. Thus R1 ; R2
tecture for multicast? and (ii) how do end systems with may be imagined to be a costly transcontinental link, while
limited topological information cooperate to construct good all other links are cheaper local links. Further, let us assume
overlay structures? A wishes to send data to all other nodes.
In this paper, we seek to answer these questions in the con- Figure 1(b) depicts the IP Multicast tree constructed by
text of a protocol that we have developed called Narada. DVMRP 4]. DVMRP is the classical IP Multicast protocol,
Narada constructs an overlay structure among participat- where data is delivered from the source to recipients using
ing end systems in a self-organizing and fully distributed an IP Multicast tree composed of the shortest paths from
manner. Narada is robust to the failure of end systems and each recipient to the source. R1 and R2 receive a single
to dynamic changes in group membership. End systems copy of the packet but forward it along multiple interfaces.
begin with no knowledge of the underlying physical topol- At most one copy of a packet is sent over any physical link.
ogy, and they determine latencies to other end systems by Each recipient receives data with the same delay as though
probing them in a controlled fashion. Narada continually A were sending to it directly by unicast.
renes the overlay structure as more probe information is
available. Narada may be distinguished from many other End System Multicast does not rely on router support for
27
A C A C A C A C
1 1 1 1
1 27 1
R1 R2 3 2 R1 R2 R1 R2
25 28 25 25
2 1 2 2
B D B D B 1 D B 1 D
28
(a) (c) (e) (g)
27 27 27
A C A C A C A C
1 1 27
R1 R2 3 3 2 3 2
2 25
B 1 D B D B D B D
28
(b) (d) (f) (h)
Figure 1: Examples to illustrate IP Multicast, naive unicast and End System Multicast
multicast. It abstracts the physical topology as a Complete 3. NARADA DESIGN
Virtual Graph (CVG) as shown in Figure 1(c). Further, it In this section, we present Narada, a protocol we designed
tries to construct a spanning-tree of the CVG along which that implements End System Multicast. In designing Narada,
A could send data to other recipients. In particular, this we have the following objectives in mind:
scheme could degenerate to naive unicast transmission as
shown in Figure 1(d). Figure 1(e) depicts how naive unicast Self-organizing: The construction of the end system over-
transmission maps onto the underlying physical network. It lay must take place in a fully distributed fashion and must
is seen that links R1 ; R2 and A ; R1 carry 2 and 3 copies be robust to dynamic changes in group membership.
of a transmission by A respectively. We refer to the number
of identical copies of a packet carried by a physical link as Overlay eciency: The tree constructed must ideally have
the stress of a physical link. Thus, unicast in general leads low stress, low RDP and low resource usage. Further, the
to a high stress on the link nearest the source. out-degree of each member in the overlay must reect the
bandwidth of its connection to the Internet.
We could build smarter spanning trees of the CVG such
as the one shown in Figure 1(f). Figure 1(g) depicts how Self-improving in an incremental fashion: The overlay con-
this tree maps onto the underlying physical network. It is struction must include mechanisms by which end systems
seen that all physical links have a stress of at most 2. Not gather network information in a scalable fashion. The pro-
only does this tree reduce the worst case physical link stress, tocol must allow for the overlay to incrementally evolve into
but it also reduces the stress of the costlier link R1 ; R2 a better structure as more information becomes available.
to 1. In order to capture this, we introduce the P notion of
resource usage. We dene resource usage as Li=1 di si , There are two basic methods for construction of overlay
where, L is the number of links active in data transmission, spanning trees for data delivery. A rst approach is to
di is the delay of link i and si is the stress of link i. The construct the tree directly - that is, members explicitly se-
resource usage is a metric of the network resources consumed lect their parents from among the members they know 8].
in the process of data delivery to all receivers. Implicit here Narada however constructs trees in a two-step process. First
is the assumption that links with higher delay tend to be it constructs a richer connected graph that we term mesh .
associated with higher cost. The resource usage might be The mesh could in general be an arbitrary connected sub-
computed to be 30 in the case of transmission by DVMRP, graph of the CVG (though Narada tries to ensure the mesh
57 for naive unicast and 32 for the smarter tree , shown in has desirable performance properties as discussed later.) In
Figure 1(b), Figure 1(d) and Figure 1(f) respectively. the second step, Narada constructs (reverse) shortest path
spanning trees of the mesh, each tree rooted at the corre-
While the spanning tree of Figure 1(f) improves link stress sponding source using well known routing algorithms. Fig-
and resource usage as compared to naive unicast transmis- ure 1(h) presents an example mesh that Narada constructs
sion, it increases delay from the source to some of the recip- for the physical topology shown in Figure 1(a), along with
ients. Thus, the delay from A to D has increased to 29 from the shortest path spanning tree rooted at A. We have several
27. We refer to the ratio of the delay between two members reasons for this two step process. First, group management
along the overlay to the unicast delay between them as the functions are abstracted out and handled at the mesh rather
Relative Delay Penalty (RDP). Thus, <A,D> has an RDP
29 , while <A,B> and <A,C> have an RDP of 1. than replicated across multiple (per source) trees. Second,
of 27 distributed heuristics for repairing mesh partition and mesh
optimization are greatly simplied as loop avoidance is no
The out-degree of a node in the overlay tree is its number of longer a constraint. Third, we may leverage standard rout-
children in the tree. Thus, the out-degree of A in the smarter ing algorithms for construction of data delivery trees. Fi-
tree is 2, while it is 3 in naive unicast. The out-degree of a nally, a mesh is more resilient to the failure of members
node in the overlay spanning tree directly impacts the stress than a tree and heavy weight partition repair mechanisms
of the links close to the node. are invoked less frequently.
In our approach, there is no control over the resulting span- Let i receive refresh message from neighbor j is
at 0 local
ning trees for a given mesh. Hence, it becomes important t
time . Let < k s > js
kj be an entry in 0 refresh message.
if i
does not have an entry for , then k i
inserts the
to construct a good mesh so that good quality trees may entry kj < k s t >
into its table
be produced. In particular, we attempt to ensure the fol- i
else if 's entry for is k < k s t >
ki ki , then
lowing properties: (i) the shortest path delay between any s
if ki kj s i
ignores the entry pertaining to k
pair of members along the mesh is at most K times the uni- else i
updates its entry for to k < k s t >
kj
cast delay between them, where K is a small constant and
(ii) each member has a limited number of neighbors in the Figure 2: Actions taken by a member i on receiving
mesh which does not exceed a given (per-member) bound a refresh message from member j .
chosen to reect1the bandwidth of the member's connection
to the Internet. Limiting the number of neighbors regu- the following information for every other member k in the
lates the fanout of members in the spanning trees. Second, group: (i) member address k (ii) last sequence number ski
it controls the overhead of running routing algorithms on that i knows k has issued and (iii) local time at i when i
the mesh. The extreme case where the mesh is chosen to be rst received information that k issued ski . If member i has
the Complete Virtual Graph incurs all the overhead of rout- not received an update from member k for Tm time, then,
ing with none of its benets as the resulting shortest path i assumes that k is either dead or potentially partitioned
spanning trees degenerates to naive unicast transmission. from i. It then initiates a set of actions to determine the
existence of a partition and repair it if present as discussed
Narada has striking dierences from self-conguring proto- in Section 3.1.3.
cols developed in other contexts. First, Narada distinguishes
itself from normal routing protocols in that it changes the Propagation of refresh messages from every member along
very topology over which routing is performed. Second, the mesh could potentially be quite expensive. Instead, we
most existing self-conguring protocols 12, 13, 14, 22] as- require that each member periodically exchange its knowl-
sume native IP Multicast support. Narada attempts self- edge of group membership with its neighbors in the mesh.
conguration in the absence of a lower level multicast ser- A message from member i to a neighbor j contains a list of
vice, and this is fundamentally more challenging. entries, one entry for each member k that i knows is part of
the group. Each entry has the following elds: (i) member
We explain the distributed algorithms that Narada uses to address k and (ii) last sequence number ski that i knows
construct and maintain the mesh in Section 3.1. We present k has issued. On receiving a message from a neighbor j ,
heuristics that Narada uses to improve mesh quality in Sec- member i updates its table according to the pseudo code
tion 3.2. Narada runs a variant of standard distance vector presented in Figure 2.
algorithms on top of the mesh and uses well known algo-
rithms to construct per-source (reverse) shortest path span- Finally, given that a distance vector routing algorithm is run
ning trees for data delivery. We discuss this in Section 3.3. on top of the mesh (Section 3.3), routing update messages
exchanged between neighbors can include member sequence
3.1 Group Management
number information with minimum extra overhead.
We have seen that Narada tries to construct a mesh among
end systems participating in the multicast group. In this sec- 3.1.1 Member Join
tion, we present mechanisms Narada uses to keep the mesh When a member wishes to join a group, Narada assumes
connected, to incorporate new members into the mesh and that the member is able to get a list of group members by
to repair possible partitions that may be caused by members an out-of-band bootstrap mechanism. The list needs nei-
leaving the group or by member failure. ther be complete nor accurate, but must contain at least
As we do not wish to rely on a single non-failing entity to one currently active group member. In this paper, we do
keep track of group membership, the burden of group main- not address the issue of the bootstrap mechanism. We be-
tenance is shared jointly by all members. To achieve a high lieve that such a mechanism is application specic and our
degree of robustness, our approach is to have every member protocol is able to accommodate dierent ways of obtaining
maintain a list of all other members in the group. Since the bootstrap information.
Narada is targeted towards small sized groups, maintaining The joining member randomly selects a few group mem-
the complete group membership list is not a major over- bers from the list available to it and sends them messages
head. Every member's list needs to be updated when a new requesting to be added as a neighbor. It repeats the pro-
member joins or an existing member leaves. The challenge cess until it gets a response from some member, when it has
is to disseminate changes in group membership eciently, successfully joined the group. Having joined, the member
especially in the absence of a multicast service provided by then starts exchanging refresh messages with its neighbors.
the lower layer. We tackle this by exploiting the mesh to The mechanisms described earlier will ensure that the newly
propagate such information. However, this strategy is com- joined member and the rest of the group learn about each
plicated by the fact that the mesh might itself become par- other quickly.
titioned when a member leaves. To handle this, we require
that each member periodically generate a refresh message
with monotonically increasing sequence number, which is
disseminated along the mesh. Each member i keeps track of 3.1.2 Member Leave and Failure
When a member leaves a group, it noties its neighbors,
1 An ideal mesh is a \Degree-Bounded K-spanner" 11] of and this information is propagated to the rest of the group
the Complete Virtual Graph. The problem of construct- members along the mesh. In Section 3.3, we will describe
ing Degree-Bounded K-spanners of a graph has been widely our enhancement to distance vector routing that requires
studied in centralized settings that assume complete infor- a leaving member to continue forwarding packets for some
mation and is NP-complete even in such scenarios 11]. time to minimize transient packet loss.
EvaluateUtility (j ) begin
E C utility = 0
for each member m (m not i) begin
CL = current latency between i and m along mesh
B A G NL = new latency between i and m along mesh
if edge i-j were added
if (NL < CL) then begin
F D utility + = CLCLNL
;
end
Figure 3: A sample virtual topology end
return utility
Let Q be a queue of members for which i
has stopped Figure 5: Algorithm i uses in determining utility of
receiving sequence number updates for at least m T adding link to j
time. Let T
be maximum time an entry may remain in Q.
while(1) begin
Update Q stopped receiving sequence number updates from for at least
while( !Empty(Q) and Tm time. It runs a scheduling algorithm that periodically
Head(Q) is present in Q for T time) and probabilistically deletes a member from the head of the
begin queue. The deleted member is probed and it is either de-
j = Dequeue(Q) termined to be dead, or a link is added to it. The schedul-
Initiate probe cycle to determine if is dead j ing algorithm is adjusted so that no entry remains in the
or to add a link to it.
end queue for more than a bounded period of time. Further, the
if(!Empty(Q) ) begin probability value is chosen carefully so that in spite of sev-
prob Length(Q) GroupSize
= / eral members simultaneously attempting to repair partition
With probabilityprob begin only a small number of new links are added. The algorithm
j Dequeue(Q)
= is summarized in Figure 4.
Initiate probe cycle to determine if is dead j
or to add a link to it.
end
sleep(P). // Sleep for time P seconds 3.2 Improving mesh quality
end The constructed mesh can be quite sub-optimal, because (i)
initial neighbor selection by a member joining the group is
Figure 4: Scheduling algorithm used by member i random given limited availability of topology information at
to repair mesh partition bootstrap (ii) partition repair might aggressively add edges
that are essential for the moment but not useful in the long
We also need to consider the dicult case of abrupt fail- run (iii) group membership may change due to dynamic join
ure. In such a case, failure should be detected locally and and leave and (iv) underlying network conditions, routing
propagated to the rest of the group. In this paper, we as- and load may vary. Narada allows for incremental improve-
sume a failstop failure model 19], which means that once a ment of mesh quality. Members probe each other at random
member dies, it remains in that state, and the fact that the and new links may be added depending on the perceived gain
member is dead is detectable by other members. We explain in utility in doing so. Further, members continuously moni-
the actions taken on member death with respect to Figure tor the utility of existing links, and drop links perceived as
3. This example depicts the mesh between group members not useful. This dynamic adding and dropping of links in
at a given point in time. Assume that member C dies. Its the mesh distinguishes Narada from other topology mainte-
neighbors in the mesh, A, G stop receiving refresh messages nance protocols.
from C . Each of them independently send redundant probe
messages to C , such that the probability every probe mes- The issue then is the design of a utility function that reects
sage (or its reply) is lost is very small. If C does not respond mesh quality. A good quality mesh must ensure that the
to any probe message, then, A and G assume C to be dead shortest path delay between any pair of members along the
and propagate this information throughout the mesh. mesh is comparable to the unicast delay between them. A
member i computes the utility gain if a link is added to
Every member needs to retain entries in its group member- member j based on (i) the number of members to which j
ship table for dead members. Otherwise, it is impossible improves the routing delay of i and (ii) how signicant this
to distinguish between a refresh announcing a new member improvement in delay is. Figure 5 presents pseudo code that
and a refresh announcing stale information regarding a dead i uses to compute the gain in utility if a link to member j is
member. However, dead member information can be ushed added. The utility can take a maximum value of n, where n
after sucient amount of time. is the number of group members i is aware of. Each member
m can contribute a maximum of 1 to the utility, the actual
contribution being i s relative decrease in delay to m if the
0
3.1.3 Repairing Mesh Partitions edge to j were added.

It is possible that member failure can cause the mesh to
become partitioned. For example, in Figure 3, if member A We now present details of how Narada adds and removes
dies, the mesh becomes partitioned. In such a case, mem- links from the mesh.
bers must rst detect the existence of a partition, and then Addition of links: Narada requires every member to con-
repair it by adding at least one virtual link to reconnect the stantly probe other members. Currently, the algorithm that
mesh. Members on each side of the partition stop receiving we use is to conduct a probe periodically, and probe some
sequence number updates from members on the other side . random member each time. This algorithm could be made
This condition is detected by a timeout of duration Tm . smarter by varying the interval between probes depending
on how satised a member is with the performance of the
Each member maintains a queue of members that it has mesh, as well as choosing whom to probe based on results
EvaluateConsensusCost(j) begin 3.3 Data Delivery
Costij = number of members for which i uses j as We have described how Narada constructs a mesh among
next hop for forwarding packets.
Costji = number of members for which j uses i as participating group members, how it keeps it connected, and
next hop for forwarding packets. how it keeps rening the mesh. In this section we explain
return max Cost
( ij , Cost
ji ) how Narada builds data delivery tree.
end
Figure 6: Algorithm i uses to determine consensus Narada runs a distance vector protocol on top of the mesh.
cost to a neighbor j In order to avoid the well-known count-to-innity problems,
it employs a strategy similar to BGP 16]. Each member not
only maintains the routing cost to every other member, but
of previous probes. also maintains the path that leads to such a cost. Further,
routing updates between neighbors contains both the cost
When a member i probes a member j , j returns to i a copy to the destination and the path that leads to such a cost.
of its routing table. i uses this information to compute the The per-source trees used for data delivery are constructed
expected gain in utility if a link to j is added as described from the reverse shortest path between each recipient and
in Figure 5. i decides to add a link to j if the expected the source, in identical fashion to DVMRP 4]. A member
utility gain exceeds a given threshold. The threshold value M that receives a packet from source S through a neighbor
is a function of i s estimation of group size, and the current
0 N forwards the packet only if N is the next hop on the
and maximum fanout values of i and j respectively. Finally, shortest path from M to S . Further, M forwards the packet
i may also add a link to j if the physical delay between them to all its neighbors who use M as the next hop to reach S .
is very low and the current overlay delay between them very
high. The routing metric used in the distance vector protocol is
the latency between neighbors. Each endpoint of a link inde-
Dropping of links: Ideally, the loss in utility if a link pendently estimates the latency of the link and could have
were to be dropped must exactly equal the gain in utility if dierent estimates. Using the latency as a metric enables
the same link were immediately re-added. However, this re- routing to adapt to dynamics in the underlying network.
quires estimating the relative increase in delay to a member However, it also increases the probability of routing insta-
if a link were dropped and it is dicult to obtain such infor- bility and oscillations. In our work, we assume that members
mation. Instead, we overestimate the actual utility of a link use an exponential smoothing algorithm to measure latency.
by its cost. The cost of a link between i and j in i s percep-
0 Further, the latency estimate is updated only at periodic in-
tion is the number of group members for which i uses j as tervals. The period length can be varied to tradeo routing
next hop. Periodically, a member computes the consensus stability with reactivity to changing conditions.
cost of its link to every neighbor using the algorithm shown
in Figure 6. It then picks the neighbor with lowest consensus A consequence of running a routing algorithm for data deliv-
cost and drops it if the consensus cost falls below a certain ery is that there could be packet loss during transient condi-
threshold. The threshold is again computed as a function of tions when member routing tables have not yet converged.
the member's estimation of group size and its current and In particular, there could be packet loss when a member
maximum fanout. The consensus cost of a link represents leaves the group or when a link is dropped for performance
the maximum of the cost of the link in each neighbor's per- reasons. To avoid this, data continues to be forwarded along
ception. Yet, it might be computed locally as the mesh runs old routes for enough time until routing tables converge. To
a distance vector algorithm with path information. achieve this, we introduce a new routing cost called Tran-
sient Forward (TF). TF is guaranteed to be larger than the
Our heuristics for link-dropping have the following desirable cost of a path with a valid route, but smaller than innite
properties: cost. A member M that leaves advertises a cost of TF for
Stability: A link that Narada drops is unlikely to be all members for which it had a valid route. Normal distance
added again immediately. This is ensured by several fac- vector operations leads to members choosing alternate valid
tors: (i) the threshold for dropping a link is less than or routes not involving M (as TF is guaranteed to be larger
equal to the threshold for adding a link (ii) the utility of an than the cost of any valid route). The leaving member con-
existing link is overestimated by the cost metric (iii) drop- tinues to forward packets until it is no longer used by any
ping of links is done considering the perception that both neighbor as a next hop to reach any member, or until a
members have regarding link cost (iv) a link with small de- certain time period expires.
lay is not dropped.
Partition avoidance: We present an informal argument as 4. SIMULATION EXPERIMENTS
to why our link dropping algorithm does not cause a parti- In this section, we evaluate the properties of the overlay
tion assuming steady state conditions and assuming multiple structure that Narada produces, and the overheads associ-
links are not dropped concurrently. Assume that member ated with the Narada protocol.
i drops neighbor j . This could result in at most two par-
titions. Assume the size of i s partition is Si and the size
0
of j s partition is Sj . Further, assume both i and j know

0
4.1 Performance Indices
all members currently in the group. Then, the sum of Si An overlay structure fundamentally cannot perform as e-
and Sj is the size of the group. Thus Costij must be at ciently as IP Multicast. We are interested in evaluating the
least Sj and Costji must be at least Si , and at least one of quality of the structure produced by Narada and in compar-
these must exceed half the group size. As long as the drop ing it with two alternate methods of data dissemination, IP
threshold is lower than half the group size, the edge will not Multicast using DVMRP4] and naive unicast. To facilitate
be dropped. our comparison, we consider the following metrics:
Relative Delay Penalty (RDP) , dened in Section 2, which
is a measure of the increase in delay that applications per- mance but do not investigate these in this paper. Routing
ceive while using Narada. policy could be signicant because in the event that routing
Worst Case Stress, dened as maxL i=1 si , where L is the is not based on shortest path, some pairs of members could
number of physical links used in transmission and si is the have an RDP of less than 1 with Narada. Group distri-
stress (Section 2) of link i. This metric measures the ef- bution is important as presence of clusters in groups could
fectiveness of Narada in distributing network load across improve Narada's performance compared to unicast. This
physical links. is because Narada could minimize the number of copies of
Normalized Resource Usage (NRU), dened as the ratio a packet that enter a cluster via costlier inter-cluster links
of the resource usage (Section 2) of Narada relative to re- and distribute them along cheaper intra-cluster links.
source usage of DVMRP. NRU is a measure of the addi-
tional network resources consumed by Narada compared to 4.3 Simulation Setup
IP Multicast. We use a locally written, packet-level, event-based simula-
DVMRP has an RDP of 1 (assuming symmetric routing), a tor to evaluate our protocol. The simulator assumes short-
worst case stress of 1 and by denition an NRU of 1. Naive est delay routing between any two members. The simulator
unicast has an RDP of 1 (by denition) and a worst case models the propagation delay of physical links but does not
stress of r, when r is the number of receivers. model queueing delay and packet losses. This was done to
make our simulations more scalable. To consider dynamic
We also evaluate the time it takes for the overlay to stabilize network conditions we are currently conducting a detailed
and the protocol overhead that Narada introduces. In this evaluation of Narada on the Internet and we present prelim-
paper, we do not consider performance metrics related to inary results in Section 5.
behavior under transient conditions, such as packet loss. All experiments we report here are conducted in the follow-
ing manner. A xed number of members join the group in
4.2 Factors that affect Narada’s Performance the rst 100 seconds of the simulation in random sequence.
We have investigated the eects of the following factors on A member that joins is assumed to contain a list of all
Narada's performance: (i) topology model (ii) topology members that joined the group previously. After 100 sec-
size (iii) group size and (iv) fanout range. onds, there is no further change in group membership. One
sender is chosen at random to multicast data at a constant
We used three dierent models to generate backbone topolo- rate. We allow the simulation to run for 40 minutes. In all
gies for our simulation. For each model of the backbone, we experiments, neighbors exchanges routing messages every 30
modeled members as being attached directly to the back- seconds. Each member probes one random group member
bone topology. Each member was attached to a random every 10 seconds to evaluate performance.
router, and was assigned a random delay of 1 ; 4ms.
4.4 Simulation Methodology
Waxman: The model considers a set of n vertices on We do not adopt a full factorial design that investigates ev-
a square in the plane and places
;d
an edge between two points ery possible combination of all factors. Instead we study the
with a probability of e L , where, d is the distance be- inuence of each individual factor on Narada's performance
tween vertices , L is the length of the longest possible edge one at a time, keeping other factors xed.
and and are parameters. We use the Georgia Tech.
23] random graph generators to generate topologies of this We begin by presenting results from a typical experiment
model. that characterizes key aspects of Narada's performance in
Mapnet: Backbone connectivity and delay are mod- Section 4.5. In Section 4.6, we present results that investi-
eled after actual ISP backbones that could span multiple gate the inuence of the factors on Narada's performance.
continents. Connectivity information is obtained from the We present protocol overhead incurred with Narada in Sec-
CAIDA Mapnet project database 7]. Link delays are as- tion 4.7. Finally, we summarize and interpret our results in
signed based on geographical distance between nodes. Section 4.8.
Automous System map (ASMap): Backbone connectivity
information is modeled after inter-domain Internet connec- 4.5 Simulation Results with a Typical Run
tivity. This information is collected by a route server from This section presents results from a single typical experi-
BGP routing tables of multiple geographically distributed ment. The results are typical in the sense they capture some
routers with BGP connections to the server 6]. This data of the key invariants in the performance of Narada across all
has been analyzed in 5] and has been shown to satisfy cer- runs. In the experiment, we used a topology generated by
tain power laws. Random link delays of 8 ; 12 ms was the Waxman model consisting of 1024 nodes and 3145 links.
assigned to each physical link. We used a group size of 128 members, and each member had
In our simulations, we used backbone topology sizes consist- a fanout range of <3-6>.
ing of up to 1070 members and multicast groups of up to 256
members. The fanout range of a member is the minimum Delay Penalty and Stabilization Time
and maximum number of neighbors each member strives to Figure 7 plots the cumulative distribution of RDP at dif-
maintain in the mesh. An increase of the fanout range could ferent time instances during the simulation. The horizon-
decrease mesh diameter and result in lower delay penalties. tal axis represents a given value of RDP and the vertical
However, it could potentially result in higher stress on links axis represents the percentage of pairs of group members
near members. for which the RDP was less than this value. Each curve
corresponds to the cumulative distribution at a particular
In addition, we identify network routing policy and group time instance. It might happen that at a given time, some
distribution as factors that could impact Narada's perfor- members have not yet learned of the existence of some other
100 160
Cumulative Percentage of Pairs of Members
140
80
120
Overlay Delay (in ms)

100
60
80
40 60
1 minute
2 minutes 40
20 3 minutes
10 minutes 20
20 minutes
40 minutes
0 0
1 2 4 8 16 32 0 10 20 30 40 50 60 70 80
RDP (log-scale) Physical Delay (in ms)
Figure 7: Cumulative distribution of RDP shown at Figure 9: Overlay delay vs. physical delay. Each
various snapshots of the simulation. The minutes point denotes the existence of a pair of members
denote the time after the last join. with a given physical delay and overlay delay
14
400
12
Cumulative # of Changes to Topology

350
10
300
8 250
RDP
6 200
4 150
100
2
50
0
0 10 20 30 40 50 60 70 80
Physical Delay (in ms) 0
0 5 10 15 20 25 30 35 40
Simualtion Time (in Minutes)
Figure 8: RDP vs. physical delay. Each point de-

notes the existence of a pair of members with a given Figure 10: Cumulative number of virtual links
physical delay and RDP added and removed vs. time
members or do not have routes to others. Thus, 1 minute in the previous paragraph and (iii) it is fairly insensitive to
after the last join, approximately 10% of pairs do not have particular experiment parameters, unlike the omitted tail
routes to each other, indicated by the lower curve. All pairs
have routes to each other 2 minutes after the last join. As Figure 10 plots the cumulative number of virtual links added
time increases, the curve moves to the left, indicating the and removed from the mesh as a function of simulation time.
RDP is reduced as the quality of the overlay improves. We observe that most of the changes happen within the
When the system stabilizes, about 90% of pairs of members rst 4 minutes of the simulation. This is consistent with
have RDP less than 4. However, there exist a few pairs of the behavior seen in Figure 7 and indicates that the mesh
members with high RDP. This tail can be explained from quickly stabilizes into a good structure.
Figure 8. Each dot in this gure indicates the existence of
a pair of members with a given RDP and physical delay. Physical Link Stress
We observe that all pairs of members with high RDP have We study the variation of physical link stress under Narada
very small physical delays. Such members are so close to and compare the results we obtain for a typical run with
each other in the physical network that even a slightly sub- physical stress under DVMRP and naive unicast in Figure
optimal conguration leads to a high delay penalty. How- 11. One of the members is picked as source at random, and
ever, the delay between them along the overlay is not too we evaluate the stress of each physical link. Here, the hori-
high. This can be seen from Figure 9, where each point zontal axis represents stress and the vertical axis represents
represents the existence of a pair of members with a given the number of physical links with a given stress. The stress
overlay delay and a given physical delay. It may be veri- of any physical link is at most 1 for DVMRP, indicated by
ed that the delay between all pairs of members along the a solitary dot. Under both naive unicast and Narada, most
overlay is at most 160ms, while the physical delay can be as links have a small stress - this is only to be expected. How-
high as 71ms. ever, the signicance lies in the tail of the plots. Under
naive unicast, one link has a stress of 127 and quite a few
In future experiments, we summarize RDP results of an ex- links have a stress above 16. This is unsurprising consider-
periment by the 90 percentile RDP value. We believe this is ing that links near the source are likely to experience high
an appropriate method of summarizing results because: (i) stress. Narada however distributes the stress more evenly
it is an upper bound on the RDP observed by 90% of pairs of across physical links, and no physical link has a stress larger
members (ii) for pairs of members with a RDP value higher than 9. While this is high compared to DVMRP, it is a
than the 90 percentile, the overlay delay is small as discussed 14-fold improvement over naive unicast.
512 14
Unicast
256 DVMRP
DVMRP Narada 12
Worst Case Physical Link Stress

# of Physical Links (log-scale)
128
10
64
Narada
32 8
16 Unicast
6
8
4
4
2 2 Waxman.1024
Mapnet.1070
1 ASMap.1024
0
1 2 4 8 16 32 64 128 0 50 100 150 200 250 300
Stress of Physical Link (log-scale) Group Size
Figure 11: No. of physical links with a given stress Figure 13: Worst case physical link stress vs. group
vs. Stress for naive unicast, Narada and DVMRP size for topologies from three models
3.5
5 Narada (Waxman.1024)
Unicast (Waxman.1024)
4.5 3
Normalized Resource Usage

4
2.5
3.5
90 Percentile RDP
3 2
2.5
1.5
2
1.5 1
1
0.5
Waxman.1024
0.5 Mapnet.1070
ASMap.1024 0
0 0 50 100 150 200 250 300
0 50 100 150 200 250 300 Group Size
Group Size
Figure 12: 90 percentile RDP vs. group size for Figure 14: Eect of group size on NRU : Narada vs.
topologies from three models naive unicast
Figure 13 plots the variation of worst case physical link stress
4.6 Impact of factors on performance against group size for three topologies. Each curve corre-
We are interested in studying variation in Narada's perfor- sponds to one topology. We observe that the curves are
mance due to each of the following factors: (i) topology close to each other for small group sizes but seem to diverge
model (ii) topology size (iii) group size and (iv) fanout for larger group sizes. Further, for all topologies, worst case
range. Keeping other factors xed at the default, we study stress increases with group size. Thus, for a group size of
the inuence of each individual factor on Narada's perfor- 64, mean worst case stress is about 5 ; 7 across the three
mance. By default, we used a Waxman topology with 1024 topologies, while for a group size of 256, it is about 8 ; 14.
nodes and 3145 links, a group size of 128 and a fanout range We believe this increase of stress with group size is an ar-
of <3-6> for all group members. For all results in this sec- tifact of the small topologies in a simulation environment
tion, we compute each data point by conducting 25 indepen- relative to the actual Internet backbone. We analyze this in
dent simulation experiments and we plot the mean with 95% detail in Section 4.8.
condence intervals. Due to space constraints, we present
plots of selected experiments and summarize results of other Figure 14 plots the normalized resource usage (NRU) against
experiments. group size for the Waxman model alone. The lower and up-
per curves correspond to Narada and unicast respectively.
First, Narada consumes less network resources than naive
Topology Model and Group Size unicast, and this is consistent for all group sizes. For a
We used a Waxman topology consisting of 1024 routers and group size of 128, the NRU for Narada is about 1:8 and 2:2
3145 links, an ASMap topology consisting of 1024 routers for naive unicast. Second, NRU increases with group size.
and 3037 links and a Mapnet topology consisting of 1070 While these results imply a nearly 20% savings of network
routers and 3170 links. resources, we believe that the savings could be even more
signicant if members are clustered. We have repeated this
Figure 12 plots the variation of the 90 percentile RDP with study with the Mapnet and ASMap topologies and observe
group size for three topologies. Each curve corresponds to similar trends. For all topologies, the NRU is at most 1:8
one topology. All the curves are close to each other indi- for a group size of 128.
cating that the RDP is not sensitive to the choice of the
topology model. For all topologies and for a group size of Topology Size
128 members, the 90 percentile RDP is less than 4. For For each topology model, we generate topologies of sizes
each topology, the 90 percentile RDP increases with group varying from about 64 nodes to about 1070 nodes and eval-
size. This is because an increase of group size results in an uate the impact on Narada's performance. Figure 15 plots
increase of mesh diameter and hence an increase of RDP. the worst case physical link stress against topology size for
25 than 4. We hypothesize that RDP values might be lower
on the Internet, as Internet routing is policy based and sub-
Waxman
ASMap
Mapnet
optimal while the simulator assumes shortest path routing.
Worst Case Physical Link Stress
20
Preliminary Internet evaluation indicates that the 90 per-
centile RDP for a 13 member group can be as low as 1:5
15
(Section 5).
10
Across a range of topology models, Narada results in a
low worst case stress for small group sizes. For example, for
5 a group size of 16, the worst case stress is about 5. While
for larger group sizes, worst case stress may be higher, it is
still much lower than unicast. For example, for a group of
128 members, Narada reduces worst case stress by a factor
0
0 200 400 600 800 1000 1200
Topology Size
of 14 compared to unicast.
Figure 15: Worst case physical link stress vs. topol- We hypothesize that worst case stress on the Internet is
ogy size for topologies from three models lower than seen in simulations. The largest topologies that
each topology model. Across all topology models, we ob- we use in our simulations (around 1000 nodes) are still orders
serve that the worst case stress increases with decrease in of magnitude smaller than the Internet. Consequently, the
topology size. While the same general trend is observed for ratio of the group size to topology size, which we term den-
all topology models, it seems more pronounced for Waxman. sity, is much higher in simulations than in actual practice.
We analyze the signicance of this result in Section 4.8. Our simulations indicate that higher group density results
in higher worst case link stress. This can be deduced from
We have also studied the eect of topology size on RDP and Figures 13 and 15, where we observe that the worst case
NRU. Across all topology models, RDP appears largely un- stress increases with group size and decreases with topology
aected by topology size, while NRU decreases with increase size. We hypothesize that an increase in group density in-
in topology size. We omit the plots due to space constraint. creases the probability that an internal physical link could
be shared by multiple uncorrelated virtual links. The links
are uncorrelated2 in the sense that they connect distinct pairs
Fanout Range of end systems. This could increase worst case stress with
So far, we have assumed that each member strives to main- Narada because Narada is only able to regulate fanout of
tain <3-6> neighbors in the mesh. We have investigated members and consequently can only control stress of links
the eect of variation of fanout range on Narada's perfor- near member and not stress of internal physical links. For
mance. In summary, when the fanout range increases, mesh the range of group sizes we consider, we expect that the den-
diameter decreases and stress on links close to members sity ratio is much lower on the Internet and thus we expect
increases. Consequently, RDP decreases while worst case lower link stress.
stress increases. For a group of 128 members, as fanout
range increases from <2-4> to <8-16>, the 90 percentile Narada lowers resource usage by at least 20% compared
RDP decreases from about 5:5 to 2 while the worst case to unicast for a range of group sizes. We believe that if
physical stress increases from about 9 to 15. members are clustered, Narada can result in even larger im-
provement in resource usage.
4.7 Protocol Overhead
Narada incurs a protocol overhead for two reasons. First, 5. INTERNET EXPERIMENTS
members periodically exchange routing tables and control We have implemented Narada and we report preliminary
information between each other. Second, members estimate results obtained by conducting experiments on the Internet.
their delays to other members by probing them periodically.
We dene Protocol Overhead Ratio (POR) as the ratio of Our experiments consisted of 13 hosts distributed through-
bytes of non-data trac that enter the network to bytes of out the United States. In each experiment, every host was
data trac. While we do not present results, we nd that initially provided the list of names of all other hosts, and
POR increases linearly with group size. Further, we note all hosts join the group at approximately the same time.
that the protocol trac that Narada introduces is indepen- Each host attempted to maintain a fanout range of <2-4>.
dent of source data rate and thus the POR decreases with A host at CMU was designated the source and sent data
increase in data trac. For a group size of 128 members, at periodic intervals. At present, we assume that Narada
the POR is about 0:25 for a source data rate of 16 kilobits attempts to minimize the propagation delay between hosts
per second (kbps), and less than 0:04 for a source data rate on the mesh. Thus, each host assumes its physical delay to
of 128 kbps. For a 64 member group and a source data rate another host is the minimum delay observed across multiple
of 128 kbps, the POR is hardly 0:02. estimates. Each experiment was run for about 20 minutes.
4.8 Results Summary and Interpretation Figure 16 shows the overlay spanning tree that Narada pro-
In this section, we summarize key results that we have pre- duced, which was used to route data from the source (CMU1)
sented and attempt to explain the results. to other recipients, for a typical experiment. The links of
the tree are labeled by their delays (in milliseconds), as mea-
Across a range of topology models, Narada results in a 2 For example, consider the physical topology shown in Fig-
low RDP for small group sizes. For example, for a group ure 1(a) and assume that Narada constructs a mesh as shown
size of 16, the 90 percentile RDP is less than 2.5. Even for in Figure 1(h). Uncorrelated virtual links A ; C and B ; D
group sizes of 128 members, the 90 percentile RDP is less share the physical link R1 ; R2.
UWisc UMass 100
10 14
UIUC2
Cumulative Percentage of Pairs of Members

CMU1
1 10
UIUC1 80
1 11
CMU2
31 UDel
38 Virginia1 60
Berkeley 1 15
8 UKY Virginia2
40
UCSB
13
GATech
20
Figure 16: Overlay spanning tree Narada produces

in a typical Internet experiment with CMU1 as
source. Note that it produces a mesh, and edges 0
0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8
of the mesh not in the tree are omitted. RDP
sured by one end point of the link. Here, UIUC1 and UIUC2 Figure 17: Cumulative distribution of RDP
belong to the same campus as do Virginia1 and Virginia2.
Berkeley and UCSB are closer together (on the West coast) challenge the appropriateness of using IP Multicast for all
as compared to all other hosts. Narada is able to ensure forms of group communication. Yallcast argues the need for
that only a single copy of data from CMU1 is delivered to hosts to auto-congure into tunneled topologies for ecient
UIUC, Virginia and the West Coast and shorter links here data dissemination. Yallcast emphasizes architectural as-
are used for distribution of data within the sites. pects and has not yet considered performance implications
Narada in fact constructs a mesh, and there are additional of using an overlay. In contrast, our work places a central
edges of the mesh not in the tree which we have omitted in emphasis on performance tradeos associated with an over-
Figure 16. Further, Narada may dynamically add and drops lay network, and this greatly inuences the design of our
links to improve mesh quality. In this experiment, the mesh self-organization protocol. Yallcast presents the design of
did not have links between Berkeley and UCSB, and be- a rendezvous service for the bootstrapping of a new mem-
tween UIUC1 and UIUC2 in the rst few minutes of the ber. The bootstrapping mechanism is however orthogonal
experiment. The self-improving nature of Narada resulted to the issues we consider in this paper - while the mecha-
in addition of these links and improved the eciency of data nisms suggested in Yallcast could be used, we are open to
delivery. Narada was also able to drop a link between GAT- other application specic and out-of-band bootstrap mech-
ech and Delaware which it identied as not useful in data anisms. At the protocol level too, Yallcast and Narada have
delivery. dierences. Narada constructs an ecient mesh among par-
ticipating members, and then constructs spanning trees of
the mesh. In contrast, Yallcast constructs a spanning tree
Figure 17 plots the cumulative distribution of the RDP of among participating members directly. We believe that a
the mesh that Narada produced. The worst case RDP is mesh-rst approach helps avoid replicating group manage-
only 2:6 and 90% of pairs of members have an RDP of at ment functions across multiple (per-source) trees, simplies
most 1:5. Further, we nd that the delay along the mesh overlay maintenance (as loop avoidance is no longer an is-
between any pair of members is at most 84 ms , while the sue), allows for leveraging on standard routing algorithms,
worst case unicast delay between any pair of members can and provides a structure more resilient to the failure of mem-
be as high as 64 ms (observed between UKY and UCSB). bers.
The physical delay used in RDP calculations were the de-
lays estimated by an end point of the link. End points inde- Scattercast argues for infrastructure support, where prox-
pendently estimate link delay and thus might have dierent ies deployed in the network run self-organization protocols
estimates. In our experiments however, we nd that link on behalf of the clients, while clients subscribe to nearby
delay estimates by end points are generally consistent with proxies. Although we have not emphasized infrastructure
each other and the estimates are very close to the ping times support in this paper, we wish to make explicit that our
between the machines. Pairs with an RDP of 1 correspond notion of an end system is not restricted to clients of a
to pairs of hosts with mesh links between them. Interest- multicast group, and includes network proxies. In partic-
ingly, we nd a small number of pairs for which the RDP ular, we believe that Narada can be used to build an overlay
is slightly less than 1. There are two reasons for this eect. structure among proxies on a scattercast architecture, while
In some cases, this was due to minor inaccuracies in delay the self-organization protocol proposed in Scattercast called
measurements. However, in some cases, the reason was more Gossamer can be adapted to an End System Multicast archi-
fundamental and due to the policy based nature of Internet tecture. At the protocol level, while Gossamer too adopts
routing. This eect has been reported in 18]. a mesh-rst approach, it relies on centralized rendezvous
points for repairing mesh partition. Although this assump-
In the future we plan to conduct larger scale Internet exper- tion signicantly simplies the partition recovery mecha-
iments with emphasis on studying the dynamics of Narada, nisms in Gossamer, the members of the mesh could become
transient behavior and eects of congestion and packet loss. partitioned from each other in event of failure of all ren-
dezvous points.
6. RELATED WORK The MBone 2] and 6Bone 10] are popular existing exam-
The works that come closest to the End System Multicast ples of overlay networks. However, these are statically con-
architecture we propose here and which share much of our gured in a manual and adhoc fashion. Narada, on the other
motivation are Yallcast 8] and Scattercast 3]. Both projects hand, strives for a self-conguring and ecient overlay net-
work. Internet routing protocols are self-conguring. The 8. REFERENCES
most striking dierence between Narada and standard rout- 1] E. Bommaiah, A. McAuley, R. Talpade, and M. Liu.
ing protocols is that while the latter work on a xed physi- Amroute: Adhoc multicast routing protocol. Internet draft,
cal topology, Narada alters the very topology over which it Internet Engineering Task Force, August 1998.
routes data. Routing protocols merely route around a link 2] S. Casner and S. Deering. First IETF internet audiocast.
that has failed and have no notion of dynamic adding or ACM Computer Communication Review, pages 92{97,
dropping of links. Narada might dynamically add links to 1992.
ensure connectivity of the virtual topology, and drop links 3] Y. Chawathe, S. McCanne, and E. A. Brewer. An
it perceives as not useful. architecture for internet content distribution as an
infrastructure service, February 2000. Unpublished work.
Self-conguration has been proposed in other contexts. AM- 4] S. Deering. Multicast routing in internetworks and
extended lans. In Proceedings of the ACM SIGCOMM 88,
Route 1] allows for robust IP Multicast in mobile adhoc pages 55{64, Stanford,CA, August 1988.
networks by exploiting user-multicast trees. Several reliable 5] C. Faloutsos, M. Faloutsos, and P. Faloutsos. On power-law
IP Multicast protocols 12, 13, 22] involve group members relationships of the internet topology. In Proceedings of
self-organizing into structures that help in data recovery. ACM Sigcomm, August 1999.
Adaptive Web Caching 14] is a self-organizing cache hierar- 6] National Laboratory for Applied Network Research.
chy. The key feature that distinguishes Narada from these Routing data. http://moat.nlanr.net/Routing/rawdata/.
protocols is that Narada does not assume a native multi- 7] Cooperative Association for Internet Data Analysis.
cast medium - AMRoute assumes a native wireless broad- Mapnet project.
cast channel, while all other protocols assume the existence http://www.caida.org/Tools/Mapnet/Data/.
of IP Multicast. Self-conguration in the absence of such a 8] P. Francis. Yallcast: Extending the internet multicast
native multicast medium is a much harder problem. architecture, http://www.yallcast.com, September 1999.
9] H.W.Holbrook and D.R. Cheriton. IP multicast channels:
EXPRESS support for large-scale single-source
7. CONCLUSIONS applications. In Proceedings of the ACM SIGCOMM 99,
We have made two contributions in this paper. First, we August 1999.
have shown that, for small and sparse multicast groups, it 10] IPV6 backbone. http://www.6bone.org/.
is feasible to use an end system overlay approach to e- 11] G. Kortsarz and D. Peleg. Generating low-degree
ciently and robustly support all multicast related function- 2-spanners. SIAM Journal on Computing, Volume 27,
ality including membership management and packet repli- Number 5.
cation. The shifting of multicast support from routers to 12] B. N. Levine, D. B. Lavo, and J. J. Garcia-Luna-Aceves.
end systems, while introducing some performance penalties, The case for concurrent reliable multicasting using shared
has the potential to address most problems associated with ACK trees. In Proceedings of ACM Multimedia'96,
IP Multicast. We have shown, with both simulation and In- November 1996.
ternet experiments, that the performance penalties are low 13] J. Liebeherr and B. S. Sethi. A scalable control topology
in the case of small and sparse groups. We believe that the for multicast communications. In Proceedings of IEEE
Infocom, April 1998.
potential benets of repartitioning the multicast functional- 14] S. Michel, K. Nguyen, A. Rozenstein, L. Zhang, S. Floyd,
ity between end systems and routers signicantly outweigh and V. Jacobson. Adaptive web caching: towards a new
the performance penalty incurred. global caching architecture. Computer Networks and ISDN
Systems, November 1998.
Second, we have proposed one of the rst self-organizing 15] R. Perlman, C. Lee, T. Ballardie, J. Crowcroft, Z. Wang,
and self-improving protocols that constructs an overlay net- T. Maufer, C. Diot, J. Thoo, and M. Green. Simple
work on top of a dynamic, unpredictable and heterogeneous multicast: A design for simple, low-overhead multicast.
Internet environment without relying on a native multicast Internet Draft, Internet Engineering Task Force, March
medium. We also believe this is among the rst works that 1999. Work in progress.
attempt to systematically evaluate the performance of a self- 16] Y. Rekhter and T. Li. A Border Gateway Protocol 4
organizing overlay network protocol and the tradeos in us- (BGP-4), RFC 1771, March 1995.
ing overlay networks. In 20], it was argued that the over- 17] J. Saltzer, D. Reed, and D. Clark. End-to-end arguments in
lay approach is a fundamental technique to incrementally system design. ACM Transactions on Computer Systems,
deploy services and evolve networks. We believe that the 2(4):195{206, 1984.
techniques and insights developed in this paper are general 18] S. Savage, A. Collins, E. Homan, J.Snell, and
and can be applied to overlay networks in contexts other T. Anderson. The end-to-end eects of internet path
than multicast. selection. In Proceedings of ACM Sigcomm, August 1999.
19] F.B. Schneider. Byzantine generals in action: Implementing
In this paper, we emphasize the performance aspect of us- fail-stop processors. ACM transactions on Computer
Systems, 2(2), pages 145{154, 1984.
ing an end system overlay approach to support multicast. 20] K. Sripanidkulchai, A. Myers, and H. Zhang. A third-party
We are currently extending this work in several dimensions. value-added network service approach to reliable multicast.
First, as mentioned in Section 1, the \end system" in the In Proceedings of ACM Sigmetrics, August 1999.
protocol can be either an application module, a host, or a 21] I. Stoica, T.S.E. Ng, and H. Zhang. REUNITE: A recursive
shared proxy. We are exploring architectural issues involved unicast approach to multicast. In Proceedings of IEEE
in adapting Narada to each individual context. Second, in INFOCOM'00, Tel-Aviv, Israel, March 2000.
the current form, Narada considers latency between mem- 22] R. X. Xu, A. C. Myers, H. Zhang, and R. Yavatkar.
bers as the sole criterion that needs to be optimized. We Resilient multicast support for continuous-media
are extending this to consider measured packet loss rates applications. In Proceedings of NOSSDAV'97, May 1997.
and bandwidth availability. Finally, we are studying how 23] E. W. Zegura, K. L. Calvert, and S. Bhattacharjee. How to
support for error, ow, and congestion control functionality model an internetwork. In Proceedings of IEEE Infocom,
can be added in End System Multicast. March 1996.

Narada

Uploaded by

Copyright:

Available Formats

Narada

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Narada

Uploaded by

Copyright:

Available Formats

A Case for End System Multicast

Yang-hua Chu, Sanjay G. Rao, and Hui Zhang

ABSTRACT and routers). One of the most important architectural deci-

3.1.3 Repairing Mesh Partitions edge to j were added.

of j s partition is Sj . Further, assume both i and j know

Overlay Delay (in ms)

Cumulative # of Changes to Topology

Figure 8: RDP vs. physical delay. Each point de-

Worst Case Physical Link Stress

Normalized Resource Usage

Cumulative Percentage of Pairs of Members

Figure 16: Overlay spanning tree Narada produces

of the mesh not in the tree are omitted. RDP

You might also like