Narada
Narada
Narada
27 27 27
A C A C A C A C
1 1 27
R1 R2 3 3 2 3 2
2 25
B 1 D B D B D B D
28
(b) (d) (f) (h)
Figure 1: Examples to illustrate IP Multicast, naive unicast and End System Multicast
multicast. It abstracts the physical topology as a Complete 3. NARADA DESIGN
Virtual Graph (CVG) as shown in Figure 1(c). Further, it In this section, we present Narada, a protocol we designed
tries to construct a spanning-tree of the CVG along which that implements End System Multicast. In designing Narada,
A could send data to other recipients. In particular, this we have the following objectives in mind:
scheme could degenerate to naive unicast transmission as
shown in Figure 1(d). Figure 1(e) depicts how naive unicast Self-organizing: The construction of the end system over-
transmission maps onto the underlying physical network. It lay must take place in a fully distributed fashion and must
is seen that links R1 ; R2 and A ; R1 carry 2 and 3 copies be robust to dynamic changes in group membership.
of a transmission by A respectively. We refer to the number
of identical copies of a packet carried by a physical link as Overlay eciency: The tree constructed must ideally have
the stress of a physical link. Thus, unicast in general leads low stress, low RDP and low resource usage. Further, the
to a high stress on the link nearest the source. out-degree of each member in the overlay must reect the
bandwidth of its connection to the Internet.
We could build smarter spanning trees of the CVG such
as the one shown in Figure 1(f). Figure 1(g) depicts how Self-improving in an incremental fashion: The overlay con-
this tree maps onto the underlying physical network. It is struction must include mechanisms by which end systems
seen that all physical links have a stress of at most 2. Not gather network information in a scalable fashion. The pro-
only does this tree reduce the worst case physical link stress, tocol must allow for the overlay to incrementally evolve into
but it also reduces the stress of the costlier link R1 ; R2 a better structure as more information becomes available.
to 1. In order to capture this, we introduce the P notion of
resource usage. We dene resource usage as Li=1 di si , There are two basic methods for construction of overlay
where, L is the number of links active in data transmission, spanning trees for data delivery. A rst approach is to
di is the delay of link i and si is the stress of link i. The construct the tree directly - that is, members explicitly se-
resource usage is a metric of the network resources consumed lect their parents from among the members they know 8].
in the process of data delivery to all receivers. Implicit here Narada however constructs trees in a two-step process. First
is the assumption that links with higher delay tend to be it constructs a richer connected graph that we term mesh .
associated with higher cost. The resource usage might be The mesh could in general be an arbitrary connected sub-
computed to be 30 in the case of transmission by DVMRP, graph of the CVG (though Narada tries to ensure the mesh
57 for naive unicast and 32 for the smarter tree , shown in has desirable performance properties as discussed later.) In
Figure 1(b), Figure 1(d) and Figure 1(f) respectively. the second step, Narada constructs (reverse) shortest path
spanning trees of the mesh, each tree rooted at the corre-
While the spanning tree of Figure 1(f) improves link stress sponding source using well known routing algorithms. Fig-
and resource usage as compared to naive unicast transmis- ure 1(h) presents an example mesh that Narada constructs
sion, it increases delay from the source to some of the recip- for the physical topology shown in Figure 1(a), along with
ients. Thus, the delay from A to D has increased to 29 from the shortest path spanning tree rooted at A. We have several
27. We refer to the ratio of the delay between two members reasons for this two step process. First, group management
along the overlay to the unicast delay between them as the functions are abstracted out and handled at the mesh rather
Relative Delay Penalty (RDP). Thus, <A,D> has an RDP
29 , while <A,B> and <A,C> have an RDP of 1. than replicated across multiple (per source) trees. Second,
of 27 distributed heuristics for repairing mesh partition and mesh
optimization are greatly simplied as loop avoidance is no
The out-degree of a node in the overlay tree is its number of longer a constraint. Third, we may leverage standard rout-
children in the tree. Thus, the out-degree of A in the smarter ing algorithms for construction of data delivery trees. Fi-
tree is 2, while it is 3 in naive unicast. The out-degree of a nally, a mesh is more resilient to the failure of members
node in the overlay spanning tree directly impacts the stress than a tree and heavy weight partition repair mechanisms
of the links close to the node. are invoked less frequently.
In our approach, there is no control over the resulting span- Let i receive refresh message from neighbor j is
at 0 local
ning trees for a given mesh. Hence, it becomes important t
time . Let < k s > js
kj be an entry in 0 refresh message.
if i
does not have an entry for , then k i
inserts the
to construct a good mesh so that good quality trees may entry kj < k s t >
into its table
be produced. In particular, we attempt to ensure the fol- i
else if 's entry for is k < k s t >
ki ki , then
lowing properties: (i) the shortest path delay between any s
if ki kj s i
ignores the entry pertaining to k
pair of members along the mesh is at most K times the uni- else i
updates its entry for to k < k s t >
kj
cast delay between them, where K is a small constant and
(ii) each member has a limited number of neighbors in the Figure 2: Actions taken by a member i on receiving
mesh which does not exceed a given (per-member) bound a refresh message from member j .
chosen to reect1the bandwidth of the member's connection
to the Internet. Limiting the number of neighbors regu- the following information for every other member k in the
lates the fanout of members in the spanning trees. Second, group: (i) member address k (ii) last sequence number ski
it controls the overhead of running routing algorithms on that i knows k has issued and (iii) local time at i when i
the mesh. The extreme case where the mesh is chosen to be rst received information that k issued ski . If member i has
the Complete Virtual Graph incurs all the overhead of rout- not received an update from member k for Tm time, then,
ing with none of its benets as the resulting shortest path i assumes that k is either dead or potentially partitioned
spanning trees degenerates to naive unicast transmission. from i. It then initiates a set of actions to determine the
existence of a partition and repair it if present as discussed
Narada has striking dierences from self-conguring proto- in Section 3.1.3.
cols developed in other contexts. First, Narada distinguishes
itself from normal routing protocols in that it changes the Propagation of refresh messages from every member along
very topology over which routing is performed. Second, the mesh could potentially be quite expensive. Instead, we
most existing self-conguring protocols 12, 13, 14, 22] as- require that each member periodically exchange its knowl-
sume native IP Multicast support. Narada attempts self- edge of group membership with its neighbors in the mesh.
conguration in the absence of a lower level multicast ser- A message from member i to a neighbor j contains a list of
vice, and this is fundamentally more challenging. entries, one entry for each member k that i knows is part of
the group. Each entry has the following elds: (i) member
We explain the distributed algorithms that Narada uses to address k and (ii) last sequence number ski that i knows
construct and maintain the mesh in Section 3.1. We present k has issued. On receiving a message from a neighbor j ,
heuristics that Narada uses to improve mesh quality in Sec- member i updates its table according to the pseudo code
tion 3.2. Narada runs a variant of standard distance vector presented in Figure 2.
algorithms on top of the mesh and uses well known algo-
rithms to construct per-source (reverse) shortest path span- Finally, given that a distance vector routing algorithm is run
ning trees for data delivery. We discuss this in Section 3.3. on top of the mesh (Section 3.3), routing update messages
exchanged between neighbors can include member sequence
3.1 Group Management
number information with minimum extra overhead.
We have seen that Narada tries to construct a mesh among
end systems participating in the multicast group. In this sec- 3.1.1 Member Join
tion, we present mechanisms Narada uses to keep the mesh When a member wishes to join a group, Narada assumes
connected, to incorporate new members into the mesh and that the member is able to get a list of group members by
to repair possible partitions that may be caused by members an out-of-band bootstrap mechanism. The list needs nei-
leaving the group or by member failure. ther be complete nor accurate, but must contain at least
As we do not wish to rely on a single non-failing entity to one currently active group member. In this paper, we do
keep track of group membership, the burden of group main- not address the issue of the bootstrap mechanism. We be-
tenance is shared jointly by all members. To achieve a high lieve that such a mechanism is application specic and our
degree of robustness, our approach is to have every member protocol is able to accommodate dierent ways of obtaining
maintain a list of all other members in the group. Since the bootstrap information.
Narada is targeted towards small sized groups, maintaining The joining member randomly selects a few group mem-
the complete group membership list is not a major over- bers from the list available to it and sends them messages
head. Every member's list needs to be updated when a new requesting to be added as a neighbor. It repeats the pro-
member joins or an existing member leaves. The challenge cess until it gets a response from some member, when it has
is to disseminate changes in group membership eciently, successfully joined the group. Having joined, the member
especially in the absence of a multicast service provided by then starts exchanging refresh messages with its neighbors.
the lower layer. We tackle this by exploiting the mesh to The mechanisms described earlier will ensure that the newly
propagate such information. However, this strategy is com- joined member and the rest of the group learn about each
plicated by the fact that the mesh might itself become par- other quickly.
titioned when a member leaves. To handle this, we require
that each member periodically generate a refresh message
with monotonically increasing sequence number, which is
disseminated along the mesh. Each member i keeps track of 3.1.2 Member Leave and Failure
When a member leaves a group, it noties its neighbors,
1 An ideal mesh is a \Degree-Bounded K-spanner" 11] of and this information is propagated to the rest of the group
the Complete Virtual Graph. The problem of construct- members along the mesh. In Section 3.3, we will describe
ing Degree-Bounded K-spanners of a graph has been widely our enhancement to distance vector routing that requires
studied in centralized settings that assume complete infor- a leaving member to continue forwarding packets for some
mation and is NP-complete even in such scenarios 11]. time to minimize transient packet loss.
EvaluateUtility (j ) begin
E C utility = 0
for each member m (m not i) begin
CL = current latency between i and m along mesh
B A G NL = new latency between i and m along mesh
if edge i-j were added
if (NL < CL) then begin
F D utility + = CLCLNL
;
end
Figure 3: A sample virtual topology end
return utility
Let Q be a queue of members for which i
has stopped Figure 5: Algorithm i uses in determining utility of
receiving sequence number updates for at least m T adding link to j
time. Let T
be maximum time an entry may remain in Q.
while(1) begin
Update Q stopped receiving sequence number updates from for at least
while( !Empty(Q) and Tm time. It runs a scheduling algorithm that periodically
Head(Q) is present in Q for T time) and probabilistically deletes a member from the head of the
begin queue. The deleted member is probed and it is either de-
j = Dequeue(Q) termined to be dead, or a link is added to it. The schedul-
Initiate probe cycle to determine if is dead j ing algorithm is adjusted so that no entry remains in the
or to add a link to it.
end queue for more than a bounded period of time. Further, the
if(!Empty(Q) ) begin probability value is chosen carefully so that in spite of sev-
prob Length(Q) GroupSize
= / eral members simultaneously attempting to repair partition
With probabilityprob begin only a small number of new links are added. The algorithm
j Dequeue(Q)
= is summarized in Figure 4.
Initiate probe cycle to determine if is dead j
or to add a link to it.
end
sleep(P). // Sleep for time P seconds 3.2 Improving mesh quality
end The constructed mesh can be quite sub-optimal, because (i)
initial neighbor selection by a member joining the group is
Figure 4: Scheduling algorithm used by member i random given limited availability of topology information at
to repair mesh partition bootstrap (ii) partition repair might aggressively add edges
that are essential for the moment but not useful in the long
We also need to consider the dicult case of abrupt fail- run (iii) group membership may change due to dynamic join
ure. In such a case, failure should be detected locally and and leave and (iv) underlying network conditions, routing
propagated to the rest of the group. In this paper, we as- and load may vary. Narada allows for incremental improve-
sume a failstop failure model 19], which means that once a ment of mesh quality. Members probe each other at random
member dies, it remains in that state, and the fact that the and new links may be added depending on the perceived gain
member is dead is detectable by other members. We explain in utility in doing so. Further, members continuously moni-
the actions taken on member death with respect to Figure tor the utility of existing links, and drop links perceived as
3. This example depicts the mesh between group members not useful. This dynamic adding and dropping of links in
at a given point in time. Assume that member C dies. Its the mesh distinguishes Narada from other topology mainte-
neighbors in the mesh, A, G stop receiving refresh messages nance protocols.
from C . Each of them independently send redundant probe
messages to C , such that the probability every probe mes- The issue then is the design of a utility function that reects
sage (or its reply) is lost is very small. If C does not respond mesh quality. A good quality mesh must ensure that the
to any probe message, then, A and G assume C to be dead shortest path delay between any pair of members along the
and propagate this information throughout the mesh. mesh is comparable to the unicast delay between them. A
member i computes the utility gain if a link is added to
Every member needs to retain entries in its group member- member j based on (i) the number of members to which j
ship table for dead members. Otherwise, it is impossible improves the routing delay of i and (ii) how signicant this
to distinguish between a refresh announcing a new member improvement in delay is. Figure 5 presents pseudo code that
and a refresh announcing stale information regarding a dead i uses to compute the gain in utility if a link to member j is
member. However, dead member information can be ushed added. The utility can take a maximum value of n, where n
after sucient amount of time. is the number of group members i is aware of. Each member
m can contribute a maximum of 1 to the utility, the actual
contribution being i s relative decrease in delay to m if the
0
Figure 6: Algorithm i uses to determine consensus Narada runs a distance vector protocol on top of the mesh.
cost to a neighbor j In order to avoid the well-known count-to-innity problems,
it employs a strategy similar to BGP 16]. Each member not
only maintains the routing cost to every other member, but
of previous probes. also maintains the path that leads to such a cost. Further,
routing updates between neighbors contains both the cost
When a member i probes a member j , j returns to i a copy to the destination and the path that leads to such a cost.
of its routing table. i uses this information to compute the The per-source trees used for data delivery are constructed
expected gain in utility if a link to j is added as described from the reverse shortest path between each recipient and
in Figure 5. i decides to add a link to j if the expected the source, in identical fashion to DVMRP 4]. A member
utility gain exceeds a given threshold. The threshold value M that receives a packet from source S through a neighbor
is a function of i s estimation of group size, and the current
0 N forwards the packet only if N is the next hop on the
and maximum fanout values of i and j respectively. Finally, shortest path from M to S . Further, M forwards the packet
i may also add a link to j if the physical delay between them to all its neighbors who use M as the next hop to reach S .
is very low and the current overlay delay between them very
high. The routing metric used in the distance vector protocol is
the latency between neighbors. Each endpoint of a link inde-
Dropping of links: Ideally, the loss in utility if a link pendently estimates the latency of the link and could have
were to be dropped must exactly equal the gain in utility if dierent estimates. Using the latency as a metric enables
the same link were immediately re-added. However, this re- routing to adapt to dynamics in the underlying network.
quires estimating the relative increase in delay to a member However, it also increases the probability of routing insta-
if a link were dropped and it is dicult to obtain such infor- bility and oscillations. In our work, we assume that members
mation. Instead, we overestimate the actual utility of a link use an exponential smoothing algorithm to measure latency.
by its cost. The cost of a link between i and j in i s percep-
0 Further, the latency estimate is updated only at periodic in-
tion is the number of group members for which i uses j as tervals. The period length can be varied to tradeo routing
next hop. Periodically, a member computes the consensus stability with reactivity to changing conditions.
cost of its link to every neighbor using the algorithm shown
in Figure 6. It then picks the neighbor with lowest consensus A consequence of running a routing algorithm for data deliv-
cost and drops it if the consensus cost falls below a certain ery is that there could be packet loss during transient condi-
threshold. The threshold is again computed as a function of tions when member routing tables have not yet converged.
the member's estimation of group size and its current and In particular, there could be packet loss when a member
maximum fanout. The consensus cost of a link represents leaves the group or when a link is dropped for performance
the maximum of the cost of the link in each neighbor's per- reasons. To avoid this, data continues to be forwarded along
ception. Yet, it might be computed locally as the mesh runs old routes for enough time until routing tables converge. To
a distance vector algorithm with path information. achieve this, we introduce a new routing cost called Tran-
sient Forward (TF). TF is guaranteed to be larger than the
Our heuristics for link-dropping have the following desirable cost of a path with a valid route, but smaller than innite
properties: cost. A member M that leaves advertises a cost of TF for
Stability: A link that Narada drops is unlikely to be all members for which it had a valid route. Normal distance
added again immediately. This is ensured by several fac- vector operations leads to members choosing alternate valid
tors: (i) the threshold for dropping a link is less than or routes not involving M (as TF is guaranteed to be larger
equal to the threshold for adding a link (ii) the utility of an than the cost of any valid route). The leaving member con-
existing link is overestimated by the cost metric (iii) drop- tinues to forward packets until it is no longer used by any
ping of links is done considering the perception that both neighbor as a next hop to reach any member, or until a
members have regarding link cost (iv) a link with small de- certain time period expires.
lay is not dropped.
Partition avoidance: We present an informal argument as 4. SIMULATION EXPERIMENTS
to why our link dropping algorithm does not cause a parti- In this section, we evaluate the properties of the overlay
tion assuming steady state conditions and assuming multiple structure that Narada produces, and the overheads associ-
links are not dropped concurrently. Assume that member ated with the Narada protocol.
i drops neighbor j . This could result in at most two par-
titions. Assume the size of i s partition is Si and the size
0
40 60
1 minute
2 minutes 40
20 3 minutes
10 minutes 20
20 minutes
40 minutes
0 0
1 2 4 8 16 32 0 10 20 30 40 50 60 70 80
RDP (log-scale) Physical Delay (in ms)
Figure 7: Cumulative distribution of RDP shown at Figure 9: Overlay delay vs. physical delay. Each
various snapshots of the simulation. The minutes point denotes the existence of a pair of members
denote the time after the last join. with a given physical delay and overlay delay
14
400
12
8 250
RDP
6 200
4 150
100
2
50
0
0 10 20 30 40 50 60 70 80
Physical Delay (in ms) 0
0 5 10 15 20 25 30 35 40
Simualtion Time (in Minutes)
16 Unicast
6
8
4
4
2 2 Waxman.1024
Mapnet.1070
1 ASMap.1024
0
1 2 4 8 16 32 64 128 0 50 100 150 200 250 300
Stress of Physical Link (log-scale) Group Size
Figure 11: No. of physical links with a given stress Figure 13: Worst case physical link stress vs. group
vs. Stress for naive unicast, Narada and DVMRP size for topologies from three models
3.5
5 Narada (Waxman.1024)
Unicast (Waxman.1024)
4.5 3
3 2
2.5
1.5
2
1.5 1
1
0.5
Waxman.1024
0.5 Mapnet.1070
ASMap.1024 0
0 0 50 100 150 200 250 300
0 50 100 150 200 250 300 Group Size
Group Size
Figure 12: 90 percentile RDP vs. group size for Figure 14: Eect of group size on NRU : Narada vs.
topologies from three models naive unicast
Figure 13 plots the variation of worst case physical link stress
4.6 Impact of factors on performance against group size for three topologies. Each curve corre-
We are interested in studying variation in Narada's perfor- sponds to one topology. We observe that the curves are
mance due to each of the following factors: (i) topology close to each other for small group sizes but seem to diverge
model (ii) topology size (iii) group size and (iv) fanout for larger group sizes. Further, for all topologies, worst case
range. Keeping other factors xed at the default, we study stress increases with group size. Thus, for a group size of
the inuence of each individual factor on Narada's perfor- 64, mean worst case stress is about 5 ; 7 across the three
mance. By default, we used a Waxman topology with 1024 topologies, while for a group size of 256, it is about 8 ; 14.
nodes and 3145 links, a group size of 128 and a fanout range We believe this increase of stress with group size is an ar-
of <3-6> for all group members. For all results in this sec- tifact of the small topologies in a simulation environment
tion, we compute each data point by conducting 25 indepen- relative to the actual Internet backbone. We analyze this in
dent simulation experiments and we plot the mean with 95% detail in Section 4.8.
condence intervals. Due to space constraints, we present
plots of selected experiments and summarize results of other Figure 14 plots the normalized resource usage (NRU) against
experiments. group size for the Waxman model alone. The lower and up-
per curves correspond to Narada and unicast respectively.
First, Narada consumes less network resources than naive
Topology Model and Group Size unicast, and this is consistent for all group sizes. For a
We used a Waxman topology consisting of 1024 routers and group size of 128, the NRU for Narada is about 1:8 and 2:2
3145 links, an ASMap topology consisting of 1024 routers for naive unicast. Second, NRU increases with group size.
and 3037 links and a Mapnet topology consisting of 1070 While these results imply a nearly 20% savings of network
routers and 3170 links. resources, we believe that the savings could be even more
signicant if members are clustered. We have repeated this
Figure 12 plots the variation of the 90 percentile RDP with study with the Mapnet and ASMap topologies and observe
group size for three topologies. Each curve corresponds to similar trends. For all topologies, the NRU is at most 1:8
one topology. All the curves are close to each other indi- for a group size of 128.
cating that the RDP is not sensitive to the choice of the
topology model. For all topologies and for a group size of Topology Size
128 members, the 90 percentile RDP is less than 4. For For each topology model, we generate topologies of sizes
each topology, the 90 percentile RDP increases with group varying from about 64 nodes to about 1070 nodes and eval-
size. This is because an increase of group size results in an uate the impact on Narada's performance. Figure 15 plots
increase of mesh diameter and hence an increase of RDP. the worst case physical link stress against topology size for
25 than 4. We hypothesize that RDP values might be lower
on the Internet, as Internet routing is policy based and sub-
Waxman
ASMap
Mapnet
optimal while the simulator assumes shortest path routing.
Worst Case Physical Link Stress
20
Preliminary Internet evaluation indicates that the 90 per-
centile RDP for a 13 member group can be as low as 1:5
15
(Section 5).
10
Across a range of topology models, Narada results in a
low worst case stress for small group sizes. For example, for
5 a group size of 16, the worst case stress is about 5. While
for larger group sizes, worst case stress may be higher, it is
still much lower than unicast. For example, for a group of
128 members, Narada reduces worst case stress by a factor
0
0 200 400 600 800 1000 1200
Topology Size
of 14 compared to unicast.
Figure 15: Worst case physical link stress vs. topol- We hypothesize that worst case stress on the Internet is
ogy size for topologies from three models lower than seen in simulations. The largest topologies that
each topology model. Across all topology models, we ob- we use in our simulations (around 1000 nodes) are still orders
serve that the worst case stress increases with decrease in of magnitude smaller than the Internet. Consequently, the
topology size. While the same general trend is observed for ratio of the group size to topology size, which we term den-
all topology models, it seems more pronounced for Waxman. sity, is much higher in simulations than in actual practice.
We analyze the signicance of this result in Section 4.8. Our simulations indicate that higher group density results
in higher worst case link stress. This can be deduced from
We have also studied the eect of topology size on RDP and Figures 13 and 15, where we observe that the worst case
NRU. Across all topology models, RDP appears largely un- stress increases with group size and decreases with topology
aected by topology size, while NRU decreases with increase size. We hypothesize that an increase in group density in-
in topology size. We omit the plots due to space constraint. creases the probability that an internal physical link could
be shared by multiple uncorrelated virtual links. The links
are uncorrelated2 in the sense that they connect distinct pairs
Fanout Range of end systems. This could increase worst case stress with
So far, we have assumed that each member strives to main- Narada because Narada is only able to regulate fanout of
tain <3-6> neighbors in the mesh. We have investigated members and consequently can only control stress of links
the eect of variation of fanout range on Narada's perfor- near member and not stress of internal physical links. For
mance. In summary, when the fanout range increases, mesh the range of group sizes we consider, we expect that the den-
diameter decreases and stress on links close to members sity ratio is much lower on the Internet and thus we expect
increases. Consequently, RDP decreases while worst case lower link stress.
stress increases. For a group of 128 members, as fanout
range increases from <2-4> to <8-16>, the 90 percentile Narada lowers resource usage by at least 20% compared
RDP decreases from about 5:5 to 2 while the worst case to unicast for a range of group sizes. We believe that if
physical stress increases from about 9 to 15. members are clustered, Narada can result in even larger im-
provement in resource usage.
4.7 Protocol Overhead
Narada incurs a protocol overhead for two reasons. First, 5. INTERNET EXPERIMENTS
members periodically exchange routing tables and control We have implemented Narada and we report preliminary
information between each other. Second, members estimate results obtained by conducting experiments on the Internet.
their delays to other members by probing them periodically.
We dene Protocol Overhead Ratio (POR) as the ratio of Our experiments consisted of 13 hosts distributed through-
bytes of non-data trac that enter the network to bytes of out the United States. In each experiment, every host was
data trac. While we do not present results, we nd that initially provided the list of names of all other hosts, and
POR increases linearly with group size. Further, we note all hosts join the group at approximately the same time.
that the protocol trac that Narada introduces is indepen- Each host attempted to maintain a fanout range of <2-4>.
dent of source data rate and thus the POR decreases with A host at CMU was designated the source and sent data
increase in data trac. For a group size of 128 members, at periodic intervals. At present, we assume that Narada
the POR is about 0:25 for a source data rate of 16 kilobits attempts to minimize the propagation delay between hosts
per second (kbps), and less than 0:04 for a source data rate on the mesh. Thus, each host assumes its physical delay to
of 128 kbps. For a 64 member group and a source data rate another host is the minimum delay observed across multiple
of 128 kbps, the POR is hardly 0:02. estimates. Each experiment was run for about 20 minutes.
4.8 Results Summary and Interpretation Figure 16 shows the overlay spanning tree that Narada pro-
In this section, we summarize key results that we have pre- duced, which was used to route data from the source (CMU1)
sented and attempt to explain the results. to other recipients, for a typical experiment. The links of
the tree are labeled by their delays (in milliseconds), as mea-
Across a range of topology models, Narada results in a 2 For example, consider the physical topology shown in Fig-
low RDP for small group sizes. For example, for a group ure 1(a) and assume that Narada constructs a mesh as shown
size of 16, the 90 percentile RDP is less than 2.5. Even for in Figure 1(h). Uncorrelated virtual links A ; C and B ; D
group sizes of 128 members, the 90 percentile RDP is less share the physical link R1 ; R2.
UWisc UMass 100
10 14
UIUC2
sured by one end point of the link. Here, UIUC1 and UIUC2 Figure 17: Cumulative distribution of RDP
belong to the same campus as do Virginia1 and Virginia2.
Berkeley and UCSB are closer together (on the West coast) challenge the appropriateness of using IP Multicast for all
as compared to all other hosts. Narada is able to ensure forms of group communication. Yallcast argues the need for
that only a single copy of data from CMU1 is delivered to hosts to auto-congure into tunneled topologies for ecient
UIUC, Virginia and the West Coast and shorter links here data dissemination. Yallcast emphasizes architectural as-
are used for distribution of data within the sites. pects and has not yet considered performance implications
Narada in fact constructs a mesh, and there are additional of using an overlay. In contrast, our work places a central
edges of the mesh not in the tree which we have omitted in emphasis on performance tradeos associated with an over-
Figure 16. Further, Narada may dynamically add and drops lay network, and this greatly inuences the design of our
links to improve mesh quality. In this experiment, the mesh self-organization protocol. Yallcast presents the design of
did not have links between Berkeley and UCSB, and be- a rendezvous service for the bootstrapping of a new mem-
tween UIUC1 and UIUC2 in the rst few minutes of the ber. The bootstrapping mechanism is however orthogonal
experiment. The self-improving nature of Narada resulted to the issues we consider in this paper - while the mecha-
in addition of these links and improved the eciency of data nisms suggested in Yallcast could be used, we are open to
delivery. Narada was also able to drop a link between GAT- other application specic and out-of-band bootstrap mech-
ech and Delaware which it identied as not useful in data anisms. At the protocol level too, Yallcast and Narada have
delivery. dierences. Narada constructs an ecient mesh among par-
ticipating members, and then constructs spanning trees of
the mesh. In contrast, Yallcast constructs a spanning tree
Figure 17 plots the cumulative distribution of the RDP of among participating members directly. We believe that a
the mesh that Narada produced. The worst case RDP is mesh-rst approach helps avoid replicating group manage-
only 2:6 and 90% of pairs of members have an RDP of at ment functions across multiple (per-source) trees, simplies
most 1:5. Further, we nd that the delay along the mesh overlay maintenance (as loop avoidance is no longer an is-
between any pair of members is at most 84 ms , while the sue), allows for leveraging on standard routing algorithms,
worst case unicast delay between any pair of members can and provides a structure more resilient to the failure of mem-
be as high as 64 ms (observed between UKY and UCSB). bers.
The physical delay used in RDP calculations were the de-
lays estimated by an end point of the link. End points inde- Scattercast argues for infrastructure support, where prox-
pendently estimate link delay and thus might have dierent ies deployed in the network run self-organization protocols
estimates. In our experiments however, we nd that link on behalf of the clients, while clients subscribe to nearby
delay estimates by end points are generally consistent with proxies. Although we have not emphasized infrastructure
each other and the estimates are very close to the ping times support in this paper, we wish to make explicit that our
between the machines. Pairs with an RDP of 1 correspond notion of an end system is not restricted to clients of a
to pairs of hosts with mesh links between them. Interest- multicast group, and includes network proxies. In partic-
ingly, we nd a small number of pairs for which the RDP ular, we believe that Narada can be used to build an overlay
is slightly less than 1. There are two reasons for this eect. structure among proxies on a scattercast architecture, while
In some cases, this was due to minor inaccuracies in delay the self-organization protocol proposed in Scattercast called
measurements. However, in some cases, the reason was more Gossamer can be adapted to an End System Multicast archi-
fundamental and due to the policy based nature of Internet tecture. At the protocol level, while Gossamer too adopts
routing. This eect has been reported in 18]. a mesh-rst approach, it relies on centralized rendezvous
points for repairing mesh partition. Although this assump-
In the future we plan to conduct larger scale Internet exper- tion signicantly simplies the partition recovery mecha-
iments with emphasis on studying the dynamics of Narada, nisms in Gossamer, the members of the mesh could become
transient behavior and eects of congestion and packet loss. partitioned from each other in event of failure of all ren-
dezvous points.
6. RELATED WORK The MBone 2] and 6Bone 10] are popular existing exam-
The works that come closest to the End System Multicast ples of overlay networks. However, these are statically con-
architecture we propose here and which share much of our gured in a manual and adhoc fashion. Narada, on the other
motivation are Yallcast 8] and Scattercast 3]. Both projects hand, strives for a self-conguring and ecient overlay net-
work. Internet routing protocols are self-conguring. The 8. REFERENCES
most striking dierence between Narada and standard rout- 1] E. Bommaiah, A. McAuley, R. Talpade, and M. Liu.
ing protocols is that while the latter work on a xed physi- Amroute: Adhoc multicast routing protocol. Internet draft,
cal topology, Narada alters the very topology over which it Internet Engineering Task Force, August 1998.
routes data. Routing protocols merely route around a link 2] S. Casner and S. Deering. First IETF internet audiocast.
that has failed and have no notion of dynamic adding or ACM Computer Communication Review, pages 92{97,
dropping of links. Narada might dynamically add links to 1992.
ensure connectivity of the virtual topology, and drop links 3] Y. Chawathe, S. McCanne, and E. A. Brewer. An
it perceives as not useful. architecture for internet content distribution as an
infrastructure service, February 2000. Unpublished work.
Self-conguration has been proposed in other contexts. AM- 4] S. Deering. Multicast routing in internetworks and
extended lans. In Proceedings of the ACM SIGCOMM 88,
Route 1] allows for robust IP Multicast in mobile adhoc pages 55{64, Stanford,CA, August 1988.
networks by exploiting user-multicast trees. Several reliable 5] C. Faloutsos, M. Faloutsos, and P. Faloutsos. On power-law
IP Multicast protocols 12, 13, 22] involve group members relationships of the internet topology. In Proceedings of
self-organizing into structures that help in data recovery. ACM Sigcomm, August 1999.
Adaptive Web Caching 14] is a self-organizing cache hierar- 6] National Laboratory for Applied Network Research.
chy. The key feature that distinguishes Narada from these Routing data. http://moat.nlanr.net/Routing/rawdata/.
protocols is that Narada does not assume a native multi- 7] Cooperative Association for Internet Data Analysis.
cast medium - AMRoute assumes a native wireless broad- Mapnet project.
cast channel, while all other protocols assume the existence http://www.caida.org/Tools/Mapnet/Data/.
of IP Multicast. Self-conguration in the absence of such a 8] P. Francis. Yallcast: Extending the internet multicast
native multicast medium is a much harder problem. architecture, http://www.yallcast.com, September 1999.
9] H.W.Holbrook and D.R. Cheriton. IP multicast channels:
EXPRESS support for large-scale single-source
7. CONCLUSIONS applications. In Proceedings of the ACM SIGCOMM 99,
We have made two contributions in this paper. First, we August 1999.
have shown that, for small and sparse multicast groups, it 10] IPV6 backbone. http://www.6bone.org/.
is feasible to use an end system overlay approach to e- 11] G. Kortsarz and D. Peleg. Generating low-degree
ciently and robustly support all multicast related function- 2-spanners. SIAM Journal on Computing, Volume 27,
ality including membership management and packet repli- Number 5.
cation. The shifting of multicast support from routers to 12] B. N. Levine, D. B. Lavo, and J. J. Garcia-Luna-Aceves.
end systems, while introducing some performance penalties, The case for concurrent reliable multicasting using shared
has the potential to address most problems associated with ACK trees. In Proceedings of ACM Multimedia'96,
IP Multicast. We have shown, with both simulation and In- November 1996.
ternet experiments, that the performance penalties are low 13] J. Liebeherr and B. S. Sethi. A scalable control topology
in the case of small and sparse groups. We believe that the for multicast communications. In Proceedings of IEEE
Infocom, April 1998.
potential benets of repartitioning the multicast functional- 14] S. Michel, K. Nguyen, A. Rozenstein, L. Zhang, S. Floyd,
ity between end systems and routers signicantly outweigh and V. Jacobson. Adaptive web caching: towards a new
the performance penalty incurred. global caching architecture. Computer Networks and ISDN
Systems, November 1998.
Second, we have proposed one of the rst self-organizing 15] R. Perlman, C. Lee, T. Ballardie, J. Crowcroft, Z. Wang,
and self-improving protocols that constructs an overlay net- T. Maufer, C. Diot, J. Thoo, and M. Green. Simple
work on top of a dynamic, unpredictable and heterogeneous multicast: A design for simple, low-overhead multicast.
Internet environment without relying on a native multicast Internet Draft, Internet Engineering Task Force, March
medium. We also believe this is among the rst works that 1999. Work in progress.
attempt to systematically evaluate the performance of a self- 16] Y. Rekhter and T. Li. A Border Gateway Protocol 4
organizing overlay network protocol and the tradeos in us- (BGP-4), RFC 1771, March 1995.
ing overlay networks. In 20], it was argued that the over- 17] J. Saltzer, D. Reed, and D. Clark. End-to-end arguments in
lay approach is a fundamental technique to incrementally system design. ACM Transactions on Computer Systems,
deploy services and evolve networks. We believe that the 2(4):195{206, 1984.
techniques and insights developed in this paper are general 18] S. Savage, A. Collins, E. Homan, J.Snell, and
and can be applied to overlay networks in contexts other T. Anderson. The end-to-end eects of internet path
than multicast. selection. In Proceedings of ACM Sigcomm, August 1999.
19] F.B. Schneider. Byzantine generals in action: Implementing
In this paper, we emphasize the performance aspect of us- fail-stop processors. ACM transactions on Computer
Systems, 2(2), pages 145{154, 1984.
ing an end system overlay approach to support multicast. 20] K. Sripanidkulchai, A. Myers, and H. Zhang. A third-party
We are currently extending this work in several dimensions. value-added network service approach to reliable multicast.
First, as mentioned in Section 1, the \end system" in the In Proceedings of ACM Sigmetrics, August 1999.
protocol can be either an application module, a host, or a 21] I. Stoica, T.S.E. Ng, and H. Zhang. REUNITE: A recursive
shared proxy. We are exploring architectural issues involved unicast approach to multicast. In Proceedings of IEEE
in adapting Narada to each individual context. Second, in INFOCOM'00, Tel-Aviv, Israel, March 2000.
the current form, Narada considers latency between mem- 22] R. X. Xu, A. C. Myers, H. Zhang, and R. Yavatkar.
bers as the sole criterion that needs to be optimized. We Resilient multicast support for continuous-media
are extending this to consider measured packet loss rates applications. In Proceedings of NOSSDAV'97, May 1997.
and bandwidth availability. Finally, we are studying how 23] E. W. Zegura, K. L. Calvert, and S. Bhattacharjee. How to
support for error, ow, and congestion control functionality model an internetwork. In Proceedings of IEEE Infocom,
can be added in End System Multicast. March 1996.