Session 25,26

Session 25,26
• IMPROVING NETWORK BANDWIDTH :Large cloud data centers have a lot of money invested in networking systems and cabling
infrastructure and, therefore, want to maximize the utilization of this equipment. But the underlying Ethernet layer 2 network has
used the spanning tree protocol (STP) for many years to eliminate loops in the network. This leaves a lot of unused link bandwidth that
could be used to improve overall network performance. Because of this, several industry standards have been developed to get
around this limitation including equal cost multipath (ECMP) routing, TRILL and SPB. This section will provide an overview of these
standards.
• Spanning tree :Ethernet started life as a simple architecture with transactions across a shared medium. When switches were
introduced and these LANs became more complex, a method had to be developed to prevent packets from circulating through these
networks in endless loops. To solve this problem, the IEEE developed the STP which is the IEEE 802.1D standard for networks
connected using layer 2 bridges or switches. It shows a spanning tree within a layer 2 network where a subset of links has been
selected as active links with the rest disabled but available for backup if needed.
The STP operates by first selecting a root bridge using the lowest bridge priority and MAC address values. Next, all other nodes on the network find
the least cost path to the root bridge by using parameters such as number of hops and link bandwidth capability. Finally, all other paths are set to a
disabled mode. When a new bridge or switch is added to the network, the spanning tree algorithm is run again to determine a new set of active links
in the network. As networks grew larger, the time to converge on a new spanning tree became too long, so the rapid spanning tree protocol (RSTP)
algorithm was introduced as IEEE 802.1D-2004, which obsoleted the original STP algorithm. With the wide deployment of VLANs within Ethernet
networks, it became desirable to provide a separate spanning tree for each VLAN. Therefore, the multiple spanning tree protocol (MSTP) was
introduced as IEEE 802.1s but was later merged into IEEE 802.1Q-2005.
• Equal cost multipath routing:Many data centers use some form of layer 3 routing or tunneling between layer 2 domains in
the network. When routing data through layer 3 networks, bandwidth can be improved by using multiple paths. The
industry has adopted ECMP routing as a way to implement this as shown in Figure 5.8. Figure 5.8 shows one router
distributing traffic across three different paths, although each router in this diagram could further distribute traffic across
additional paths. Even though protocols such as Transmission Control Protocol/Internet Protocol (TCP/IP) can perform flow
reordering at the network egress, due to differences in network latency and other issues.
• Shortest path bridging:MSTP allows a separate spanning tree to be created for each VLAN. The IEEE 802.1aq standard called shortest
path bridging is a replacement for STPs, and allows for multiple paths through the network through the use of multiple different
spanning trees. To do this, it relies on the capabilities developed for provider bridging (PB) and provider backbone bridging (PBB).
Shortest path bridging VID (SPBV) adds to the Ethernet frame an additional VLAN tag that can identify up to 4096 paths through the
network. An additional specification called shortest path bridging MAC (SPBM) can scale up to 16M versus the 4096 limit using VLAN
tags by using MAC addresses representing paths through the network. The IS-IS protocol is used to exchange information between
switches which maintain network state information. The ingress switches will encapsulate frames with SPBV or SPBM headers and the
egress switches will remove the headers. The paths through the network can be selected using a hash algorithm, and all flows will
take the same path.
• Transparent interconnection of lots of links: Another method to improve bandwidth utilization in data center networks is to use the
TRILL standard developed by the IETF.
• Minimal or no configuration required
• Load-splitting among multiple paths
• Routing loop mitigation (possibly through a TTL field)
• Support of multiple points of attachment
• Support for broadcast and multicast
• No significant service delay after attachment
• No less secure than existing bridged solutions
• The TRILL standard defines RBridges to tunnel packets through layer 2 networks using multiple paths. It uses a link state protocol that allows all RBridges to
learn about each other and the connectivity through the network. Both a TRILL header and an outer Ethernet header are added to the native Ethernet
frame (that contains an inner Ethernet header). The ingress RBridge encapsulates the frame and the egress RBridge de-encapsulates the frame as shown in
Figure 5.9. This figure shows an example network that includes RBridge components in which a packet is being forwarded from end station A (ES-A) to end
station B (ES-B). The Ingress RBridge encapsulates the packet and forwards it to the next RBridge using the MAC destination address of RBridge 2. The
standard bridges simply forward the packet using the outer Ethernet header. The Relay RBridge acts somewhat like a router, using the egress nickname to
determine the next hop for the packet. The Egress RBridge de-encapsulates the packet, which is forwarded through the standard network using the
original inner Ethernet header.
REMOTE DIRECT MEMORY ACCESS
• One type of data transport technology that has several applications in the
data center and can improve data center performance is remote direct
memory access (RDMA). This evolved from direct memory access (DMA)
which is used to improve performance in CPU subsystems. Many CPUs,
NPUs, and microcontrollers contain DMA engines which can offload the
task of moving data in and out of the CPUs main memory without
involving the operating system (OS). One example of this is in the network
switch. Let’s say that the switch chip receives a management packet that
needs to be processed by the attached control plane CPU. The switch can
send an interrupt to the CPU, which then would need to access the packet
data from the switch memory and transfer it to its own main memory.
• Executing this transfer as part of the CPUs main processing thread will
slow the execution of other critical tasks. Instead, the CPU could send
the starting address and packet length information to an embedded
DMA engine, which can execute the data transfer from the switch
memory to main memory in the background, with little impact on the
processor. Remote DMA or RDMA is very similar, but instead of
automating memory transfers between devices on a given circuit
board, the memory transfers are executed across the network.
• For example, a server on one side of the network can efficiently
transfer data from its own application memory into the application
memory of a server on the other side of the network without the need
to write the data into kernel memory first. This is known as kernel
bypass. This zero copy approach is much more efficient than requiring
multiple data transfers on each side of the network. In most cases
RDMA uses an offload engine located in the network controller
attached to the server CPU, where the network controller can directly
read and write into the CPUs application memory. Today, there are two
main competing industry standards for RDMA: Internet Wide-Area
RDMA Protocol (iWARP) and RDMA over Converged Ethernet (RoCE).
• Data center requirements :Data center performance can be critical in applications such as
compute clustering, big data analytics, and financial services. RDMA technology can
dramatically reduce latencies in these applications by eliminating the extra memory
transitions required between the kernel memory and application memory. The demands
placed on data center networks are also changing as the amount of server-to-server
(east-west) traffic grows. Some of the changes in data centers that are driving adoption of
RDMA include:
• Web applications that can spawn hundreds of server-to-server workflows, with each
workflow requiring a rapid response in order to maintain a good user experience.
• Low-latency server-to-server transactions that are required in financial trading and big
data applications.
• Internet Wide-Area RDMA Protocol :In 2004, Adaptec®, Broadcom®,
Cisco, Dell®, EMC®, Hewlett-Packard, IBM, Intel, Microsoft, and
Network Appliance® formed the RDMA Consortium. Their goal was to
minimize the impact of TCP/IP overhead, context switching, and
multiple buffer copies when transferring data across Ethernet
networks. Today, several network adapters are available that conform
to the iWARP standard. They use offload engines to move TCP/IP
processing out of software, freeing up CPU processing overhead.
• Another source of overhead comes from applications sending commands to
the network adapter causing expensive context switching in the OS. iWARP
solves this by providing OS bypass, allowing an application running in user
space to issue commands directly to the network adapter. Another major
area of improvement for iWARPis provided by reducing the number of
buffer copies required when data moves from the network into an
application buffer. Figure 5.10 shows the multiple buffer copies required
when using a traditional network adapter versus the single buffer copy used
with RDMA protocols such as iWARP. Companies like Intel also provide
direct data placement technology which places data in the application
buffer residing in the CPU cache memory to further improve performance.
• RDMA over Converged Ethernet :After the iWARP spec was developed, several
companies continued to look for better ways to support RDMA. Although
iWARP can be run over any TCP/IP network, the TCP/IP protocol adds certain
complexity to network adapters including functions such as TCP checksum
processing, TCP segmentation, and header split. This additional processing can
also lead to higher latency. The Infiniband Trade Association (IBTA) already had
a layer 2 RDMA over Infiniband (IB) specification, so they created a new
specification that kept the IB transport layers and network layers intact, but
swapped out the IB link layer for an Ethernet link layer. They called the new
specification RoCE.
• There are several applications for RoCE where latency is important such as high
performance computing (HPC) and financial trading, but because RoCE is
limited to layer 2, these applications must be confined to an Ethernet subnet.
Also, RoCE requires that all the Ethernet switches in the subnet must support
the DCB standard, which not all legacy networks support.

Session 25,26

Uploaded by

Copyright:

Available Formats

Session 25,26

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Session 25,26

Uploaded by

Copyright:

Available Formats

Session 25,26

You might also like