Nothing Special   »   [go: up one dir, main page]

Saturday, February 1, 2025

AI Metrics

AI Metrics is available on GitHub. The application provides performance metrics for AI/ML RoCEv2 network traffic, for example, large scale CUDA compute tasks using NVIDIA Collective Communication Library (NCCL) operations for inter-GPU communications: AllReduce, Broadcast, Reduce, AllGather, and ReduceScatter.

The dashboard shown above is from a simulated network 1,000 switches, each with 48 ports access ports connected to a host. Activity occurs in a 256mS on / off cycle to emulate an AI learning run. The metrics include:

  • Total Traffic Total traffic entering fabric
  • Operations Total RoCEv2 operations broken out by type
  • Core Link Traffic Histogram of load on fabric links
  • Edge Link Traffic Histogram of load on access ports
  • RDMA Operations Total RDMA operations
  • RDMA Bytes Average RDMA operation size
  • Credits Average number of credits in RoCEv2 acknowledgements
  • Period Detected period of compute / exchange activity on fabric (in this case just over 0.5 seconds)
  • Congestion Total ECN / CNP congestion messages
  • Errors Total ingress / egress errors
  • Discards Total ingress / egress discards
  • Drop Reasons Packet drop reasons

Note: Clicking on peaks in the charts shows values at that time.

This article gives step-by-step instructions to run the AI Metrics application in a production environment and integrate the metrics with back end Prometheus / Grafana dashboards. Please try AI Metrics out and share your comments so that the set of metrics can be refined and extended to address operational requirements.

docker run -p 8008:8008 -p 6343:6343/udp sflow/ai-metrics
User Docker to run the pre-built sflow/ai-metrics image and access the web interface on port 8008.
Enable sFlow on all switches in the cluster (leaf and spine) using the recommeded settings. Enable sFlow dropped packet notifications to populate the drop reasons metric, see Dropped packet notifications with Arista Networks and NVIDIA Cumulus Linux 5.11 for AI / ML for examples.

Note: Tuning Performance describes how to optimize settings for very large clusters.

Industry standard sFlow telemetry is uniquely suited monitoring AI workloads. The sFlow agents leverage instrumentation built into switch ASICs to stream randomly sampled packet headers and metadata in real-time. Sampling provides a scaleable method of monitoring the large numbers of 400G/800G links found in AI fabrics. Export of packet headers allows the sFlow collector to decode the InfiniBand Base Transport headers to extract operations and RDMA metrics. The Dropped Packet extension uses Mirror-on-Drop (MoD) / What Just Happened (WJH) capabilities in the ASIC to include packet header, location, and reason for EVERY dropped packet in the fabric.

Talk to your switch vendor about their plans to support the Transit delay and queueing extension. This extension provides visibility into queue depth and switch transit delay using instrumentation built into the ASIC.

A network topology is required to generate the analytics, see Topology for a description of the JSON file and instructions for generating topologies from Graphvis DOT format, NVIDIA NetQ, Arista eAPI, and NetBox.
Use the Topology Status dashboard to verify that the topology is consistent with the sFlow telemetry and fully monitored. The Locate tab can be used to locate network addresses to access switch ports.

Note: If any gauges indicate an error, click on the guage to get specific details.

Congratulations! The configuration is now complete and you should see charts at top of article in AI Metric application Traffic tab.

The AI Metrics application exports the metrics shown in Prometheus scrape format, see Help tab for details. The Docker image also includes the Prometheus application that allows flow metrics to be created and extracted, see Flow metrics with Prometheus and Grafana.

Getting Started provides an introductions to sFlow-RT, describes how to browse metrics and traffic flows using tools included in the Docker image, and links to information on creating applications using sFlow-RT APIs.

Thursday, January 30, 2025

Replay pcap files using sflowtool


It can be very useful to capture sFlow telemetry from production networks so that it can be replayed later to perform off-line analysis, or to develop or evaluate sFlow collection tools.
sudo tcpdump -i any -s 0 -w sflow.pcap udp port 6343
Run the command above on the system you are using to collect sFlow data (if you aren't yet collecting sFlow, see Agents for suggested configuration settings). Type Control-C to end the capture after 5 to 10 minutes.  Copy the resulting sflow.pcap file to your laptop.
docker run --rm -it -v $PWD/sflow.pcap:/sflow.pcap sflow/sflowtool \
  -r /sflow.pcap -P 1
Either compile the latest version of sflowtool or, as shown above, use Docker to run the pre-built sflow/sflowtool image. The -P (Playback) option replays the trace in real-time and displays the contents of each sFlow message. Running sflowtool using Docker provides additional examples, including converting the sFlow messages into JSON format for processing by a Python script. 
docker run --rm -it -v $PWD/sflow.pcap:/sflow.pcap sflow/sflowtool \
  -r /sflow.pcap -f 192.168.4.198/6343 -P 1
The -f (forwarding) option takes an IP address and UDP port number as arguments, in this case the laptop's address, 192.168.4.198, and the standard sFlow port, 6343. Use this option to send the sFlow stream to sFlow analytics software.
For example, Deploy real-time network dashboards using Docker compose, describes how to quickly stand up an sFlow-RT, Prometheus, and Grafana analytics stack.

Monday, November 25, 2024

Topology aware flow analytics with NVIDIA NetQ

NVIDIA Cumulus Linux 5.11 for AI / ML describes how NVIDIA 400/800G Spectrum-X switches combined with the latest Cumulus Linux release deliver enhanced real-time telemetry that is particularly relevant to the AI / machine learning workloads that Spectrum-X switches are designed to handle.

This article shows how to extract Topology from an NVIDIA fabric in order to perform advanced fabric aware analytics, for example: detect flow collisions, trace flow paths, and de-duplicate traffic.

In this example, we will use NVIDIA NetQ, a highly scalable, modern network operations toolset that provides visibility, troubleshooting, and validation of your Cumulus and SONiC fabrics in real time.

netq show lldp json
For example, the NetQ Link Layer Discovery Protocol (LLDP) service simplifies the task of gathering neighbor data from switches in the network, and with the json option, makes the output easy to process with a Python script, for example, lldp-rt.py.

The simplest way to try sFlow-RT is to use the pre-built sflow/topology Docker image that packages sFlow-RT with additional applications that are useful for monitoring network topologies.

docker run -p 6343:6343/udp -p 8008:8008 sflow/topology
Configure Cumulus Linux to steam sFlow telemetry to sFlow-RT on UDP port 6343 (the default for sFlow).
netq show lldp json | ./lldp-rt.py http://sflow-rt:8008/topology/json
The above command puts it all together, taking LLDP data from NetQ, converting it to sFlow-RT format, and posting the fabric topology to the sFlow-RT REST API.
Access the sFlow-RT web interface on port 8008. The Topology application includes a dashboard to verify that all the nodes and links in the topology are fully covered by the sFlow telemetry stream.

Getting Started is a step by step guide to sFlow-RT applications, APIs, and community support.

Thursday, November 21, 2024

SC24 Over 10 Terabits per Second of WAN Traffic

The SC24 WAN Stress Test chart shows 10.3 Terabits bits per second of WAN traffic to the The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC24) conference held this week in Atlanta. The conference network used in the demonstration, SCinet, is described as the most powerful and advanced network on Earth, connecting the SC community to the world.

SC24 Real-time RoCEv2 traffic visibility describes a demonstration of wide area network bulk data transmission using RDMA over Converged Ethernet (RoCEv2) flows typically seen in AI/ML data centers. In the example, 3.2Tbits/second sustained trasmissions from sources geographically distributed around the United States was demonstrated.

SC24 Dropped packet visibility demonstration shows how the sFlow data model integrates three telemetry streams: counters, packet samples, and packet drop notifications. Each type of data is useful on its own, but together they provide the comprehensive network wide observability needed to drive automation. Real-time network visibility is particularly relevant to AI / ML data center networks where congestion and dropped packets can result in serious performance degradation and in this screen capture you can see multiple 400Gbits/s RoCEv2 flows.

SC24 SCinet traffic describes the architecture of the real-time monitoring system used to generate these charts. This chart shows that over 225 Petabytes of data were transfered during the show.

Wednesday, November 20, 2024

SC24 Real-time RoCEv2 traffic visibility

The chart shows eight 400Gbits/s RDMA over Converged Ethernet (RoCEv2) flows, typically seen in AI / ML data centers, totaling 3.2 Tbits/s. The unique challenge in this case is that flows are being routed from locations scattered around the United States to Atlanta, the location of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC24) conference.
SC24 Network Research Exhibit: The Resiliant, Performant Networks and Distributed Processing demonstration aims to explore performance limitations and enablers for high volume bulk data tranfers. Maintaining stable 400Gbits/s RoCEv2 connections over a wide area network is challenging since the packets have to traverse multiple links, avoid contention on links, and deal with buffering associated with transmission latency that is orders of magnitude higher than data center environments where RoCEv2 is typically deployed (one way latency across the USA is a minimum of 16 milliseconds due to speed of light, but in practice the latency is quite a bit larger, on the other hand latency across a leaf and spine data center fabric is measured in microseconds).
During setup it was noticed that total throughput with 8 concurrent flows was only 2.7Tbits/s (instead of the 3Tbits/second plus expected). Examining a real-time view of the throughput revealed that the two smallest flows, pink and light green at the top of the chart, were likely sharing a 400Gbits path since each flow was only transferring 200Gbps. The next flow down, light blue, appeared to be unstable and wasn't maintaining a constant 400Gbps.
Drilling down to look at the unstable flow showed that it was oscilating between 280Gbits/s and 400Gbits/s with a period of around 15 seconds. Further investigation revealed that the cause of the instability was a collision with a smaller flow on one of the links traversed by this flow. Once the flow collisions were resolved, all flows achieved close to 400Gbit/s, allowing the full 3Tbits/s transfer rate shown at the top of this article.
In this example, the sFlow-RT real-time analytics engine receives sFlow telemetry from switches, routers, and servers in the SCinet network and creates metrics to drive the real-time charts. Getting Started provides a quick introduction to deploying and using sFlow-RT for real-time network-wide flow analytics.

Real-time network visibility is particularly relevant to AI / ML data center networks where congestion and dropped packets can result in serious performance degredation of machine learning tasks. Industry standard sFlow instrumentation is supported by the high speed 400/800G switches currently being deployed in AI / ML data centers. Enabling sFlow analytics provides the visibility needed to optimize performance.

Network visibility complements existing system management tools used to provide visibility into compute nodes, extending visibility into the fabric to directly observe problems in the network that can't easily be inferred from the compute nodes, and providing a second pair of eyes with an independent view of performance.

Finally, check out the SC24 Dropped packet visibility demonstration to learn about one of newest developments in sFlow monitoring and see a live demonstration.

Tuesday, November 19, 2024

SC24 SCinet traffic

The real-time dashboard shows total network traffic at The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC24) conference being held this week in Atlanta. The dashboard shows that 31 Petabytes of data have been transferred already and the conference has just started.

The conference network used in the demonstration, SCinet, is described as the most powerful and advanced network on Earth, connecting the SC community to the world.

In this example, the sFlow-RT real-time analytics engine receives sFlow telemetry from switches, routers, and servers in the SCinet network and creates metrics to drive the real-time charts in the dashboard. Getting Started provides a quick introduction to deploying and using sFlow-RT for real-time network-wide flow analytics.

Finally, check out the SC24 Dropped packet visibility demonstration to learn about one of newest developments in sFlow monitoring and see a live demonstration.

Monday, November 18, 2024

NVIDIA Cumulus Linux 5.11 for AI / ML


NVIDIA Cumulus Linux 5.11 includes major upgrades to the sFlow agent that fully exposes the advanced instrumentation built into NVIDIA Spectrum-X silicon. The enhanced real-time telemetry is particularly relevant to the AI / machine learning workloads that Spectrum-X is designed to handle.

With Cumulus Linux 5.11, the sFlow agent is easily configured using nvue commands, see Monitoring System Statistics and Network Traffic with sFlow:

nv set system sflow dropmon hw
nv set system sflow poll-interval 20
nv set system sflow collector 192.0.2.1
nv set system sflow state enabled
nv config apply

Note: In this case, enabling dropmon ensures that every dropped packet is captured, along with ingress port and drop reason (e.g. ttl_exceeded).

The same commands should be applied to every switch in the fabric for comprehensive visibility.

RDMA over Converged Ethernet (RoCE) describes how sFlow provides detailed visibility into RoCE flows used to move data between GPUs in an AI / ML data center fabric. The chart above from the RDMA network visibility demonstration at the SC22 conference shows that sFlow monitoring easily scales to the 400/800G speeds needed for machine learning.
In this example, the sFlow-RT real-time analytics engine receives sFlow telemetry from all the switches and servers in the fabric. Deploy real-time network dashboards using Docker compose describes how to quickly set up an sFlow-RT, Prometheus, Grafana stack to capture and display metrics. Dropped packet metrics with Prometheus and Grafana describes how to add a dashboard to display packet drop notifications.

If you are standing up a new NVIDIA Spectrum-X / Cumulus Linux network, enable sFlow on all the switches and set up an instance of sFlow-RT for the real-time fabric wide visibility into traffic flows and dropped packets. Real-time network visibility is particularly relevant to AI / ML data center networks where congestion and dropped packets can result in serious performance degradation.