Understanding LLM GPUs Clusters Fabrics Traffic For Networkers – Part 2

Unravel the complexities of LLM GPU traffic patterns and learn how cutting-edge network topologies are transforming AI training efficiency.

tl;dr: This blog explores GPU-to-GPU traffic in LLM training clusters, highlighting ultra-high bandwidth needs, synchronized bursty data flows, and challenges like latency, elephant flows, and low entropy causing network inefficiencies. Emerging topologies like Rail-Optimized Fabrics are key to addressing these demands for scalable AI workloads.


This is the second part of a series of blogs to capture the general network traffic patterns in LLM GPU training clusters. Part 1 in this series focused on providing a brief background on GPU training clusters.  This second part will explain GPU-to-GPU traffic patterns and the technologies/techniques involved, as well as the unique network traffic characteristics driven by these traffic patterns.

This context will pave the way for more in-depth discussions covering the needed bandwidth, latency, and emerging network features. We are planning to cover these topics in one or more future posts or parts.

GPUs are not only compute-dense devices; they are communication and storage-intensive devices as well.  Storing and sharing data is integral to efficient use of the phenomenal compute resources available in the GPU.

Ultra-High Bandwidth per GPU

Each GPU is connected to an internal dedicated bus (NVLink, in case of Nvidia GPUs). The internal dedicated bus is used to connect to other GPUs in the same node (domain). In the current Blackwell generation 4x224G Serdes (4 lanes) are used for a total of 800Gb/s. This totals 100GB/s. There are 18 of those links in each B200 GPU, resulting in a total internal bus (NVlink) connection of 1.8TB/s.

For a quick comparison to the external network (scale-out fabric) this is equivalent to 9 unidirectional 400Gbps. On one hand, this shows how data-intensive those GPUs are. On the flip side, we can clearly see how quickly the external (Scale out) network has to advance to keep up. Until the external network performance catches up, this internal inter-GPU bus within the same domain (server, node or systems like NVL72) will always be preferred. This clear preference brings significant changes to the network topologies used to connect GPUs. More on this in the coming section.

Diagram from “NVIDIA GB200 Interconnect Architecture Analysis” – by NADDOD Nov. 20, 2024.

Each GPU connects through a second link to a different bus (PCIe) where it connects to the network interface card (NIC). The PCIe/NIC path is capable of supporting up to 400Gb/s today. It is much needed and expected that NIC bandwidth will increase over the next few years.

Collective GPU Fabrics Topology

The emergence of new network topologies such as rail and rail-optimized fabrics, is critically important for supporting collective operation libraries in GPU-centric environments, particularly those powering GenAI workloads. Collective operations — like all-reduce, all-gather, or broadcast — require high-bandwidth, low-latency and lossless network paths to synchronize data and gradients efficiently across large numbers of GPUs. Traditional topologies like fat-tree, may struggle to scale or maintain these properties as cluster size increases. These challenges introduce delays or bottlenecks that hamper AI training and inference performance.

Rail and Rail Optimized topologies are designed specifically to meet these demanding requirements. Structuring the network fabric to provide predictable, non-blocking connectivity enhances parallelism and minimizes congestion.  It enables more efficient communication between distributed GPUs, ensuring not only robust performance for collective operations libraries, but also streamlined operations and management for large-scale AI deployments. This makes rail and rail-optimized topologies essential for modern GPU-based data centers tackling complex collective communication patterns required by AI workloads.

Elephant Flows

AI is trained using clusters of GPUs, which helps break up large datasets across GPU servers, with each handling a subset. Once a cluster finishes processing a batch of data, it sends all the output in a single burst to the next cluster. Each dataset is large enough to generate an “elephant flow” (elephant flows are those larger than 1 GB/10 seconds). These fabrics of GPU clusters connect to the network with very high-bandwidth network interface controllers (NICs), ranging from 200 Gbps up to 800 Gbps. GPU NICs run at 100% when data is transmitted.

Examples of elephant flows in AI training include:

    • Distributing training data across multiple GPUs often involves large data transfers.
    • Syncing model parameters between GPUs, especially during distributed training, can result in significant data flows.
    • The exchange of gradients between GPUs during backpropagation is another source of large data transfers.

Timely Synced Transmission (Bursty)

Asynchronous workloads are common in non-AI workloads, such as end-users making database queries or requests of a web server, and are fulfilled upon request. AI workloads, however, are synchronous, which means that the clusters of GPUs must receive all the data before they can start their own job. Output from previous steps like gradients, model parameters, and so on, becomes vital input to subsequent phases.

In AI clusters, collective operations libraries manage the data distribution and time the start of transmission for participating GPUs.

Latency and Tail Latency

Latency and tail latency are crucial concepts in GPU network fabrics, especially in the context of AI workloads. Latency refers to the average time taken for data packets to travel from the source to the destination. This metric is essential for understanding the baseline efficiency of a network, as it helps indicate the typical speed of data transmission.

Tail latency measures the delay experienced by the last or slowest packets, or both.  AI training workloads rely on Collective Communications Libraries to distribute data and synchronize dataset transmissions between GPUs. In these collective environments, GPUs cannot start processing datasets until all datasets are completely delivered to their destinations.

This makes tail latency particularly important in AI networking because it identifies potential bottlenecks and disruptions that could affect overall performance. In distributed AI tasks, a few packets delayed by high tail latency can hinder the entire process. Keeping tail latency low becomes critical for ensuring timely data synchronization across multiple GPUs.

Low Entropy

In GPU-to-GPU traffic, the source/destination IP addresses and port numbers involved in these large data transfers tend to be relatively consistent. This leads to a low degree of entropy. This consistency is particularly evident in distributed AI/ML training workloads. The GPUs in a cluster, frequently sync their states using RDMA over Ethernet, encapsulated within standard UDP/IP packets.

When using Equa- Cost Multi-Path (ECMP) in such environments, low entropy can cause multiple traffic flows to be hashed to the same link. High-bandwidth flows, often seen in GPU clusters, can result in multiple flows being directed to the same path. This leads to potential link congestion and inefficient network utilization. This happens because ECMP algorithm does not take link utilization into account; it instead relies on static hashing mechanisms. This imbalance in the load can result in packet drops and reduced overall network performance, specifically impacting the training performance of distributed AI/ML workloads.

About the Author: Sam Hassan

With over two decades of experience at the forefront of data networking, Sam has played a pivotal role in the design and presales architecture of several enterprise-scale solutions that underpin today’s most demanding workloads. His career spans the evolution of Ethernet since adding voice as an application and high-performance computing. From early adoption of HPC clusters through to the rise of AI-optimized data centers— Sam built deep expertise in integrating cutting-edge networking technologies, orchestrating complex deployments, and advising Fortune 500 clients on strategies that maximize performance, security, and scalability. Over the years, they have partnered with industry leaders to deliver robust solutions leveraging Ethernet, and advanced cluster management platforms, and have been instrumental in helping organizations bridge the gap between legacy infrastructures and next-generation, AI-ready environments. Today, Hassan focus is at the intersection of GENAI fabric and GPU traffic management, designing and deploying end-to-end infrastructure that accelerates artificial intelligence initiatives. Drawing from an extensive background in AI networking and large-scale GPU cluster integration, they specialize in optimizing fabric architectures for generative AI, ensuring seamless high-bandwidth, low-latency data movement across hundreds of GPUs. Their hands-on engagements include leading datacenter readiness assessments, architecting GPU-accelerated server networks enabling clients to unlock the full potential of next-generation AI and deep learning workloads.