- Introduction
In Large Language Models (LLM) clusters, particularly the larger models, the efficiency of distributed AI training workloads is critical. The model efficiency is highly dependent on the network’s capacity to transfer large volumes of data among GPUs and nodes. AI training necessitates the exchange of gradients and weights through high-bandwidth, low-latency network connections.
When a training workload is spread across multiple GPUs or nodes (domains), frequent communication is required to synchronize parameters and share data. If network bandwidth or latency is suboptimal, GPUs can remain idle waiting for these communications instead of performing computations.
As LLMs grow, encompassing billions of parameters, the need for robust network fabrics becomes essential. Lossy, congested or suboptimal networks can become substantial bottlenecks. “Time Spent in Network” can potentially consume up to 50% of workload time as indicated in Meta’s diagram below.

Training “workload” efficiency refers to how well computational resources are used to achieve high performance. One of the key factors impacting “Workload efficiency” is time spent waiting on the network during distributed training. This “Time Spent in Network” directly reduces hardware utilization, increases overall training time, and elevates operational costs.
Minimizing the “Time Spent in Network” — through high-speed interconnects and improved communication protocols enables more effective parallel computation, maximizes resource use which significantly improves training efficiency. Hence, the network’s performance directly influences the overall computational throughput. This effectively makes the network an integral part of the compute capability in distributed training scenarios.
It is well understood that effective GPU-to-GPU communication demands high bandwidth and low latency — often achieved using technologies like RoCEv2. However, network architects also need to consider additional new features such as dynamic load balancing and improved congestion control while designing solutions for GPU networks. Comparing different offerings requires a clear understanding of these features and how they address the specific challenges presented in distributed training use cases.
As we began drafting the blog, it became evident that a foundational understanding of GPU communications and their typical traffic patterns is essential. This background will allow readers to appreciate the challenges traditional network solutions have with these demands. It will also highlight the need for enhanced features.
Our first post will focus on providing a brief background around GPU training clusters, the second post will explain GPU-to-GPU traffic patterns, and the technologies/techniques involved. As well as the unique network traffic characteristics driven by these traffic patterns. This context will pave the way for more in-depth discussions covering the needed bandwidth, latency and emerging network features. We are planning to cover these topics in one or more future posts.
1.1. Different Parallelism Techniques: Parallelism is a critical technique in the training of Large Language Models (LLMs) due to their vast computational requirements and memory constraints. These models cannot be efficiently trained on a single GPU when they have billions of parameters. Parallelism distributes the computational and memory workload across multiple devices, which significantly accelerates training time and enhances scalability. The primary types of parallelism include:
1.1.1. Data Parallelism (DP): Multiple GPUs each holds a complete copy of the model and process different data batches. Gradients are synchronized during backpropagation.
1.1.2. Tensor Parallelism (TP): Layers or tensors of the model are partitioned across GPUs. This involves dividing large model layers at the tensor dimension level.
1.1.3. Pipeline Parallelism (PP): Model layers are divided into sequential stages distributed across different GPUs or nodes, forming a pipeline through which data flows.

video training by Sreeram Potluri – NVIS training May 2024
Combining these parallelism techniques can improve the training and deployment of LLMs. This ensures that memory and computation resources are utilized efficiently. For instance, combining data, tensor and pipeline parallelism can handle immense parameter counts and deep architectures effectively.
Different parallelism techniques rely on collective operations to coordinate computation across multiple computational units like GPUs or nodes (as per the diagram above). For example, in data parallelism, each device processes different data batches. But later they have to synchronize gradients through collective communications (like all-reduce) before updating model weights to ensure consistency.
In tensor parallelism, tensors are split across devices requiring synchronization at every layer to exchange partial results, again using collective operations to maintain correctness and performance. Pipeline parallelism segments the model across devices and necessitates collective operations to transfer activations and gradients between pipeline stages. Thus, high-speed collective operations are essential to minimize communication overhead and successfully scale LLM training across distributed infrastructure.
2. Collective and One-to-one Operations in Deep Learning: Generally collective and p2p operations refer to a set of communication primitives — such as all-reduce, all-gather, all-to-all and broadcast. These sets are required to coordinate and synchronize data (like gradients or parameters) across multiple GPUs or nodes during distributed training.
2.1. Examples of these operations:
All Gather
The Allgather operation is a collective communication primitive critical in distributed large language model (LLM) training. It enables each participating GPU to share its unique piece of data with every other GPU in the group.
During model parallel training, each worker computes a portion of the model or dataset. After the portion or the dataset is complete, local computation (such as computing gradients or forward activations), it uses Allgather to exchange its results. Now, all workers possess the complete set of required data for subsequent steps. This synchronized data exchange ensures model consistency and enables further computation (e.g., parameter updates) across distributed devices.
Efficient Allgather implementations are essential to maximize throughput and minimize communication overhead in large-scale LLM training.

All Reduce
AllReduce is a collective communication operation essential for distributed training of Large Language Models (LLMs). During each training step, gradients or parameters computed on multiple GPUs are summed across all nodes. The result is then broadcast back to each node, so they all hold the same synchronized, averaged (or otherwise reduced) values.
This process ensures model consistency across distributed processes and is critical for efficient scaling of LLM training on high-performance computing clusters. AllReduce minimizes communication overhead compared to naive approaches and is typically implemented using improved libraries such as NCCL or MPI.

All-to-All
All-to-All operation refers to a communication pattern in distributed computing where data or gradients must be exchanged between all participating GPUs during the training process. This operation is critical in scenarios like distributed model parallelism (expert parallelism) and fine-tuning.
Each device holds only a portion of the model or dataset. During synchronizing model weights, aggregating gradients or sharing activations — every node needs to send data to and receive data from every other node in the cluster, ensuring consistency and enabling collaborative computation. High-speed interconnects, such as InfiniBand or high-performance Ethernet, are vital for these operations, as inefficient or slow networking can create significant bottlenecks.

Efficient all-to-all operations help maximize resource utilization and minimize training time. Their implementation requires careful coordination to balance communication overhead and computational efficiency, especially as model size scales up. There are several more commonly used collective operations (i.e. ReduceScatter, Scatter, Reduce, Gather … etc.).
2.2 Collective Communication Libraries in Deep Learning
Collective libraries are specialized software components that efficiently implement these collective and p2p operations by leveraging advanced networking technologies and hardware topologies. For example, the NVIDIA Collective Communication Library (NCC – nickl) is widely used in LLM training to accelerate and improve these operations, ensuring that data flow among GPUs and nodes is both high bandwidth and low-latency.
Some of the other competitive communication libraries are AMD’s RCCL (ROCm Collective Communication Library), and finally Microsoft heterogenous MS-CCL Collective Communication Library for multivendor GPUs.
The choice and effectiveness of a collective library directly impacts the scalability and speed of distributed model training: efficient collective operations implemented via advanced libraries like NCCL are crucial for maintaining synchronization and throughput in large, multinode LLM training clusters.
Here is an overview of other capabilities delivered by NCCL (This section is adopted from Nvidia’s NCCL 2.6.4 developers guide – overview section).
NCCL integrates with popular deep learning frameworks like PyTorch, TensorFlow and MXNet, providing improved communication routines. That accelerates deep learning training on multi-GPU systems.
2.2.1. Automatic Topology Detection: NCCL features advanced topology detection and internal tuning models that automatically select optimal parameters based on several factors. Some of those factors are the number of GPUs, NVLink domain size, NVLink speed and PCIe topology. This automatic detection ensures optimal communication performance.
Collective Communication Libraries such as NCCL automatically detect the topology of GPUs, which includes understanding the physical arrangement of GPUs. They also detect their networking interconnects like PCIe, NVLink, NVSwitch and Mellanox InfiniBand.
This detection process involves querying the system for GPU locations and network adapters to construct an internal map of the topology. This mapping is essential for improving communication routes—selecting paths with the highest bandwidth and lowest latency to ensure efficient data synchronization and reduced training times.
-
- Collective Communication Libraries detect the arrangement of GPUs and the available interconnects (such as PCIe, NVLink, NVSwitch and InfiniBand) within and across nodes.
- By interrogating the system for GPU locations, network adapters and possible communication paths, these libraries construct an internal map of the topology.
- This topology awareness allows the library to improve data movement by selecting the most efficient communication routes, minimizing latency and maximizing bandwidth.
- The library can also use advanced features like GPUDirect RDMA, which allows direct memory access from one GPU to another across nodes, bypassing the CPU for lower latency.
- NCCL’s topology-aware communication can adapt its algorithms based on detected network paths, preferring NVLink for intranode communication and InfiniBand for internode communication.
2.2.2. Fault Tolerance NCCL provides a set of features to allow applications to recover from fatal errors such as a network failure, a node failure or a process failure. When such an error happens, the application should be able to call ncclCommAborton the communicator to free all resources. It then creates a new communicator to continue.
2.2.3. Quality of Services Applications which overlap communication may benefit from network Quality of Service (QoS) features. NCCL allows an application to assign a traffic class (TC) to each communicator to identify the communication requirements of the communicator. All network operations on a communicator will use the assigned TC.
The meaning of TC is specific to the network plugin in use by the communicator (such as IB networks use service level, RoCE networks use type of service). TCs are defined by the system configuration. Applications must understand the TCs available on a system and their relative behavior to use them effectively.
2.3. Where does NCCL Runs
2.3.1. Within Each Compute Node
NCCL operates on all GPUs installed in a single node. It uses high-speed interconnects like NVLink, PCIe or shared memory to handle communication between the GPUs of that node.
2.3.2. Across Multiple Compute Nodes
For distributed clusters, NCCL enables communication between GPUs located on different physical servers (nodes). To achieve this, NCCL leverages networking technologies such as InfiniBand, RoCE or Ethernet. This internode communication is crucial for scaling large clusters.
2.3.3. Integrated With Deep Learning Frameworks
Deep learning libraries (such as PyTorch’s DistributedDataParallel or TensorFlow’s MultiWorkerMirroredStrategy) invoke NCCL functions transparently. NCCL runs as part of the training processes launched on each GPU — there’s no separate NCCL service. Instead, its routines are called from the same Python (or C++) processes running the training code.
2.3.4. Process Context
NCCL runs in the context of the user’s training script or framework process on each GPU. Therefore, every worker process responsible for a model shard or data shard includes NCCL calls for the required collective communication.
Key Takeaways
- Network efficiency is critical for distributed training of large language models, as it determines how quickly data can be shared among GPUs and nodes. Poor network bandwidth or high latency causes idle GPUs and significantly increases training time and operational costs. Using high-speed, low-latency interconnects and advanced network features maximizes resource utilization and overall training performance.
- Efficient distributed LLM training depends on several types of parallelism—including data, tensor, and pipeline parallelism—which enable large models to scale across multiple GPUs by evenly distributing computation and memory loads.
- Key collective communication operations such as allreduce, allgather, and all-to-all synchronize gradients, activations, and model weights between GPUs; their efficiency is essential for maintaining consistency and performance during LLM training.
- High-speed interconnects (like InfiniBand, RoCEv2, NVLink, and NVSwitch) and robust networking technologies are required; poor bandwidth or high latency can lead to significant bottlenecks, with up to 50% of workload time potentially lost to waiting on network traffic.
- Specialized collective communication libraries—such as NVIDIA’s NCCL, AMD’s RCCL, and Microsoft’s MS-CCL—automatically detect GPU/topology arrangements and optimize data movement across clusters using features like topology awareness, GPUDirect RDMA, fault tolerance, and network QoS.
- The effectiveness of these libraries and network infrastructure directly impacts training throughput, scalability, and cost efficiency; optimal GPU cluster traffic flow is foundational for successful large-scale LLM deployments.


