Distributed Training Strategies (Data vs. Model Parallelism)

Distributed training parallelizes deep learning models across multiple GPUs and nodes. Distributed Data Parallelism (DDP) replicates models to process data in parallel, while Model Parallelism splits parameters across devices to handle large models.

Distributed Data Parallelism (DDP)

DDP replicates the model on all available GPUs, splits the input batch, and synchronizes gradients using All-Reduce operations.

Data Partitioning and Synchronization

Distributed Data Parallelism (DDP) is the most common distributed training strategy. In DDP, the entire model is replicated on every available GPU. The global training dataset is partitioned into subsets, and each GPU receives a distinct shard of the batch (a micro-batch) using a distributed sampler. During the forward pass, each GPU independently calculates activations and loss for its micro-batch.

During the backward pass, the GPUs must synchronize their gradients before updating the weights. This is achieved using an All-Reduce communication algorithm (specifically Ring All-Reduce), which averages the gradients across all devices. Once the gradients are synchronized, each GPU executes the optimizer step, ensuring that the model weights remain identical across all replicas. DDP scales efficiently because communication occurs concurrently with the backward pass, minimizing synchronization delays.

PyTorch DDP Setup

This PyTorch script demonstrates how to initialize a distributed process group, wrap a model in DDP, and coordinate training across multiple GPUs:

<pre><code class="language-python">import os import torch import torch.nn as nn import torch.distributed as dist from torch.nn.parallel import DistributedDataParallel as DDP def setup_distributed(rank, world_size): # Configure master address and port for process coordination os.environ['MASTER_ADDR'] = 'localhost' os.environ['MASTER_PORT'] = '12355' # Initialize the distributed process group using the NCCL backend (optimized for NVIDIA GPUs) dist.init_process_group(backend="nccl", rank=rank, world_size=world_size) torch.cuda.set_device(rank) def cleanup_distributed(): dist.destroy_process_group() def train_ddp_node(rank, world_size): setup_distributed(rank, world_size) # Move model to local GPU rank model = nn.Linear(10, 5).to(rank) # Wrap in DDP ddp_model = DDP(model, device_ids=[rank]) loss_fn = nn.MSELoss() optimizer = torch.optim.SGD(ddp_model.parameters(), lr=0.01) # Generate dummy input on target rank GPU x = torch.randn(8, 10, device=rank) y = torch.randn(8, 5, device=rank) # Forward and backward pass out = ddp_model(x) loss = loss_fn(out, y) optimizer.zero_grad() loss.backward() # Triggers Ring All-Reduce gradient synchronization internally optimizer.step() cleanup_distributed()</pre>

Model Parallelism

Model Parallelism splits large networks across multiple devices using pipeline-level or tensor-level partitioning.

Pipeline Parallelism

When a model is too large to fit into the VRAM of a single GPU (such as modern LLMs with tens of billions of parameters), we must use Model Parallelism. The simplest approach is Pipeline Parallelism, where the model is split vertically along its layers. For example, layers 1-10 are assigned to GPU 0, layers 11-20 to GPU 1, and so on. During training, GPU 0 computes activations for a batch and passes the output to GPU 1. GPU 1 then computes its activations while GPU 0 sits idle waiting for the backward pass, a problem known as the bubble.

To minimize this idle time, pipeline frameworks (like GPipe) split the input batch into smaller micro-batches. GPU 0 processes the first micro-batch and passes it to GPU 1, immediately starting work on the second micro-batch. This pipelining overlaps computations across GPUs, reducing the bubble size and improving hardware utilization.

Tensor Parallelism

Pipeline parallelism splits the model between layers, but it does not address layers that are individually too large for a single GPU (such as linear layers with massive weight matrices). Tensor Parallelism (pioneered by Megatron-LM) solves this by splitting individual weight matrices across multiple GPUs. For a linear layer \\( \\mathbf{Y} = \\mathbf{X} \\mathbf{W} \\), we can split the weight matrix \\( \\mathbf{W} \\) column-wise: \\( \\mathbf{W} = [\\mathbf{W}_1, \\mathbf{W}_2] \\).

Each GPU computes a subset of the output channels: \\( \\mathbf{Y}_1 = \\mathbf{X} \\mathbf{W}_1 \\) and \\( \\mathbf{Y}_2 = \\mathbf{X} \\mathbf{W}_2 \\) in parallel. The outputs are then concatenated using an All-Gather communication step. For subsequent layers, we can split the weights row-wise and perform an All-Reduce step to sum the partial outputs. This tensor-level splitting allows models to scale beyond the memory limits of a single device.

Communication Topologies and Latency

Distributed training efficiency is limited by hardware communication bandwidth, NVLink topologies, and message routing.

NVLink vs. PCIe

In distributed training, the speed of weight synchronization is limited by the interconnect bandwidth between GPUs. Traditional systems transfer data over the PCIe bus, which has a bandwidth limit of 16 to 64 GB/s. This can create a significant bottleneck during the All-Reduce phase of large-scale distributed training.

To address this, NVIDIA developed NVLink, a high-speed GPU interconnect that enables direct GPU-to-GPU communication bypassing the CPU and PCIe bus. NVLink provides bandwidths up to 900 GB/s on modern architectures. Utilizing NVLink topologies reduces communication latency, allowing distributed scaling to maintain near-linear efficiency across hundreds of GPUs.

Optimization Parameters

To optimize distributed training, practitioners must balance communication parameters. One key parameter is the bucket size in PyTorch DDP. Instead of communicating gradients for each parameter individually (which creates high latency overhead), DDP groups gradients into buckets and performs the All-Reduce operation on entire buckets. Tuning the bucket size ensures that communication is overlapped with backward pass computations.

Additionally, selecting the right backend is essential. PyTorch supports Gloo (for CPU nodes) and NCCL (NVIDIA Collective Communications Library, optimized for GPUs). NCCL uses ring-based and tree-based topologies to optimize All-Reduce routing across NVLink connections, maximizing collective communication throughput.