Learning Rate Warm-up Strategies

Learning rate warm-up is a training strategy where the learning rate starts at a small value and increases gradually to its target value. This prevents early gradient explosions, stabilizing the initial training stages.


Warm-up Strategies

Warm-up strategies scale up learning rates during the first few epochs to stabilize early parameter adjustments.

The Motivation for Warm-up

Early in training, weight parameters are randomly initialized, and the model has not learned any representations. The initial gradients are often large and unstable, which can cause the optimizer to make massive adjustments that ruin the weight distribution.

This instability is particularly severe in deep models like Transformers. Learning rate warm-up acts as a stabilization buffer, keeping updates small during the first few steps while the model adapts to the training data.

Linear Warm-up Schedule

The most common warm-up strategy is Linear Warm-up. The learning rate increases linearly from a small initial value $\eta_{init}$ to the target value $\eta_{target}$ over the first $W$ steps:

$$\eta_t = \eta_{init} + t \times \frac{\eta_{target} - \eta_{init}}{W}$$

Once step $W$ is reached, the scheduler transitions to a standard decay schedule, such as cosine annealing or step decay, ensuring smooth learning rate transitions.

Advanced Warm-up Dynamics

Warm-up stabilizes running optimizer states and can be combined with cosine decay profiles.

Stabilizing Adaptive Optimizers

Adaptive optimizers like Adam rely on running averages of first and second moments to adjust step sizes. Early in training, these running averages are unstable and biased toward zero, leading to erratic step scaling.

By starting with a small learning rate, warm-up prevents these early fluctuations from destabilizing parameter updates. This allows the moments to stabilize before the learning rate reaches its peak value.

Integration with Decay Profiles

Modern pipelines combine warm-up with cosine decay schedules. The learning rate increases linearly during the warm-up phase, and then decays according to a cosine curve during the annealing phase.

This combined schedule balances early stability with late convergence, making it the standard learning rate schedule for training large models like GPT, BERT, and ResNets.

PyTorch Implementation

We can implement learning rate warm-up in PyTorch using the LinearLR scheduler or custom scaling wrappers.

Using LinearLR for Warm-up

Here is how to configure a linear warm-up phase followed by a decay phase in PyTorch:

<pre><code class="language-python">import torch import torch.nn as nn import torch.optim as optim from torch.optim.lr_scheduler import LinearLR, SequentialLR, CosineAnnealingLR model = nn.Linear(5, 1) optimizer = optim.Adam(model.parameters(), lr=0.001) # Warm-up phase: increase lr from 0.0001 to 0.001 over 5 epochs warmup_scheduler = LinearLR(optimizer, start_factor=0.1, end_factor=1.0, total_iters=5) # Decay phase: cosine annealing over 15 epochs decay_scheduler = CosineAnnealingLR(optimizer, T_max=15, eta_min=1e-5) # Chain schedulers sequentially scheduler = SequentialLR( optimizer, schedulers=[warmup_scheduler, decay_scheduler], milestones=[5] ) # Simulated epoch loop for epoch in range(20): optimizer.step() scheduler.step() print(f"Epoch {epoch} LR: {optimizer.param_groups[0]['lr']:.6f}")</pre>

In this code, we use SequentialLR to chain the warm-up and decay phases. The milestones parameter specifies the transition epoch, switching from the warm-up scheduler to the decay scheduler at epoch 5.

Custom Warmup Wrapper

We can implement a custom warm-up step calculation by scaling the optimizer's learning rate manually during the first $W$ iterations. This approach is useful when integrating custom schedules that are not supported by standard PyTorch modules.

This custom scaling provides fine control over learning rate adjustments, ensuring compatibility with custom optimization workflows.