Optimizers: Learning Rate Schedulers (Step Decay, Cosine Annealing)

Learning rate schedulers adjust the learning rate during training according to a predefined schedule. By decaying the learning rate over time, schedulers help the model settle into deep minima and improve generalization.

Decay Schedules

Schedules reduce learning rates progressively to stabilize optimization late in training.

The Need for Decay

Early in training, a high learning rate is beneficial to cross high-loss barriers and escape poor local minima. However, as the optimization approaches the minimum, a high learning rate causes parameter updates to oscillate, preventing the model from settling into the bottom of the well.

Decaying the learning rate scales down the step size, allowing the optimizer to perform fine adjustments near the minimum and achieve precise convergence.

Step Decay Schedule

Step Decay is a simple schedule that reduces the learning rate by a multiplicative factor $\gamma$ (e.g. 0.1) at regular epoch intervals $S$:

$$\eta_{epoch} = \eta_0 \times \gamma^{\lfloor \frac{epoch}{S} \rfloor}$$

This schedule creates sudden, step-like decreases in the learning rate, which often leads to rapid drops in validation error shortly after each decay step, stabilizing training stages.

Cosine Annealing

Cosine Annealing uses a smooth cosine curve to adjust learning rates, avoiding sudden parameter changes.

Cosine Annealing Formulation

Cosine Annealing decreases the learning rate smoothly from a maximum value $\eta_{max}$ to a minimum value $\eta_{min}$ using a cosine wave profile:

$$\eta_t = \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min}) \left(1 + \cos\left(\frac{T_{cur}}{T_{max}}\pi\right)\right)$$

where $T_{cur}$ is the current epoch index and $T_{max}$ is the total number of epochs in the cycle. This smooth profile prevents sudden parameter disruptions, maintaining stable updates.

Cosine Annealing with Warm Restarts

Cosine Annealing with Warm Restarts (SGDR) resets the learning rate to $\eta_{max}$ at the end of each cycle. This sudden reset increases step sizes, helping the model escape local minima and explore different basins of the loss surface.

As training progresses, the cycle length is typically increased to allow the optimizer more time to settle into flatter, more generalized minima, which improves final test performance.

PyTorch Implementation

We can implement schedulers in PyTorch using the lr_scheduler module, tracking the learning rate across training cycles.

Using PyTorch Schedulers

Here is how to configure and apply Step Decay and Cosine Annealing schedulers in PyTorch:

<pre><code class="language-python">import torch import torch.nn as nn import torch.optim as optim from torch.optim.lr_scheduler import StepLR, CosineAnnealingLR model = nn.Linear(2, 1) optimizer = optim.SGD(model.parameters(), lr=0.1) # StepLR: decay lr by 0.1 every 5 epochs scheduler_step = StepLR(optimizer, step_size=5, gamma=0.1) # CosineAnnealingLR: decay over T_max = 10 epochs scheduler_cosine = CosineAnnealingLR(optimizer, T_max=10, eta_min=1e-4) # Simulated epoch loop for epoch in range(10): # Training steps here... optimizer.step() # Step the scheduler at the end of each epoch scheduler_cosine.step() print(f"Epoch {epoch} LR: {optimizer.param_groups[0]['lr']}")</pre>

In this code, the scheduler's step() method adjusts the optimizer's learning rate. It is typically called at the end of each epoch, modifying parameters for the next iteration.

Tracking Learning Rates

We can track the learning rate modifications by recording the scheduler's state. Comparing the learning rate curves helps verify the scheduling dynamics and ensures the updates align with our training plan.

Plotting these learning rate trajectories ensures that decay rates and warm restarts are timed correctly, avoiding premature convergence or unstable updates.