Optimizers: AdaGrad

AdaGrad is an adaptive learning rate optimizer that scales updates for each parameter individually based on its historical gradients. This makes it highly effective for training on sparse datasets where features occur at varying frequencies.

Adaptive Learning Rates

AdaGrad adjusts learning rates parameter-by-parameter, making larger updates for infrequent features.

Motivation for Adaptivity

Standard SGD updates all parameters using a single shared learning rate. This is problematic if features have different frequencies. Frequently occurring features generate frequent gradient updates, which can destabilize training if the learning rate is too large.

Conversely, rare features generate sparse updates, which can be neglected if the learning rate is too small. AdaGrad solves this by scaling updates inversely with the cumulative history of gradients for each parameter.

The AdaGrad Update Equations

AdaGrad updates are governed by a running sum of squared gradients $\mathbf{s}_t$:

$$\mathbf{s}_t = \mathbf{s}_{t-1} + \mathbf{g}_t \odot \mathbf{g}_t$$

$$\theta_t = \theta_{t-1} - \frac{\eta}{\sqrt{\mathbf{s}_t} + \epsilon} \odot \mathbf{g}_t$$

where $\odot$ represents element-wise multiplication, $\eta$ is the base learning rate, and $\epsilon$ is a small constant (e.g. $10^{-8}$) to prevent division by zero.

Theoretical Trade-offs

AdaGrad simplifies parameter tuning but suffers from a terminal decay in learning rates.

Decoupled Feature Optimization

By scaling updates by $\sqrt{\mathbf{s}_t}$, parameters associated with sparse features (small $\mathbf{s}_t$) receive larger updates, while parameters with frequent features (large $\mathbf{s}_t$) receive smaller updates.

This decoupled scaling eliminates the need to tune the learning rate for individual weights, making AdaGrad highly suited for applications with sparse inputs, such as natural language processing and recommendation systems.

The Diminishing Learning Rate Bottleneck

The primary limitation of AdaGrad is that $\mathbf{s}_t$ accumulates squared gradients monotonically, meaning its values grow continuously. As a result, the effective learning rate scale $\frac{\eta}{\sqrt{\mathbf{s}_t} + \epsilon}$ decays continuously.

Late in training, the learning rate scale becomes so small that parameters stop updating completely (frozen gradients). The model stops learning before reaching the optimum, which requires the development of decay-limited optimizers like RMSprop.

PyTorch Implementation

We can use PyTorch's native Adagrad implementation or build a custom class with squared gradient tracking.

Using torch.optim.Adagrad

Here is how to configure and apply the AdaGrad optimizer in PyTorch:

<pre><code class="language-python">import torch import torch.nn as nn import torch.optim as optim model = nn.Linear(2, 1) # Initialize AdaGrad with initial accumulator value optimizer = optim.Adagrad(model.parameters(), lr=0.01, initial_accumulator_value=0.1) # Training step inputs = torch.randn(4, 2) targets = torch.randn(4, 1) loss = torch.mean((model(inputs) - targets) ** 2) loss.backward() optimizer.step() optimizer.zero_grad()</pre>

In this code, we initialize the AdaGrad optimizer. The initial_accumulator_value prevents large early updates by pre-populating the running sum of squared gradients, stabilizing the initial steps.

Coding a Custom AdaGrad Optimizer

We can implement the AdaGrad accumulation rule in a custom PyTorch optimizer class:

<pre><code class="language-python">class CustomAdaGrad(torch.optim.Optimizer): def __init__(self, params, lr=0.01, eps=1e-8): defaults = dict(lr=lr, eps=eps) super().__init__(params, defaults) def step(self, closure=None): loss = None for group in self.param_groups: lr = group['lr'] eps = group['eps'] for p in group['params']: if p.grad is None: continue state = self.state[p] # Initialize state accumulator if 'sum' not in state: state['sum'] = torch.zeros_like(p.data) s = state['sum'] # Accumulate squared gradients: s = s + grad^2 s.addcmul_(p.grad.data, p.grad.data) # Compute scale factor std = s.sqrt().add_(eps) # Update parameter: p = p - (lr / std) * grad p.data.addcdiv_(p.grad.data, std, value=-lr) return loss custom_opt = CustomAdaGrad(model.parameters(), lr=0.01)</pre>

This implementation utilizes the PyTorch tensor operations addcmul_ (accumulate squared product) and addcdiv_ (add scaled quotient) to run the element-wise operations efficiently in-place, preventing memory overhead.