Optimizers: Adam (Adaptive Moment Estimation)

Adam (Adaptive Moment Estimation) is a widely used optimizer that combines the velocity accumulation of Momentum with the adaptive scaling of RMSprop, utilizing bias correction to stabilize early training steps.

Momentum and RMSprop Integration

Adam tracks both the first moment (mean) and the second moment (uncentered variance) of the gradients.

First and Second Moments

Adam integrates the concepts of SGD with Momentum and RMSprop. It tracks the first moment $\mathbf{m}_t$ (the exponentially decaying average of past gradients) and the second moment $\mathbf{v}_t$ (the exponentially decaying average of past squared gradients):

$$\mathbf{m}_t = \beta_1 \mathbf{m}_{t-1} + (1 - \beta_1) \mathbf{g}_t$$

$$\mathbf{v}_t = \beta_2 \mathbf{v}_{t-1} + (1 - \beta_2) \mathbf{g}_t \odot \mathbf{g}_t$$

where $\beta_1$ (typically 0.9) controls momentum decay, and $\beta_2$ (typically 0.999) controls the second moment decay.

Dual Scaling Advantage

Tracking both moments provides the benefits of both strategies. The first moment acts as a smoothing filter, accelerating updates in consistent directions and reducing noise. The second moment scales parameter steps based on coordinate curvature, ensuring robust updates.

This combination makes Adam highly robust across diverse architectures and hyperparameters, establishing it as the default choice for training deep neural networks and Transformers.

Bias Correction and Initialization

Adam uses bias correction to compensate for zero-initialization errors in early training steps.

The Need for Bias Correction

Because $\mathbf{m}_0$ and $\mathbf{v}_0$ are initialized as zero vectors, the running estimates are biased toward zero, especially during early steps when decay rates $\beta_1$ and $\beta_2$ are close to 1.0. Without correction, early steps would be extremely small.

To correct this, we divide the moments by a correction factor containing the step power of the decay rates, bringing the running estimates closer to their true expected values.

Bias-Corrected Update Equations

The bias-corrected moments $\hat{\mathbf{m}}_t$ and $\hat{\mathbf{v}}_t$ are defined as:

$$\hat{\mathbf{m}}_t = \frac{\mathbf{m}_t}{1 - \beta_1^t}$$

$$\hat{\mathbf{v}}_t = \frac{\mathbf{v}_t}{1 - \beta_2^t}$$

The parameters are then updated using the corrected moments:

$$\theta_t = \theta_{t-1} - \frac{\eta}{\sqrt{\hat{\mathbf{v}}_t} + \epsilon} \odot \hat{\mathbf{m}}_t$$

PyTorch Implementation

We can use PyTorch's native Adam optimizer or implement the bias-corrected step rules manually.

Using torch.optim.Adam

Here is how to configure and apply the Adam optimizer in PyTorch:

<pre><code class="language-python">import torch import torch.nn as nn import torch.optim as optim model = nn.Linear(2, 1) # Initialize Adam with default coefficients optimizer = optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.999), eps=1e-8) inputs = torch.randn(4, 2) targets = torch.randn(4, 1) loss = torch.mean((model(inputs) - targets) ** 2) loss.backward() optimizer.step() optimizer.zero_grad()</pre>

In this code, the betas tuple specifies $(\beta_1, \beta_2)$. The learning rate $0.001$ is a standard starting value for Adam, as it performs robustly across different models and tasks.

Coding a Custom Adam Optimizer

We can implement the complete Adam update algorithm, including step-dependent bias correction, in PyTorch:

<pre><code class="language-python">class CustomAdam(torch.optim.Optimizer): def __init__(self, params, lr=0.001, betas=(0.9, 0.999), eps=1e-8): defaults = dict(lr=lr, beta1=betas[0], beta2=betas[1], eps=eps) super().__init__(params, defaults) def step(self, closure=None): loss = None for group in self.param_groups: lr = group['lr'] beta1 = group['beta1'] beta2 = group['beta2'] eps = group['eps'] for p in group['params']: if p.grad is None: continue state = self.state[p] # Initialize state tracking if len(state) == 0: state['step'] = 0 state['exp_avg'] = torch.zeros_like(p.data) state['exp_avg_sq'] = torch.zeros_like(p.data) state['step'] += 1 t = state['step'] m, v = state['exp_avg'], state['exp_avg_sq'] # Update moments in-place m.mul_(beta1).add_(p.grad.data, alpha=1.0 - beta1) v.mul_(beta2).addcmul_(p.grad.data, p.grad.data, value=1.0 - beta2) # Compute bias correction terms bias_correction1 = 1.0 - beta1 ** t bias_correction2 = 1.0 - beta2 ** t # Calculate corrected step size step_size = lr * (bias_correction2 ** 0.5) / bias_correction1 denom = v.sqrt().add_(eps) # Update parameter: p = p - step_size * (m / denom) p.data.addcdiv_(m, denom, value=-step_size) return loss custom_opt = CustomAdam(model.parameters(), lr=0.001)</pre>

This custom class tracks the step count $t$ per parameter to compute the bias correction terms. By scaling the learning rate directly using these correction terms, it matches the native behavior of PyTorch's Adam.