Optimizers: RMSprop

RMSprop (Root Mean Square Propagation) is an adaptive learning rate optimizer developed by Geoff Hinton. It resolves AdaGrad's diminishing learning rate bottleneck by using an exponentially decaying average of squared gradients.

Exponentially Decaying Average

RMSprop adjusts learning rates using a moving window of past gradients, preventing update freezing.

Resolving AdaGrad's Decay

AdaGrad accumulates all historical squared gradients, causing the scale denominator to grow indefinitely. This decreases the learning rate to zero, freezing training before convergence. RMSprop resolves this by replacing the cumulative sum with an exponentially decaying average.

This decaying average allows the optimizer to forget gradients from long ago, focusing on recent gradient statistics. The effective history size is controlled by a decay hyperparameter, ensuring the learning rate scale remains responsive throughout training.

The RMSprop Equations

The parameter updates in RMSprop are governed by the following mathematical equations:

$$\mathbf{v}_t = \beta \mathbf{v}_{t-1} + (1 - \beta) \mathbf{g}_t \odot \mathbf{g}_t$$

$$\theta_t = \theta_{t-1} - \frac{\eta}{\sqrt{\mathbf{v}_t} + \epsilon} \odot \mathbf{g}_t$$

where $\mathbf{v}_t$ is the running estimate of the second moment, $\beta$ is the decay rate (typically 0.9 or 0.99), $\eta$ is the base learning rate, and $\epsilon$ is a small constant (e.g. $10^{-8}$) to prevent division by zero.

Dynamics and Hyperparameters

Tuning the decay rate and base learning rate determines the responsiveness of RMSprop updates.

Moving Average Window

The decay rate $\beta$ determines the relative weight of past gradients compared to the current step. Mathematically, it corresponds to an effective moving average window of $\frac{1}{1-\beta}$ iterations. For $\beta=0.9$, the window covers the last 10 steps; for $\beta=0.99$, it covers the last 100 steps.

This windowing allows the optimizer to adjust to changes in the loss surface. If the model enters a steep ravine, the running average adjusts quickly, preventing excessive bouncing and stabilizing parameter trajectories.

Hyperparameter Interactions

The choice of $\beta$ interacts directly with the base learning rate $\eta$ and the mini-batch size. Large batch sizes reduce gradient noise, allowing for larger values of $\beta$ (closer to 1.0) because the gradient estimates are already stable.

Conversely, smaller batch sizes introduce noise, requiring smaller values of $\beta$ to quickly adapt to local fluctuations. Correctly balancing these parameters is critical to prevent training divergence in deep networks.

PyTorch Implementation

We can use PyTorch's native RMSprop or implement the adaptive update rules in a custom optimizer.

Using torch.optim.RMSprop

Here is how to configure and run the RMSprop optimizer on a PyTorch model:

<pre><code class="language-python">import torch import torch.nn as nn import torch.optim as optim model = nn.Linear(2, 1) # Initialize RMSprop with decay rate alpha (beta in equations) optimizer = optim.RMSprop(model.parameters(), lr=0.01, alpha=0.99, eps=1e-8) inputs = torch.randn(4, 2) targets = torch.randn(4, 1) loss = torch.mean((model(inputs) - targets) ** 2) loss.backward() optimizer.step() optimizer.zero_grad()</pre>

In this code, the parameter alpha corresponds to the decay coefficient $\beta$. Setting it to 0.99 tracks a longer gradient history, which stabilizes training updates at the expense of adaptation speed.

Coding a Custom RMSprop Optimizer

We can write a custom RMSprop optimizer in PyTorch by managing the second moment tracking manually:

<pre><code class="language-python">class CustomRMSprop(torch.optim.Optimizer): def __init__(self, params, lr=0.01, alpha=0.99, eps=1e-8): defaults = dict(lr=lr, alpha=alpha, eps=eps) super().__init__(params, defaults) def step(self, closure=None): loss = None for group in self.param_groups: lr = group['lr'] alpha = group['alpha'] eps = group['eps'] for p in group['params']: if p.grad is None: continue state = self.state[p] # Initialize moving average if 'square_avg' not in state: state['square_avg'] = torch.zeros_like(p.data) v = state['square_avg'] # Accumulate decayed average: v = alpha * v + (1 - alpha) * grad^2 v.mul_(alpha).addcmul_(p.grad.data, p.grad.data, value=1.0 - alpha) # Compute denominator scale: sqrt(v) + eps std = v.sqrt().add_(eps) # Update parameter: p = p - (lr / std) * grad p.data.addcdiv_(p.grad.data, std, value=-lr) return loss custom_opt = CustomRMSprop(model.parameters(), lr=0.01)</pre>

This implementation updates parameters by combining addcmul_ and addcdiv_. By maintaining the square_avg accumulator for each tensor, the custom class matches PyTorch's native RMSprop implementation.