Optimizers: Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent (SGD) is the foundational optimizer for deep learning. By updating parameters using gradients computed on small subsets of data (mini-batches), SGD accelerates training and introduces regularization noise.

Mathematical Mechanics

SGD updates model parameters iteratively in the direction of steepest descent on the loss surface.

Gradient Descent vs. SGD

Batch Gradient Descent computes gradients across the entire training dataset. While accurate, this is computationally expensive for large datasets. SGD approximates the true gradient using a single sample (or a mini-batch of samples):

$$\mathbf{g}_t = \nabla_\theta \mathcal{L}(\mathbf{x}_t, y_t; \theta_t)$$

This approximation reduces the computational complexity per step from $\mathcal{O}(M)$ to $\mathcal{O}(B)$, where $M$ is the dataset size and $B$ is the mini-batch size, enabling faster training updates.

The SGD Update Equation

Once the gradient estimate $\mathbf{g}_t$ is computed, parameters $\theta$ are updated by scaling the gradient by the learning rate $\eta$:

$$\theta_{t+1} = \theta_t - \eta \mathbf{g}_t$$

The learning rate determines the size of the step taken along the loss surface. If the learning rate is too large, the optimizer will overshoot the minimum; if it is too small, convergence will be extremely slow.

Optimization Challenges

While efficient, SGD struggles on complex loss landscapes characterized by noise and pathological curvatures.

Convergence Path and Noise

Because gradient estimates are computed on mini-batches, they contain stochastic noise. This causes the optimization path to zig-zag rather than following a straight path to the minimum.

However, this noise can be beneficial. The fluctuations can help the optimizer escape shallow local minima and saddle points, allowing the model to find flatter minima that generalize better to unseen test data.

Pathological Curvature and Saddles

SGD performs poorly in regions of pathological curvature, such as ravines, where the loss surface is steep in one direction and gentle in another. SGD tends to bounce back and forth between the steep walls instead of moving along the valley floor.

Additionally, high-dimensional spaces contain many saddle points where gradients are zero. Without momentum, SGD can become trapped in these regions, highlights the need for advanced optimization variants.

PyTorch Implementation

We can use PyTorch's native optimizer or write a custom SGD class using the Optimizer base interface.

Using torch.optim.SGD

Here is how to configure and apply PyTorch's SGD optimizer to train a model parameter:

<pre><code class="language-python">import torch import torch.nn as nn import torch.optim as optim # Simple linear model model = nn.Linear(2, 1) # Initialize SGD optimizer with a learning rate of 0.01 optimizer = optim.SGD(model.parameters(), lr=0.01) # Simulated training step inputs = torch.randn(4, 2) targets = torch.randn(4, 1) predictions = model(inputs) loss = torch.mean((predictions - targets) ** 2) loss.backward() # Update parameters using SGD rule optimizer.step() # Reset gradients for next iteration optimizer.zero_grad()</pre>

In this code, we initialize the SGD optimizer. The step() method updates the model's weights and biases using the calculated gradients, and zero_grad() resets parameter gradients to prevent accumulation.

Coding a Custom SGD Optimizer

We can implement a custom SGD optimizer in PyTorch by inheriting from the base Optimizer class:

<pre><code class="language-python">class CustomSGD(torch.optim.Optimizer): def __init__(self, params, lr=0.01): defaults = dict(lr=lr) super().__init__(params, defaults) def step(self, closure=None): loss = None for group in self.param_groups: lr = group['lr'] for p in group['params']: if p.grad is None: continue # Update parameter: p = p - lr * grad p.data.add_(p.grad.data, alpha=-lr) return loss custom_opt = CustomSGD(model.parameters(), lr=0.01) print("Custom optimizer initialized successfully.")</pre>

This implementation traverses the parameter groups managed by the optimizer. The in-place operation add_ is used to update the weights directly, avoiding copying overhead and matching the behavior of PyTorch's native optimizer.