Optimizers: Stochastic Gradient Descent (SGD)
Stochastic Gradient Descent (SGD) is the foundational optimizer for deep learning. By updating parameters using gradients computed on small subsets of data (mini-batches), SGD accelerates training and introduces regularization noise.
Mathematical Mechanics
SGD updates model parameters iteratively in the direction of steepest descent on the loss surface.
Gradient Descent vs. SGD
Batch Gradient Descent computes gradients across the entire training dataset. While accurate, this is computationally expensive for large datasets. SGD approximates the true gradient using a single sample (or a mini-batch of samples):
$$\mathbf{g}_t = \nabla_\theta \mathcal{L}(\mathbf{x}_t, y_t; \theta_t)$$
This approximation reduces the computational complexity per step from $\mathcal{O}(M)$ to $\mathcal{O}(B)$, where $M$ is the dataset size and $B$ is the mini-batch size, enabling faster training updates.
The SGD Update Equation
Once the gradient estimate $\mathbf{g}_t$ is computed, parameters $\theta$ are updated by scaling the gradient by the learning rate $\eta$:
$$\theta_{t+1} = \theta_t - \eta \mathbf{g}_t$$
The learning rate determines the size of the step taken along the loss surface. If the learning rate is too large, the optimizer will overshoot the minimum; if it is too small, convergence will be extremely slow.
Optimization Challenges
While efficient, SGD struggles on complex loss landscapes characterized by noise and pathological curvatures.
Convergence Path and Noise
Because gradient estimates are computed on mini-batches, they contain stochastic noise. This causes the optimization path to zig-zag rather than following a straight path to the minimum.
However, this noise can be beneficial. The fluctuations can help the optimizer escape shallow local minima and saddle points, allowing the model to find flatter minima that generalize better to unseen test data.
Pathological Curvature and Saddles
SGD performs poorly in regions of pathological curvature, such as ravines, where the loss surface is steep in one direction and gentle in another. SGD tends to bounce back and forth between the steep walls instead of moving along the valley floor.
Additionally, high-dimensional spaces contain many saddle points where gradients are zero. Without momentum, SGD can become trapped in these regions, highlights the need for advanced optimization variants.
PyTorch Implementation
We can use PyTorch's native optimizer or write a custom SGD class using the Optimizer base interface.
Using torch.optim.SGD
Here is how to configure and apply PyTorch's SGD optimizer to train a model parameter:
<pre><code class="language-python">import torch import torch.nn as nn import torch.optim as optim # Simple linear model model = nn.Linear(2, 1) # Initialize SGD optimizer with a learning rate of 0.01 optimizer = optim.SGD(model.parameters(), lr=0.01) # Simulated training step inputs = torch.randn(4, 2) targets = torch.randn(4, 1) predictions = model(inputs) loss = torch.mean((predictions - targets) ** 2) loss.backward() # Update parameters using SGD rule optimizer.step() # Reset gradients for next iteration optimizer.zero_grad()</pre>In this code, we initialize the SGD optimizer. The step() method updates the model's weights and biases using the calculated gradients, and zero_grad() resets parameter gradients to prevent accumulation.
Coding a Custom SGD Optimizer
We can implement a custom SGD optimizer in PyTorch by inheriting from the base Optimizer class:
This implementation traverses the parameter groups managed by the optimizer. The in-place operation add_ is used to update the weights directly, avoiding copying overhead and matching the behavior of PyTorch's native optimizer.