Local Minima, Global Minima, and Saddle Points
Training a deep neural network is like hiking through a dense, foggy mountain range to find the absolute lowest valley. As you navigate the rugged terrain, you will encounter various geographical landmarks: deep basins, shallow depressions, and flat mountain passes. In optimization terms, these correspond to global minima, local minima, and saddle points. Understanding these features is critical for developing algorithms that can successfully navigate to the best set of model parameters.
Valleys in the Landscape: Minima
Minima are points where the gradient of the loss function is zero ($\nabla L = 0$) and the surface curves upward in all directions. These represent stable configurations where the optimizer will stop making adjustments.
Global vs. Local Minima
The global minimum is the absolute lowest point on the entire loss landscape, representing the parameter configuration with the lowest possible error. A local minimum is a point that is lower than its immediate neighbors, but not necessarily the lowest point in the entire space.
The High-Dimensional Reality
While getting trapped in a poor local minimum is a major concern in low dimensions, modern research suggests that in high-dimensional spaces (like neural networks with millions of parameters), most local minima are actually clustered near the bottom of the landscape and have very similar loss values to the global minimum.
The Flat Obstacles: Saddle Points
In high-dimensional landscapes, the primary obstacle is not the local minimum, but the saddle point. A saddle point is a location where the gradient is zero, but the surface curves upward in some directions and downward in others.
Structure of a Saddle Point
A saddle point resembles a horse saddle: it is a minimum along one axis (front-to-back) but a maximum along another axis (left-to-right). Mathematically, the Hessian matrix (the matrix of second partial derivatives) at a saddle point has both positive and negative eigenvalues.
Why Saddle Points Slow Down Training
In high-dimensional spaces, saddle points are exponentially more common than local minima. Standard gradient descent algorithms slow down dramatically near saddle points because the gradient magnitude approaches zero. Advanced optimizers use momentum or adaptive step sizes to escape these flat regions and continue descending.