The Gradient: Pointing to the Steepest Ascent
When we collect all the partial derivatives of a multivariable function into a single vector, we obtain the gradient. The gradient is one of the most powerful concepts in vector calculus and machine learning. In any multi-dimensional space, the gradient vector points in the direction of the steepest increase of the function, and its magnitude represents the rate of that increase. By reversing the gradient, we find the fastest path downhill.
The Gradient Vector
The gradient generalizes the concept of the derivative to functions of multiple variables. It is represented by the Nabla symbol ($\nabla$), which looks like an upside-down triangle.
Mathematical Structure
For a function $f(x_1, x_2, \dots, x_n)$, the gradient vector is defined as:
$$\nabla f = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}$$
This vector lives in the input space of the function and contains the slope of the function along each coordinate axis.
Steepest Ascent and Magnitude
If you stand on a hillside, there are infinitely many directions you could walk. The gradient vector points in the single direction that has the steepest upward slope. The length (or magnitude) of the vector, $||\nabla f||$, tells you how steep that slope is at your current location.
Gradient Descent: The Core Optimizer
In machine learning, we want to minimize the loss function, not maximize it. Therefore, we utilize the gradient to move in the opposite direction.
Moving Downhill
Since the gradient $\nabla L$ points in the direction of steepest ascent, the negative gradient $-\nabla L$ points in the direction of steepest descent. By taking steps in this direction, we can systematically lower the model's error.
The Update Rule
The gradient descent update formula is written as:
$$\theta \leftarrow \theta - \alpha \nabla L(\theta)$$
Here, $\theta$ represents the vector of all model parameters, $\alpha$ is the learning rate (step size), and $\nabla L(\theta)$ is the gradient of the loss function. The learning rate controls how far we step in the direction of the negative gradient.