The Gradient: Pointing to the Steepest Ascent

When we collect all the partial derivatives of a multivariable function into a single vector, we obtain the gradient. The gradient is one of the most powerful concepts in vector calculus and machine learning. In any multi-dimensional space, the gradient vector points in the direction of the steepest increase of the function, and its magnitude represents the rate of that increase. By reversing the gradient, we find the fastest path downhill.

The Gradient Vector

The gradient generalizes the concept of the derivative to functions of multiple variables. It is represented by the Nabla symbol ($\nabla$), which looks like an upside-down triangle.

Mathematical Structure

For a function $f(x_1, x_2, \dots, x_n)$, the gradient vector is defined as: $$\nabla f = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}$$ This vector lives in the input space of the function and contains the slope of the function along each coordinate axis.

Steepest Ascent and Magnitude

If you stand on a hillside, there are infinitely many directions you could walk. The gradient vector points in the single direction that has the steepest upward slope. The length (or magnitude) of the vector, $||\nabla f||$, tells you how steep that slope is at your current location.

Gradient Descent: The Core Optimizer

In machine learning, we want to minimize the loss function, not maximize it. Therefore, we utilize the gradient to move in the opposite direction.

Moving Downhill

Since the gradient $\nabla L$ points in the direction of steepest ascent, the negative gradient $-\nabla L$ points in the direction of steepest descent. By taking steps in this direction, we can systematically lower the model's error.

The Update Rule

The gradient descent update formula is written as: $$\theta \leftarrow \theta - \alpha \nabla L(\theta)$$ Here, $\theta$ represents the vector of all model parameters, $\alpha$ is the learning rate (step size), and $\nabla L(\theta)$ is the gradient of the loss function. The learning rate controls how far we step in the direction of the negative gradient.