The Chain Rule: The Math Behind Backpropagation

Deep neural networks are built by nesting functions inside one another. The output of the first layer becomes the input of the second, which flows into the third, and so on. To train the network, we must calculate how a change in the weights of the very first layer affects the final error at the output. The Chain Rule is the mathematical formula that allows us to differentiate these nested, composite functions, making backpropagation possible.

Differentiating Composite Functions

The Chain Rule describes how to find the derivative of a composite function. It states that the rate of change of a nested function is the product of the rate of change of the outer function and the rate of change of the inner function.

The Core Formula

If $y = f(u)$ and $u = g(x)$, then $y$ is a composite function of $x$ (written as $y = f(g(x))$). The Chain Rule states that: $$\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}$$ Alternatively, in prime notation: $(f \circ g)'(x) = f'(g(x)) \cdot g'(x)$.

Intuitive Rate Multiplication

Imagine three gears linked together: Gear A, Gear B, and Gear C. If Gear A turns 3 times as fast as Gear B, and Gear B turns 2 times as fast as Gear C, then Gear A turns $3 \times 2 = 6$ times as fast as Gear C. The Chain Rule works the same way: rates of change multiply along the chain.

The Chain Rule in Neural Networks

Backpropagation is simply the practical implementation of the multivariable Chain Rule on a computational graph. It allows error signals to flow backward through the network.

Flow of Gradients

During the backward pass of training, we start at the loss function at the very end of the network. We calculate the derivative of the loss and propagate it backward, multiplying it by the derivatives of each activation function and weight matrix along the way to find the gradient for every parameter.

Computational Graphs

Modern deep learning libraries represent models as directed acyclic graphs of mathematical operations. By keeping track of the local derivative of each node, the library can apply the Chain Rule automatically to calculate gradients for arbitrary architectures.