Backpropagation: Computational Graphs and the Chain Rule

Backpropagation is the algorithm used to calculate gradients of the loss function with respect to a network's weights. By representing execution as a computational graph, backpropagation applies the chain rule of calculus in reverse order to propagate derivatives.

Computational Graphs

Computational graphs represent complex functions as directed acyclic graphs of operations and variables.

Node and Edge Representations

In a computational graph, nodes represent mathematical operators (such as addition, multiplication, or activations), while edges represent tensors flowing between them. The input variables and weights are the leaf nodes of this directed acyclic graph (DAG).

During the forward pass, data flows from inputs to outputs, producing a final scalar value (the loss). The graph structure records the sequence of operations, defining the dependencies needed to compute derivatives during the backward pass.

Topological Sorting and Execution

To execute the forward pass, nodes are evaluated in topological order, ensuring that all inputs to an operation are computed before the node itself is evaluated. The backward pass reverses this order, traversing the graph from output to inputs.

This reverse-mode traversal is highly efficient. By storing intermediate values during the forward pass, we can calculate the gradients of all parameters in a single backward pass, regardless of network depth.

The Chain Rule of Calculus

The chain rule allows us to decompose the gradient of a composite function into a product of simpler derivatives.

Multivariate Chain Rule

For a function $f$ that depends on multiple intermediate variables $y_j$, which in turn depend on $x$, the multivariate chain rule states:

$$\frac{\partial f}{\partial x} = \sum_j \frac{\partial f}{\partial y_j} \frac{\partial y_j}{\partial x}$$

In neural networks, this summation accounts for paths where a node branches to multiple output connections. Backpropagation sums the incoming gradients from all paths, ensuring that the total derivative is computed correctly.

Matrix Calculus and Jacobians

When working with vectors and matrices, derivatives are represented as Jacobian matrices. For a mapping $\mathbf{y} = f(\mathbf{x})$, where $\mathbf{x} \in \mathbb{R}^N$ and $\mathbf{y} \in \mathbb{R}^M$, the Jacobian $\mathbf{J} \in \mathbb{R}^{M \times N}$ contains all partial derivatives:

$$\mathbf{J}_{ij} = \frac{\partial y_i}{\partial x_j}$$

Gradient propagation is executed as matrix multiplications of these Jacobians. In practice, to save memory, we calculate vector-Jacobian products (VJPs) directly, avoiding the need to store massive, sparse Jacobian matrices in memory.

PyTorch Autograd Engine

PyTorch's Autograd engine constructs dynamic computational graphs during execution, enabling automatic differentiation.

Coding a Custom Autograd Function

We can extend PyTorch's differentiation capabilities by writing a custom autograd function with explicit forward and backward rules:

<pre><code class="language-python">import torch class SquareAndDouble(torch.autograd.Function): @staticmethod def forward(ctx, x): # Save input tensor for backward pass ctx.save_for_backward(x) # Compute forward pass: y = 2 * x^2 return 2.0 * (x ** 2) @staticmethod def backward(ctx, grad_output): # Retrieve saved tensors x, = ctx.saved_tensors # Compute gradient: dy/dx = 4 * x grad_x = 4.0 * x # Apply chain rule: multiply incoming gradient by local gradient return grad_output * grad_x # Test custom function x = torch.tensor([3.0], requires_grad=True) custom_op = SquareAndDouble.apply y = custom_op(x) y.backward() print("Gradient of y wrt x:", x.grad.item()) # 4 * 3 = 12.0</pre>

In this code, we define a custom autograd operation. The ctx context object is used to cache tensors during the forward pass. The backward method implements the local gradient calculation, applying the chain rule by multiplying with grad_output.

Dynamic Graph Construction

Unlike static graph frameworks, PyTorch builds its computational graph dynamically during execution (define-by-run). Every tensor operation involving parameters with requires_grad=True appends a new node to the graph, referenced by the tensor's grad_fn property.

This dynamic architecture allows for conditional logic (e.g. if-statements) and variable-length inputs in the forward pass, as the backward graph is constructed from scratch during each iteration, providing immense design flexibility.