Calculating Derivatives Programmatically

Backpropagation requires computing derivatives of a loss function with respect to every parameter. In theory, you derive these by hand. In practice, you use two approaches: numerical differentiation (approximating with finite differences) for debugging, and automatic differentiation (used by PyTorch/JAX) for production.

Numerical Differentiation (Finite Differences)

The derivative definition is $f'(x) = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}$. With a very small $h$ (like $10^{-7}$), we can approximate this numerically. This is useful for gradient checking — verifying your analytical gradient is correct.

A Gradient Checker

<pre><code class="language-python">def numerical_gradient(f, x, h=1e-7): """Approximate the derivative of f at x.""" return (f(x + h) - f(x - h)) / (2 * h) # central difference # Test on f(x) = x^3, true derivative is 3x^2 f = lambda x: x**3 print(numerical_gradient(f, x=2.0)) # ≈ 12.0 (true: 12) print(numerical_gradient(f, x=5.0)) # ≈ 75.0 (true: 75) </pre>

The central difference formula (using $x+h$ and $x-h$) is more accurate than the one-sided formula because its error is $O(h^2)$ instead of $O(h)$.

Computing Gradients for Multivariable Functions

Most AI functions take a vector of parameters. We compute the gradient by calculating the partial derivative for each parameter while holding the rest constant.

Vector Gradient

<pre><code class="language-python">import numpy as np def numerical_gradient_vec(f, x, h=1e-5): grad = np.zeros_like(x) for i in range(len(x)): x_plus = x.copy(); x_plus[i] += h x_minus = x.copy(); x_minus[i] -= h grad[i] = (f(x_plus) - f(x_minus)) / (2 * h) return grad # f(x, y) = x^2 + 3y → gradient should be [2x, 3] f = lambda v: v[0]**2 + 3*v[1] x0 = np.array([4.0, 1.0]) print(numerical_gradient_vec(f, x0)) # [8.0, 3.0] </pre>

Automatic Differentiation with PyTorch

Numerical differentiation is slow — it needs $2n$ function calls for $n$ parameters. Automatic differentiation (autograd) computes exact gradients in a single backward pass by tracking operations in a computation graph. This is what makes training billion-parameter models feasible.

PyTorch Autograd Basics

<pre><code class="language-python">import torch x = torch.tensor(3.0, requires_grad=True) # Define a computation y = x ** 3 + 2 * x # y = x^3 + 2x # Backpropagate to compute dy/dx y.backward() print(x.grad) # tensor(29.) → 3x^2 + 2 at x=3 = 27+2 = 29 ✅ </pre>