Backpropagation: Analytical Gradients vs. Numerical Gradients

Gradients can be calculated analytically using exact derivative formulas, or numerically using finite differences approximations. Comparing these two methods is the basis of gradient checking, which is used to verify the correctness of custom backpropagation code.

Gradient Estimation Methods

Analytical and numerical differentiation offer different trade-offs in speed, precision, and implementation complexity.

Analytical Gradients

Analytical differentiation uses symbolic calculus to derive exact formulas for gradients. For example, the analytical gradient of $f(x) = x^2$ is exactly $2x$. Once derived, evaluating these formulas is extremely fast, requiring only basic arithmetic operations.

This efficiency is why deep learning frameworks rely entirely on analytical gradients for training. However, deriving analytical gradients for complex operations (like custom CUDA kernels) is prone to human error, requiring external validation.

Numerical Gradients (Finite Differences)

Numerical differentiation approximates derivatives by evaluating the function at close intervals. Using the symmetric finite difference formula, the gradient is approximated as:

$$f'(x) \approx \frac{f(x + h) - f(x - h)}{2h}$$

where $h$ is a small step size (e.g., $10^{-5}$). While easy to implement, this method is slow because calculating gradients for a model with $D$ parameters requires $2D$ forward passes, making it impractical for training large networks.

Gradient Checking (Grad Check)

Gradient checking compares analytical gradients against numerical approximations to catch implementation bugs.

Relative Error Metric

To compare the analytical gradient vector $\mathbf{g}_{ana}$ and the numerical gradient vector $\mathbf{g}_{num}$, we calculate their normalized relative error:

$$E_{rel} = \frac{\|\mathbf{g}_{num} - \mathbf{g}_{ana}\|_2}{\|\mathbf{g}_{num}\|_2 + \|\mathbf{g}_{ana}\|_2}$$

This metric is scale-invariant, preventing false alarms when gradients are extremely small. A relative error below $10^{-7}$ indicates a correct gradient implementation, while values above $10^{-4}$ suggest bugs in the analytical derivative code.

Common Sources of Checking Failures

Gradient checking can fail due to precision issues or non-differentiable operations. If the step size $h$ is too small (e.g., $10^{-15}$), floating-point subtraction underflow will corrupt the numerical gradient. If $h$ is too large, the approximation error will dominate.

Additionally, functions containing non-differentiable points (like ReLU at $x=0$, or max-pooling boundaries) will exhibit large relative errors if the step size crosses the kink, which requires checking gradients away from active boundaries.

PyTorch Implementation

We can use PyTorch's built-in gradcheck utility or write a manual numerical checker to validate gradients.

Using torch.autograd.gradcheck

PyTorch contains a high-precision gradient checker that verifies custom functions using double-precision float tensors:

<pre><code class="language-python">import torch from torch.autograd import gradcheck # Define custom operation def my_op(x): return x ** 3 # Inputs must be double precision (float64) for accurate checks inputs = torch.randn(2, requires_grad=True, dtype=torch.float64) # Verify the gradients using finite differences test_passed = gradcheck(my_op, inputs, eps=1e-6, atol=1e-4) print("PyTorch gradcheck passed?:", test_passed)</pre>

In this code, we use gradcheck to validate the derivative of $x^3$. It computes the Jacobian matrix numerically using finite differences and compares it to the analytical Jacobian generated by PyTorch's autograd engine, returning True if they align.

Coding a Manual Gradient Checker

We can implement a manual gradient checking loop in PyTorch to compare analytical gradients with finite difference approximations:

<pre><code class="language-python"># Function to test f = lambda x: (x ** 2).sum() x = torch.tensor([1.5, -2.0], requires_grad=True) # Analytical gradient y = f(x) y.backward() grad_ana = x.grad.clone() # Numerical gradient using finite differences h = 1e-5 grad_num = torch.zeros_like(x) for i in range(len(x)): x_plus = x.clone().detach() x_minus = x.clone().detach() x_plus[i] += h x_minus[i] -= h grad_num[i] = (f(x_plus) - f(x_minus)) / (2.0 * h) # Relative error calculation rel_error = torch.norm(grad_num - grad_ana) / (torch.norm(grad_num) + torch.norm(grad_ana)) print("Relative Error:", rel_error.item())</pre>

This loop computes the numerical gradient dimension by dimension. Comparing the output with the analytical gradient allows us to check the correctness of the analytical backpropagation pipeline, confirming the mathematical implementation.