Why Standard RNNs Forget (Vanishing Gradients again)
Vanilla RNNs struggle to learn dependencies that span more than a few steps. The mathematical cause is the vanishing gradient problem, where propagating errors backward through many steps causes gradients to decay exponentially to zero.
The Jacobian Chain
The gradient of the loss at step T with respect to the hidden state at step t involves a product of Jacobian matrices representing the derivative of the recurrence relation.
Repeated Matrix Multiplications
To backpropagate from step T to step t, we multiply the transition matrix W_{hh}^T repeatedly: \\frac{\\partial h_T}{\\partial h_t} = \\prod_{j=t+1}^T W_{hh}^T \\text{diag}(1 - \\tanh^2(...)). This acts like exponentiating a matrix.
Vanishing vs. Exploding Conditions
If the largest eigenvalue of W_{hh} is less than 1, the gradient norm decays exponentially to zero (vanishing). If the largest eigenvalue is greater than 1, the gradient norm grows exponentially (exploding), causing training to destabilize.
Practical Remedies
While exploding gradients can be mitigated using gradient clipping, vanishing gradients require structural changes to the cell architecture.
Gradient Clipping
Gradient clipping prevents exploding gradients by scaling down the gradient vector if its norm exceeds a predefined threshold, keeping updates stable.
<pre><code class="language-python">import torch import torch.nn as nn import torch.optim as optim model = nn.RNN(10, 20, batch_first=True) optimizer = optim.Adam(model.parameters(), lr=0.01) # Training step # loss.backward() # Clip gradients to a maximum norm of 1.0 nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) optimizer.step()</pre>Structural Remedies
Gradient clipping does not solve vanishing gradients. To fix vanishing gradients, we must replace the simple multiplicative hidden state recurrence with an additive conveyor belt, which is the foundational design choice of the LSTM and GRU.