Weight Initialization: Symmetry Breaking and Vanishing Gradients
Symmetry breaking is the process of initializing weights randomly to ensure that different neurons in the same layer learn different features, preventing gradient vanishing and exploding during backpropagation.
Symmetry Breaking
Random initialization breaks the symmetry of hidden units, allowing them to adapt to different features.
The Identical Update Problem
If all weights in a layer are initialized to the same value, the backward gradients for those weights will be identical. During optimization, they will update by the same amount, meaning the neurons will remain identical.
To prevent this identical update problem, we must break the symmetry by initializing parameters with random values drawn from normal or uniform distributions. This ensures that each neuron starts with a different feature filter.
Breaking Symmetry via Randomness
We draw weights randomly from a distribution centered at zero, such as $\mathcal{N}(0, \sigma^2)$ or $\mathcal{U}(-a, a)$. This randomness ensures that during the forward pass, different neurons respond to different combinations of inputs.
This diversity of activation values leads to diverse gradient updates during backpropagation, allowing the network's filters to partition the input space and learn rich representations.
Gradient Stability and Scale
The scale of weight initialization determines whether gradients remain active or vanish in deep architectures.
Impact of Initial Scale on Gradients
If the scale of the initial weights is too large, the pre-activation inputs $z$ to sigmoid or tanh layers will be large, forcing activations into the saturated regions ($1$ or $-1$). In these regions, the derivatives are near zero, causing gradients to vanish.
Conversely, if the scale is too small, activations will shrink at each layer, reducing gradient values to zero by the time they reach the early layers, halting training.
Mathematical Gradient Bounds
For a network with $L$ layers, the gradient of the loss with respect to the first layer's weights is proportional to a product of the weight matrices of all subsequent layers:
$$\nabla_{\mathbf{W}^{[1]}} \mathcal{L} \propto \prod_{l=2}^L \mathbf{W}^{[l]}$$
If the scale of the weights is such that the eigenvalues of $\mathbf{W}^{[l]}$ are less than 1.0, the product will decay exponentially, vanishing for deep networks. If the eigenvalues are greater than 1.0, the product will explode.
PyTorch Diagnostics
We can diagnose initialization problems in PyTorch by logging activation variance and tracking gradient norms.
Tracking Activation Variances
This PyTorch script tracks how activation variance changes across sequential layers, showing how poor scale choices lead to variance collapse:
<pre><code class="language-python">import torch import torch.nn as nn # Stack of 10 linear layers without activations for simplicity layers = [nn.Linear(100, 100) for _ in range(10)] x = torch.randn(100, 100) # Input with variance = 1.0 # Initialize with very small weights (std = 0.01) for layer in layers: nn.init.normal_(layer.weight, mean=0.0, std=0.01) # Forward pass tracking variance current_activation = x for i, layer in enumerate(layers): current_activation = layer(current_activation) print(f"Layer {i} Variance: {torch.var(current_activation).item():.6e}")</pre>In this code, we observe the activation variance. Because the weights are initialized with an standard deviation of 0.01 (too small for dimension 100), the variance collapses exponentially, illustrating signal decay.
Gradient Tracking during Backpropagation
We can track gradient norms during the backward pass by checking the .grad attributes of parameters. If the norms approach zero for early layers, it indicates vanishing gradients; if they are extremely large, it indicates exploding gradients.
Logging these values during the first few epochs helps detect initialization issues before training crashes, ensuring stable optimization.