Weight Initialization: Xavier (Glorot) Initialization

Xavier (Glorot) initialization is designed to stabilize activations and gradients in networks with symmetric activation functions like sigmoid and tanh. It scales weights based on the input and output dimensions of each layer.

Mathematical Derivation

Xavier initialization derives weight scales that keep activation and gradient variances constant across layers.

Variance Consistency Assumptions

Glorot and Bengio assumed that the network operates in its linear region, that inputs are independent and identically distributed with zero mean, and that weights are independent with zero mean. Under these assumptions, the variance of activation $a^{[l]}$ is:

$$\text{Var}(a^{[l]}) = d_{in} \times \text{Var}(W^{[l]}) \times \text{Var}(a^{[l-1]})$$

To maintain constant variance across layers, we require $d_{in} \text{Var}(W^{[l]}) = 1$. Similarly, to maintain constant gradient variance in the backward pass, we require $d_{out} \text{Var}(W^{[l]}) = 1$.

Xavier Uniform and Normal Bounds

To balance the forward ($d_{in}$) and backward ($d_{out}$) constraints, Xavier initialization sets the weight variance to the harmonic mean of the two dimensions:

$$\text{Var}(W) = \frac{2}{d_{in} + d_{out}}$$

For a normal distribution, we draw weights from $\mathcal{N}(0, \sigma^2)$ where $\sigma^2 = \frac{2}{d_{in} + d_{out}}$. For a uniform distribution, we draw from $\mathcal{U}(-a, a)$ where $a = \sqrt{\frac{6}{d_{in} + d_{out}}}$.

Design Trade-offs

Xavier initialization performs well with symmetric activations but fails when used with rectified linear units.

Balancing Fan-in and Fan-out

By including both $d_{in}$ (fan-in) and $d_{out}$ (fan-out), Xavier initialization prevents signal decay in the forward pass and gradient explosion in the backward pass. This dual-balancing is critical for networks where layers change dimension, such as compression bottlenecks.

This approach stabilizes gradient flow across varying layer widths, reducing the need to adjust learning rates manually for different architectures.

Limitations with ReLU Activations

Xavier initialization assumes that the activation function is symmetric around zero. However, the ReLU activation maps all negative values to zero, discarding half of the activation variance: $\text{Var}(a) \approx \frac{1}{2} \text{Var}(z)$.

As a result, using Xavier initialization in deep ReLU networks causes the activation variance to decay by half at each layer, leading to vanishing gradients and requiring an adjusted initialization scheme.

PyTorch Implementation

We can apply Xavier initialization using PyTorch's nn.init functions or implement the scaling bounds manually.

Using nn.init.xavier

Here is how to apply Xavier uniform and normal initializations to PyTorch layers:

<pre><code class="language-python">import torch import torch.nn as nn import torch.nn.init as init layer = nn.Linear(in_features=100, out_features=50) # Xavier Uniform Initialization init.xavier_uniform_(layer.weight, gain=1.0) # Xavier Normal Initialization init.xavier_normal_(layer.weight, gain=1.0) # Initialize bias to zero init.constant_(layer.bias, 0.0)</pre>

In this code, the gain parameter adjusts the initialization scale to compensate for specific activation functions. For sigmoid and tanh activations, the default gain of 1.0 matches the theoretical derivation.

Coding Xavier Manually

We can write a manual Xavier uniform initialization function in PyTorch using random uniform tensors:

<pre><code class="language-python"># Dimensions of the layer d_in = layer.weight.shape[1] d_out = layer.weight.shape[0] # Calculate bound: sqrt(6 / (d_in + d_out)) bound = (6.0 / (d_in + d_out)) ** 0.5 with torch.no_grad(): # Fill weights manually from uniform distribution [-bound, bound] layer.weight.uniform_(-bound, bound) print("Manual Xavier bound:", bound)</pre>

This manual calculation verifies the theoretical uniform bounds. Generating random values within the calculated range breaks symmetry while maintaining variance, matching PyTorch's native function.