Weight Initialization: He (Kaiming) Initialization

He (Kaiming) initialization is the standard weight initialization scheme for networks using rectified activation functions like ReLU and Leaky ReLU. It adjusts the scaling factor to compensate for the variance loss in rectified activations.

Mathematical Derivation for ReLU

He initialization doubles the weight variance to compensate for the zero-thresholding behavior of ReLU.

The ReLU Variance Deficit

Because ReLU maps negative inputs to zero, the variance of its output is halved compared to symmetric activations: $\text{Var}(a) = \frac{1}{2} \text{Var}(z)$. Substituting this into the variance equation yields:

$$\text{Var}(a^{[l]}) = \frac{1}{2} d_{in} \text{Var}(W^{[l]}) \text{Var}(a^{[l-1]})$$

To maintain constant variance across layers, we require $\frac{1}{2} d_{in} \text{Var}(W^{[l]}) = 1$, which means the weights must be scaled to have twice the variance of Xavier initialization.

He Normal and Uniform Bounds

To satisfy the ReLU variance condition, He initialization sets the weight variance to:

$$\text{Var}(W) = \frac{2}{d_{in}}$$

For He Normal initialization, weights are drawn from $\mathcal{N}(0, \sigma^2)$ where $\sigma^2 = \frac{2}{d_{in}}$. For He Uniform initialization, weights are drawn from $\mathcal{U}(-a, a)$ where the bound is $a = \sqrt{\frac{6}{d_{in}}}$.

Modes and Activation Adjustments

He initialization supports different variance tracking modes and adjusts for non-zero slopes in Leaky ReLU.

Fan-in vs. Fan-out Modes

He initialization can scale weights based on $d_{in}$ (fan-in mode) or $d_{out}$ (fan-out mode). Fan-in mode preserves the variance of activations in the forward pass, which is standard. Fan-out mode preserves the variance of gradients in the backward pass.

The choice between these modes depends on the architecture. In modern deep networks, fan-in is generally preferred, but fan-out can be useful in architectures with wide convolutional channels.

Leaky ReLU Kaiming Adjustment

When using Leaky ReLU with negative slope $\alpha$, the variance loss is smaller because negative inputs are scaled rather than discarded. The variance equation adjusts to:

$$\text{Var}(W) = \frac{2}{(1 + \alpha^2) d_{in}}$$

For a negative slope of $\alpha=0.01$, the adjustment is negligible, but for larger values (e.g. $\alpha=0.2$ in GANs), this adjustment is necessary to prevent signal explosion in deep layers.

PyTorch Implementation

We can apply Kaiming initialization using PyTorch's native methods or implement the scaling boundaries manually.

Using nn.init.kaiming

Here is how to apply Kaiming uniform and normal initializations in PyTorch:

<pre><code class="language-python">import torch import torch.nn as nn import torch.nn.init as init layer = nn.Linear(in_features=100, out_features=50) # Kaiming Normal (He Normal) for ReLU activations init.kaiming_normal_(layer.weight, a=0.0, mode='fan_in', nonlinearity='relu') # Kaiming Uniform (He Uniform) for Leaky ReLU activations init.kaiming_uniform_(layer.weight, a=0.2, mode='fan_in', nonlinearity='leaky_relu') init.constant_(layer.bias, 0.0)</pre>

In this code, the a parameter represents the negative slope of the activation function, and nonlinearity informs PyTorch which activation is used so it can compute the correct gain adjustment.

Coding He Initialization Manually

We can manually implement He Normal initialization to verify the variance scaling rules:

<pre><code class="language-python"># Extract fan-in (input dimensions) d_in = layer.weight.shape[1] # Calculate standard deviation: sqrt(2 / d_in) std = (2.0 / d_in) ** 0.5 with torch.no_grad(): # Fill weights manually from normal distribution with std layer.weight.normal_(mean=0.0, std=std) print("Manual He Normal std:", std)</pre>

This manual calculation verifies the He normal boundary condition. Scaling the random normal distribution by $\sqrt{2/d_{in}}$ preserves the signal variance in the forward pass, matching the behavior of PyTorch's native function.