Leaky ReLU and ELU

Leaky ReLU and Exponential Linear Unit (ELU) are advanced activation functions designed to resolve the dying ReLU problem by maintaining a small, non-zero gradient for negative inputs.

Leaky ReLU

Leaky ReLU introduces a small slope for negative values to ensure gradients never vanish completely.

The Leaky ReLU Equation

Leaky ReLU is defined as:

$$f(z) = \max(\alpha z, z)$$

where $\alpha$ is a small constant, typically set to 0.01. For positive inputs, it acts as the identity function; for negative inputs, it scales the input by $\alpha$, allowing a small negative output.

The Leaky ReLU Derivative

The derivative of Leaky ReLU is constant for both positive and negative ranges:

$$f'(z) = \begin{cases} 1 & \text{if } z > 0 \ \alpha & \text{if } z < 0 \end{cases}$$

Because the derivative for negative inputs is $\alpha > 0$, the gradient never becomes zero. This ensures that even if a neuron is inactive, it continues to receive updates during backpropagation, resolving the dying ReLU problem.

Exponential Linear Unit (ELU)

ELU smooths the transition for negative values using an exponential function, pushing the mean activation closer to zero.

The ELU Equation

The Exponential Linear Unit (ELU) is formulated as:

$$f(z) = \begin{cases} z & \text{if } z > 0 \ \alpha(e^z - 1) & \text{if } z \le 0 \end{cases}$$

where $\alpha$ is a hyperparameter (usually 1.0) controlling the value that the negative curve approaches. The exponential term creates a smooth curve that transitions gradually to a constant negative threshold.

Smoothness and Optimization Benefits

Unlike ReLU and Leaky ReLU, ELU is smooth and continuously differentiable everywhere, including at $z=0$. This smoothness reduces noise in gradient updates, speeding up convergence.

Furthermore, ELU's negative values push the average activation of the layer closer to zero, mimicking the zero-centered benefit of tanh without suffering from vanishing gradients for positive inputs. The trade-off is the extra computational cost of calculating the exponential function.

PyTorch Implementation

We can implement and compare Leaky ReLU and ELU activations in PyTorch.

Coding Leaky ReLU and ELU

This PyTorch model shows how to apply Leaky ReLU and ELU activations to inputs:

<pre><code class="language-python">import torch import torch.nn as nn class AdvancedActivations(nn.Module): def __init__(self, alpha_leaky=0.01, alpha_elu=1.0): super().__init__() self.leaky_relu = nn.LeakyReLU(negative_slope=alpha_leaky) self.elu = nn.ELU(alpha=alpha_elu) def forward(self, x, mode="leaky"): # x shape: [batch_size, features] if mode == "leaky": return self.leaky_relu(x) else: return self.elu(x) model = AdvancedActivations() x = torch.tensor([-2.0, 0.0, 2.0]) print("Leaky ReLU:", model(x, mode="leaky")) print("ELU:", model(x, mode="elu"))</pre>

In this code, we evaluate Leaky ReLU and ELU. For Leaky ReLU, the input $-2.0$ maps to $-0.02$. For ELU, the input $-2.0$ maps to $\alpha(e^{-2} - 1) \approx -0.8647$.

Comparing Gradient Flow

We can verify the gradient flow for both activations. Unlike standard ReLU, both functions preserve non-zero gradients for negative inputs:

<pre><code class="language-python"># Leaky ReLU gradient check x_leaky = torch.tensor([-2.0], requires_grad=True) y_leaky = model(x_leaky, mode="leaky") y_leaky.backward() print("Leaky ReLU grad at x=-2:", x_leaky.grad.item()) # 0.01 # ELU gradient check x_elu = torch.tensor([-2.0], requires_grad=True) y_elu = model(x_elu, mode="elu") y_elu.backward() print("ELU grad at x=-2:", x_elu.grad.item()) # alpha * e^-2 ~= 0.1353</pre>

This validation shows that both activations maintain gradient propagation. Leaky ReLU keeps it constant at $0.01$, while ELU scales it exponentially, preventing dead units in both cases.