Activation Functions: ReLU (Rectified Linear Unit)

The Rectified Linear Unit (ReLU) is the most widely used activation function in deep learning. It threshold values at zero, providing constant gradient flow for positive values and enabling extremely fast training.

Mathematical Properties of ReLU

ReLU is a piecewise linear function that outputs the input directly if it is positive, and zero otherwise.

The ReLU Equation

The mathematical definition of ReLU is simple:

$$f(z) = \max(0, z)$$

For $z > 0$, the function acts as the identity mapping; for $z \le 0$, the output is 0. This half-rectified behavior introduces non-linearity while remaining computationally trivial to evaluate.

The ReLU Derivative

The derivative of the ReLU function is constant for positive inputs and zero for negative inputs:

$$f'(z) = \begin{cases} 1 & \text{if } z > 0 \ 0 & \text{if } z < 0 \end{cases}$$

At $z=0$, the function is not differentiable, but in practice, libraries use a subgradient of 0 or 1. The constant gradient of 1.0 for positive inputs completely eliminates the vanishing gradient problem in those channels.

Advantages and Training Bottlenecks

ReLU provides substantial performance benefits but introduces a specific failure mode during training.

Sparsity and Computational Efficiency

Because ReLU outputs zero for all negative inputs, it naturally induces sparse activations in the network. At any given moment, only a subset of neurons are active, which mimics biological energy-saving mechanisms and reduces computation.

Additionally, evaluating $\max(0, z)$ requires no expensive exponential calculations, making it computationally faster than sigmoid or tanh, which accelerates epoch times in large networks.

The Dying ReLU Problem

The primary weakness of ReLU is the dying ReLU problem. If a large gradient flows through a ReLU neuron, the weights may adjust such that the neuron receives negative inputs across the entire dataset.

Once a neuron's input is always negative, its output is always zero, and its gradient is always zero. The neuron becomes inactive (dead) and can never recover because gradient descent cannot update its weights, reducing the network's capacity.

PyTorch Implementation

We can implement and inspect the ReLU activation and its gradient flow in PyTorch.

Using nn.ReLU

In PyTorch, we can apply ReLU using nn.ReLU or torch.relu. We can also use it in-place to save memory:

<pre><code class="language-python">import torch import torch.nn as nn class ReLUModel(nn.Module): def __init__(self): super().__init__() # inplace=True overwrites the input tensor directly to save GPU memory self.relu = nn.ReLU(inplace=True) def forward(self, x): # Input tensor x return self.relu(x) model = ReLUModel() x = torch.tensor([-5.0, 0.0, 5.0]) # Note: we pass a copy because inplace=True modifies the tensor print("ReLU outputs:", model(x.clone()))</pre>

In this code, the negative value $-5.0$ is mapped directly to zero, while the positive value $5.0$ remains unchanged. The inplace=True parameter is a useful optimization for large models like ResNets to reduce memory overhead.

Verifying ReLU Gradients

We can inspect the gradient flow through ReLU in PyTorch, demonstrating how negative inputs result in zero gradients (dead pathways) while positive inputs maintain constant gradient flow:

<pre><code class="language-python">x = torch.tensor([-2.0, 2.0], requires_grad=True) y = torch.relu(x) y.sum().backward() print("ReLU Gradients:", x.grad) # tensor([0.0, 1.0])</pre>

This code confirms the gradient behavior: the negative element has a gradient of $0.0$, while the positive element has a gradient of $1.0$. This demonstrates both the sparsity benefit and the risk of dead neurons.