Activation Functions: The Sigmoid Function

The sigmoid activation function is a smooth, S-shaped curve that maps real-valued inputs to a range between 0 and 1. It is widely used in binary classification output layers but suffers from limitations in deep hidden layers.

Mathematical Properties of Sigmoid

The sigmoid, or logistic function, compresses the real number line into a probability distribution.

The Sigmoid Equation

The sigmoid function $\sigma(z)$ is defined mathematically as:

$$\sigma(z) = \frac{1}{1 + e^{-z}}$$

For large positive inputs, $\sigma(z)$ approaches 1; for large negative inputs, it approaches 0. When $z=0$, the output is exactly 0.5. This mapping is highly useful for predicting binary probabilities, where the output represents $P(y=1|x)$.

The Sigmoid Derivative

An elegant property of the sigmoid function is that its derivative can be expressed in terms of its output:

$$\sigma'(z) = \sigma(z)(1 - \sigma(z))$$

This property makes gradient calculation computationally efficient, as the derivative requires only multiplication of the forward pass outputs. However, the maximum value of this derivative is exactly 0.25 (at $z=0$), which has severe consequences for optimization.

Limitations in Deep Networks

Sigmoid activations introduce training bottlenecks when used in hidden layers of deep neural networks.

The Vanishing Gradient Problem

Since the derivative of the sigmoid function peaks at 0.25, backpropagating through multiple hidden layers multiplies these small fractions. For a network with $L$ layers, the gradient is scaled by $(0.25)^L$, which approaches zero rapidly.

As a result, the weights of the early layers update extremely slowly, preventing the network from learning features in the first layers. This is the vanishing gradient problem, which historically limited the depth of early networks.

Saturation and Non-Zero Centered Outputs

When inputs are highly positive or highly negative, the sigmoid function saturates, meaning its output is close to 1 or 0. In these regions, the derivative $\sigma'(z)$ is close to zero, which freezes weight updates (saturated gradients).

Additionally, sigmoid outputs are always positive ($0 < \sigma(z) < 1$), meaning activations are non-zero centered. This causes the gradients of all weights in the next layer to have the same sign, leading to zig-zagging weight updates and slower training convergence.

PyTorch Implementation

We can implement the sigmoid activation using PyTorch's built-in activation modules and verify its derivatives.

Using torch.sigmoid

PyTorch provides the sigmoid activation function via torch.sigmoid and the nn.Sigmoid layer module:

<pre><code class="language-python">import torch import torch.nn as nn class SigmoidModel(nn.Module): def __init__(self): super().__init__() self.sigmoid = nn.Sigmoid() def forward(self, x): # Input tensor return self.sigmoid(x) model = SigmoidModel() x = torch.tensor([-3.0, 0.0, 3.0]) print("Sigmoid outputs:", model(x))</pre>

In this code, we evaluate sigmoid activations for negative, zero, and positive values, demonstrating how it compresses the inputs into the $(0,1)$ range.

Verifying Sigmoid Gradients

We can calculate the gradients of the sigmoid function in PyTorch using the autograd engine, verifying that the maximum gradient is indeed 0.25:

<pre><code class="language-python">x_grad = torch.tensor([0.0], requires_grad=True) y_grad = torch.sigmoid(x_grad) y_grad.backward() print("Sigmoid gradient at x=0:", x_grad.grad.item()) # 0.25</pre>

This manual derivative check confirms the mathematical derivative properties. The peak derivative of 0.25 highlights why sigmoid activations are generally restricted to the final classification layer rather than hidden layers.