Activation Functions: Step and Linear Functions

Activation functions decide whether a neuron should fire, introducing non-linearity to the network. The step function was used in early perceptrons, while linear functions illustrate why non-linear transformations are mathematically required.


The Step Activation Function

The step function outputs a binary value based on whether the input crosses a threshold, making it the simplest activation model.

Mathematical Formulation

The step function, or Heaviside step function, maps any input to a binary choice:

$$f(z) = \begin{cases} 1 & \text{if } z \ge 0 \ 0 & \text{if } z < 0 \end{cases}$$

This activation function models the all-or-nothing firing threshold of biological neurons. While intuitive, it has a derivative of zero everywhere except at $z=0$, where it is undefined.

The Zero Gradient Problem

Because the derivative of the step function is zero everywhere, it is impossible to train using gradient-based optimization algorithms like backpropagation. The gradient of the loss with respect to the weights becomes zero, meaning weights cannot update dynamically.

This limitation requires the use of continuous, differentiable activation functions (like Sigmoid or ReLU) in modern deep learning, allowing gradient flow throughout the network during backpropagation.

Linear Activation Functions

A linear activation function simply passes the input directly to the output without altering its mathematical structure.

Formulation and Derivatives

A linear activation is defined as $f(z) = c z$, where $c$ is a constant. If $c=1$, it is the identity function $f(z) = z$. The derivative of a linear activation is constant: $f'(z) = c$.

While this allows gradient flow, the gradient is independent of the input value. This prevents the model from adapting its learning rate based on the magnitude of the activations, limiting optimization flexibility.

Linear Collapse in Deep Networks

If a network uses only linear activations, the entire multi-layer network collapses mathematically into a single-layer linear model. For a two-layer network, the output is $\mathbf{y} = \mathbf{W}_2 ( \mathbf{W}_1 \mathbf{x} + \mathbf{b}_1 ) + \mathbf{b}_2 = (\mathbf{W}_2 \mathbf{W}_1) \mathbf{x} + (\mathbf{W}_2 \mathbf{b}_1 + \mathbf{b}_2)$.

Since the product of two matrices is a new matrix $\mathbf{W}_{eff} = \mathbf{W}_2 \mathbf{W}_1$, the network is equivalent to a single linear layer. Therefore, deep networks require non-linear activations to learn complex, non-linear relationships.

PyTorch Implementation

Let's write a simple model in PyTorch demonstrating the difference between linear and threshold operations.

Coding Threshold vs Linear Pass

The following code implements manually the step activation function and a linear identity pass using PyTorch operations:

<pre><code class="language-python">import torch import torch.nn as nn class ActivationComparison(nn.Module): def __init__(self): super().__init__() self.linear_layer = nn.Linear(2, 2) def forward(self, x, mode="linear"): # x shape: [batch_size, 2] z = self.linear_layer(x) # [batch_size, 2] if mode == "step": # Step function thresholding return (z >= 0.0).float() else: # Linear identity activation return z model = ActivationComparison() x = torch.randn(3, 2) print("Linear:", model(x, mode="linear")) print("Step:", model(x, mode="step"))</pre>

In this implementation, the step function converts the continuous outputs of the linear layer into binary activations. The linear mode demonstrates the identity mapping, leaving the outputs unchanged.

Gradient Calculation Failures

We can verify the gradient behavior by executing backpropagation. For the linear mode, the gradients are successfully calculated; for the step function, attempting to backpropagate will result in zero gradients for the linear layer weights.

<pre><code class="language-python"># Compute loss on linear mode out_linear = model(x, mode="linear") loss_linear = out_linear.sum() loss_linear.backward() print("Linear grads:", model.linear_layer.weight.grad) # Reset grads and attempt on step mode model.zero_grad() out_step = model(x, mode="step") # We can compute gradients of the step output, but they will be zero loss_step = out_step.sum() # Note: since PyTorch's threshold comparison is not differentiable, # the gradient of the comparison operation will be None or zero.</pre>

This code illustrates why threshold activation functions cannot be used with gradient descent. The non-differentiable step function blocks gradient flow, leaving the weights frozen.