Activation Functions: Hyperbolic Tangent (Tanh)

The hyperbolic tangent (tanh) activation function is a zero-centered S-shaped curve that maps inputs to a range between -1 and 1, improving training speed compared to the sigmoid function.

Mathematical Properties of Tanh

The tanh function is a scaled and shifted version of the sigmoid function, offering symmetric output bounds.

The Tanh Equation

The tanh function $\tanh(z)$ is defined as the ratio of hyperbolic sine and cosine functions:

$$\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}} = 2\sigma(2z) - 1$$

For positive inputs, the output approaches 1; for negative inputs, it approaches -1; and when $z=0$, the output is exactly 0. This symmetric mapping makes the activations zero-centered.

The Tanh Derivative

The derivative of the tanh function can be written efficiently in terms of its output value:

$$\tanh'(z) = 1 - \tanh^2(z)$$

The peak derivative of tanh is exactly 1.0 (at $z=0$). This is four times larger than the sigmoid's maximum derivative of 0.25, allowing for stronger gradient flow and faster backpropagation in shallow networks.

The Zero-Centered Advantage

Having zero-centered activations improves optimization dynamics in neural network training.

Centering the Activations

In sigmoid layers, all activations are positive, meaning the gradients of the next layer's weights are always positive or always negative. This restricts weight updates to a single direction, causing slow, zig-zagging convergence.

The tanh function outputs both negative and positive activations. This keeps the mean of the activations closer to zero, which decorrelates weight updates and speeds up optimization convergence.

Vanishing Gradient in Deep Networks

While tanh has a larger peak derivative than sigmoid, it still saturates for large inputs. In the saturated regions ($|z| > 3$), the derivative approaches zero, blocking gradient flow.

In deep networks, multiplying multiple tanh derivatives still leads to vanishing gradients, meaning that while tanh outperforms sigmoid in hidden layers, it is still not suitable for very deep networks.

PyTorch Implementation

Let's implement the tanh activation in PyTorch and verify its zero-centered properties.

Using nn.Tanh

We can access the tanh activation in PyTorch using the built-in nn.Tanh module or functional torch.tanh:

<pre><code class="language-python">import torch import torch.nn as nn class TanhModel(nn.Module): def __init__(self): super().__init__() self.tanh = nn.Tanh() def forward(self, x): # Input tensor shape: [batch_size, features] return self.tanh(x) model = TanhModel() x = torch.tensor([-2.0, 0.0, 2.0]) print("Tanh outputs:", model(x))</pre>

This model evaluates tanh activations, showing that negative inputs map to negative outputs and positive inputs map to positive outputs, maintaining symmetry.

Verifying Tanh Gradients

We can check the gradients of tanh at different points in PyTorch, verifying that it is zero-centered and reaches a maximum derivative of 1.0:

<pre><code class="language-python">x_grad = torch.tensor([0.0], requires_grad=True) y_grad = torch.tanh(x_grad) y_grad.backward() print("Tanh gradient at x=0:", x_grad.grad.item()) # 1.0</pre>

This code shows that at $x=0$, the gradient is exactly 1.0. This larger gradient helps prevent early gradient decay compared to the sigmoid function, although saturation remains a risk.