Activation Functions: Hyperbolic Tangent (Tanh)
The hyperbolic tangent (tanh) activation function is a zero-centered S-shaped curve that maps inputs to a range between -1 and 1, improving training speed compared to the sigmoid function.
Mathematical Properties of Tanh
The tanh function is a scaled and shifted version of the sigmoid function, offering symmetric output bounds.
The Tanh Equation
The tanh function $\tanh(z)$ is defined as the ratio of hyperbolic sine and cosine functions:
$$\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}} = 2\sigma(2z) - 1$$
For positive inputs, the output approaches 1; for negative inputs, it approaches -1; and when $z=0$, the output is exactly 0. This symmetric mapping makes the activations zero-centered.
The Tanh Derivative
The derivative of the tanh function can be written efficiently in terms of its output value:
$$\tanh'(z) = 1 - \tanh^2(z)$$
The peak derivative of tanh is exactly 1.0 (at $z=0$). This is four times larger than the sigmoid's maximum derivative of 0.25, allowing for stronger gradient flow and faster backpropagation in shallow networks.
The Zero-Centered Advantage
Having zero-centered activations improves optimization dynamics in neural network training.
Centering the Activations
In sigmoid layers, all activations are positive, meaning the gradients of the next layer's weights are always positive or always negative. This restricts weight updates to a single direction, causing slow, zig-zagging convergence.
The tanh function outputs both negative and positive activations. This keeps the mean of the activations closer to zero, which decorrelates weight updates and speeds up optimization convergence.
Vanishing Gradient in Deep Networks
While tanh has a larger peak derivative than sigmoid, it still saturates for large inputs. In the saturated regions ($|z| > 3$), the derivative approaches zero, blocking gradient flow.
In deep networks, multiplying multiple tanh derivatives still leads to vanishing gradients, meaning that while tanh outperforms sigmoid in hidden layers, it is still not suitable for very deep networks.
PyTorch Implementation
Let's implement the tanh activation in PyTorch and verify its zero-centered properties.
Using nn.Tanh
We can access the tanh activation in PyTorch using the built-in nn.Tanh module or functional torch.tanh:
This model evaluates tanh activations, showing that negative inputs map to negative outputs and positive inputs map to positive outputs, maintaining symmetry.
Verifying Tanh Gradients
We can check the gradients of tanh at different points in PyTorch, verifying that it is zero-centered and reaches a maximum derivative of 1.0:
<pre><code class="language-python">x_grad = torch.tensor([0.0], requires_grad=True) y_grad = torch.tanh(x_grad) y_grad.backward() print("Tanh gradient at x=0:", x_grad.grad.item()) # 1.0</pre>This code shows that at $x=0$, the gradient is exactly 1.0. This larger gradient helps prevent early gradient decay compared to the sigmoid function, although saturation remains a risk.