Shannon Entropy: Measuring Surprise

Shannon Entropy is a fundamental concept in information theory that quantifies the average level of uncertainty, surprise, or information inherent in a random variable. In machine learning, entropy is the mathematical engine behind optimization. It provides the basis for Cross-Entropy Loss—the standard loss function for training classification networks—and Kullback-Leibler (KL) Divergence, which measures the difference between probability distributions. Understanding entropy is key to building predictive classifiers, generative models, and reinforcement learning agents.

We analyze entropy through foundational axioms defined by Shannon, extending them to continuous spaces using differential entropy. Additionally, entropy is used as an inductive bias in optimization, forcing distributions to make minimal assumptions about unobserved events. We will derive KL non-negativity, Cross-Entropy gradient maps, and policy search regularizers.

Mathematical Formulation and Axioms

Shannon Entropy measures the average uncertainty of a probability distribution. Shannon proved that this measure can be uniquely derived from a set of intuitive mathematical axioms.

By establishing properties of continuity, symmetry, maximality, and additivity, we show that entropy is the unique measure of statistical uncertainty.

Shannon's Axioms of Entropy

Shannon proved that the only function $H(p_1, \dots, p_n)$ satisfying the following properties (up to a multiplicative constant factor) is of the form $-K \sum p_i \log p_i$:

1. Continuity: $H$ must be continuous in the probabilities $p_i$.

2. Symmetry: $H$ is unchanged if the outcomes are reordered.

3. Maximality: $H$ is maximized when all outcomes are equally likely ($p_i = 1/n$).

4. Additivity: If a choice is broken down into successive choices, the total entropy is the weighted sum of the individual entropies. Specifically, $H(p_1, \dots, p_n) = H(p_1 + p_2, p_3, \dots, p_n) + (p_1 + p_2)H(\frac{p_1}{p_1+p_2}, \frac{p_2}{p_1+p_2})$. This unique axiomatic framework guarantees that entropy is a mathematically consistent representation of uncertainty.

Differential Entropy and the Maximum Entropy of Gaussians

For a continuous random variable $X$ with PDF $f(x)$, we define Differential Entropy $h(X)$ as:

$$h(X) = -\int_{-\infty}^{\infty} f(x) \ln f(x) \, dx$$

Unlike discrete entropy, differential entropy can be negative and is not scale-invariant. Let us prove that for a fixed mean $\mu$ and variance $\sigma^2$, the distribution that maximizes differential entropy is the Normal (Gaussian) distribution. We set up the calculus of variations with Lagrange multipliers $\lambda_0$, $\lambda_1$, and $\lambda_2$ to maximize $-\int f(x) \ln f(x) dx$ subject to the constraints $\int f(x) dx = 1$, $\int x f(x) dx = \mu$, and $\int (x-\mu)^2 f(x) dx = \sigma^2$. The objective functional is:

$$\mathcal{J}(f) = -\int f(x) \ln f(x) dx - \lambda_0 \left( \int f(x) dx - 1 \right) - \lambda_1 \left( \int x f(x) dx - \mu \right) - \lambda_2 \left( \int (x-\mu)^2 f(x) dx - \sigma^2 \right)$$

Taking the functional derivative with respect to $f(x)$ and setting it to zero:

$$\frac{\delta \mathcal{J}}{\delta f(x)} = -\ln f(x) - 1 - \lambda_0 - \lambda_1 x - \lambda_2 (x-\mu)^2 = 0$$

Solving for $f(x)$:

$$f(x) = \exp\left( -(1 + \lambda_0) - \lambda_1 x - \lambda_2 (x-\mu)^2 \right)$$

By solving for the constants using the constraints, we find that $\lambda_1 = 0$, $\lambda_2 = \frac{1}{2\sigma^2}$, and $\exp(-(1+\lambda_0)) = \frac{1}{\sqrt{2\pi\sigma^2}}$, yielding the exact probability density function of the univariate Gaussian distribution: $f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$. This justifies its usage as the default model of maximum uncertainty in physics and statistics.

Cross-Entropy and KL Divergence

When we use an approximating probability distribution $Q$ to model the true, unknown distribution $P$, we introduce excess uncertainty. We quantify this using relative entropy.

The divergence measures the overhead incurred by using an incorrect code, serving as the primary loss metric for distribution alignment in machine learning.

Kullback-Leibler (KL) Divergence and Jensen's Proof

KL Divergence (or relative entropy) measures the statistical divergence of distribution $Q$ from $P$:

$$D_{KL}(P \parallel Q) = \sum_{x \in X} P(x) \log_2 \left( \frac{P(x)}{Q(x)} \right)$$

Using Gibbs' Inequality, we can prove that $D_{KL}(P \parallel Q) \ge 0$, with equality if and only if $P = Q$. Let us prove this by applying Jensen's Inequality to the concave function $\ln(u)$:

$$-D_{KL}(P \parallel Q) = \sum_{x} P(x) \ln\left(\frac{Q(x)}{P(x)}\right) = E_P\left[ \ln\left(\frac{Q(X)}{P(X)}\right) \right]$$

By Jensen's Inequality, since $\ln(u)$ is concave, $E[\ln(U)] \le \ln(E[U])$:

$$E_P\left[ \ln\left(\frac{Q(X)}{P(X)}\right) \right] \le \ln\left( E_P\left[ \frac{Q(X)}{P(X)} \right] \right) = \ln\left( \sum_{x} P(x) \frac{Q(x)}{P(x)} \right) = \ln\left( \sum_{x} Q(x) \right) = \ln(1) = 0$$

Multiplying by -1 reverses the inequality: $D_{KL}(P \parallel Q) \ge 0$. KL divergence is asymmetric and does not satisfy the triangle inequality, meaning it is a divergence rather than a metric.

Multivariate Gaussian KL and Cross-Entropy

In deep learning, particularly in VAEs, we often compute the KL divergence between two multivariate Gaussian distributions, $\mathcal{N}(\boldsymbol{\mu}_0, \Sigma_0)$ and $\mathcal{N}(\boldsymbol{\mu}_1, \Sigma_1)$, which has a closed-form analytical solution:

$$D_{KL}(p_0 \parallel p_1) = \frac{1}{2} \left[ \text{tr}(\Sigma_1^{-1} \Sigma_0) + (\boldsymbol{\mu}_1 - \boldsymbol{\mu}_0)^T \Sigma_1^{-1} (\boldsymbol{\mu}_1 - \boldsymbol{\mu}_0) - D + \ln\left(\frac{|\Sigma_1|}{|\Sigma_0|}\right) \right]$$

where $D$ is the dimensionality of the Gaussian. Cross-Entropy $H(P, Q)$ is the average length of codes from $P$ using an approximating distribution $Q$: $H(P, Q) = -\sum P(x) \log_2 Q(x)$. It is related to entropy and KL divergence by: $H(P, Q) = H(P) + D_{KL}(P \parallel Q)$. Since $H(P)$ is fixed with respect to model parameters, minimizing Cross-Entropy is mathematically equivalent to minimizing the KL divergence.

Cross-Entropy Loss in Supervised Learning

In classification tasks, Cross-Entropy is the standard loss function used to optimize neural networks, derived from the principle of Maximum Likelihood Estimation.

Its gradient map shows that the error signals backpropagating through the network are proportional to the linear difference between predictions and targets, avoiding saturation issues.

Categorical Cross-Entropy and Gradient Derivation

For a multi-class classification problem, let $y$ be the true one-hot target vector ($y_c = 1$ for the true class, and $0$ otherwise), and $\hat{y}$ be the model's predicted probability distribution output by a Softmax layer: $\hat{y}_i = e^{z_i}/\sum_{j=1}^{C} e^{z_j}$. The Categorical Cross-Entropy Loss for a single sample is:

$$L = -\sum_{c=1}^{C} y_c \ln \hat{y}_c$$

We derive the gradient of $L$ with respect to the input logit $z_i$. First, note that the derivative of the Softmax output is $\frac{\partial \hat{y}_c}{\partial z_i} = \hat{y}_i(1 - \hat{y}_i)$ if $c=i$, and $-\hat{y}_c\hat{y}_i$ if $c \neq i$. Using the multivariate chain rule:

$$\frac{\partial L}{\partial z_i} = \sum_{c=1}^{C} \frac{\partial L}{\partial \hat{y}_c} \frac{\partial \hat{y}_c}{\partial z_i} = \sum_{c=1}^{C} \left( -\frac{y_c}{\hat{y}_c} \right) \frac{\partial \hat{y}_c}{\partial z_i}$$

$$= -\frac{y_i}{\hat{y}_i} \hat{y}_i(1 - \hat{y}_i) - \sum_{c \neq i} \frac{y_c}{\hat{y}_c} (-\hat{y}_c\hat{y}_i) = -y_i(1 - \hat{y}_i) + \sum_{c \neq i} y_c \hat{y}_i$$

$$= -y_i + y_i\hat{y}_i + \hat{y}_i \sum_{c \neq i} y_c = -y_i + \hat{y}_i \sum_{c=1}^{C} y_c$$

Since $y$ is a one-hot vector, $\sum_{c=1}^{C} y_c = 1$. Thus, the gradient simplifies to:

$$\frac{\partial L}{\partial z_i} = \hat{y}_i - y_i$$

This elegant form shows that the backpropagated error is the simple difference between the predicted probability and the true target, ensuring linear gradient flow and preventing gradient saturation. This equivalence also shows that minimizing cross-entropy is equivalent to maximizing the log-likelihood of the parameters.

Binary Cross-Entropy (BCE) Gradient Derivation

For binary classification, where the label is $y \in \{0, 1\}$ and the prediction is $\hat{y} = \sigma(z) = \frac{1}{1 + e^{-z}}$ output by a Sigmoid function, the loss is:

$$L = - [y \ln \hat{y} + (1 - y) \ln (1 - \hat{y})]$$

Let us derive the gradient of $L$ with respect to the pre-sigmoid logit $z$. Note that $\frac{d\hat{y}}{dz} = \hat{y}(1-\hat{y})$. By the chain rule:

$$\frac{dL}{dz} = \frac{dL}{d\hat{y}} \frac{d\hat{y}}{dz} = - \left[ \frac{y}{\hat{y}} - \frac{1-y}{1-\hat{y}} \right] \hat{y}(1-\hat{y})$$

$$= - \left[ \frac{y(1-\hat{y}) - \hat{y}(1-y)}{\hat{y}(1-\hat{y})} \right] \hat{y}(1-\hat{y}) = - [y - y\hat{y} - \hat{y} + y\hat{y}] = \hat{y} - y$$

Like the multi-class case, the gradient simplifies to the simple difference $\hat{y} - y$, ensuring stable, linear gradient signals during optimization.

The Principle of Maximum Entropy and RL Regularization

Entropy is a powerful regularization tool, used both to construct unbiased distributions and to maintain exploration in reinforcement learning.

By maximizing the entropy of policy outputs, agents avoid premature convergence to sub-optimal actions and retain capability in stochastic systems.

Principle of Maximum Entropy

Formulated by Edwin Jaynes, this principle states that when representing a probability distribution subject to constraints, one should choose the distribution that maximizes entropy. This ensures that the model makes the fewest assumptions possible about the unknown data. For example, if we only know that a continuous distribution is bounded in $[a, b]$, maximizing entropy yields the Uniform distribution. If we only know the mean and variance of a continuous distribution, maximizing entropy yields the Gaussian distribution. It serves as a foundational rule for creating non-informative priors in Bayesian inference.

Entropy Regularization in Reinforcement Learning

In Reinforcement Learning, policy gradient methods can suffer from premature policy collapse, where the agent converges to a single action and stops exploring. To prevent this, we add the entropy of the policy $\pi(a \mid s)$ to the optimization objective:

$$J(\pi) = E\left[ \sum_{t} r_t \right] + \beta H(\pi(\cdot \mid s))$$

where $\beta$ is the temperature parameter. This forces the policy to remain stochastic and explore alternative paths. Algorithms like Soft Actor-Critic (SAC) use this entropy regularization to learn robust and adaptive policies, maximizing both expected cumulative reward and policy entropy.