Loss Functions: Binary Cross-Entropy

Binary Cross-Entropy (BCE) is the standard loss function for binary classification. It measures the dissimilarity between target binary labels and predicted probabilities, providing robust gradients for training classifiers.


Information Theory and Probabilistic Foundations

BCE measures the distance between two probability distributions, minimizing the difference between prediction and reality.

The BCE Formula

For a batch of $N$ samples, where $y_i \in \{0, 1\}$ is the target label and $\hat{y}_i \in (0, 1)$ is the predicted probability, Binary Cross-Entropy is defined as:

$$\mathcal{L}_{BCE} = -\frac{1}{N} \sum_{i=1}^N [y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i)]$$

If the target $y_i = 1$, the second term becomes zero, and we minimize $-\log(\hat{y}_i)$. If the target $y_i = 0$, the first term becomes zero, and we minimize $-\log(1 - \hat{y}_i)$. This logarithmic scaling penalizes confident incorrect predictions exponentially.

Kullback-Leibler Divergence Connection

In information theory, cross-entropy measures the average number of bits needed to identify an event from a set if we use a coding scheme optimized for probability distribution $\hat{p}$ instead of the true distribution $p$. It is related to Kullback-Leibler (KL) divergence by:

$$H(p, \hat{p}) = H(p) + D_{KL}(p || \hat{p})$$

Since the entropy of the true label $H(p)$ is zero (as targets are constant), minimizing BCE is equivalent to minimizing the KL divergence, which aligns the model's predictions with the true class distribution.

Optimization and Gradient Flow

Combining BCE with the sigmoid activation yields a simple, stable gradient that avoids saturation bottlenecks.

Gradient with Respect to Logits

When using a sigmoid output layer, the predicted probability is $\hat{y} = \sigma(z)$. Substituting this into the BCE formula and taking the derivative with respect to the input logit $z$ yields an elegant result:

$$\frac{\partial \mathcal{L}_{BCE}}{\partial z} = \hat{y} - y$$

This derivative is simply the linear error between prediction and target. Even if the logit is far from the threshold, the gradient remains active and proportional to the error, avoiding the vanishing gradient issues associated with sigmoid layers in regression.

Numerical Instability of Raw Logs

Evaluating BCE directly in code is dangerous due to precision limits. If the prediction $\hat{y}$ is extremely close to 0, calculating $\log(\hat{y})$ will produce negative infinity. Similarly, if $\hat{y}$ is close to 1, $\log(1 - \hat{y})$ will underflow.

These extreme values lead to NaN errors during backpropagation. To resolve this, deep learning libraries combine the sigmoid and BCE computations into a single stabilized mathematical function.

PyTorch Implementation

Let's contrast PyTorch's BCELoss and BCEWithLogitsLoss to show the stability advantage of log-sum-exp scaling.

BCELoss vs. BCEWithLogitsLoss

We can use PyTorch to calculate binary classification loss, demonstrating why logits-level evaluation is superior:

<pre><code class="language-python">import torch import torch.nn as nn # Raw logits and targets (batch of 2) logits = torch.tensor([12.0, -15.0], requires_grad=True) targets = torch.tensor([1.0, 0.0]) # Method 1: BCELoss (requires manual sigmoid first) sigmoid = nn.Sigmoid() probs = sigmoid(logits) bce_loss = nn.BCELoss() loss1 = bce_loss(probs, targets) # Method 2: BCEWithLogitsLoss (processes logits directly) bce_logits_loss = nn.BCEWithLogitsLoss() loss2 = bce_logits_loss(logits, targets) print("BCELoss result:", loss1.item()) print("BCEWithLogitsLoss result:", loss2.item())</pre>

In this code, we evaluate both losses. BCEWithLogitsLoss is mathematically identical to applying sigmoid followed by BCE, but it is numerically stable. It prevents floating-point overflow for large logits like $12.0$ by reformulating the log-sigmoid operation.

Coding Stable BCE Manually

We can implement a stable BCE loss function using the log-sum-exp formulation to prevent precision overflow:

<pre><code class="language-python">def stable_bce_loss(logits, targets): # Stable formulation: max(x, 0) - x * y + log(1 + exp(-|x|)) max_val = torch.clamp(logits, min=0) loss = max_val - logits * targets + torch.log(1.0 + torch.exp(-torch.abs(logits))) return torch.mean(loss) manual_bce = stable_bce_loss(logits, targets) print("Stable manual loss matches?:", torch.allclose(loss2, manual_bce))</pre>

This implementation utilizes the mathematical identity $\log(1 + e^{-|z|})$ to prevent computing large exponentials. The result matches PyTorch's stable loss, verifying the numerical optimization strategy.