Loss Functions: Mean Squared Error in Neural Networks
Mean Squared Error (MSE) is the standard loss function for regression tasks. It measures the average squared difference between predicted values and actual targets, providing a smooth gradient for optimization.
Mathematical Mechanics
MSE calculates the average squared residuals, penalizing larger prediction errors quadratically.
The MSE Formula
For a batch of $N$ samples, where $y_i$ is the target value and $\hat{y}_i$ is the predicted output, the Mean Squared Error is defined as:
$$\mathcal{L}_{MSE} = \frac{1}{N} \sum_{i=1}^N (y_i - \hat{y}_i)^2$$
For multi-dimensional targets, the loss sums the squared differences across all dimensions. The squared term ensures that the loss is always positive and penalizes larger errors more severely than smaller ones.
Gradient Derivation
The derivative of the MSE loss with respect to a prediction $\hat{y}_i$ is computed using the power rule:
$$\frac{\partial \mathcal{L}_{MSE}}{\partial \hat{y}_i} = -\frac{2}{N}(y_i - \hat{y}_i)$$
This derivative is linear with respect to the error $(y_i - \hat{y}_i)$. As the prediction approaches the target, the gradient decreases, leading to stable updates near the minimum. However, large errors yield large gradients, which can destabilize training.
Statistical and Geometrical Insights
MSE has a strong statistical foundation, assuming Gaussian noise distributions in the target variables.
Maximum Likelihood Estimation
Minimizing MSE is mathematically equivalent to maximizing the likelihood of the data under the assumption that the targets are corrupted by zero-mean Gaussian noise: $y = f(x) + \epsilon$, where $\epsilon \sim \mathcal{N}(0, \sigma^2)$.
Under Maximum Likelihood Estimation (MLE), the log-likelihood function simplifies directly to the sum of squared errors. This connection guarantees that MSE yields the optimal estimator (minimum variance) if the noise assumption is correct.
Sensitivity to Outliers
Because errors are squared, a single large outlier will dominate the loss calculation. For instance, an error of $10$ contributing $100$ to the loss has a massive influence compared to ten errors of $1$ contributing only $10$.
This quadratic penalty forces the network to prioritize correcting outliers, which can pull the decision boundary away from the majority of normal samples. In datasets with high noise or corrupt outliers, alternative losses like Mean Absolute Error (MAE) or Huber loss are preferred.
PyTorch Implementation
Let's implement MSE loss in PyTorch and compare the built-in module with a manual implementation.
Using nn.MSELoss
PyTorch provides the MSE loss via nn.MSELoss, supporting different reductions:
In this code, we evaluate MSE loss. For predictions $[1.5, 2.0, 4.0]$ and targets $[1.0, 2.0, 3.0]$, the residuals are $[0.5, 0.0, 1.0]$. The squared residuals are $[0.25, 0.0, 1.0]$, yielding a mean of $0.4167$.
Manual MSE Implementation
We can write a manual MSE loss function and verify its gradients match PyTorch's native implementation:
<pre><code class="language-python"># Reset gradients predictions.grad.zero_() # Manual calculation: mean of (pred - target)^2 manual_loss = torch.mean((predictions - targets) ** 2) manual_loss.backward() # Manual gradient calculation: 2/N * (pred - target) manual_grad = (2.0 / len(targets)) * (predictions - targets) print("Manual Loss matches?:", torch.allclose(loss, manual_loss)) print("Manual Gradient matches?:", torch.allclose(predictions.grad, manual_grad))</pre>This implementation confirms the mathematical derivative formulation. Manually verifying gradients using torch.allclose ensures that our analytical understanding aligns with PyTorch's numerical engine.