Bayes' Theorem: Updating Beliefs with Data
Bayes' Theorem is one of the most celebrated equations in mathematics, providing a formal mechanism for updating our beliefs when faced with new evidence. In machine learning, Bayes' Theorem allows us to flip conditional probabilities, calculating the probability of a hidden cause (like a disease or model parameters) given an observed effect (like a medical test result or training data). It forms the foundation of Bayesian inference, a major paradigm in statistical AI. We will cover the anatomy of the theorem, conjugate priors, parameter estimation paradigms, and Bayesian Neural Networks.
The Anatomy of Bayes' Theorem
Bayes' Theorem mathematically connects prior knowledge with fresh observations. It is derived directly from the product rule of probability.
The Formula
Bayes' Theorem is expressed as:
$$P(H | E) = \frac{P(E | H) \cdot P(H)}{P(E)}$$
Where:
- $P(H | E)$ is the Posterior (probability of hypothesis $H$ given evidence $E$).
- $P(E | H)$ is the Likelihood (probability of evidence $E$ given hypothesis $H$).
- $P(H)$ is the Prior (initial probability of hypothesis $H$ before seeing evidence).
- $P(E)$ is the Evidence (total probability of observing evidence $E$ across all hypotheses).
$$P(H | E) = \frac{P(E | H) \cdot P(H)}{P(E)}$$
Where:
- $P(H | E)$ is the Posterior (probability of hypothesis $H$ given evidence $E$).
- $P(E | H)$ is the Likelihood (probability of evidence $E$ given hypothesis $H$).
- $P(H)$ is the Prior (initial probability of hypothesis $H$ before seeing evidence).
- $P(E)$ is the Evidence (total probability of observing evidence $E$ across all hypotheses).
Flipping Conditionals & Total Probability
Bayes' Theorem is useful because the likelihood $P(E | H)$ is often easy to measure, whereas the posterior $P(H | E)$ is what we actually want to know. The denominator $P(E)$ acts as a normalizing constant, calculated using the Law of Total Probability:
$$P(E) = \sum_{i} P(E | H_i) P(H_i)$$
$$P(E) = \sum_{i} P(E | H_i) P(H_i)$$
Conjugate Priors & Analytical Tractability
In practice, computing the evidence denominator $P(E)$ can involve intractable integrations. Conjugate priors provide an algebraic shortcut.
Definition of Conjugacy
If the posterior distribution is in the same probability family as the prior distribution, the prior is conjugate to the likelihood. This allows us to calculate the posterior parameters analytically without integrating.
Beta-Binomial Updates
For a binary process with success probability $\theta$, if we choose a Beta distribution $Beta(\alpha, \beta)$ as our prior, and observe $k$ successes in $n$ trials, the posterior is also a Beta distribution: $Beta(\alpha + k, \beta + n - k)$. This allows sequential, closed-form updates as new data arrives.
Parameter Estimation: MLE vs. MAP
In machine learning, we use Bayes' Theorem to estimate a model's parameters $\theta$ based on training data $D$.
Maximum Likelihood Estimation (MLE)
MLE ignores the prior and finds the parameters that maximize the probability of the observed data:
$$\theta_{MLE} = \arg\max_{\theta} P(D | \theta)$$
$$\theta_{MLE} = \arg\max_{\theta} P(D | \theta)$$
Maximum A Posteriori (MAP) Estimation
MAP incorporates prior beliefs, maximizing the posterior:
$$\theta_{MAP} = \arg\max_{\theta} P(D | \theta) P(\theta)$$
If we assume a Gaussian prior on the parameters $\theta$, MAP estimation is mathematically identical to L2 regularization (weight decay) in neural networks.
$$\theta_{MAP} = \arg\max_{\theta} P(D | \theta) P(\theta)$$
If we assume a Gaussian prior on the parameters $\theta$, MAP estimation is mathematically identical to L2 regularization (weight decay) in neural networks.
Bayesian Neural Networks (BNNs)
Standard neural networks output point estimates of weights. BNNs learn a probability distribution over every weight in the network.
Quantifying Parameter Uncertainty
By representing weights as probability distributions $P(W | D)$, BNNs can quantify epistemic uncertainty. If the model is shown an out-of-distribution input, the high variance in the weight distributions will result in a highly uncertain, safe prediction.
Variational Inference & MC Dropout
Calculating the true posterior over millions of weights is intractable. BNNs use Variational Inference to find the best Gaussian approximation, or Monte Carlo (MC) Dropout, which runs dropout during inference to sample multiple predictions and estimate variance.