Bayes' Theorem: Updating Beliefs with Data

Bayes' Theorem is one of the most celebrated equations in mathematics, providing a formal mechanism for updating our beliefs when faced with new evidence. In machine learning, Bayes' Theorem allows us to flip conditional probabilities, calculating the probability of a hidden cause (like a disease or model parameters) given an observed effect (like a medical test result or training data). It forms the foundation of Bayesian inference, a major paradigm in statistical AI. We will cover the anatomy of the theorem, conjugate priors, parameter estimation paradigms, and Bayesian Neural Networks.

The Anatomy of Bayes' Theorem

Bayes' Theorem mathematically connects prior knowledge with fresh observations. It is derived directly from the product rule of probability.

The Formula

Bayes' Theorem is expressed as:

$$P(H | E) = \frac{P(E | H) \cdot P(H)}{P(E)}$$

Where:
- $P(H | E)$ is the Posterior (probability of hypothesis $H$ given evidence $E$).
- $P(E | H)$ is the Likelihood (probability of evidence $E$ given hypothesis $H$).
- $P(H)$ is the Prior (initial probability of hypothesis $H$ before seeing evidence).
- $P(E)$ is the Evidence (total probability of observing evidence $E$ across all hypotheses).

Flipping Conditionals & Total Probability

Bayes' Theorem is useful because the likelihood $P(E | H)$ is often easy to measure, whereas the posterior $P(H | E)$ is what we actually want to know. The denominator $P(E)$ acts as a normalizing constant, calculated using the Law of Total Probability:

$$P(E) = \sum_{i} P(E | H_i) P(H_i)$$

Conjugate Priors & Analytical Tractability

In practice, computing the evidence denominator $P(E)$ can involve intractable integrations. Conjugate priors provide an algebraic shortcut.

Definition of Conjugacy

If the posterior distribution is in the same probability family as the prior distribution, the prior is conjugate to the likelihood. This allows us to calculate the posterior parameters analytically without integrating.

Beta-Binomial Updates

For a binary process with success probability $\theta$, if we choose a Beta distribution $Beta(\alpha, \beta)$ as our prior, and observe $k$ successes in $n$ trials, the posterior is also a Beta distribution: $Beta(\alpha + k, \beta + n - k)$. This allows sequential, closed-form updates as new data arrives.

Parameter Estimation: MLE vs. MAP

In machine learning, we use Bayes' Theorem to estimate a model's parameters $\theta$ based on training data $D$.

Maximum Likelihood Estimation (MLE)

MLE ignores the prior and finds the parameters that maximize the probability of the observed data:

$$\theta_{MLE} = \arg\max_{\theta} P(D | \theta)$$

Maximum A Posteriori (MAP) Estimation

MAP incorporates prior beliefs, maximizing the posterior:

$$\theta_{MAP} = \arg\max_{\theta} P(D | \theta) P(\theta)$$

If we assume a Gaussian prior on the parameters $\theta$, MAP estimation is mathematically identical to L2 regularization (weight decay) in neural networks.

Bayesian Neural Networks (BNNs)

Standard neural networks output point estimates of weights. BNNs learn a probability distribution over every weight in the network.

Quantifying Parameter Uncertainty

By representing weights as probability distributions $P(W | D)$, BNNs can quantify epistemic uncertainty. If the model is shown an out-of-distribution input, the high variance in the weight distributions will result in a highly uncertain, safe prediction.

Variational Inference & MC Dropout

Calculating the true posterior over millions of weights is intractable. BNNs use Variational Inference to find the best Gaussian approximation, or Monte Carlo (MC) Dropout, which runs dropout during inference to sample multiple predictions and estimate variance.