Marginal and Joint Probability

In real-world data, we are rarely interested in just a single variable. Instead, we want to know how multiple variables interact. Joint probability allows us to calculate the likelihood of multiple events occurring simultaneously, while marginal probability allows us to isolate the behavior of a single variable from a larger, multi-variable distribution. These two concepts form the foundational rules of probability algebra in machine learning. We will cover joint distributions, marginalization via the sum rule, latent variable models, and product and chain rules.

Joint Probability Distributions

Joint probability measures the likelihood of two or more random variables taking specific values simultaneously. The mathematical representation differs for discrete and continuous variables.

Joint PMFs and PDFs

For discrete variables, the joint PMF is $P(X=x, Y=y)$, representing the probability that both events occur. For continuous variables, the joint PDF is $f(x, y)$, representing a three-dimensional probability density surface. The volume under the joint PDF surface must sum to 1:

$$\int_{-\infty}^{\infty} \int_{-\infty}^{\infty} f(x, y) dx dy = 1$$

High-Dimensional Joint Distributions

In deep learning, an image is represented as a single sample from a high-dimensional joint distribution over thousands of pixel variables: $P(X_1, X_2, \dots, X_D)$. Generative modeling aims to learn this complex joint distribution to sample new, realistic images.

Marginal Probability & The Sum Rule

Marginal probability is the probability of a single event occurring, irrespective of the outcomes of other variables in the system. We compute it through a process called marginalization.

Marginalization (The Sum Rule)

To isolate the marginal probability of $X$, we sum (or integrate) the joint probability over all possible values of $Y$. For discrete variables, this is written as:

$$P(X = x) = \sum_{y} P(X = x, Y = y)$$

For continuous variables, we replace the sum with an integral:

$$f_X(x) = \int_{-\infty}^{\infty} f_{X,Y}(x, y) dy$$

Geometric Projection

Visually, marginalization represents the projection of a multi-dimensional probability distribution onto a lower-dimensional subspace. For example, projecting a 2D joint distribution onto the X-axis yields the marginal distribution of X.

Marginalization in Latent Variable Models

Latent variable models introduce unobserved hidden variables to explain complex structure in the observed data.

Gaussian Mixture Models (GMMs)

GMMs model data as a mixture of multiple cluster components. The probability of observing a data point $x$ is computed by marginalizing over the discrete latent variable $z$ (representing the cluster index):

$$P(x) = \sum_{k=1}^{K} P(x | z=k) P(z=k)$$

Variational Autoencoders (VAEs)

VAEs learn a continuous latent space $z$. To calculate the likelihood of data $x$, the model must marginalize over $z$ using an integral:

$$P(x) = \int P(x | z) P(z) dz$$

Because this integral is mathematically intractable for deep neural networks, VAEs use variational lower bounds (ELBO) to approximate it.

The Product Rule & Chain Rule

The product rule allows us to decompose joint probabilities into conditional relationships, providing the algebraic foundation for generative models.

The Product Rule

The product rule states that the joint probability of $X$ and $Y$ is the product of the conditional probability of $X$ given $Y$ and the marginal probability of $Y$:

$$P(X, Y) = P(X | Y) P(Y) = P(Y | X) P(X)$$

The Chain Rule of Probability

By applying the product rule recursively, we get the Chain Rule of Probability for $n$ variables:

$$P(X_1, X_2, \dots, X_n) = P(X_1) \prod_{i=2}^{n} P(X_i | X_1, \dots, X_{i-1})$$

This factorization is the mathematical core of autoregressive sequence generation models, such as GPT models, which generate text token by token.