Probability Density Functions (PDF)

For continuous variables, such as height, weight, or sensor readings, we cannot ask 'what is the probability that an observation is exactly 1.7500000... meters tall?' (The probability is mathematically zero!). Instead, we use a Probability Density Function (PDF) to describe the relative likelihood of a variable falling within a continuous range. Understanding PDFs is essential for analyzing continuous data landscapes and building generative models. We will cover the properties of probability density, multivariable densities, generative AI applications, and maximum likelihood estimation.

Properties of Probability Density

A PDF, denoted as $f(x)$, represents a curve where the height does not indicate probability directly, but rather probability density. To find actual probabilities, we must compute the area under the curve.

Non-negativity and Total Area

A valid PDF must satisfy two conditions: the density must be non-negative everywhere ($f(x) \ge 0$), and the total area under the curve must equal exactly 1:

$$\int_{-\infty}^{\infty} f(x) dx = 1$$

Calculating Interval Probability

The probability that a continuous random variable $X$ falls within the interval $[a, b]$ is calculated by integrating the PDF over that range:

$$P(a \le X \le b) = \int_{a}^{b} f(x) dx$$

Visually, this corresponds to the area under the curve between vertical lines at $x=a$ and $x=b$.

Multivariable PDFs & Transformations

In multi-dimensional spaces, we work with joint density functions and must track how transformations affect probability landscapes.

Joint PDFs

For two variables, the joint PDF $f(x, y)$ represents a 3D surface, and the volume under the surface integrates to 1. Marginal densities are obtained by integrating out the unwanted variables.

Change of Variables & The Jacobian

If we transform a variable $X \to Y = g(X)$, the PDF of $Y$ must account for the stretching or compressing of the coordinate space. This is achieved by multiplying by the absolute derivative (or Jacobian determinant in higher dimensions):

$$f_Y(y) = f_X(g^{-1}(y)) \left| \frac{d g^{-1}(y)}{d y} \right|$$

Density Modeling in Generative AI

Generative models attempt to learn the high-dimensional PDF of real-world data (such as human faces or audio signals) to generate new samples.

Normalizing Flows

Normalizing Flows learn complex probability densities by applying a sequence of invertible, differentiable transformations to a simple base distribution (like a standard Gaussian). The Jacobian determinant is used at each step to track the change in probability density.

Score Matching in Diffusion Models

Instead of estimating the PDF $p(x)$ directly, diffusion models learn the score function, defined as the gradient of the log-probability density with respect to the input: $\nabla_x \log p(x)$. This score vector points in the direction of the nearest high-density data region, guiding noisy inputs back to clean samples.

Maximum Likelihood Estimation (MLE)

MLE is the standard method for fitting a parametric continuous distribution to an observed dataset.

Log-Likelihood Optimization

Given data points $x_i$, we maximize the log-likelihood function:

$$\log L(\theta) = \sum_{i=1}^{N} \log f(x_i | \theta)$$

Taking the natural logarithm converts products into sums, simplifying differentiation and preventing numerical underflow.

Gradient Updates

To train a model, we compute the gradient of the log-likelihood with respect to parameters $\theta$ and perform gradient ascent to maximize the likelihood of the training data.