Cumulative Distribution Functions (CDF)
While a Probability Density Function (PDF) tells you the height of the probability curve at any specific point, a Cumulative Distribution Function (CDF) tracks the running total of that probability. It answers the fundamental question: 'What is the probability that a random variable $X$ is less than or equal to a specific value $x$?' CDFs provide an elegant way to compute intervals, perform inverse sampling, and set decision thresholds in AI applications. We will explore cumulative probability properties, inverse CDFs, anomaly detection, and joint CDFs.
The Accumulation of Probability
The CDF, denoted as $F(x)$, acts as the running sum (for discrete variables) or running integral (for continuous variables) of the probability distribution.
Mathematical Definition
For any random variable $X$, the CDF is defined as:
$$F(x) = P(X \le x)$$
For a continuous variable with PDF $f(t)$, the CDF accumulates probability from negative infinity up to $x$:
$$F(x) = \int_{-\infty}^{x} f(t) dt$$
$$F(x) = P(X \le x)$$
For a continuous variable with PDF $f(t)$, the CDF accumulates probability from negative infinity up to $x$:
$$F(x) = \int_{-\infty}^{x} f(t) dt$$
Properties of the CDF
Every CDF exhibits three key mathematical properties:
- It is non-decreasing: as $x$ increases, $F(x)$ cannot decrease.
- It starts at 0: $\lim_{x \to -\infty} F(x) = 0$.
- It ends at 1: $\lim_{x \to \infty} F(x) = 1$.
For continuous variables, the derivative of the CDF is the PDF: $F'(x) = f(x)$.
- It is non-decreasing: as $x$ increases, $F(x)$ cannot decrease.
- It starts at 0: $\lim_{x \to -\infty} F(x) = 0$.
- It ends at 1: $\lim_{x \to \infty} F(x) = 1$.
For continuous variables, the derivative of the CDF is the PDF: $F'(x) = f(x)$.
Inverse CDFs & Quantile Functions
By inverting the CDF, we can translate probabilities back into threshold values, which is extremely useful for simulation and data generation.
Quantiles and Percentiles
The Inverse CDF, $F^{-1}(p)$, is also known as the Quantile Function. Given a probability $p \in [0, 1]$, it returns the value $x$ such that $P(X \le x) = p$. For example, the median of a distribution is the value $x = F^{-1}(0.5)$.
Inverse Transform Sampling
This is a fundamental technique for generating random samples. To generate a sample from a distribution with CDF $F(x)$, we first draw a uniform random number $U \sim Uniform(0, 1)$ and then compute $X = F^{-1}(U)$. This method allows computers to simulate arbitrary continuous distributions.
Anomaly Detection & Statistical Significance
In machine learning pipelines, CDFs are widely used to identify out-of-distribution inputs and compute statistical significance.
p-Values in Hypothesis Testing
The p-value is the probability of observing a result at least as extreme as the one obtained, assuming the null hypothesis is true. It is calculated directly using the tail probabilities derived from the CDF of the null distribution.
CDF Thresholding for Anomaly Detection
In anomaly detection, we compute the cumulative probability of a test input. If $F(x) > 0.999$ or $F(x) < 0.001$, the input lies in the extreme tails of the distribution and is flagged as a statistically significant outlier or anomaly.
Joint CDFs & Copulas
To model relationships between multiple random variables, we extend the CDF to higher dimensions.
Multivariate CDFs
The joint CDF of two variables $X$ and $Y$ is defined as $F(x, y) = P(X \le x, Y \le y)$. It describes the probability that both inequalities are satisfied simultaneously.
Copulas & Dependency Modeling
Sklar's Theorem states that any multivariate joint CDF can be expressed in terms of its marginal CDFs and a Copula, which is a joint CDF with uniform marginals. Copulas allow data scientists to model the dependency structure of multiple variables independently of their marginal distributions.