Conditional Probability Concept
Probability is not static; it changes as we acquire new information. Conditional probability is the likelihood of an event occurring given that another event has already occurred. This concept is fundamental to machine learning because training a model is essentially the process of updating our predictions based on the evidence presented by the training data. We will cover the formal definition of conditional probability, supervised learning as conditional estimation, sequence generation in LLMs, and common probabilistic paradoxes.
Restricting the Sample Space
Conditional probability restricts the sample space to only those outcomes where the given condition is true. It measures the relative likelihood of the target event within this narrowed window.
The Conditional Probability Formula
The conditional probability of event $A$ given event $B$ (written as $P(A | B)$) is defined as the ratio of their joint probability to the probability of the condition $B$:
$$P(A | B) = \frac{P(A \cap B)}{P(B)}$$
This formula is valid only when the probability of the condition is positive ($P(B) > 0$).
$$P(A | B) = \frac{P(A \cap B)}{P(B)}$$
This formula is valid only when the probability of the condition is positive ($P(B) > 0$).
Geometric Interpretation
In a Venn diagram, when we condition on $B$, the entire universe $\Omega$ shrinks to the set $B$. We then measure what fraction of $B$ is occupied by the intersection $A \cap B$. If $A$ and $B$ are independent, the ratio remains unchanged: $P(A|B) = P(A)$.
Supervised Learning as Conditional Estimation
AI classifiers are conditional probability estimators. When we feed an input into a network, we are asking it to compute a conditional probability distribution over the outputs.
Discriminative Modeling
In supervised classification, the model's goal is to learn the conditional probability $P(Y | X)$, representing the probability of target label $Y$ given input features $X$. For example, an image classifier computes $P(\text{cat} | \text{image pixels})$ using a Softmax activation function in the output layer.
Deriving Cross-Entropy Loss
Minimizing cross-entropy loss is mathematically equivalent to maximizing the conditional log-likelihood of the training data. Given data points $(x_i, y_i)$, the loss function is:
$$\mathcal{L} = -\sum_{i} \log P(y_i | x_i)$$
This connection bridges probability and gradient descent optimization.
$$\mathcal{L} = -\sum_{i} \log P(y_i | x_i)$$
This connection bridges probability and gradient descent optimization.
Autoregressive Sequence Generation
Generative Large Language Models (LLMs) operate entirely on conditional probabilities, generating text token by token.
Next-Token Prediction
An LLM predicts the next word $w_t$ in a sentence by evaluating the conditional probability given all preceding words:
$$P(w_t | w_1, w_2, \dots, w_{t-1})$$
During generation, the model samples from this conditional distribution, appends the output to the prompt, and repeats the process.
$$P(w_t | w_1, w_2, \dots, w_{t-1})$$
During generation, the model samples from this conditional distribution, appends the output to the prompt, and repeats the process.
Context Window Limits
The attention mechanism of a transformer computes the relationships between all historical tokens in the context window. The size of this window dictates how much historical context can be incorporated into the conditioning term $w_{1 \dots t-1}$.
Probabilistic Paradoxes in AI Analytics
Human intuition often struggles with conditional probability, leading to statistical errors that data scientists must guard against.
The Base Rate Fallacy
The Base Rate Fallacy occurs when people ignore the prior probability of an event when estimating its conditional probability. For example, if a model has a 99% true positive rate for a rare disease, the probability of having the disease given a positive test is still low if the disease is extremely rare. Data scientists use Bayes' Theorem to avoid this pitfall.
Simpson's Paradox in Machine Learning
Simpson's Paradox describes a phenomenon where a trend appears in several different groups of data but disappears or reverses when these groups are combined. This occurs when confounding variables are ignored, highlighting the importance of proper feature conditioning in causal inference.