Random Variables: Discrete vs. Continuous

A random variable is a mathematical rule that assigns a numerical value to the outcomes of a random process. Instead of working with abstract qualitative events, random variables allow us to translate uncertainty into quantitative numbers that computers can process. In machine learning, these variables are categorized as discrete (taking distinct, countable values) or continuous (taking any real value within an interval), with each requiring different mathematical treatments. We will explore their formal mappings, probability mass functions, probability density functions, and their representations in machine learning.

Formalizing Random Outcomes

A random variable $X$ is not a variable in the traditional algebraic sense. Instead, it is a function that maps outcomes from an underlying sample space to real numbers.

The Sample Space Mapping

Mathematically, we define a random variable as a function $X: \Omega \to \mathbb{R}$, where $\Omega$ is the sample space containing all possible outcomes of a random process, and $\mathbb{R}$ is the set of real numbers. For example, if we flip a coin, the sample space is $\Omega = \{\text{heads}, \text{tails}\}$. We can define a random variable $X$ such that $X(\text{heads}) = 1$ and $X(\text{tails}) = 0$. This maps the qualitative events directly into numeric values.

Notation Conventions

By convention, uppercase letters (such as $X, Y, Z$) represent the random variable as an abstract function. Lowercase letters (such as $x, y, z$) represent specific numerical values that the random variable can take. For instance, $P(X = x)$ denotes the probability that the random variable $X$ takes the specific value $x$.

Discrete Random Variables & PMFs

A discrete random variable is one that can take on only a finite or countably infinite set of distinct values. The probability distribution of a discrete random variable is described by a Probability Mass Function (PMF).

Probability Mass Function (PMF)

For a discrete random variable $X$, the PMF, written as $P(X = x)$ or $p(x)$, gives the probability that the variable takes the exact value $x$. A valid PMF must satisfy two conditions: the probability of each outcome must be non-negative, and the sum of all probabilities must equal exactly 1:

$$\sum_{i} P(X = x_i) = 1$$

Expectation and Variance

The Expected Value (or mean) of a discrete random variable represents its probability-weighted average:

$$E[X] = \sum_{i} x_i P(X = x_i)$$

The variance measures the spread of the variable around its expected value, defined as:

$$Var(X) = E[(X - E[X])^2] = \sum_{i} (x_i - E[X])^2 P(X = x_i)$$

Continuous Random Variables & PDFs

A continuous random variable can take any real value within a continuous interval. Because the number of possible outcomes is infinite and uncountable, the probability of the variable taking any single exact value is zero. Instead, we measure the probability of the variable falling within a range.

Probability Density Function (PDF)

For continuous random variables, we use a Probability Density Function (PDF), denoted as $f(x)$. The height of $f(x)$ represents the relative likelihood, not the absolute probability. The probability that $X$ falls in a range $[a, b]$ is computed by integrating the PDF over that range:

$$P(a \le X \le b) = \int_{a}^{b} f(x) dx$$

A valid PDF must satisfy $f(x) \ge 0$ for all $x$, and the total area under the curve must equal exactly 1:

$$\int_{-\infty}^{\infty} f(x) dx = 1$$

Continuous Expectation and Variance

For continuous variables, expectation and variance are computed by replacing the discrete sums with integrals:

$$E[X] = \int_{-\infty}^{\infty} x f(x) dx$$

$$Var(X) = \int_{-\infty}^{\infty} (x - E[X])^2 f(x) dx$$

Representation in Machine Learning

Machine learning models must handle both discrete and continuous variables, requiring different representations in the data pipeline.

Discrete Features: Categorical & Token Representations

Discrete variables are common in classification tasks (e.g., class labels) and natural language processing (NLP). Words are represented as discrete tokens in a vocabulary. Since neural networks cannot process raw discrete tokens directly, we map these discrete values into high-dimensional, continuous vector representations called word embeddings.

Continuous Features: Regression & Input Scaling

Continuous variables represent physical measurements, such as temperatures, prices, or pixel intensities. Regression models predict continuous outputs, such as the steering angle of an autonomous vehicle. Continuous inputs must be standardized or normalized (e.g., scaled to $[0,1]$ or z-scored) to ensure stable gradient updates in neural networks.