Measures of Central Tendency (Mean, Median, Mode)

Measures of central tendency are statistical metrics used to identify a single representative value that summarizes the center or typical value of a distribution. While the mean, median, and mode are often taught as simple averages, they represent different mathematical optimizations and respond uniquely to skewness and outliers. In machine learning, these measures are foundational for data preprocessing, input normalization, handling missing values, and loss function design.

Depending on the geometry of the data and the presence of extreme values, one measure may be heavily preferred over another. For instance, the mean minimizes squared distances, which makes it perfect for Gaussian structures but highly vulnerable to outliers. The median minimizes absolute distances, giving it robust characteristics. The mode represents the point of maximum density. We will dive deep into these differences and their usage in AI.

The Averages: Arithmetic, Geometric, and Harmonic Means

The mean represents the mathematical 'center of gravity' of a distribution. However, there are multiple formulations of the mean, each optimized for different data properties and mathematical objectives in machine learning.

Using the wrong type of mean can lead to distorted models and incorrect evaluations, especially when dealing with exponential growth, percentages, or highly skewed rates like precision and recall.

Arithmetic Mean and MSE Optimization

The arithmetic mean of a sample is the sum of all values divided by $n$:

$$\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i$$

In machine learning, the arithmetic mean is the optimal constant prediction under Mean Squared Error (MSE) loss. Let us derive this property. Suppose we seek a constant value $c$ that minimizes the sum of squared deviations:

$$J(c) = \sum_{i=1}^{n} (x_i - c)^2$$

To find the value of $c$ that minimizes $J(c)$, we take the derivative of $J(c)$ with respect to $c$ and set it to zero:

$$\frac{dJ}{dc} = \frac{d}{dc} \left( \sum_{i=1}^{n} (x_i - c)^2 \right) = \sum_{i=1}^{n} 2(x_i - c)(-1) = -2 \sum_{i=1}^{n} (x_i - c) = 0$$

Simplifying this expression:

$$\sum_{i=1}^{n} (x_i - c) = 0 \implies \sum_{i=1}^{n} x_i - \sum_{i=1}^{n} c = 0 \implies \sum_{i=1}^{n} x_i - nc = 0 \implies c = \frac{1}{n} \sum_{i=1}^{n} x_i = \bar{x}$$

This derivation explains why regression models trained with MSE loss predict the conditional expectation (mean) of the target variable $E[Y|X]$. Any deviation from the mean increases the squared error quadratically, explaining why regression models are pulled heavily by outliers.

Geometric and Harmonic Means in AI

The geometric mean is defined as the $n$-th root of the product of $n$ numbers:

$$\text{GM} = \left( \prod_{i=1}^{n} x_i \right)^{\frac{1}{n}} = \exp\left( \frac{1}{n} \sum_{i=1}^{n} \ln x_i \right)$$,

It is used for variables that grow exponentially or represent ratios, as it is less sensitive to extreme values than the arithmetic mean. The harmonic mean is the reciprocal of the arithmetic mean of the reciprocals:

$$\text{HM} = \frac{n}{\sum_{i=1}^{n} \frac{1}{x_i}}$$

It is used when averaging rates. A vital application in ML is the F1-Score, which is the harmonic mean of precision ($P$) and recall ($R$):

$$F_1 = \frac{2}{\frac{1}{P} + \frac{1}{R}} = \frac{2}{\frac{R + P}{PR}} = \frac{2PR}{P+R}$$

Unlike the arithmetic mean, if either precision or recall is close to 0, the F1-Score drops precipitously. For example, if Precision is 1.0 and Recall is 0.01, the arithmetic mean is $0.505$, while the harmonic mean is $F_1 = \frac{2 \times 1.0 \times 0.01}{1.0 + 0.01} \approx 0.0198$. This prevents models from masking poor performance in one metric with high performance in another. By the AM-GM-HM inequality, we have: $\text{AM} \ge \text{GM} \ge \text{HM}$.

The Median and Robust Statistics

While the mean is optimal under MSE, it is highly sensitive to outliers. The median represents the middle value of a sorted dataset, serving as the foundation of robust statistics.

Because the median focuses on ordinal positioning rather than numerical sums, it resists the pull of extremely large or small errors. This makes it an ideal central estimator in hostile or noisy environments.

Median Formulation and MAE Optimization

For a sorted sample $x_{(1)} \le x_{(2)} \le \dots \le x_{(n)}$, the median is the value that splits the distribution into two equal halves. Mathematically, the median minimizes the sum of absolute deviations:

$$J(c) = \sum_{i=1}^{n} |x_i - c|$$

Let us derive this. The derivative of $|u|$ is $\text{sign}(u)$ (defined as 1 if $u>0$, -1 if $u<0$, and undefined at 0). The subgradient condition with respect to $c$ is:

$$\frac{\partial J}{\partial c} = \frac{\partial}{\partial c} \sum_{i=1}^{n} |x_i - c| = -\sum_{i=1}^{n} \text{sign}(x_i - c) = 0$$

The signum sum is zero when the number of terms where $x_i > c$ (giving a sign of 1) equals the number of terms where $x_i < c$ (giving a sign of -1). This occurs precisely when $c$ splits the sorted data into two equal parts—the median. Thus, models trained with Mean Absolute Error (MAE) loss predict the conditional median of the target distribution, making them robust to outlier values. In contrast, MSE regression models predict the conditional mean, which can be heavily skewed by extreme values.

Robustness and the Breakdown Point

In robust statistics, the breakdown point of an estimator is the proportion of incorrect observations (e.g., arbitrarily large outliers) an estimator can handle before it becomes infinitely distorted. The arithmetic mean has a breakdown point of $1/n$, meaning a single outlier can drag the mean to infinity. In contrast, the median has a breakdown point of $0.5$ (or $50\%$), meaning half the data must be corrupted before the median breaks down. In AI pipelines, we use the median for missing data imputation (imputing NaNs) when the feature distribution is highly skewed, preventing outliers from biasing the imputed values.

The Mode and Modal Distributions

The mode is the most frequently occurring value in a discrete dataset, or the point of maximum probability density in a continuous distribution.

Unlike the mean and median, the mode can be applied to categorical data, and distributions can have multiple modes, indicating distinct underlying subpopulations.

Discrete and Continuous Modes

For a discrete random variable, the mode is the value $x$ that maximizes the probability mass function $P(X=x)$. For a continuous random variable with probability density function $f(x)$, the mode is defined as:

$$\text{mode} = \arg\max_x f(x)$$

A distribution with a single peak is unimodal (such as the normal distribution). If a distribution has two distinct peaks, it is bimodal (often indicating a mixture of two distinct populations), and if it has more, it is multimodal. In ML, multimodal data distributions are common, and modeling them requires advanced architectures like Mixture Density Networks (MDNs) or generative models like Gaussian Mixture Models (GMMs), VAEs, and GANs. Standard regressions fail on multimodal targets because they predict the average, which might lie in a region of zero probability between two modes.

Mode in AI: Majority Voting, Text Generation, and Mode Collapse

The mode is the only measure of central tendency applicable to nominal or categorical data (e.g., class labels). In ensemble learning, such as a Random Forest or Voting Classifier, the final prediction is the mode (majority vote) of the predictions of the individual models. Furthermore, in autoregressive language models (like GPT), decoding strategies often leverage the mode. The greedy decoding strategy selects the token with the highest predicted probability (the mode of the output softmax distribution) at each step:

$$\hat{w}_t = \arg\max_w P(w \mid w_{<t})$$

To prevent repetitive and deterministic outputs, strategies like nucleus sampling (Top-$p$) or temperature scaling are used to sample from the distribution rather than always selecting the mode, introducing diversity into generated text. In generative modeling, mode collapse is a common issue where a Generative Adversarial Network (GAN) learns to produce samples from only a few modes of the target distribution, ignoring other valid categories (e.g., generating only one digit class in MNIST), caused by the generator exploits a local minimum in the minimax optimization game.

Feature Scaling, Normalization, and Mode-Finding

Measures of central tendency are critical for standardizing data inputs to ensure stable gradient updates, and they form the algorithmic engine of certain clustering methods.

Centering inputs around zero changes the optimization landscape of neural networks, while locating modes allows us to discover cluster structures without specifying their count in advance.

Standardization and Mean-Centering

Standardization (Z-Score normalization) scales a feature to have a mean of 0 and a standard deviation of 1:

$$z = \frac{x - \mu}{\sigma}$$

In deep learning, mean-centering the inputs is crucial. When features are all positive, the gradients for the weights in the first layer will all have the same sign (either all positive or all negative) during backpropagation. This forces the gradient updates to oscillate in a zig-zag pattern, slowing down convergence. Centering the data around a mean of 0 eliminates this constraint, facilitating smooth optimization.

Mean Shift Clustering and Mode-Finding

The Mean Shift algorithm is a non-parametric clustering technique that finds clusters by locating modes in the data density. It works by defining a kernel window around each data point and calculating the mean of the points within that window. The window is then shifted to this new mean. Mathematically, given a set of points, we estimate the density using a kernel $K(x)$. We seek the mode where the gradient of the density estimator is zero: $\nabla f(x) = 0$. Taking the derivative of the kernel density estimator leads to the shift vector $m(x)$:

$$m(x) = \frac{\sum_{i=1}^{n} x_i K\left( \frac{x - x_i}{h} \right)}{\sum_{i=1}^{n} K\left( \frac{x - x_i}{h} \right)} - x$$

where $K$ is a kernel function (such as a Gaussian kernel) and $h$ is the bandwidth. This process is repeated until the points converge to the local density modes (peaks). The points that converge to the same mode are grouped into the same cluster, making the algorithm highly effective for arbitrary cluster shapes without predefined cluster counts.