Measures of Dispersion (Variance, Standard Deviation)

Measures of central tendency only tell us part of the story. Two datasets can have identical means but entirely different profiles. Measures of dispersion quantify the spread, variability, or risk associated with a distribution. Understanding variance, standard deviation, and quantile ranges is essential for initializing deep learning weights, stabilizing network training, standardizing features, and performing outlier detection.

In probability theory, dispersion is related to the second moment of a distribution. We must distinguish between population parameters and sample estimators, especially when calculating variance. We will explore Bessel's correction, probability bounds, robust statistics, and their applications in deep learning optimization and normalization.


Variance and Bessel's Correction

Variance measures the average squared deviation of data points from their mean, representing the dispersion of the distribution. However, estimating variance from a sample requires a mathematical correction to prevent bias.

If we use the population formula directly on a sample, we will systematically underestimate the variance because the sample data points are closer to their own sample mean than to the population mean.

Population vs. Sample Variance

The population variance $\sigma^2$ is defined as the expectation of the squared deviation from the population mean $\mu$:

$$\sigma^2 = \frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2$$

If we use the same formula for a sample of size $n$ by dividing by $n$, the estimator systematically underestimates the true population variance. To resolve this, sample variance $s^2$ is calculated using Bessel's correction (dividing by $n-1$):

$$s^2 = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2$$

Proof of Bessel's Correction

To prove that $s^2$ is an unbiased estimator ($E[s^2] = \sigma^2$), we expand the sum of squared deviations:

$$\sum_{i=1}^{n} (x_i - \bar{x})^2 = \sum_{i=1}^{n} ((x_i - \mu) - (\bar{x} - \mu))^2 = \sum_{i=1}^{n} \left( (x_i - \mu)^2 - 2(x_i - \mu)(\bar{x} - \mu) + (\bar{x} - \mu)^2 \right)$$

$$= \sum_{i=1}^{n} (x_i - \mu)^2 - 2(\bar{x} - \mu)\sum_{i=1}^{n}(x_i - \mu) + n(\bar{x} - \mu)^2$$

Since $\sum_{i=1}^n (x_i - \mu) = n(\bar{x} - \mu)$, the sum simplifies to:

$$\sum_{i=1}^{n} (x_i - \bar{x})^2 = \sum_{i=1}^{n} (x_i - \mu)^2 - 2n(\bar{x} - \mu)^2 + n(\bar{x} - \mu)^2 = \sum_{i=1}^{n} (x_i - \mu)^2 - n(\bar{x} - \mu)^2$$

Taking expectations on both sides:

$$E\left[ \sum_{i=1}^{n} (x_i - \bar{x})^2 \right] = \sum_{i=1}^{n} E[(x_i - \mu)^2] - n E[(\bar{x} - \mu)^2]$$

Since $E[(x_i - \mu)^2] = \sigma^2$ and $E[(\bar{x} - \mu)^2] = \text{Var}(\bar{x}) = \sigma^2/n$, this becomes:

$$E\left[ \sum_{i=1}^{n} (x_i - \bar{x})^2 \right] = n\sigma^2 - n\left(\frac{\sigma^2}{n}\right) = n\sigma^2 - \sigma^2 = (n-1)\sigma^2$$

Dividing by $n-1$ yields the unbiased estimator: $E\left[ \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2 \right] = \sigma^2$. This demonstrates that Bessel's correction removes the downward bias of the sample variance estimator.

Standard Deviation and Chebyshev's Inequality

Standard deviation is the square root of variance, returning the measure of dispersion to the same physical unit as the original data. It forms the foundation of probability bounds.

While normal distributions follow the empirical 68-95-99.7 rule, we can establish universal mathematical bounds for arbitrary distributions using standard deviation.

Standard Deviation ($\sigma$) and the Normal Distribution

Sample standard deviation is $s = \sqrt{s^2}$. In a normal distribution $\mathcal{N}(\mu, \sigma^2)$, the standard deviation dictates exact coverage percentages: $68.27\%$ of the data falls within $\mu \pm 1\sigma$, $95.45\%$ within $\mu \pm 2\sigma$, and $99.73\%$ within $\mu \pm 3\sigma$. This is known as the Empirical Rule, or 68-95-99.7 rule. In anomaly detection, data points lying beyond 3 standard deviations are frequently flagged as outliers.

Markov's and Chebyshev's Inequalities

First, let us prove Markov's Inequality, which states that for any non-negative random variable $Y$ and $a > 0$: $P(Y \ge a) \le E[Y]/a$. Let $f(y)$ be the PDF of $Y$. Then:

$$E[Y] = \int_{0}^{\infty} y f(y) \, dy = \int_{0}^{a} y f(y) \, dy + \int_{a}^{\infty} y f(y) \, dy$$

Since $y \ge 0$, both terms are non-negative. Discarding the first term gives:

$$E[Y] \ge \int_{a}^{\infty} y f(y) \, dy \ge \int_{a}^{\infty} a f(y) \, dy = a \int_{a}^{\infty} f(y) \, dy = a P(Y \ge a)$$

Dividing by $a$ proves $P(Y \ge a) le E[Y]/a$.

Now, we use this to prove Chebyshev's Inequality, which states that the probability of a random variable $X$ deviating from its mean $\mu$ by more than $k$ standard deviations is bounded: $P(|X - \mu| \ge k\sigma) \le 1/k^2$. Set $Y = (X - \mu)^2$ and $a = k^2\sigma^2$. Since $Y$ is non-negative and $a > 0$, we apply Markov's Inequality:

$$P((X - \mu)^2 \ge k^2\sigma^2) \le \frac{E[(X - \mu)^2]}{k^2\sigma^2}$$

Because $(X - \mu)^2 \ge k^2\sigma^2$ is equivalent to $|X - \mu| \ge k\sigma$, and $E[(X - \mu)^2] = \sigma^2$ by definition of variance, this simplifies to:

$$P(|X - \mu| \ge k\sigma) \le \frac{\sigma^2}{k^2\sigma^2} = \frac{1}{k^2}$$

For example, for $k=2$, at least $75\%$ ($1 - 1/4$) of all observations must lie within 2 standard deviations of the mean, regardless of the distribution's shape. This inequality is used in machine learning to establish robust outlier detection rules and probability bounds without making strict assumptions about the data distribution.

Robust Dispersion: Range, IQR, and Percentiles

Variance and standard deviation are sensitive to outliers because they square the deviations. Robust dispersion metrics provide alternative measures of spread.

By focusing on positional rankings rather than squared magnitudes, robust measures allow us to describe data dispersion accurately even in the presence of noise and heavy tails.

Range, Percentiles, and IQR

The range is the difference between the maximum and minimum values in a dataset: $\text{Range} = x_{\max} - x_{\min}$. A single extreme outlier will drastically increase the range, making it an unstable measure of typical spread. Percentiles partition a sorted dataset into 100 equal parts. The $p$-th percentile is the value below which $p$ percent of the data falls, offering a localized view of the distribution shape. The IQR is the range between the 75th percentile (third quartile, $Q_3$) and the 25th percentile (first quartile, $Q_1$): $\text{IQR} = Q_3 - Q_1$. Because it focuses on the middle $50\%$ of the data, it is completely unaffected by outliers.

Tukey Outliers and Median Absolute Deviation (MAD)

In exploratory data analysis, the Tukey fence method uses IQR to identify outliers. Any data point $x$ is flagged as an outlier if $x < Q_1 - 1.5 \times \text{IQR}$ or $x > Q_3 + 1.5 \times \text{IQR}$. In machine learning preprocessing pipelines, removing or clipping features using IQR boundaries is standard practice to prevent extreme values from distorting model training.

Another robust measure is the Median Absolute Deviation (MAD), defined as:

$$\text{MAD} = \text{median}(|x_i - \text{median}(X)|)$$

To use MAD as a consistent estimator for standard deviation $\sigma$, we scale it by a constant. For a normal distribution, the probability of falling within one MAD of the median is $0.5$. Since the cumulative distribution function of standard normal satisfies $\Phi(0.6745) \approx 0.75$, we have $P(|X - \mu| < 0.6745\sigma) = 0.5$. Thus, $\text{MAD} = 0.6745 \sigma$, which means $\sigma \approx 1.4826 \times \text{MAD}$. This scale factor allows us to estimate variance robustly under noise.

Dispersion in Deep Learning: Initialization and Normalization

Controlling the dispersion of weights and activations is a primary engineering challenge when building and training deep neural networks.

If dispersion is unchecked, signals will either explode to infinity or vanish to zero as they propagate through successive network layers.

Weight Initialization: Xavier and He Scaling

When initializing weights in deep networks, if the variance of weights is too large, the activations in later layers will explode. If the variance is too small, activations will vanish. To keep the variance of activations constant across layers, we analyze a linear layer $y = Wx$. Assuming independent zero-mean variables, the variance of the output is $\text{Var}(y_i) = n_{\text{in}} \text{Var}(w) \text{Var}(x)$. To maintain $\text{Var}(y_i) = \text{Var}(x)$, we require $\text{Var}(w) = 1/n_{\text{in}}$. Xavier (Glorot) initialization averages input and output dimensions:

$$\text{Var}(W) = \frac{2}{n_{\text{in}} + n_{\text{out}}}$$

For networks using ReLU activation, He (Kaiming) initialization accounts for the fact that ReLU zeroes out all negative inputs, which halves the variance of the inputs propagating forward. Mathematically, let $x = \max(0, z)$. If $z \sim \mathcal{N}(0, \sigma^2)$, then $E[x^2] = \frac{1}{2} E[z^2] = \frac{1}{2} \sigma^2$. To compensate for this $50\%$ loss of activation variance at each layer, He initialization doubles the weight variance:

$$\text{Var}(W) = \frac{2}{n_{\text{in}}}$$

This precise tuning of weight dispersion is what enables training of networks with hundreds of layers.

Batch Normalization

Batch Normalization stabilizes network training by explicitly controlling the dispersion of activations. For a mini-batch $B$, it computes the mini-batch mean $\mu_B$ and variance $\sigma_B^2$, and normalizes each activation $x_i$:

$$\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}$$

where $\epsilon$ is a small constant to prevent division by zero. By forcing activations to have a mean of 0 and a variance of 1, Batch Normalization prevents the distribution of inputs to deeper layers from shifting during training (internal covariate shift), allowing higher learning rates and faster convergence. It also scales the normalized activations with learnable parameters $\gamma \hat{x}_i + \beta$, allowing the model to recover the optimal representation representation scale.