Covariance: How Variables Change Together
When dealing with multi-dimensional datasets in machine learning, we rarely study variables in isolation. Covariance is the mathematical measure of how two random variables change together. It determines whether a linear relationship exists between them and quantifies the direction of this relationship. In high dimensions, covariance generalizes to the covariance matrix, a central object in dimensionality reduction, multivariate Gaussian distributions, and feature representation.
Understanding covariance is crucial because it describes redundant information in datasets. If two features covary strongly, they provide similar signals to a model, which can lead to instability in regression coefficients and increased computational overhead. We will explore joint expected values, symmetric matrices, transformations, and variance maximization.
Bivariate Covariance and Expected Value
Covariance is the measure of the linear association between two random variables. It describes whether the variables change together and in what direction.
Unlike variance, which measures the spread of a single variable, covariance tracks how two variables coordinate their deviations from their respective means.
Bivariate Covariance Formulation
The population covariance between two random variables $X$ and $Y$ is defined as the expectation of their joint deviations from their respective means:
$$\text{Cov}(X, Y) = E[(X - E[X])(Y - E[Y])]$$
By expanding the product and using the linearity of expectation, we derive the computational formula:
$$\text{Cov}(X, Y) = E[XY - X E[Y] - Y E[X] + E[X]E[Y]]$$
$$= E[XY] - E[X]E[Y] - E[Y]E[X] + E[X]E[Y] = E[XY] - E[X]E[Y]$$
For a sample, we estimate the covariance as:
$$s_{XY} = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})$$
Direction, Interpretation, and Independence
A positive covariance indicates that $X$ and $Y$ tend to move in the same direction. A negative covariance indicates they move in opposite directions. A covariance of zero indicates no linear relationship. However, covariance is not a measure of independence. If two variables are independent, their covariance is 0. But the converse is not true. Let us prove this with a classic counterexample. Let $X \sim U[-1, 1]$ (a uniform distribution). Then $E[X] = 0$. Let $Y = X^2$. Clearly, $Y$ is completely dependent on $X$. Let us calculate their covariance:
$$\text{Cov}(X, Y) = E[XY] - E[X]E[Y] = E[X^3] - 0 \cdot E[X^2]$$
Because $X$ is symmetric around zero, the expected value of any odd power of $X$ is zero. Thus, $E[X^3] = 0$, yielding $\text{Cov}(X, Y) = 0$. This demonstrates that covariance only captures linear associations and can be zero even when variables share a perfect non-linear relationship.
The Covariance Matrix
When working with multi-dimensional datasets, we represent the pairwise covariances between all variables in a structured matrix.
The covariance matrix is a fundamental operator in linear algebra, summarizing the multi-dimensional shape and alignment of the data distribution.
Matrix Structure and Symmetry
For a random vector $\mathbf{X} = [X_1, X_2, \dots, X_D]^T$ representing $D$ features, the covariance matrix $\Sigma$ is a $D \times D$ matrix defined as:
$$\Sigma = E[(\mathbf{X} - E[\mathbf{X}])(\mathbf{X} - E[\mathbf{X}])^T]$$
The entry in the $i$-th row and $j$-th column is $\Sigma_{ij} = \text{Cov}(X_i, X_j)$. The covariance matrix $\Sigma$ is always symmetric because $\text{Cov}(X_i, X_j) = \text{Cov}(X_j, X_i)$. The diagonal elements represent the variances of the individual features: $\Sigma_{ii} = \text{Var}(X_i) = \sigma_i^2$.
Positive Semi-Definiteness and Multivariate Gaussian Geometry
The covariance matrix $\Sigma$ is positive semi-definite (PSD), meaning that for any real non-zero vector $\mathbf{v} \in \mathbb{R}^D$: $\mathbf{v}^T \Sigma \mathbf{v} \ge 0$. Let us prove this by looking at the variance of a linear combination of the random variables, $Y = \mathbf{v}^T \mathbf{X}$:
$$\text{Var}(Y) = \text{Var}(\mathbf{v}^T \mathbf{X}) = E[(\mathbf{v}^T \mathbf{X} - E[\mathbf{v}^T \mathbf{X}])^2] = E[(\mathbf{v}^T(\mathbf{X} - E[\mathbf{X}]))^2]$$
$$= E[\mathbf{v}^T(\mathbf{X} - E[\mathbf{X}])(\mathbf{X} - E[\mathbf{X}])^T\mathbf{v}] = \mathbf{v}^T E[(\mathbf{X} - E[\mathbf{X}])(\mathbf{X} - E[\mathbf{X}])^T] \mathbf{v} = \mathbf{v}^T \Sigma \mathbf{v}$$
Since the variance of any random variable must be non-negative, $\mathbf{v}^T \Sigma \mathbf{v} \ge 0$ is proven. This property guarantees that all eigenvalues of $\Sigma$ are non-negative. In a Multivariate Gaussian Distribution, the covariance matrix defines the geometry of the probability density function:
$$f(\mathbf{x}) = \frac{1}{(2\pi)^{D/2} |\Sigma|^{1/2}} \exp\left(-\frac{1}{2} (\mathbf{x}-\boldsymbol{mu})^T \Sigma^{-1} (\mathbf{x}-\boldsymbol{mu})\right)$$
The term inside the exponent, $D_M(\mathbf{x}) = \sqrt{(\mathbf{x}-\boldsymbol{mu})^T \Sigma^{-1} (\mathbf{x}-\boldsymbol{mu})}$, is the Mahalanobis Distance. Unlike Euclidean distance, it accounts for correlations between features and differences in scale, serving as the mathematical basis for multivariate anomaly detection and classification.
Linear Transformations and Covariance
Understanding how covariance propagates under linear transformations is a key requirement for multivariate modeling and data preprocessing.
By mapping variables through matrix multipliers, we can rotate, scale, and align distributions, which is used for data whitening and generative modeling.
Transforming Covariance Matrices
If we apply a linear transformation to a random vector $\mathbf{X} \in \mathbb{R}^D$ using a matrix $\mathbf{A} \in \mathbb{R}^{M \times D}$ and add a constant vector $\mathbf{b} \in \mathbb{R}^M$, we get a new random vector $\mathbf{Y} = \mathbf{A}\mathbf{X} + \mathbf{b}$. The covariance matrix of $\mathbf{Y}$ is given by:
$$\Sigma_Y = E[(\mathbf{Y} - E[\mathbf{Y}])(\mathbf{Y} - E[\mathbf{Y}])^T] = E[(\mathbf{A}(\mathbf{X} - E[\mathbf{X}]))(\mathbf{A}(\mathbf{X} - E[\mathbf{X}]))^T] = \mathbf{A} \Sigma_X \mathbf{A}^T$$
This property is fundamental in generative modeling (e.g., standard Normal distributions are transformed to model correlated Gaussian distributions using the Cholesky decomposition of the target covariance matrix: $\Sigma = \mathbf{L}\mathbf{L}^T$, where $\mathbf{L}$ is a lower triangular matrix. To sample from $\mathcal{N}(\mu, \Sigma)$, we sample $\mathbf{z} \sim \mathcal{N}(0, I)$ and compute $\mathbf{y} = \mathbf{L}\mathbf{z} + \mu$. The covariance of $\mathbf{y}$ is then $\mathbf{L}I\mathbf{L}^T = \Sigma$).
PCA Whitening vs. ZCA Whitening
Data whitening is a preprocessing transformation that decorrelates features and standardizes their variances, resulting in a covariance matrix equal to the identity matrix $I$. Given a dataset with covariance $\Sigma$, we compute its eigen-decomposition $\Sigma = \mathbf{U}\mathbf{\Lambda}\mathbf{U}^T$, where $\mathbf{U}$ is an orthogonal matrix of eigenvectors and $\mathbf{\Lambda}$ is a diagonal matrix of eigenvalues. There are two primary whitening methods:
1. PCA Whitening: Scales the data along the principal component axes. The transformation matrix is $\mathbf{W}_{\text{PCA}} = \mathbf{\Lambda}^{-\frac{1}{2}} \mathbf{U}^T$. Applying this decorrelates the features and sets their variances to 1, but it rotates the data.
2. ZCA Whitening: Rotates the PCA-whitened data back to the original orientation of the features using the eigenvector matrix: $\mathbf{W}_{\text{ZCA}} = \mathbf{U} \mathbf{\Lambda}^{-\frac{1}{2}} \mathbf{U}^T$. This ensures that the whitened features remain as close as possible to the original features in terms of mean squared distance, which is highly useful for preprocessing images before feeding them to convolutional neural networks.
Principal Component Analysis (PCA)
PCA is the primary unsupervised technique used to project high-dimensional data onto a lower-dimensional space by maximizing variance.
By projecting the data onto the eigenvectors of its covariance matrix, PCA preserves the directions that contain the most information while discarding low-variance noise.
Maximizing Projected Variance
PCA seeks to find a direction vector $\mathbf{v}$ (where $\mathbf{v}^T \mathbf{v} = 1$) such that projecting the centered data onto $\mathbf{v}$ maximizes the variance of the projected data. The variance of the projection is $\text{Var}(\mathbf{v}^T \mathbf{X}) = \mathbf{v}^T \Sigma \mathbf{v}$. To maximize this subject to $\mathbf{v}^T \mathbf{v} = 1$, we set up the Lagrangian:
$$\mathcal{L}(\mathbf{v}, \lambda) = \mathbf{v}^T \Sigma \mathbf{v} - \lambda(\mathbf{v}^T \mathbf{v} - 1)$$
Taking the partial derivative with respect to $\mathbf{v}$ and setting it to zero:
$$\frac{\partial \mathcal{L}}{\partial \mathbf{v}} = 2\Sigma\mathbf{v} - 2\lambda\mathbf{v} = 0 \implies \Sigma\mathbf{v} = \lambda\mathbf{v}$$
This is the classic eigenvalue equation. The direction of maximum variance is the eigenvector $\mathbf{v}$ corresponding to the largest eigenvalue $\lambda$ of the covariance matrix $\Sigma$. The variance of the data projected onto this component is exactly $\lambda$. This mathematical proof guarantees that eigenvalue decomposition of the covariance matrix yields the directions that capture the maximum information.
Dimensionality Reduction and Reconstruction Error
By calculating the eigenvectors of the sample covariance matrix, sorting them by their corresponding eigenvalues in descending order, and selecting the top $k$ eigenvectors, we construct a projection matrix $\mathbf{V}_k$. Projecting the original $D$-dimensional data onto $\mathbf{V}_k$ yields a $k$-dimensional representation that retains the maximum possible variance of the original dataset. It can be mathematically proven that maximizing the projected variance is equivalent to minimizing the mean squared reconstruction error between the original data points and their low-dimensional projections. This orthogonal transformation ensures that the new principal components are completely decorrelated, preventing multicollinearity in downstream models.