Correlation Coefficients (Pearson vs. Spearman)

While covariance tells us the direction of a relationship, its magnitude depends entirely on the units of measurement, making it hard to compare different features. Correlation coefficients resolve this by normalizing covariance into a scale-free metric ranging from -1 to 1. In machine learning, selecting the correct correlation coefficient (Pearson for linear patterns or Spearman for non-linear monotonic patterns) is key to feature selection, collinearity diagnosis, and understanding complex relationships.

Understanding correlation boundaries and assumptions protects data pipelines from spurious correlations. Additionally, distinguishing between correlation and causation is the dividing line between purely predictive models and causal models capable of simulating active interventions. We will explore Pearson bounds, Spearman ranking, Variance Inflation Factors, and causal graphs.

Pearson Correlation Coefficient ($r$)

Pearson's correlation coefficient is a normalized measure of the linear relationship between two continuous variables, scaling covariance to a unitless range.

It assumes that both variables are normally distributed and share a linear relationship, which makes it highly effective for Gaussian patterns but sensitive to outliers.

Mathematical Formulation

The population Pearson correlation coefficient $\rho_{X,Y}$ is defined as the covariance of $X$ and $Y$ divided by the product of their standard deviations:

$$\rho_{X,Y} = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y}$$

For a sample, it is calculated as:

$$r = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n} (x_i - \bar{x})^2 \sum_{i=1}^{n} (y_i - \bar{y})^2}}$$

Proof of Bounds Using Cauchy-Schwarz

To prove that $-1 \le \rho_{X,Y} \le 1$ (or $|\rho_{X,Y}| \le 1$), we use the Cauchy-Schwarz Inequality for expectations. For any two random variables $U$ and $V$, the inequality states that $(E[UV])^2 \le E[U^2] E[V^2]$. Let $U = X - E[X]$ and $V = Y - E[Y]$. Substituting these into the inequality:

$$(E[(X - E[X])(Y - E[Y])])^2 \le E[(X - E[X])^2] E[(Y - E[Y])^2]$$

By definition, the left-hand term is the squared covariance, and the right-hand terms are the variances:

$$(\text{Cov}(X, Y))^2 \le \text{Var}(X) \text{Var}(Y) = \sigma_X^2 \sigma_Y^2$$

Taking the square root on both sides:

$$|\text{Cov}(X, Y)| \le \sigma_X \sigma_Y \implies \frac{|\text{Cov}(X, Y)|}{\sigma_X \sigma_Y} \le 1 \implies -1 \le \rho_{X,Y} \le 1$$

A value of 1 represents a perfect positive linear relationship, -1 a perfect negative linear relationship, and 0 represents no linear relationship. This unitless scale allows us to compare features with different units.

Spearman Rank Correlation ($\rho$)

Spearman's rank correlation is a non-parametric coefficient that measures the monotonic relationship between two variables, based on their ranks rather than raw values.

Monotonic relationships mean the variables move in the same direction, but not necessarily at a constant linear rate. Using ranks eliminates linear assumptions.

Rank Formulation and simplified formula derivation

To compute Spearman's correlation, we rank the raw values of $X$ and $Y$ separately. Let $rg(x_i)$ and $rg(y_i)$ be the ranks. Spearman's $\rho$ is the Pearson correlation coefficient calculated on these ranks. If there are no tied ranks, we can derive the simplified formula. Let the ranks be integers $1, 2, \dots, n$. The mean of the ranks is $\bar{rg} = (n+1)/2$, and the variance is $\text{Var}(rg) = (n^2 - 1)/12$. The difference in ranks is $d_i = rg(x_i) - rg(y_i)$. Since $\sum d_i = 0$, we expand $\sum d_i^2$ and substitute the values into Pearson's formula to yield:

$$\rho = 1 - \frac{6 \sum_{i=1}^{n} d_i^2}{n(n^2 - 1)}$$

This formula makes calculation highly efficient and shows that Spearman's $\rho$ is purely a function of the squared differences in ranks. Another rank correlation metric is Kendall's Tau ($\tau$), defined as:

$$\tau = \frac{C - D}{\frac{1}{2} n(n-1)}$$

where $C$ is the number of concordant pairs (pairs that retain their relative ordering in both rankings) and $D$ is the number of discordant pairs. Kendall's Tau is often preferred over Spearman when sample sizes are small, as it has better asymptotic normality properties.

Linear vs. Monotonic Relationships

Pearson correlation measures linear relationships, whereas Spearman correlation measures Rank/Monotonic relationships. For example, if $Y = e^X$, Pearson's $r$ will be less than 1 (approx. $0.88$ depending on data range) because the relationship is non-linear. However, Spearman's $\rho$ will be exactly 1 because the ranks match perfectly. Furthermore, because it operates on ranks, Spearman is highly robust to extreme outliers, whereas a single outlier can severely distort Pearson's $r$. If we add an outlier at $(100, 100)$ to a zero-correlation dataset of size 20, Pearson's $r$ will jump close to 1.0, while Spearman's $\rho$ will remain near 0.

Collinearity and Feature Selection

Analyzing correlations between features is a vital diagnostic step when preparing datasets for training predictive models.

When independent features are highly correlated with each other, it introduces redundant noise and destabilizes the learning weights in linear and regression models.

Multicollinearity and Variance Inflation Factor (VIF)

Multicollinearity occurs when two or more independent variables in a regression model are highly correlated, meaning they provide redundant information. This inflates the variance of the regression coefficients, making them highly sensitive to small changes in the data. Mathematically, the variance of the parameter estimate in linear regression is given by:

$$\text{Var}(\hat{\beta}) = \sigma^2 (X^T X)^{-1}$$

When features are highly correlated, the design matrix $X^T X$ becomes near-singular (ill-conditioned), making its inverse $(X^T X)^{-1}$ explode, inflating the variance of our parameter estimate. We quantify multicollinearity using the Variance Inflation Factor (VIF) for feature $X_j$:

$$\text{VIF}_j = \frac{1}{1 - R_j^2}$$

where $R_j^2$ is the coefficient of determination when regressing $X_j$ against all other features. A VIF value exceeding 5 or 10 indicates severe multicollinearity, prompting feature pruning.

Feature Selection Pipelines

To prevent overfitting and optimize training speed, feature selection pipelines construct pairwise correlation matrices. If two features have a correlation $|r| > 0.85$, they are considered highly collinear, and one is discarded. In linear models, this prevents the design matrix $X^T X$ from becoming near-singular, ensuring numerical stability during inversion. Pruning collinear variables also improves model explainability (e.g. SHAP values are more stable when features are independent).

The Boundary of Machine Learning: Correlation vs. Causation

Establishing a correlation between two variables does not imply that one causes the other. Predictive ML models capture correlations, while causal AI models attempt to capture mechanisms.

Relying purely on correlation leads to models that collapse when the environment changes because they fail to distinguish between confounding features and actual causal drivers.

Confounders and Spurious Correlations

A correlation can be spurious when two variables $X$ and $Y$ are correlated not because of a direct link, but because they are both causally influenced by a shared confounding variable $Z$. Mathematically, the data-generating process is $X \leftarrow Z \rightarrow Y$. For example, ice cream sales and sunscreen sales are highly correlated, but both are caused by the outdoor temperature ($Z$). If a machine learning model is trained to predict skin cancer using ice cream sales as a feature, it will capture this correlation. However, if we intervene and ban ice cream, skin cancer rates will not change. This illustrates the failure of purely predictive models under interventions, where correlation is mistaken for causation.

Introduction to Causal Inference and do-Calculus

To model interventions, modern AI leverages Causal Inference frameworks, such as Judea Pearl's Structural Causal Models (SCMs) and the do-calculus. Causal models distinguish between passive observation $P(Y | X = x)$ ('what is the probability of $Y$ given we observe $X=x$') and active intervention $P(Y | \text{do}(X = x))$ ('what is the probability of $Y$ if we force $X$ to equal $x$'). By constructing causal graphs, we can identify which variables must be controlled for to estimate true causal effects. For example, using the Backdoor Criterion, we can block all confounding paths between $X$ and $Y$ by conditioning on a set of variables $Z$:

$$P(Y \mid \text{do}(X = x)) = \sum_{z} P(Y \mid X = x, Z = z) P(Z = z)$$

If we cannot observe $Z$, we can sometimes use the Frontdoor Criterion to estimate the causal effect through a mediator variable $M$ that lies on the causal path from $X$ to $Y$ ($X \to M \to Y$). These criteria allow us to compute causal effects from purely observational data, enabling robust AI models that do not fail under distribution shifts.