Calculating Covariance Matrices in Pandas
In real AI projects, data arrives as tables — CSV files, SQL exports, spreadsheets. Pandas is the standard tool for working with tabular data. Its built-in covariance and correlation methods let you understand the relationships between all features at once, which guides feature selection and data cleaning before model training.
Covariance: Are These Features Moving Together?
The covariance between two features $X$ and $Y$ is $\text{Cov}(X, Y) = \frac{1}{n-1}\sum (x_i - \bar{x})(y_i - \bar{y})$. A positive value means they move together; negative means they move opposite; zero means no linear relationship.
The full covariance matrix computes this for every pair of features simultaneously — an $F \times F$ matrix for $F$ features.
Computing with Pandas
Correlation: The Normalised Version
Covariance is hard to interpret because its magnitude depends on the scale of the data. Correlation normalises this to a value between -1 and 1: $r = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y}$. A correlation of 0.9 means strongly positive; -0.8 means strongly negative; 0.0 means no linear relationship.
Pearson Correlation Matrix
Using Correlation for Feature Selection
Before training a model, remove highly correlated features to reduce redundancy (multicollinearity). The algorithm: compute the correlation matrix, find feature pairs above a threshold, and drop one from each pair.