Calculating Covariance Matrices in Pandas

In real AI projects, data arrives as tables — CSV files, SQL exports, spreadsheets. Pandas is the standard tool for working with tabular data. Its built-in covariance and correlation methods let you understand the relationships between all features at once, which guides feature selection and data cleaning before model training.


Covariance: Are These Features Moving Together?

The covariance between two features $X$ and $Y$ is $\text{Cov}(X, Y) = \frac{1}{n-1}\sum (x_i - \bar{x})(y_i - \bar{y})$. A positive value means they move together; negative means they move opposite; zero means no linear relationship.

The full covariance matrix computes this for every pair of features simultaneously — an $F \times F$ matrix for $F$ features.

Computing with Pandas

<pre><code class="language-python">import pandas as pd import numpy as np # Create a small dataset df = pd.DataFrame({ 'height': np.random.normal(170, 10, 100), 'weight': np.random.normal(70, 15, 100), 'age': np.random.normal(35, 8, 100) }) cov_matrix = df.cov() print(cov_matrix) # A 3x3 matrix showing covariance between all feature pairs </pre>

Correlation: The Normalised Version

Covariance is hard to interpret because its magnitude depends on the scale of the data. Correlation normalises this to a value between -1 and 1: $r = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y}$. A correlation of 0.9 means strongly positive; -0.8 means strongly negative; 0.0 means no linear relationship.

Pearson Correlation Matrix

<pre><code class="language-python">corr_matrix = df.corr() # Default: Pearson print(corr_matrix) # High correlation (>0.8 or <-0.8) between two features # often means they are redundant — you can drop one. mask = corr_matrix.abs() > 0.8 print(mask) </pre>

Using Correlation for Feature Selection

Before training a model, remove highly correlated features to reduce redundancy (multicollinearity). The algorithm: compute the correlation matrix, find feature pairs above a threshold, and drop one from each pair.

Dropping Correlated Features

<pre><code class="language-python">def drop_correlated(df, threshold=0.85): corr = df.corr().abs() # Upper triangle only (avoid double-counting) upper = corr.where( np.triu(np.ones(corr.shape), k=1).astype(bool) ) # Find features to drop to_drop = [col for col in upper.columns if any(upper[col] > threshold)] return df.drop(columns=to_drop) df_reduced = drop_correlated(df, threshold=0.85) print("Remaining features:", list(df_reduced.columns)) </pre>