Correlation Matrices and Heatmaps

A correlation matrix computes the pairwise Pearson correlation coefficient between all numeric features, and a heatmap renders it visually — making it fast to identify strongly correlated (redundant) feature pairs and features correlated with the target.

Computing and Visualizing Correlations

The Pearson correlation coefficient ranges from −1 (perfect negative) to +1 (perfect positive), with 0 indicating no linear relationship. Values above 0.8 or below −0.8 between two features often indicate redundancy.

Seaborn Heatmap

<pre><code class="language-python">import seaborn as sns import matplotlib.pyplot as plt import pandas as pd df = pd.read_csv("data.csv") corr = df.corr(numeric_only=True) plt.figure(figsize=(10, 8)) sns.heatmap( corr, annot=True, fmt=".2f", cmap="coolwarm", center=0, square=True, linewidths=0.5 ) plt.title("Feature Correlation Matrix") plt.tight_layout() plt.show()</pre>

Using Correlations for Feature Selection

Features with very high mutual correlation (> 0.9) are candidates for removal since they add no new information. Features with high absolute correlation to the target are strong candidates for inclusion.

Identifying Highly Correlated Pairs

<pre><code class="language-python">import numpy as np # Upper triangle mask mask = np.triu(np.ones_like(corr, dtype=bool)) corr_unstacked = corr.mask(mask).stack() high_corr = corr_unstacked[corr_unstacked.abs() > 0.85] print("Highly correlated pairs:") print(high_corr.sort_values(ascending=False))</pre>

Target Correlation

<pre><code class="language-python"># Correlations with the target variable target_corr = corr["target"].drop("target").abs().sort_values(ascending=False) print(target_corr) # High values → likely predictive features # Near zero → possibly uninformative (but check non-linear relationships too)</pre>