Detecting Multicollinearity
Multicollinearity occurs when two or more features are highly linearly correlated with each other, inflating the variance of regression coefficients and making them unstable and uninterpretable. It does not affect predictions but severely hampers feature importance analysis.
Detecting Multicollinearity with VIF
The Variance Inflation Factor (VIF) quantifies how much the variance of a regression coefficient is inflated due to its correlation with other features. A VIF > 10 indicates severe multicollinearity; > 5 warrants investigation.
Computing VIF with statsmodels
<pre><code class="language-python">import pandas as pd
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.preprocessing import StandardScaler
df = pd.read_csv("data.csv")
X = df[["age", "income", "spend", "tenure"]].dropna()
# Standardize first (VIF is scale-sensitive)
X_scaled = pd.DataFrame(
StandardScaler().fit_transform(X), columns=X.columns
)
vif_df = pd.DataFrame({
"Feature": X_scaled.columns,
"VIF": [variance_inflation_factor(X_scaled.values, i)
for i in range(X_scaled.shape[1])]
})
print(vif_df.sort_values("VIF", ascending=False))</pre>
Resolving Multicollinearity
The main strategies for handling multicollinearity are: removing one of the correlated features, combining them through PCA, or using regularized regression (Ridge) which handles it implicitly.
Remediation Strategies
- Drop one feature: Remove the feature with the higher VIF from each correlated pair.
- PCA: Collapse correlated features into uncorrelated principal components.
- Ridge Regression: L2 regularization shrinks correlated coefficients toward each other, stabilizing them without removal.
- Domain knowledge: Create a ratio or difference of the correlated features that has a cleaner meaning.