Detecting Multicollinearity

Multicollinearity occurs when two or more features are highly linearly correlated with each other, inflating the variance of regression coefficients and making them unstable and uninterpretable. It does not affect predictions but severely hampers feature importance analysis.


Detecting Multicollinearity with VIF

The Variance Inflation Factor (VIF) quantifies how much the variance of a regression coefficient is inflated due to its correlation with other features. A VIF > 10 indicates severe multicollinearity; > 5 warrants investigation.

Computing VIF with statsmodels

<pre><code class="language-python">import pandas as pd from statsmodels.stats.outliers_influence import variance_inflation_factor from sklearn.preprocessing import StandardScaler df = pd.read_csv("data.csv") X = df[["age", "income", "spend", "tenure"]].dropna() # Standardize first (VIF is scale-sensitive) X_scaled = pd.DataFrame( StandardScaler().fit_transform(X), columns=X.columns ) vif_df = pd.DataFrame({ "Feature": X_scaled.columns, "VIF": [variance_inflation_factor(X_scaled.values, i) for i in range(X_scaled.shape[1])] }) print(vif_df.sort_values("VIF", ascending=False))</pre>

Resolving Multicollinearity

The main strategies for handling multicollinearity are: removing one of the correlated features, combining them through PCA, or using regularized regression (Ridge) which handles it implicitly.

Remediation Strategies

  • Drop one feature: Remove the feature with the higher VIF from each correlated pair.
  • PCA: Collapse correlated features into uncorrelated principal components.
  • Ridge Regression: L2 regularization shrinks correlated coefficients toward each other, stabilizing them without removal.
  • Domain knowledge: Create a ratio or difference of the correlated features that has a cleaner meaning.