Feature Importance Estimation during EDA
Estimating feature importance during EDA helps prioritize which features to invest in — cleaning, engineering, and including in the model — before running expensive full training runs. Quick estimators like mutual information and shallow trees give reliable rankings in seconds.
Mutual Information for Feature Ranking
Mutual information (MI) measures the reduction in uncertainty about the target given a feature, capturing both linear and non-linear relationships. It is model-agnostic and works for classification and regression.
Computing Mutual Information
<pre><code class="language-python">from sklearn.feature_selection import mutual_info_classif
import pandas as pd
import matplotlib.pyplot as plt
mi = mutual_info_classif(X_train, y_train, random_state=42)
mi_df = pd.DataFrame({
"feature": X_train.columns,
"mutual_info": mi
}).sort_values("mutual_info", ascending=False)
mi_df.plot(kind="barh", x="feature", y="mutual_info",
figsize=(8, 6), legend=False,
title="Mutual Information with Target")
plt.xlabel("Mutual Information Score")
plt.show()</pre>
Tree-Based Feature Importance
A shallow Random Forest or ExtraTreesClassifier can be fitted quickly during EDA to get impurity-based feature importance scores — often more reliable than correlation for non-linear relationships.
Quick RF Feature Importance
<pre><code class="language-python">from sklearn.ensemble import RandomForestClassifier
import pandas as pd
# Use a shallow, fast model for EDA
rf = RandomForestClassifier(n_estimators=50, max_depth=5,
random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)
fi = pd.Series(rf.feature_importances_,
index=X_train.columns
).sort_values(ascending=False)
print(fi.head(10))</pre>
Permutation Importance for Reliability
<pre><code class="language-python">from sklearn.inspection import permutation_importance
result = permutation_importance(
rf, X_val, y_val,
n_repeats=10, random_state=42, n_jobs=-1
)
pi_df = pd.Series(
result.importances_mean, index=X_train.columns
).sort_values(ascending=False)
print(pi_df.head(10))
# More reliable than impurity-based; measures actual prediction contribution</pre>