Mean Decrease in Impurity (MDI)
MDI importance for a feature is the total reduction in impurity (Gini or entropy) across all splits on that feature, averaged over all trees. Features used in high-level splits with many samples receive the highest scores.
Accessing MDI Importances
<pre><code class="language-python">import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
X, y = data.data, data.target
rf = RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1)
rf.fit(X, y)
importances = rf.feature_importances_
indices = np.argsort(importances)[::-1][:10] # top 10
plt.barh([data.feature_names[i] for i in indices[::-1]],
importances[indices[::-1]])
plt.xlabel('MDI Importance')
plt.title('Top 10 Feature Importances')
plt.show()</pre>
Permutation Importance
Permutation importance randomly shuffles a feature's values and measures the resulting drop in model performance, avoiding the bias of MDI toward high-cardinality features.
Computing Permutation Importance
<pre><code class="language-python">from sklearn.inspection import permutation_importance
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
rf.fit(X_train, y_train)
result = permutation_importance(rf, X_test, y_test,
n_repeats=10, random_state=42, n_jobs=-1)
perm_sorted = np.argsort(result.importances_mean)[::-1]
for i in perm_sorted[:5]:
print(f"{data.feature_names[i]}: {result.importances_mean[i]:.4f} +/- {result.importances_std[i]:.4f}")</pre>
MDI vs. Permutation: Which to Trust?
MDI is fast but biased toward continuous and high-cardinality features. Permutation importance is slower but model-agnostic and unbiased. Use permutation importance for reliable feature ranking, especially when features have different scales or types.
Feature Selection with Importances
Feature importance scores can drive dimensionality reduction by removing low-importance features, improving model speed and sometimes accuracy.
SelectFromModel
<pre><code class="language-python">from sklearn.feature_selection import SelectFromModel
selector = SelectFromModel(rf, threshold='median')
selector.fit(X_train, y_train)
X_train_sel = selector.transform(X_train)
print(f"Features selected: {X_train_sel.shape[1]} of {X_train.shape[1]}")</pre>