Extracting Feature Importance from Forests

Random Forests provide built-in feature importance scores that rank which features most influence predictions, enabling feature selection and model interpretation.

Mean Decrease in Impurity (MDI)

MDI importance for a feature is the total reduction in impurity (Gini or entropy) across all splits on that feature, averaged over all trees. Features used in high-level splits with many samples receive the highest scores.

Accessing MDI Importances

<pre><code class="language-python">import numpy as np import matplotlib.pyplot as plt from sklearn.ensemble import RandomForestClassifier from sklearn.datasets import load_breast_cancer data = load_breast_cancer() X, y = data.data, data.target rf = RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1) rf.fit(X, y) importances = rf.feature_importances_ indices = np.argsort(importances)[::-1][:10] # top 10 plt.barh([data.feature_names[i] for i in indices[::-1]], importances[indices[::-1]]) plt.xlabel('MDI Importance') plt.title('Top 10 Feature Importances') plt.show()</pre>

Permutation Importance

Permutation importance randomly shuffles a feature's values and measures the resulting drop in model performance, avoiding the bias of MDI toward high-cardinality features.

Computing Permutation Importance

<pre><code class="language-python">from sklearn.inspection import permutation_importance from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) rf.fit(X_train, y_train) result = permutation_importance(rf, X_test, y_test, n_repeats=10, random_state=42, n_jobs=-1) perm_sorted = np.argsort(result.importances_mean)[::-1] for i in perm_sorted[:5]: print(f"{data.feature_names[i]}: {result.importances_mean[i]:.4f} +/- {result.importances_std[i]:.4f}")</pre>

MDI vs. Permutation: Which to Trust?

MDI is fast but biased toward continuous and high-cardinality features. Permutation importance is slower but model-agnostic and unbiased. Use permutation importance for reliable feature ranking, especially when features have different scales or types.

Feature Selection with Importances

Feature importance scores can drive dimensionality reduction by removing low-importance features, improving model speed and sometimes accuracy.

SelectFromModel

<pre><code class="language-python">from sklearn.feature_selection import SelectFromModel selector = SelectFromModel(rf, threshold='median') selector.fit(X_train, y_train) X_train_sel = selector.transform(X_train) print(f"Features selected: {X_train_sel.shape[1]} of {X_train.shape[1]}")</pre>