Random Forests: Out-of-Bag (OOB) Error

Because each tree is trained on a bootstrap sample, ~37% of the data is left out per tree — these out-of-bag samples serve as a built-in validation set at no extra cost.

How OOB Estimation Works

For each training sample, predictions are collected only from trees that did not train on that sample. The OOB error is the aggregated error across all such predictions — essentially equivalent to leave-one-out cross-validation for large forests.

Enabling OOB Scoring

<pre><code class="language-python">from sklearn.ensemble import RandomForestClassifier from sklearn.datasets import load_digits X, y = load_digits(return_X_y=True) rf = RandomForestClassifier(n_estimators=300, oob_score=True, random_state=42, n_jobs=-1) rf.fit(X, y) print(f"OOB Accuracy: {rf.oob_score_:.4f}") # rf.oob_decision_function_ has per-sample probabilities</pre>

OOB vs. Cross-Validation

OOB error is computed for free during training, making it ideal for quick model assessment. Cross-validation is more reliable for small datasets but requires multiple full training runs.

When to Prefer OOB

Use OOB when training data is large (making CV expensive) or when you want a fast, preliminary estimate of generalization. For final model selection, confirm OOB results with k-fold CV.

Convergence with n_estimators

<pre><code class="language-python">import numpy as np import matplotlib.pyplot as plt oob_errors = [] for n in range(10, 301, 10): rf = RandomForestClassifier(n_estimators=n, oob_score=True, random_state=42) rf.fit(X, y) oob_errors.append(1 - rf.oob_score_) plt.plot(range(10, 301, 10), oob_errors) plt.xlabel('Number of Trees') plt.ylabel('OOB Error') plt.title('OOB Error vs. n_estimators') plt.show()</pre>

Interpreting OOB Decision Function

rf.oob_decision_function_ gives per-sample class probabilities estimated from OOB predictions, useful for calibration, threshold analysis, and identifying hard-to-classify samples.

Spotting Difficult Samples

Samples with OOB probability close to 0.5 (binary case) are near the decision boundary and may benefit from data collection, feature engineering, or dedicated analysis of labeling consistency.