Pruning Decision Trees to Prevent Overfitting

Unpruned decision trees memorize training data perfectly but generalize poorly; pruning trims the tree to balance bias and variance.

Pre-Pruning (Early Stopping)

Pre-pruning stops tree growth early by enforcing constraints like maximum depth, minimum samples per split, or minimum samples per leaf before overfitting occurs.

Key Hyperparameters

<pre><code class="language-python">from sklearn.tree import DecisionTreeClassifier from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split X, y = load_breast_cancer(return_X_y=True) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) clf = DecisionTreeClassifier( max_depth=4, # limit tree depth min_samples_split=10, # need at least 10 samples to split min_samples_leaf=5, # leaf must have at least 5 samples random_state=42 ) clf.fit(X_train, y_train) print(f"Test Accuracy: {clf.score(X_test, y_test):.3f}")</pre>

Post-Pruning: Cost-Complexity Pruning

Cost-complexity pruning (also called minimal cost-complexity pruning or CCP) grows a full tree then prunes branches by minimizing R_alpha(T) = R(T) + alpha * |T|, where |T| is the number of leaves.

Finding the Best Alpha

<pre><code class="language-python">import matplotlib.pyplot as plt # Get alpha values and corresponding impurities path = DecisionTreeClassifier(random_state=42).cost_complexity_pruning_path(X_train, y_train) ccp_alphas = path.ccp_alphas clfs = [] for alpha in ccp_alphas: clf = DecisionTreeClassifier(ccp_alpha=alpha, random_state=42) clf.fit(X_train, y_train) clfs.append(clf) train_scores = [c.score(X_train, y_train) for c in clfs] test_scores = [c.score(X_test, y_test) for c in clfs] plt.plot(ccp_alphas, train_scores, label='Train') plt.plot(ccp_alphas, test_scores, label='Test') plt.xlabel('alpha'); plt.ylabel('Accuracy') plt.legend(); plt.title('Accuracy vs Alpha'); plt.show()</pre>

Selecting the Optimal Alpha

Choose the ccp_alpha where the test accuracy peaks. This can be wrapped inside GridSearchCV for robust selection across cross-validation folds.

Bias-Variance Trade-off in Pruning

Aggressive pruning increases bias (underfitting) while no pruning increases variance (overfitting). The goal is the sweet spot where validation error is minimized.

Practical Guidance

Start with max_depth between 3 and 10 and tune using cross-validation. Cost-complexity pruning is more principled but requires more computation; it is the recommended post-pruning method in scikit-learn.