Bagging (Bootstrap Aggregating) Concept

Bagging trains many copies of the same base learner on different bootstrap samples of the training set and aggregates their predictions to reduce variance.

Bootstrap Sampling

A bootstrap sample is created by sampling N observations with replacement from a dataset of size N. On average, each bootstrap sample contains ~63.2% unique observations, leaving the rest as out-of-bag (OOB) samples.

Why Sampling with Replacement?

Sampling with replacement generates diverse training sets without requiring additional data. Each base learner sees a slightly different view of the data, introducing the variance needed for effective averaging. The OOB samples serve as a built-in validation set.

Aggregation: Voting and Averaging

For classification, bagging aggregates predictions by majority vote. For regression, it takes the mean of all predictions.

BaggingClassifier in sklearn

<pre><code class="language-python">from sklearn.ensemble import BaggingClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split X, y = load_breast_cancer(return_X_y=True) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) bag = BaggingClassifier( estimator=DecisionTreeClassifier(), n_estimators=100, max_samples=0.8, # 80% of training set per bootstrap bootstrap=True, oob_score=True, random_state=42, n_jobs=-1 ) bag.fit(X_train, y_train) print(f"OOB Score: {bag.oob_score_:.3f}") print(f"Test Score: {bag.score(X_test, y_test):.3f}")</pre>

Effect on Bias and Variance

Bagging primarily reduces variance without significantly increasing bias. It works best with high-variance, low-bias base learners such as fully grown decision trees. Stable, low-variance models (like linear regression) benefit minimally from bagging.

Bagging vs. Random Forests

Random Forests extend bagging by also randomly subsampling features at each split, further decorrelating trees and improving performance.

Key Difference

In plain bagging, each tree uses all features at every split. Random Forests limit each split to a random subset of sqrt(n_features) features, reducing tree correlation and typically yielding better generalization than vanilla bagging.