Bagging (Bootstrap Aggregating) Concept
Bagging trains many copies of the same base learner on different bootstrap samples of the training set and aggregates their predictions to reduce variance.
Bootstrap Sampling
A bootstrap sample is created by sampling N observations with replacement from a dataset of size N. On average, each bootstrap sample contains ~63.2% unique observations, leaving the rest as out-of-bag (OOB) samples.
Why Sampling with Replacement?
Sampling with replacement generates diverse training sets without requiring additional data. Each base learner sees a slightly different view of the data, introducing the variance needed for effective averaging. The OOB samples serve as a built-in validation set.
Aggregation: Voting and Averaging
For classification, bagging aggregates predictions by majority vote. For regression, it takes the mean of all predictions.
BaggingClassifier in sklearn
Effect on Bias and Variance
Bagging primarily reduces variance without significantly increasing bias. It works best with high-variance, low-bias base learners such as fully grown decision trees. Stable, low-variance models (like linear regression) benefit minimally from bagging.
Bagging vs. Random Forests
Random Forests extend bagging by also randomly subsampling features at each split, further decorrelating trees and improving performance.
Key Difference
In plain bagging, each tree uses all features at every split. Random Forests limit each split to a random subset of sqrt(n_features) features, reducing tree correlation and typically yielding better generalization than vanilla bagging.