Boosting Concepts: Sequential Learning

Boosting converts a sequence of weak learners (models slightly better than random) into a powerful strong learner by making each model focus on the mistakes of its predecessors.

The Weak Learner Principle

A weak learner only needs to perform slightly better than random chance (e.g., a decision stump — a depth-1 tree). Boosting theory (Schapire, 1990) proves that any such learner can be boosted to arbitrary accuracy on the training set.

Decision Stumps as Base Learners

Decision stumps make a single binary split on one feature. They are very fast to train, highly interpretable, and naturally weak — exactly what boosting needs. Each successive stump focuses on the samples the ensemble currently misclassifies.

Sequential Reweighting

At each round, boosting increases the importance (weight) of misclassified samples so the next learner is forced to concentrate on hard examples. The final prediction is a weighted vote of all learners.

Boosting vs. Bagging

Bagging: Parallel; each model trained independently; reduces variance.
Boosting: Sequential; each model trained on reweighted data; primarily reduces bias.
Boosting can overfit if too many rounds are used without regularization.

Bias-Variance Perspective

Boosting mainly reduces bias (model error on training data), making it ideal when the base learner is too simple (high bias). However, adding too many rounds risks overfitting (increasing variance), so early stopping or shrinkage (learning rate) are used to control complexity.

Learning Rate (Shrinkage)

The learning rate \u03b7 scales the contribution of each new learner, trading off the number of iterations for robustness. Smaller \u03b7 requires more trees but often generalizes better.

Typical Settings

Common practice is to set learning_rate between 0.01 and 0.3 and then tune n_estimators accordingly — lower learning rate pairs with more estimators. This shrinkage regularization is one of the most effective tools for preventing boosting overfitting.