Binning and Discretization of Continuous Variables
Binning converts a continuous feature into discrete intervals (bins), allowing linear models to capture non-linear relationships without requiring a more complex model. It also reduces the impact of small measurement errors and outliers.
Equal-Width vs Equal-Frequency Binning
Equal-width binning divides the range into bins of equal size (e.g., 0–20, 20–40, 40–60). Equal-frequency (quantile) binning ensures each bin has roughly the same number of observations, which is better for skewed data.
Binning with Pandas
Discretization with scikit-learn
KBinsDiscretizer supports multiple strategies and integrates into sklearn pipelines. It can output ordinal integers or one-hot encoded bins, depending on the downstream model.
Using KBinsDiscretizer
Trade-offs of Binning
Binning introduces information loss by treating all values within a bin as identical. Too few bins over-smooth; too many bins overfit. Use cross-validation to select the number of bins. Binning is most valuable for linear models; tree models discover optimal split points automatically.