Data Imputation: Mean, Median, and Mode
Simple imputation replaces missing values with a representative statistic from the column — mean for symmetric numeric data, median for skewed data, and mode for categorical columns. It preserves dataset size while introducing minimal distortion.
Choosing the Right Statistic
The choice of imputation statistic depends on the data distribution and type. Using the mean on a right-skewed column will overestimate typical values; median is more robust to such skew.
Mean vs Median vs Mode
- Mean: Best for normally distributed numeric columns with few outliers.
- Median: Best for skewed numeric columns or those with extreme outliers.
- Mode: Best for categorical or ordinal columns.
Imputing with scikit-learn's SimpleImputer
SimpleImputer is the standard sklearn tool for basic imputation and integrates cleanly into a Pipeline. It fits on training data and transforms both train and test sets consistently, preventing leakage.
Using SimpleImputer
Adding a Missingness Indicator
When missingness itself may be informative, add a binary indicator column alongside the imputed value. SimpleImputer supports this with add_indicator=True.