Data Imputation: Mean, Median, and Mode

Simple imputation replaces missing values with a representative statistic from the column — mean for symmetric numeric data, median for skewed data, and mode for categorical columns. It preserves dataset size while introducing minimal distortion.


Choosing the Right Statistic

The choice of imputation statistic depends on the data distribution and type. Using the mean on a right-skewed column will overestimate typical values; median is more robust to such skew.

Mean vs Median vs Mode

  • Mean: Best for normally distributed numeric columns with few outliers.
  • Median: Best for skewed numeric columns or those with extreme outliers.
  • Mode: Best for categorical or ordinal columns.

Imputing with scikit-learn's SimpleImputer

SimpleImputer is the standard sklearn tool for basic imputation and integrates cleanly into a Pipeline. It fits on training data and transforms both train and test sets consistently, preventing leakage.

Using SimpleImputer

<pre><code class="language-python">from sklearn.impute import SimpleImputer import numpy as np # Median imputation for numeric columns num_imputer = SimpleImputer(strategy="median") X_train_num = num_imputer.fit_transform(X_train[["age", "income"]]) X_test_num = num_imputer.transform(X_test[["age", "income"]]) # Mode imputation for categorical columns cat_imputer = SimpleImputer(strategy="most_frequent") X_train_cat = cat_imputer.fit_transform(X_train[["gender", "city"]])</pre>

Adding a Missingness Indicator

When missingness itself may be informative, add a binary indicator column alongside the imputed value. SimpleImputer supports this with add_indicator=True.

<pre><code class="language-python">imputer = SimpleImputer(strategy="median", add_indicator=True) X_imputed = imputer.fit_transform(X_train) # New binary columns are appended for each column that had NaNs</pre>