Handling Imbalanced Classes: Undersampling
In classification problems like fraud detection or disease diagnosis, one class may represent fewer than 1% of samples. Models trained on such data learn to simply predict the majority class, achieving high accuracy while being useless at the task that matters.
Random Undersampling
Random undersampling removes randomly chosen majority-class samples until the classes are balanced. It is fast and simple, but discards potentially useful training data.
Using RandomUnderSampler
Informed Undersampling with Tomek Links
Tomek Links removes majority-class samples that are too close to minority-class samples, cleaning the decision boundary instead of randomly discarding data. This is a gentler, more principled approach.
Using TomekLinks
Evaluation Metrics for Imbalanced Data
Accuracy is misleading on imbalanced data. Use F1-score, Precision-Recall AUC, or Matthews Correlation Coefficient (MCC) to evaluate models on minority-class performance. Always report these alongside any resampling strategy.