Handling Imbalanced Classes: Oversampling (SMOTE)
SMOTE (Synthetic Minority Over-sampling Technique) generates new synthetic minority-class examples by interpolating between existing ones, rather than simply duplicating them. This increases diversity and generally outperforms random oversampling.
How SMOTE Works
For each minority sample, SMOTE selects k nearest minority neighbors and creates new synthetic points along the line segment connecting them: x_new = x_i + \\lambda \\cdot (x_j − x_i), where \\lambda \\in [0, 1].
Applying SMOTE with imbalanced-learn
SMOTE Variants and Pipelines
Standard SMOTE can generate noisy samples near class boundaries. Variants like SVMSMOTE and BorderlineSMOTE focus synthetic generation on decision boundary regions for better model performance.
Using SMOTE in an imblearn Pipeline
Critical Rule: SMOTE Only on Training Data
Never apply SMOTE before splitting. Synthetic samples generated from the full dataset will leak statistics from the test set into training. Always place SMOTE inside the pipeline or apply it only to X_train after splitting. The test set must remain original and unmodified.