Handling Imbalanced Classes: Oversampling (SMOTE)

SMOTE (Synthetic Minority Over-sampling Technique) generates new synthetic minority-class examples by interpolating between existing ones, rather than simply duplicating them. This increases diversity and generally outperforms random oversampling.

How SMOTE Works

For each minority sample, SMOTE selects k nearest minority neighbors and creates new synthetic points along the line segment connecting them: x_new = x_i + \\lambda \\cdot (x_j − x_i), where \\lambda \\in [0, 1].

Applying SMOTE with imbalanced-learn

<pre><code class="language-python">from imblearn.over_sampling import SMOTE smote = SMOTE(sampling_strategy="minority", k_neighbors=5, random_state=42) X_res, y_res = smote.fit_resample(X_train, y_train) import pandas as pd print(pd.Series(y_res).value_counts()) # Minority class count now matches majority class</pre>

SMOTE Variants and Pipelines

Standard SMOTE can generate noisy samples near class boundaries. Variants like SVMSMOTE and BorderlineSMOTE focus synthetic generation on decision boundary regions for better model performance.

Using SMOTE in an imblearn Pipeline

<pre><code class="language-python">from imblearn.pipeline import Pipeline # use imblearn's Pipeline! from imblearn.over_sampling import SMOTE from sklearn.ensemble import RandomForestClassifier from sklearn.preprocessing import StandardScaler pipe = Pipeline([ ("scaler", StandardScaler()), ("smote", SMOTE(random_state=42)), ("clf", RandomForestClassifier(random_state=42)) ]) pipe.fit(X_train, y_train) print(pipe.score(X_test, y_test))</pre>

Critical Rule: SMOTE Only on Training Data

Never apply SMOTE before splitting. Synthetic samples generated from the full dataset will leak statistics from the test set into training. Always place SMOTE inside the pipeline or apply it only to X_train after splitting. The test set must remain original and unmodified.