Handling Imbalanced Classes: Undersampling

In classification problems like fraud detection or disease diagnosis, one class may represent fewer than 1% of samples. Models trained on such data learn to simply predict the majority class, achieving high accuracy while being useless at the task that matters.


Random Undersampling

Random undersampling removes randomly chosen majority-class samples until the classes are balanced. It is fast and simple, but discards potentially useful training data.

Using RandomUnderSampler

<pre><code class="language-python">from imblearn.under_sampling import RandomUnderSampler rus = RandomUnderSampler(sampling_strategy=0.5, random_state=42) # sampling_strategy=0.5 means minority:majority = 1:2 X_res, y_res = rus.fit_resample(X_train, y_train) import pandas as pd print(pd.Series(y_res).value_counts())</pre>

Informed Undersampling with Tomek Links

Tomek Links removes majority-class samples that are too close to minority-class samples, cleaning the decision boundary instead of randomly discarding data. This is a gentler, more principled approach.

Using TomekLinks

<pre><code class="language-python">from imblearn.under_sampling import TomekLinks tl = TomekLinks() X_res, y_res = tl.fit_resample(X_train, y_train) # TomekLinks only removes borderline majority samples, # so it may not fully balance the dataset print(pd.Series(y_res).value_counts())</pre>

Evaluation Metrics for Imbalanced Data

Accuracy is misleading on imbalanced data. Use F1-score, Precision-Recall AUC, or Matthews Correlation Coefficient (MCC) to evaluate models on minority-class performance. Always report these alongside any resampling strategy.