The Problem with Standard K-Fold on Imbalanced Data
If your dataset has 95% class 0 and 5% class 1, a random split might leave certain folds with zero or only one positive example, making metrics like recall and F1 unreliable.
Demonstrating Class Drift
<pre><code class="language-python">from sklearn.model_selection import KFold, StratifiedKFold
from sklearn.datasets import make_classification
import numpy as np
X, y = make_classification(n_samples=1000, weights=[0.95, 0.05], random_state=42)
print("--- Standard KFold ---")
for i, (_, val_idx) in enumerate(KFold(5, shuffle=True, random_state=42).split(X, y)):
pct = y[val_idx].mean() * 100
print(f"Fold {i+1}: {pct:.1f}% positive")
print("\n--- StratifiedKFold ---")
for i, (_, val_idx) in enumerate(StratifiedKFold(5, shuffle=True, random_state=42).split(X, y)):
pct = y[val_idx].mean() * 100
print(f"Fold {i+1}: {pct:.1f}% positive")</pre>
Using StratifiedKFold in Practice
StratifiedKFold is a drop-in replacement for KFold — simply pass it as the cv argument wherever cross-validation is performed.
With cross_val_score
<pre><code class="language-python">from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# cross_val_score automatically uses StratifiedKFold for classifiers
# but passing it explicitly makes the intent clear
scores = cross_val_score(
RandomForestClassifier(class_weight="balanced", random_state=42),
X, y,
cv=skf,
scoring="f1"
)
print(f"Stratified CV F1: {scores.mean():.4f} +/- {scores.std():.4f}")</pre>
Pairing with Imbalance Handling Techniques
Stratified splitting ensures unbiased evaluation, but you may also want to handle imbalance during training. Common approaches include:
class_weight="balanced" on most sklearn estimators- SMOTE oversampling via the
imbalanced-learn library - Threshold tuning on predicted probabilities
Always apply oversampling inside each fold (not before splitting) to avoid data leakage.
Stratified Splits for Regression
For continuous targets, scikit-learn provides StratifiedShuffleSplit; for regression tasks you can bin the target to approximate stratification.
Binning Continuous Targets for Stratification
<pre><code class="language-python">import pandas as pd
from sklearn.model_selection import StratifiedKFold
# Bin continuous target into quartiles for stratification
y_binned = pd.cut(y_continuous, bins=4, labels=False)
skf = StratifiedKFold(n_splits=5)
for train_idx, val_idx in skf.split(X, y_binned):
pass # each fold has representative coverage of the target range</pre>