Splitting Data: Train, Validation, and Test Sets
Splitting data into distinct training, validation, and test sets is fundamental to honest model evaluation — using any test data during training or hyperparameter tuning invalidates your performance estimates. Each split plays a distinct role.
The Role of Each Split
Training set — model learns weights/parameters. Validation set — tune hyperparameters, select features, compare architectures. Test set — locked away until final evaluation; touch it only once.
Splitting with train_test_split
<pre><code class="language-python">from sklearn.model_selection import train_test_split
# First split: 80% train+val, 20% test
X_trainval, X_test, y_trainval, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Second split: 75% train, 25% val (60/20/20 overall)
X_train, X_val, y_train, y_val = train_test_split(
X_trainval, y_trainval, test_size=0.25,
random_state=42, stratify=y_trainval
)
print(len(X_train), len(X_val), len(X_test))</pre>
Stratified Splitting and K-Fold Cross-Validation
For classification with imbalanced classes, stratify=y ensures each split mirrors the class distribution. For small datasets, K-Fold cross-validation replaces a fixed validation set, using all data for both training and validation.
K-Fold Cross-Validation
<pre><code class="language-python">from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.linear_model import LogisticRegression
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
model = LogisticRegression()
scores = cross_val_score(model, X_train, y_train,
cv=kf, scoring="f1_macro")
print(f"CV F1: {scores.mean():.3f} ± {scores.std():.3f}")</pre>