The K-Fold Algorithm
The dataset is partitioned into k non-overlapping subsets. For each iteration one fold is held out for validation; the other k-1 folds form the training set. This repeats k times so every sample is used for validation exactly once.
Manual KFold Loop
<pre><code class="language-python">from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
import numpy as np
X, y = load_iris(return_X_y=True)
kf = KFold(n_splits=5, shuffle=True, random_state=42)
scores = []
for fold, (train_idx, val_idx) in enumerate(kf.split(X)):
X_train, X_val = X[train_idx], X[val_idx]
y_train, y_val = y[train_idx], y[val_idx]
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)
acc = accuracy_score(y_val, model.predict(X_val))
scores.append(acc)
print(f"Fold {fold+1}: {acc:.4f}")
print(f"\nMean: {np.mean(scores):.4f} Std: {np.std(scores):.4f}")</pre>
Using cross_val_score for Brevity
<pre><code class="language-python">from sklearn.model_selection import cross_val_score
scores = cross_val_score(
LogisticRegression(max_iter=200),
X, y,
cv=5,
scoring="accuracy"
)
print(scores) # per-fold scores
print(scores.mean()) # overall estimate</pre>
Choosing k
The most common choice is k=5 or k=10. Larger k reduces bias (more training data per fold) but increases variance and compute cost; smaller k is faster but produces noisier estimates.
Leave-One-Out Cross-Validation (LOOCV)
<pre><code class="language-python">from sklearn.model_selection import LeaveOneOut, cross_val_score
# LOOCV: k = n (one sample held out each time)
# Very low bias but high variance and expensive for large n
loo = LeaveOneOut()
scores = cross_val_score(LogisticRegression(max_iter=200), X, y, cv=loo)
print(f"LOOCV Accuracy: {scores.mean():.4f}")</pre>
Cross-Validation with Multiple Metrics
Use cross_validate when you need to track more than one metric simultaneously or want training scores for diagnosing bias/variance.
cross_validate Example
<pre><code class="language-python">from sklearn.model_selection import cross_validate
results = cross_validate(
LogisticRegression(max_iter=200), X, y,
cv=5,
scoring=["accuracy", "roc_auc_ovr"],
return_train_score=True
)
print("Val Accuracy:", results["test_accuracy"].mean())
print("Train Accuracy:", results["train_accuracy"].mean())</pre>