K-Fold Cross-Validation Implementation

K-Fold cross-validation splits data into k equal folds, trains on k-1 and validates on the remaining fold, cycling through all folds — producing k performance estimates whose mean gives a robust generalisation score.


The K-Fold Algorithm

The dataset is partitioned into k non-overlapping subsets. For each iteration one fold is held out for validation; the other k-1 folds form the training set. This repeats k times so every sample is used for validation exactly once.

Manual KFold Loop

<pre><code class="language-python">from sklearn.model_selection import KFold from sklearn.linear_model import LogisticRegression from sklearn.datasets import load_iris from sklearn.metrics import accuracy_score import numpy as np X, y = load_iris(return_X_y=True) kf = KFold(n_splits=5, shuffle=True, random_state=42) scores = [] for fold, (train_idx, val_idx) in enumerate(kf.split(X)): X_train, X_val = X[train_idx], X[val_idx] y_train, y_val = y[train_idx], y[val_idx] model = LogisticRegression(max_iter=200) model.fit(X_train, y_train) acc = accuracy_score(y_val, model.predict(X_val)) scores.append(acc) print(f"Fold {fold+1}: {acc:.4f}") print(f"\nMean: {np.mean(scores):.4f} Std: {np.std(scores):.4f}")</pre>

Using cross_val_score for Brevity

<pre><code class="language-python">from sklearn.model_selection import cross_val_score scores = cross_val_score( LogisticRegression(max_iter=200), X, y, cv=5, scoring="accuracy" ) print(scores) # per-fold scores print(scores.mean()) # overall estimate</pre>

Choosing k

The most common choice is k=5 or k=10. Larger k reduces bias (more training data per fold) but increases variance and compute cost; smaller k is faster but produces noisier estimates.

Leave-One-Out Cross-Validation (LOOCV)

<pre><code class="language-python">from sklearn.model_selection import LeaveOneOut, cross_val_score # LOOCV: k = n (one sample held out each time) # Very low bias but high variance and expensive for large n loo = LeaveOneOut() scores = cross_val_score(LogisticRegression(max_iter=200), X, y, cv=loo) print(f"LOOCV Accuracy: {scores.mean():.4f}")</pre>

Cross-Validation with Multiple Metrics

Use cross_validate when you need to track more than one metric simultaneously or want training scores for diagnosing bias/variance.

cross_validate Example

<pre><code class="language-python">from sklearn.model_selection import cross_validate results = cross_validate( LogisticRegression(max_iter=200), X, y, cv=5, scoring=["accuracy", "roc_auc_ovr"], return_train_score=True ) print("Val Accuracy:", results["test_accuracy"].mean()) print("Train Accuracy:", results["train_accuracy"].mean())</pre>