Choosing the Right K in KNN

The single most important hyperparameter in KNN is K — too small and the model overfits to noise; too large and it smooths away the decision boundary entirely.


K and the Bias-Variance Tradeoff

A small K (e.g., K=1) creates a very flexible, jagged decision boundary that fits noise — high variance. A large K averages over many neighbours, producing a smoother boundary — high bias. Optimal K sits between these extremes.

K=1: Perfect Training Accuracy, Poor Generalisation

With K=1 the model always predicts the label of the single nearest training point, achieving 100% training accuracy. But it learns every noise artefact in the data, leading to poor performance on unseen examples.

Selecting K via Cross-Validation

Systematically evaluate a range of K values using cross-validation and select the K with the best validation score.

Grid Search for K

<pre><code class="language-python">from sklearn.neighbors import KNeighborsClassifier from sklearn.model_selection import GridSearchCV from sklearn.preprocessing import StandardScaler from sklearn.pipeline import make_pipeline from sklearn.datasets import load_iris X, y = load_iris(return_X_y=True) pipeline = make_pipeline(StandardScaler(), KNeighborsClassifier()) param_grid = {"kneighborsclassifier__n_neighbors": list(range(1, 31))} search = GridSearchCV(pipeline, param_grid, cv=5, scoring="accuracy") search.fit(X, y) print(f"Best K: {search.best_params_}") print(f"Best CV score: {search.best_score_:.4f}")</pre>

Practical Guidelines

A common starting heuristic is K = \u221an where n is the training set size. Always use odd K for binary classification to avoid tie votes. Cross-validation on a dedicated validation set (not the test set) should be the final arbiter.