K-Nearest Neighbors (KNN) for Classification

K-Nearest Neighbors (KNN) is a simple, non-parametric classifier that makes predictions purely based on the labels of the most similar training examples.


How KNN Works

To classify a new point, KNN computes the distance to every training point, identifies the K nearest ones, and assigns the class by majority vote. There is no explicit training step — the algorithm memorises the entire dataset.

KNN in scikit-learn

<pre><code class="language-python">from sklearn.neighbors import KNeighborsClassifier from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.pipeline import make_pipeline X, y = load_iris(return_X_y=True) X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.3, random_state=42) knn = make_pipeline(StandardScaler(), KNeighborsClassifier(n_neighbors=5)) knn.fit(X_tr, y_tr) print(f"KNN Accuracy: {knn.score(X_te, y_te):.4f}")</pre>

Distance Metrics and Feature Scaling

KNN is inherently distance-based, so the choice of distance metric and feature scaling are critical to performance.

Why Scaling Matters

Features with large numeric ranges dominate distance calculations. For example, if one feature ranges 0–1000 and another 0–1, the first feature almost entirely determines the nearest neighbours. Always scale features (e.g., StandardScaler) before using KNN.

Common Distance Metrics

Euclidean distance (L2) is the default and works well for continuous features. Manhattan distance (L1) is more robust to outliers. Minkowski distance generalises both. For categorical or binary features, Hamming distance is more appropriate.