K-Means Clustering Algorithm

K-Means partitions N samples into K clusters by iteratively assigning each sample to its nearest centroid and updating centroids to minimize total within-cluster squared distance.


The K-Means Algorithm

K-Means minimizes the inertia (within-cluster sum of squares): J = \u03a3_k \u03a3_{x \u2208 C_k} ||x - \u03bc_k||\u00b2, where \u03bc_k is the centroid of cluster k.

Algorithm Steps

  1. Initialize: Place K centroids (randomly or via K-Means++).
  2. Assign: Label each point with the nearest centroid.
  3. Update: Recompute each centroid as the mean of its assigned points.
  4. Repeat steps 2–3 until assignments stop changing or a max iteration limit is reached.

K-Means in scikit-learn

scikit-learn's KMeans runs the algorithm multiple times with different initializations (n_init) and returns the best result by inertia.

Fitting and Inspecting Clusters

<pre><code class="language-python">from sklearn.cluster import KMeans from sklearn.datasets import make_blobs import matplotlib.pyplot as plt X, _ = make_blobs(n_samples=500, centers=4, cluster_std=0.8, random_state=42) km = KMeans(n_clusters=4, init='k-means++', n_init=10, random_state=42) km.fit(X) print(f"Inertia: {km.inertia_:.2f}") print(f"Cluster Centers:\n{km.cluster_centers_}") plt.scatter(X[:, 0], X[:, 1], c=km.labels_, cmap='tab10', alpha=0.6) plt.scatter(km.cluster_centers_[:, 0], km.cluster_centers_[:, 1], marker='X', s=200, c='red', label='Centroids') plt.legend(); plt.title('K-Means Clustering'); plt.show()</pre>

Key Parameters

  • n_clusters: Number of clusters (must be specified).
  • init: 'k-means++' (default) for smart initialization; 'random' for random.
  • n_init: Number of restarts; higher is more reliable but slower.
  • max_iter: Maximum iterations per run (default 300).

Limitations of K-Means

K-Means assumes spherical, equally-sized clusters and is sensitive to outliers and the initial centroid placement. It also requires K to be specified in advance.

When K-Means Fails

K-Means struggles with non-convex cluster shapes (use DBSCAN), clusters of very different sizes (use GMM), and high-dimensional sparse data. Feature scaling and dimensionality reduction (PCA) often help before applying K-Means.