K-Means Clustering Algorithm
K-Means partitions N samples into K clusters by iteratively assigning each sample to its nearest centroid and updating centroids to minimize total within-cluster squared distance.
The K-Means Algorithm
K-Means minimizes the inertia (within-cluster sum of squares): J = \u03a3_k \u03a3_{x \u2208 C_k} ||x - \u03bc_k||\u00b2, where \u03bc_k is the centroid of cluster k.
Algorithm Steps
- Initialize: Place K centroids (randomly or via K-Means++).
- Assign: Label each point with the nearest centroid.
- Update: Recompute each centroid as the mean of its assigned points.
- Repeat steps 2–3 until assignments stop changing or a max iteration limit is reached.
K-Means in scikit-learn
scikit-learn's KMeans runs the algorithm multiple times with different initializations (n_init) and returns the best result by inertia.
Fitting and Inspecting Clusters
Key Parameters
n_clusters: Number of clusters (must be specified).init:'k-means++'(default) for smart initialization;'random'for random.n_init: Number of restarts; higher is more reliable but slower.max_iter: Maximum iterations per run (default 300).
Limitations of K-Means
K-Means assumes spherical, equally-sized clusters and is sensitive to outliers and the initial centroid placement. It also requires K to be specified in advance.
When K-Means Fails
K-Means struggles with non-convex cluster shapes (use DBSCAN), clusters of very different sizes (use GMM), and high-dimensional sparse data. Feature scaling and dimensionality reduction (PCA) often help before applying K-Means.