Centroid Initialization (K-Means++)
K-Means++ is a smart initialization strategy that spreads initial centroids far apart, dramatically reducing convergence time and avoiding poor local minima.
The Problem with Random Initialization
Random centroid initialization can place multiple centroids in the same cluster region, causing K-Means to converge to poor local optima with high inertia. This makes results inconsistent across runs.
Sensitivity to Initialization
K-Means++ Algorithm
K-Means++ selects the first centroid randomly, then selects each subsequent centroid with probability proportional to its squared distance from the nearest already-chosen centroid. This ensures centroids are spread across different regions of the data.
Step-by-Step Initialization
- Choose the first centroid uniformly at random from the data points.
- For each point
x, computeD(x)\u00b2= squared distance to the nearest chosen centroid. - Select the next centroid with probability proportional to
D(x)\u00b2. - Repeat until all K centroids are chosen.
- Proceed with standard K-Means iterations.
Theoretical Guarantee
K-Means++ is proven to give an expected inertia within O(log K) of the optimal solution before any iterations begin. In practice, it converges in fewer iterations and to significantly better solutions than random initialization.
Practical Impact
K-Means++ is the default in scikit-learn (init='k-means++') because it consistently outperforms random initialization with negligible extra computation.
n_init and Robustness
Even with K-Means++, running multiple restarts (n_init=10 or higher) and keeping the best result by inertia is recommended. sklearn defaults to n_init=10; for large datasets, n_init='auto' (sklearn >=1.2) adapts this automatically.