Agglomerative vs. Divisive Clustering
Hierarchical clustering can proceed bottom-up (agglomerative) by merging individual points, or top-down (divisive) by recursively splitting the full dataset into smaller clusters.
Agglomerative Clustering (Bottom-Up)
Starting from N singleton clusters, agglomerative clustering greedily merges the two most similar clusters at each step until a single cluster remains. The merge sequence defines the dendrogram.
Advantages and Disadvantages
- Pros: Computationally tractable (O(N\u00b3) naive, better with optimizations), widely supported in libraries, effective for small-to-medium datasets.
- Cons: Early merges are irreversible; cannot undo poor decisions made at the start. Sensitive to noise and outliers depending on linkage.
Divisive Clustering (Top-Down)
Starting from one cluster containing all points, divisive clustering recursively splits clusters into two subclusters. DIANA (Divisive Analysis) is the classical algorithm.
Advantages and Disadvantages
- Pros: Considers the global structure at each split; can produce better top-level clusters. Early splits can be more meaningful if the global structure is important.
- Cons: O(2^N) naive complexity; much more computationally expensive. Less commonly implemented in mainstream ML libraries.
Bisecting K-Means: A Practical Divisive Approach
Practical Choice
In practice, agglomerative clustering (especially Ward linkage) is preferred for most tasks due to library support and reasonable performance. Divisive methods are used when top-level cluster structure is the primary interest.
When to Use Each
Use agglomerative when you want a full hierarchy and have < 10,000 samples. Use divisive (Bisecting K-Means) when you have a large dataset and want scalable hierarchical clustering that avoids the O(N\u00b2) distance matrix. For very large datasets, mini-batch K-Means or BIRCH may be more appropriate than either.