Hierarchical Clustering Fundamentals

Hierarchical clustering builds a multi-level tree of clusters (dendrogram) that reveals data structure at multiple granularities without requiring a predetermined number of clusters.


How Hierarchical Clustering Works

At each step, the two most similar clusters (or points) are merged (agglomerative) or the least-cohesive cluster is split (divisive), building a full hierarchy from N individual points to one cluster.

Linkage Criteria

  • Single linkage: Distance between nearest points in two clusters. Sensitive to outliers.
  • Complete linkage: Distance between farthest points. Produces compact clusters.
  • Average linkage: Average pairwise distance. Balanced trade-off.
  • Ward linkage: Minimizes total within-cluster variance increase. Usually best for compact, equal-size clusters.

AgglomerativeClustering in sklearn

scikit-learn implements bottom-up agglomerative clustering efficiently, supporting different linkage criteria and affinity metrics.

Fitting and Predicting

<pre><code class="language-python">from sklearn.cluster import AgglomerativeClustering from sklearn.datasets import make_blobs from sklearn.metrics import silhouette_score X, _ = make_blobs(n_samples=300, centers=4, random_state=42) agg = AgglomerativeClustering( n_clusters=4, linkage='ward', metric='euclidean' ) labels = agg.fit_predict(X) print(f"Silhouette Score: {silhouette_score(X, labels):.3f}") print(f"Cluster sizes: {[sum(labels==c) for c in range(4)]}")</pre>

Choosing Without Specifying n_clusters

Set n_clusters=None and use distance_threshold to cut the dendrogram at a specific height. This is useful when you want to explore the hierarchy without committing to a fixed number of clusters.

Time and Space Complexity

Naive agglomerative clustering has O(N\u00b3) time complexity and O(N\u00b2) space due to the distance matrix. Ward linkage with scikit-learn's connectivity constraints can scale to larger datasets.

Connectivity Constraints

<pre><code class="language-python">from sklearn.neighbors import kneighbors_graph # Build a sparse connectivity graph for scalability connectivity = kneighbors_graph(X, n_neighbors=10, include_self=False) agg_conn = AgglomerativeClustering(n_clusters=4, connectivity=connectivity, linkage='ward') labels_conn = agg_conn.fit_predict(X) print(f"Connectivity Silhouette: {silhouette_score(X, labels_conn):.3f}")</pre>