Interpreting Dendrograms
A dendrogram is a tree diagram where the height of each merge reflects the distance (dissimilarity) between the merging clusters, enabling visual selection of the number of clusters.
Reading a Dendrogram
Each leaf represents a data point. Moving up the tree, points merge into clusters. The y-axis (linkage height) represents the distance at which clusters were merged — higher merges indicate greater dissimilarity.
Plotting a Dendrogram with scipy
Cutting the Dendrogram
Draw a horizontal line across the dendrogram at a chosen height; the number of vertical lines it crosses equals the number of clusters. The longest vertical lines without a horizontal cut indicate the most natural cluster boundaries.
Automatic Cut with fcluster
Identifying the Best Cut
Look for the largest gap in linkage heights — the longest horizontal distance between consecutive merges. This gap indicates that the merging distance jumped significantly, suggesting that the clusters on either side are more distinct than those within.
Dendrogram Limitations
For large datasets (>1000 points), full dendrograms become unreadable. Use truncate_mode='lastp' to show only the last p merges, or apply hierarchical clustering to cluster representatives.
Scalable Dendrogram Interpretation
For large datasets, first reduce to cluster representatives using K-Means (e.g., 50–100 centroids), then apply hierarchical clustering to the centroids and visualize the resulting dendrogram. This preserves interpretability while scaling to millions of points.