t-SNE (t-Distributed Stochastic Neighbor Embedding)
t-SNE is a nonlinear dimensionality reduction technique that preserves local neighborhood structure in low-dimensional visualizations, revealing clusters and patterns invisible in PCA projections.
How t-SNE Works
t-SNE models pairwise similarities in high dimensions as Gaussian probabilities and in low dimensions as Student-t probabilities, then minimizes the KL divergence between the two distributions via gradient descent.
The t-Distribution in Low Dimensions
The heavy tails of the Student-t distribution in the low-dimensional space correct the "crowding problem" — in high dimensions, moderate distances between dissimilar points compress into a small region. The t-distribution spreads these dissimilar points further apart, creating the characteristic cluster separation seen in t-SNE plots.
t-SNE with scikit-learn
scikit-learn's TSNE is easy to use but computationally expensive (O(N\u00b2)). For large datasets, use method='barnes_hut' (default, O(N log N)) or the openTSNE library.
Basic Usage
Perplexity Tuning
perplexity controls the effective number of neighbors (typically 5–50). Low perplexity emphasizes local structure; high perplexity reveals more global structure. Always try multiple perplexity values and run to convergence (n_iter \u2265 1000). t-SNE results are stochastic — set random_state for reproducibility.
Interpreting t-SNE Correctly
t-SNE is a visualization tool, not a general dimensionality reduction technique. Its output cannot be used for downstream ML tasks and distances between clusters are not meaningful.
Common Misinterpretations
- Cluster sizes: Do not reflect actual cluster sizes in high dimensions.
- Distances between clusters: Not meaningful; only within-cluster structure is preserved.
- Global structure: t-SNE prioritizes local structure; use UMAP for better global structure preservation.
- Reproducibility: Different random seeds produce different layouts; always set
random_state.