UMAP (Uniform Manifold Approximation)
UMAP (Uniform Manifold Approximation and Projection) is a fast, scalable nonlinear dimensionality reduction algorithm that preserves both local and global data structure better than t-SNE, with support for supervised and semi-supervised learning.
How UMAP Works
UMAP constructs a high-dimensional fuzzy topological representation of the data (a weighted k-NN graph) and then optimizes a low-dimensional embedding to match this topology, using cross-entropy as the objective.
UMAP vs. t-SNE
- Speed: UMAP is orders of magnitude faster, scaling to millions of points.
- Global structure: UMAP better preserves inter-cluster relationships.
- Hyperparameters:
n_neighbors(local vs. global balance) andmin_dist(cluster tightness) replace t-SNE's perplexity. - Generalization: UMAP can transform new data with
transform(); t-SNE cannot.
Using UMAP
UMAP is available as the umap-learn package with a scikit-learn-compatible API.
Basic UMAP Embedding
<pre><code class="language-python">import umap
from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
X, y = load_digits(return_X_y=True)
X_scaled = StandardScaler().fit_transform(X)
reducer = umap.UMAP(
n_neighbors=15, # local neighborhood size
min_dist=0.1, # minimum distance in embedding
n_components=2,
metric='euclidean',
random_state=42
)
X_umap = reducer.fit_transform(X_scaled)
plt.figure(figsize=(10, 7))
plt.scatter(X_umap[:, 0], X_umap[:, 1], c=y, cmap='tab10', s=5, alpha=0.7)
plt.colorbar(label='Digit')
plt.title('UMAP of MNIST Digits')
plt.show()</pre>
Supervised UMAP
<pre><code class="language-python"># Supervised UMAP uses labels to guide the embedding
reducer_sup = umap.UMAP(n_neighbors=15, min_dist=0.1, random_state=42)
X_umap_sup = reducer_sup.fit_transform(X_scaled, y=y) # pass labels here
# Results in better-separated clusters when labels are available</pre>
Tuning UMAP
The two main hyperparameters — n_neighbors and min_dist — control the trade-off between local and global structure and cluster compactness.
Hyperparameter Guidance
n_neighbors: Small (5) = more local detail; Large (50) = more global structure. Typical: 10–50.min_dist: Small (0.0) = tightly packed clusters; Large (1.0) = uniform distribution. Typical: 0.1.metric: Try'cosine'for text/sparse data,'euclidean'for most other cases.