P e x c e r a

UMAP (Uniform Manifold Approximation)

UMAP (Uniform Manifold Approximation and Projection) is a fast, scalable nonlinear dimensionality reduction algorithm that preserves both local and global data structure better than t-SNE, with support for supervised and semi-supervised learning.


How UMAP Works

UMAP constructs a high-dimensional fuzzy topological representation of the data (a weighted k-NN graph) and then optimizes a low-dimensional embedding to match this topology, using cross-entropy as the objective.

UMAP vs. t-SNE

  • Speed: UMAP is orders of magnitude faster, scaling to millions of points.
  • Global structure: UMAP better preserves inter-cluster relationships.
  • Hyperparameters: n_neighbors (local vs. global balance) and min_dist (cluster tightness) replace t-SNE's perplexity.
  • Generalization: UMAP can transform new data with transform(); t-SNE cannot.

Using UMAP

UMAP is available as the umap-learn package with a scikit-learn-compatible API.

Basic UMAP Embedding

<pre><code class="language-python">import umap from sklearn.datasets import load_digits from sklearn.preprocessing import StandardScaler import matplotlib.pyplot as plt X, y = load_digits(return_X_y=True) X_scaled = StandardScaler().fit_transform(X) reducer = umap.UMAP( n_neighbors=15, # local neighborhood size min_dist=0.1, # minimum distance in embedding n_components=2, metric='euclidean', random_state=42 ) X_umap = reducer.fit_transform(X_scaled) plt.figure(figsize=(10, 7)) plt.scatter(X_umap[:, 0], X_umap[:, 1], c=y, cmap='tab10', s=5, alpha=0.7) plt.colorbar(label='Digit') plt.title('UMAP of MNIST Digits') plt.show()</pre>

Supervised UMAP

<pre><code class="language-python"># Supervised UMAP uses labels to guide the embedding reducer_sup = umap.UMAP(n_neighbors=15, min_dist=0.1, random_state=42) X_umap_sup = reducer_sup.fit_transform(X_scaled, y=y) # pass labels here # Results in better-separated clusters when labels are available</pre>

Tuning UMAP

The two main hyperparameters — n_neighbors and min_dist — control the trade-off between local and global structure and cluster compactness.

Hyperparameter Guidance

  • n_neighbors: Small (5) = more local detail; Large (50) = more global structure. Typical: 10–50.
  • min_dist: Small (0.0) = tightly packed clusters; Large (1.0) = uniform distribution. Typical: 0.1.
  • metric: Try 'cosine' for text/sparse data, 'euclidean' for most other cases.