The Curse of Dimensionality

The curse of dimensionality describes how the properties of data change dramatically as the number of features grows — distances become meaningless, data becomes sparse, and models require exponentially more samples.

Distance Concentration

In high dimensions, the ratio of max to min pairwise distances approaches 1, making all points approximately equidistant from each other and rendering distance-based algorithms (K-NN, K-Means, SVM-RBF) ineffective.

Demonstrating Distance Concentration

<pre><code class="language-python">import numpy as np for d in [2, 10, 100, 1000]: X = np.random.randn(1000, d) dists = np.linalg.norm(X - X[0], axis=1) ratio = (dists.max() - dists.min()) / dists.mean() print(f"d={d:5d}: max/min ratio = {ratio:.4f}") # As d increases, the relative contrast between distances shrinks # making it harder to find meaningful nearest neighbors</pre>

Data Sparsity in High Dimensions

A unit hypercube in d dimensions has volume 1, but the volume of a ball of radius 0.5 vanishes to 0 exponentially as d increases. To sample the same density of points, exponentially more data is required per added dimension.

Volume of a Hypersphere

The volume of a d-dimensional unit ball is V_d = \u03c0^{d/2} / \u0393(d/2 + 1), which approaches 0 as d \u2192 \u221e. In practice, this means that most of the volume of a high-dimensional hypercube is in its corners, not near the center — where data would naturally cluster in lower dimensions.

Impact on Sample Requirements

<pre><code class="language-python"># To maintain 10 samples per unit length across all dimensions: dims = [1, 2, 5, 10, 20] for d in dims: samples_needed = 10 ** d print(f"d={d}: samples needed \u2248 {samples_needed:,}") # d=1: 10 # d=5: 100,000 # d=10: 10,000,000,000 -- completely impractical</pre>

Practical Implications and Remedies

High-dimensional data requires dimensionality reduction, feature selection, or regularization to remain tractable for ML algorithms.

Mitigation Strategies

Dimensionality Reduction: PCA, t-SNE, UMAP reduce to informative lower-dimensional representations.
Feature Selection: Remove irrelevant features to reduce effective dimensionality.
Regularization: L1 regularization (Lasso) implicitly performs feature selection.
Domain Knowledge: Engineer meaningful composite features rather than using raw high-dimensional inputs.