The Curse of Dimensionality
The curse of dimensionality describes how the properties of data change dramatically as the number of features grows — distances become meaningless, data becomes sparse, and models require exponentially more samples.
Distance Concentration
In high dimensions, the ratio of max to min pairwise distances approaches 1, making all points approximately equidistant from each other and rendering distance-based algorithms (K-NN, K-Means, SVM-RBF) ineffective.
Demonstrating Distance Concentration
Data Sparsity in High Dimensions
A unit hypercube in d dimensions has volume 1, but the volume of a ball of radius 0.5 vanishes to 0 exponentially as d increases. To sample the same density of points, exponentially more data is required per added dimension.
Volume of a Hypersphere
The volume of a d-dimensional unit ball is V_d = \u03c0^{d/2} / \u0393(d/2 + 1), which approaches 0 as d \u2192 \u221e. In practice, this means that most of the volume of a high-dimensional hypercube is in its corners, not near the center — where data would naturally cluster in lower dimensions.
Impact on Sample Requirements
Practical Implications and Remedies
High-dimensional data requires dimensionality reduction, feature selection, or regularization to remain tractable for ML algorithms.
Mitigation Strategies
- Dimensionality Reduction: PCA, t-SNE, UMAP reduce to informative lower-dimensional representations.
- Feature Selection: Remove irrelevant features to reduce effective dimensionality.
- Regularization: L1 regularization (Lasso) implicitly performs feature selection.
- Domain Knowledge: Engineer meaningful composite features rather than using raw high-dimensional inputs.