Unsupervised Learning Paradigms
Unsupervised learning discovers hidden structure in unlabeled data — no target variable guides the algorithm, making it both powerful and challenging to evaluate.
Core Paradigms
Unsupervised methods can be grouped into four broad categories, each addressing a different type of structure discovery.
Overview of Paradigms
- Clustering: Groups similar samples together (K-Means, DBSCAN, Hierarchical).
- Dimensionality Reduction: Projects high-dimensional data to fewer dimensions while preserving structure (PCA, t-SNE, UMAP).
- Density Estimation: Models the probability distribution of data (GMM, KDE).
- Anomaly Detection: Identifies samples that don't conform to learned patterns (Isolation Forest, One-Class SVM).
Evaluation Without Labels
Without ground-truth labels, evaluating unsupervised models is inherently subjective. Internal metrics and domain knowledge guide model selection.
Internal Evaluation Metrics
- Silhouette Score: Measures cluster compactness and separation (range: -1 to 1).
- Davies-Bouldin Index: Lower is better; ratio of within-cluster to between-cluster distances.
- Calinski-Harabasz Score: Ratio of between-cluster to within-cluster dispersion; higher is better.
External Evaluation (When Labels Exist)
Unsupervised Learning Workflow
A typical unsupervised workflow involves preprocessing, algorithm selection, hyperparameter tuning via internal metrics, and result interpretation with domain expertise.
Preprocessing Matters
Most clustering and dimensionality reduction algorithms are sensitive to feature scale. Always standardize features (StandardScaler) before applying distance-based or variance-based methods. Consider PCA for initial dimensionality reduction before clustering on very high-dimensional data.