DBSCAN: Density-Based Spatial Clustering
DBSCAN groups together densely packed points and marks isolated points as noise, enabling discovery of clusters with arbitrary shapes without specifying the number of clusters.
Core Concepts
DBSCAN classifies each point as a core point (has \u2265 MinPts neighbors within radius \u03b5), a border point (within \u03b5 of a core point but fewer neighbors), or a noise point (neither core nor border).
Point Types Explained
- Core point: Has at least
min_samplespoints (including itself) within distanceeps. These are the dense regions that form cluster cores. - Border point: Within
epsof a core point but doesn't have enough neighbors to be core itself. Assigned to the nearest core point's cluster. - Noise point: Not within
epsof any core point. Labeled as -1.
DBSCAN in scikit-learn
scikit-learn's DBSCAN uses spatial indexing for efficient neighbor lookups and supports custom distance metrics.
Basic Usage
Advantages Over K-Means
Unlike K-Means, DBSCAN: (1) does not require specifying K; (2) handles arbitrary cluster shapes; (3) naturally identifies outliers as noise. It fails when clusters have very different densities, as a single (\u03b5, MinPts) pair cannot capture all densities simultaneously.
Accessing Cluster Components
DBSCAN exposes core sample indices via db.core_sample_indices_, enabling further analysis of dense cluster regions.