DBSCAN: Density-Based Spatial Clustering

DBSCAN groups together densely packed points and marks isolated points as noise, enabling discovery of clusters with arbitrary shapes without specifying the number of clusters.


Core Concepts

DBSCAN classifies each point as a core point (has \u2265 MinPts neighbors within radius \u03b5), a border point (within \u03b5 of a core point but fewer neighbors), or a noise point (neither core nor border).

Point Types Explained

  • Core point: Has at least min_samples points (including itself) within distance eps. These are the dense regions that form cluster cores.
  • Border point: Within eps of a core point but doesn't have enough neighbors to be core itself. Assigned to the nearest core point's cluster.
  • Noise point: Not within eps of any core point. Labeled as -1.

DBSCAN in scikit-learn

scikit-learn's DBSCAN uses spatial indexing for efficient neighbor lookups and supports custom distance metrics.

Basic Usage

<pre><code class="language-python">from sklearn.cluster import DBSCAN from sklearn.datasets import make_moons from sklearn.preprocessing import StandardScaler import matplotlib.pyplot as plt X, _ = make_moons(n_samples=300, noise=0.08, random_state=42) X = StandardScaler().fit_transform(X) db = DBSCAN(eps=0.3, min_samples=5, metric='euclidean') labels = db.fit_predict(X) n_clusters = len(set(labels)) - (1 if -1 in labels else 0) n_noise = list(labels).count(-1) print(f"Clusters: {n_clusters}, Noise points: {n_noise}") plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='Set1', s=20) plt.title('DBSCAN on Moons Dataset') plt.show()</pre>

Advantages Over K-Means

Unlike K-Means, DBSCAN: (1) does not require specifying K; (2) handles arbitrary cluster shapes; (3) naturally identifies outliers as noise. It fails when clusters have very different densities, as a single (\u03b5, MinPts) pair cannot capture all densities simultaneously.

Accessing Cluster Components

DBSCAN exposes core sample indices via db.core_sample_indices_, enabling further analysis of dense cluster regions.

Core vs. Border vs. Noise

<pre><code class="language-python">import numpy as np core_mask = np.zeros(len(X), dtype=bool) core_mask[db.core_sample_indices_] = True border_mask = (~core_mask) & (labels != -1) noise_mask = labels == -1 print(f"Core: {core_mask.sum()}, Border: {border_mask.sum()}, Noise: {noise_mask.sum()}")</pre>