DBSCAN Epsilon and MinPts Parameters

The two critical DBSCAN parameters — eps (neighborhood radius) and min_samples (minimum points for a core) — must be tuned together to match the data's density structure.


Role of Each Parameter

eps defines the neighborhood radius for density estimation. min_samples sets the density threshold — points in denser regions than this threshold become core points.

Effect of eps

Too small: most points are noise. Too large: clusters merge. The goal is an eps that captures the local density of genuine clusters. A common rule of thumb is to start with min_samples = 2 * n_features, then tune eps using the k-distance plot.

Effect of min_samples

Higher min_samples requires denser regions to form clusters, suppressing small or weak clusters. Lower values create more (potentially noisy) clusters. For 2D data, min_samples=5 is a good start; scale proportionally with dimensionality.

K-Distance Plot for Tuning eps

The k-distance plot sorts each point by its distance to its k-th nearest neighbor. A sharp knee in the plot suggests a good eps value — below the knee is noise, above is part of a cluster.

Generating the K-Distance Plot

<pre><code class="language-python">from sklearn.neighbors import NearestNeighbors from sklearn.preprocessing import StandardScaler from sklearn.datasets import make_moons import numpy as np import matplotlib.pyplot as plt X, _ = make_moons(n_samples=300, noise=0.08, random_state=42) X = StandardScaler().fit_transform(X) min_samples = 5 neigh = NearestNeighbors(n_neighbors=min_samples) neigh.fit(X) distances, _ = neigh.kneighbors(X) # Sort distances to the k-th neighbor kdist = np.sort(distances[:, -1])[::-1] plt.plot(kdist) plt.xlabel('Points sorted by distance') plt.ylabel(f'{min_samples}-NN Distance') plt.title('K-Distance Plot') plt.axhline(y=0.3, color='r', linestyle='--', label='eps=0.3') plt.legend(); plt.show()</pre>

Validating Parameter Choices

After setting parameters, validate with the silhouette score (excluding noise points) and by inspecting the ratio of noise to clustered points.

Parameter Sweep

<pre><code class="language-python">from sklearn.cluster import DBSCAN from sklearn.metrics import silhouette_score for eps in [0.2, 0.3, 0.5]: db = DBSCAN(eps=eps, min_samples=5) labels = db.fit_predict(X) n_clusters = len(set(labels)) - (1 if -1 in labels else 0) noise_ratio = list(labels).count(-1) / len(labels) if n_clusters > 1: sil = silhouette_score(X[labels != -1], labels[labels != -1]) print(f"eps={eps}: clusters={n_clusters}, noise={noise_ratio:.1%}, sil={sil:.3f}")</pre>