What?

The key idea is that for each point of a cluster, the neighborhood of a given radius has to contain at least a minimum number of points.

DBSCAN

DescriptionDensity-based spatial clustering of applications with noise.

HDBSCAN

High DBSCAN.

When?

  • We are not sure the number of clusters (like in KMeans)
  • There are outliers or noises in data.
  • Arbitrary cluster’s shape.

In Code

DBSCAN with Scikit-learn

from sklearn.cluster import DBSCAN
clr = DBSCAN(eps=3, min_samples=2)
clr.fit(X)
clr.predict(X)
# or
clr.fit_predict(X)

Parameters (others):

  • min_samples: min number of samples to be called “dense”
  • eps: max distance between 2 samples to be in the same cluster. Its unit/value based on the unit of data.
  • Higher min_samples + lower eps indicates higher density necessary to form a cluster.

Components:

  • clustering.labels_: clusters’ labels.

HDBSCAN

from hdbscan import HDBSCAN