The key idea is that for each point of a cluster, the neighborhood of a given radius has to contain at least a minimum number of points.
- "DBSCAN" = Density-based-spatial clustering of application with noise.
- Separate clusters of high density from ones of low density.
- Can sort data into clusters of varying shapes.
- Input: set of points & neighborhood N & minpts (density)
- Output: clusters with density (+ noises)
- Each point is either:
- Core point: has at least minpts points in its neighborhood.
- Border point: not a core but has at least 1 core point in its neighborhoods.
- Noise point: not a core or border point.
- Choose a point → it's a core point?
- If yes → expand → check core / check border
- If no → form a cluster
- Repeat to form other clusters
- Eliminate noise points.
- Discover any number of clusters (different from K-Means & K-Medoids Clustering which need an input of number of clusters).
- Cluster of varying sizes and shapes.
- Detect and ignore outliers.
- Sensitive → choice of neighborhood parameters (eg. If minpts is too small → wrong noises)
- Produce noise: unclear → how to calculate metric indexes when there is noise.
- High DBSCAN.
- Difference between DBSCAN and HDBSCAN:
- HDBSCAN: focus much on high density.
- DBSCAN: create right clusters but also create clusters with very low density of examples (Figure 1).
- Check more in this note.
- Reduce the speed of clustering in comparision with other methods (Figure 2).
- HDBScan has the parameter minimum cluster size (
min_cluster_size), which is how big a cluster needs to be in order to form.
- We are not sure the number of clusters (like in KMeans)
- There are outliers or noises in data.
- Arbitrary cluster's shape.
1from sklearn.cluster import DBSCAN 2clr = DBSCAN(eps=3, min_samples=2)
1# or 2clr.fit_predict(X)
min_samples: min number of samples to be called "dense"
eps: max distance between 2 samples to be in the same cluster. Its unit/value based on the unit of data.
epsindicates higher density necessary to form a cluster.
clr.labels_: clusters' labels.
For a ref of paramaters, check the API.
1from hdbscan import HDBSCAN 2clr = HDBSCAN(eps=3, min_cluster_size=3, metric='euclidean')
min_cluster_size: (ref) the smallest size grouping that you wish to consider a cluster.
min_samples: (ref) The number of samples in a neighbourhood for a point to be considered a core point. The larger value → the more points will be declared as noise & clusters will be restricted to progressively more dense areas.
- Working with (more):
1from dtaidistance import dtw 2matrix = dtw.distance_matrix_fast(series) # something likes that 3model = HDBSCAN(metric='precomputed') 4clusters = model.fit_predict(matrix)
1means that this sample is not assigned to any cluster, or noise!
clt.labels_: labels of clusters (including
clt.probabilities_: scores (between 0 and 1).
0means sample is not in cluster at all (noise),
1means the heart of cluster.
Note that, HDBSCAN is built based on scikit-learn but it doesn't have an
.predict()method as other clustering methods does on scikit-learn. Below code gives you a new version of HDBSCAN (
WrapperHDBSCAN) which has an additional
1from hdbscan import HDBSCAN 2 3class WrapperHDBSCAN(HDBSCAN): 4 def predict(self, X): 5 self.fit(X) 6 return self.labels_
- Official doc -- How HDBSCAN works?