Metrics for clustering

10-01-2021 / Edit on Github

In order to group points into clusters, we need to know about their distance between each other.

Intercluster & intracluster #

  • Intracluster distance: Distance between two point in the same cluster.
  • Intercluster distance: Distance between two points in the different clusters.

Intercluster & intracluster distance
An illustraction of intercluster and intracluster distance.

Best clustering \Rightarrow min intracluster & max intercluster.

Distances used #

Intracluster types #

Intracluster -- Measuring distance between points in a cluster.

🔅 Complete Diameter Distance: the farthest distance between two points in a cluster.

δ(S)=maxx,ySd(x,y)\delta (S) = \max_{x, y\in S} d(x,y)

Complete Diameter Distance

🔅 Average Diameter Distance: the average distance between ALL points in a clusters.

δ(S)=1S(S1)x,yS,xyd(x,y)\delta (S) = \dfrac{1}{\vert S\vert (\vert S\vert-1)} \sum_{x, y\in S, x\ne y} d(x,y)

where S\vert S\vert is the number of points in SS.

Average Diameter Distance

🔅 Centroid Diameter Distance: the double of average distance between points and the center of a cluster.

δ(S)=2(xSd(x,cS)S)\delta (S) = 2 \left( \dfrac{\sum_{x\in S}d(x, c_S)}{\vert S\vert} \right)

where cSc_S (can be calculated as ΣxSxS),S\frac{\Sigma_{x\in S}x}{\vert S\vert}), \vert S\vert are the center and the number of points in SS.

Centroid Diameter Distance

Intercluster types #

Intercluster -- Measuring distance between 2 clusters. They can be used in Agglomerative clustering.

🔅 Single Linkage Distance: the closest distance between two objects in 2 clusters.

δ(S,T)=minxS,yTd(x,y)\delta (S, T) = \min_{x\in S, y\in T} d(x,y)

Single linkage distance

🔅 Complete (maximum) Linkage Distance: the farthest distance between two objects in 2 clusters.

δ(S,T)=maxxS,yTd(x,y)\delta (S, T) = \max_{x\in S, y\in T} d(x,y)

Complete linkage distance

🔅 Centroid Linkage Distance: the distance between 2 centers of 2 clusters.

δ(S,T)=d(cS,cT)\delta (S, T) = d\left( c_S, c_T \right)

where cS,cTc_S, c_T are centers of S,TS, T. They can be calculated as ΣxSxS\frac{\Sigma_{x\in S} x}{\vert S\vert} and ΣxTxT\frac{\Sigma_{x\in T} x}{\vert T\vert} where S,T\vert S\vert, \vert T\vert is the number of elements in S,TS, T.

More about centroid and center

Don't be confused between these two. Center means the point in the interior of a circle that is equidistant from all points on the circumference, whereas centroid means the point at the centre of any shape.

Centroid linkage distance

🔅 Average Linkage Distance: the average distance between ALL objects in 2 clusters.

δ(S,T)=1STxS,yTd(x,y)\delta (S, T) = \dfrac{1}{\vert S\vert \vert T\vert} \sum_{x\in S, y\in T} d(x,y)

Average linkage distance

where S,T\vert S\vert, \vert T\vert is the number of elements in S,TS, T.

🔅 Ward's method (Minimum variance method): the different deviation between a group of 2 considered clusters and a "reputed" cluster joining those 2 clusters.

δ(S,T)=xSTxcST2(xSxcS2+xTxcT2)=STS+TcScT2\begin{aligned} \delta (S, T) &= \sum_{x\in S\cup T} \Vert x - c_{S\cup T} \Vert^2 - \left( \sum_{x\in S} \Vert x - c_S \Vert^2 + \sum_{x\in T} \Vert x - c_T \Vert^2 \right) \\ &= \dfrac{\vert S\vert \vert T\vert}{\vert S\vert + \vert T\vert}\Vert c_S - c_T \Vert^2 \end{aligned}

where cS,cTc_S, c_T are centers of S,TS, T and S,T\vert S\vert, \vert T\vert is the number of elements in S,TS, T.

Ward's method

Difference #

👉 Different linkage type: Ward, complete, average, and single linkage


Different clustering results using different linkages on some special datasets. Source of image.

Code #

Linkages can be called via linkage parameter from sklearn's AgglomerativeClustering

from sklearn.cluster import AgglomerativeClustering
clustering = AgglomerativeClustering(linkage="ward").fit(X)
# There are others: "ward" (default), "complete", "average", "single"

Silhouette analysis #

Silhouette analysys (SA) is used to determine the degree of separation between clusters. It measure how close each point in one cluster is to points in the neighboring clusters and thus gives an idea of number of clusters visually.

SAi=biaimax(ai,bi)SA_i = \dfrac{b_i-a_i}{\max (a_i, b_i)}

Illustration of  and
Illustration of mean intra-cluster distance aa (average distance between considered sample to all its neighboring in the same cluster) and nearest-cluster distance bb (average distance between considered sample to all samples in the closest cluster of its cluster).

  • SA = +1 : a sample is far away from its neighboring clusters. (For clustering algorithm) Clusters are dense & well-separated.
  • SA = 0 : a sample is near decision boundary. (For clustering algorithm) There are overlapped clusters.
  • SA = -1 : a sample is assigned to a wrong cluster.

Used for? #

  • Better than Elbow method to find the number of clusters.[ref]
  • Check if a clustering algorithm is well performed.
  • Can be used to find outliers (-1 scores)

Silhouette plot #

👉 Selecting the number of clusters with silhouette analysis on KMeans clustering

Silhouette plot with
Silhouette plot with n_clusters=2. OyOy: all samples in the dataset sorted in a given cluster. OxOx: The Silhouette scores w.r.t. these samples. The red dotted line is the mean SASA.

What we wanna see for a good number of clusters?

  1. Red dotted lines approaches 1.
  2. Plot of each cluster should be above red dotted line as much as possible.
  3. The width of plot of each cluster should be as uniform as possible.
from yellowbrick.cluster import SilhouetteVisualizer

model = KMeans(5, random_state=42)
visualizer = SilhouetteVisualizer(model, colors='yellowbrick')

visualizer.fit(X) # Fit the data to the visualizer
visualizer.show() # Finalize and render the figure

For original scikit-learn's functions, check this example.

Code #

👉 Ref to silhouette_score, silhouette_samples.

# MEAN Silhouette Coefficient over all samples
from sklearn.metrics import silhouette_score
silhouette_score(X, labels)
# Silhouette Coefficient of EACH SAMPLE
from sklearn.metrics import silhouette_samples
scores = silhouette_samples(X, cluster_labels)
for i in range(n_clusters):
ith_cluster_silhouette_values = scores[cluster_labels == i]