In order to group points into clusters, we need to know about their distance between each other.
- Intracluster distance: Distance between two point in the same cluster.
- Intercluster distance: Distance between two points in the different clusters.
Best clustering → min intracluster & max intercluster.
Intracluster -- Measuring distance between points in a cluster.
🔅 Complete Diameter Distance: the farthest distance between two points in a cluster.
🔅 Average Diameter Distance: the average distance between ALL points in a clusters.
where is the number of points in .
🔅 Centroid Diameter Distance: the double of average distance between points and the center of a cluster.
where (can be calculated as are the center and the number of points in .
Intercluster -- Measuring distance between 2 clusters. They can be used in Agglomerative clustering.
🔅 Single Linkage Distance: the closest distance between two objects in 2 clusters.
🔅 Complete (maximum) Linkage Distance: the farthest distance between two objects in 2 clusters.
🔅 Centroid Linkage Distance: the distance between 2 centers of 2 clusters.
where are centers of . They can be calculated as and where is the number of elements in .
🔅 Average Linkage Distance: the average distance between ALL objects in 2 clusters.
🔅 Ward's method (Minimum variance method): the different deviation between a group of 2 considered clusters and a "reputed" cluster joining those 2 clusters.
where are centers of and is the number of elements in .
Linkages can be called via
linkageparameter from sklearn's AgglomerativeClustering
1from sklearn.cluster import AgglomerativeClustering 2clustering = AgglomerativeClustering(linkage="ward").fit(X) 3# There are others: "ward" (default), "complete", "average", "single"
Silhouette analysys (SA) is used to determine the degree of separation between clusters. It measure how close each point in one cluster is to points in the neighboring clusters and thus gives an idea of number of clusters visually.
- SA = +1 : a sample is far away from its neighboring clusters. (For clustering algorithm) Clusters are dense & well-separated.
- SA = 0 : a sample is near decision boundary. (For clustering algorithm) There are overlapped clusters.
- SA = 1 : a sample is assigned to a wrong cluster.
- Check if a clustering algorithm is well performed.
- Can be used to find outliers (-1 scores)
What we wanna see for a good number of clusters?
- Red dotted lines approaches 1.
- Plot of each cluster should be above red dotted line as much as possible.
- The width of plot of each cluster should be as uniform as possible.
1from yellowbrick.cluster import SilhouetteVisualizer 2 3model = KMeans(5, random_state=42) 4visualizer = SilhouetteVisualizer(model, colors='yellowbrick') 5 6visualizer.fit(X) # Fit the data to the visualizer 7visualizer.show() # Finalize and render the figure
For original scikit-learn's functions, check this example.
1# MEAN Silhouette Coefficient over all samples 2from sklearn.metrics import silhouette_score 3silhouette_score(X, labels)
1# Silhouette Coefficient of EACH SAMPLE 2from sklearn.metrics import silhouette_samples 3scores = silhouette_samples(X, cluster_labels) 4for i in range(n_clusters): 5 ith_cluster_silhouette_values = scores[cluster_labels == i]