13

Are you solving ML Clustering problems using K-Means?

 3 years ago
source link: https://mc.ai/are-you-solving-ml-clustering-problems-using-k-means/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Are you solving ML Clustering problems using K-Means?

One-liner to plot Elbow curve, Silhouette curve, Intercluster distances, and learn about Scikit-Learn tips that can improve your model.

Source: Image by Author

K-Means is the most used clustering algorithm in unsupervised Machine Learning problems and it is really useful to find similar data points and to determine the structure of the data. In this article, I assume that you have a basic understanding of K-Means and will focus more on how you can-

  • Find the value of K (number of clusters) using different methods like the Elbow curve, Sillhouette curve, and Intercluster distances.
  • A great visualization library YellowBrick which can help you to plot these curves with just 1 line of code .
  • Different Scikit-Learn tips to improve your K-Means model.

If you are a newbie, there are many great articles on the internet that can help you to understand K-Means. I would recommend going through blogs from Imad DabburaK-means Clustering: Algorithm, Applications, Evaluation Methods, and Drawbacks ” and Azika AmeliaK-Means Clustering: From A to Z ” which covers most of it. You can refer to Scikit-Learn documentation for clustering as well which lay out it pretty nicely.

Let’s start and see how it can be done.

Finding K in K-Means.

To use K-Means, we need to specify the value of K that is the number of clusters we want to group our data into. Most of the time, we are not aware of the number of groups that would be present in our data and it becomes challenging to find the optimal value for K. Thankfully, different ways can help us to find the right value.

I will go through 4 different ways here.

  1. Elbow Curve.
  2. Silhouette Curve.
  3. Intercluster Distance Maps.
  4. Using other clustering algorithms.

Dataset Details

To best demonstrate, I will create a dataset using make_blobs API from Scikit-Learn which is used to create multiclass datasets by allocating each class to one or more normally-distributed clusters of points.

Have a look at the notebook I created which have more details, feel free to download and import it in your environment and play around-

Here, we created a dataset with 10 centers using make_blobs.

from sklearn.datasets import make_blobs# Generate synthetic dataset with 10 random clusters in 2 dimensional space
X, y = make_blobs(n_samples=1000, n_features=2, centers=10, random_state=42)

Although we created 10 random clusters, the plot below shows there is an overlap between some and we will see how Elbow method can tell us the exact number of clusters for which we have maximum gain.

Source: Image by Author

Elbow Curve

As stated in Wikipedia-

The elbow method is a heuristic used in determining the number of clusters in a data set. The method consists of plotting the explained variation as a function of the number of clusters, and picking the elbow of the curve as the number of clusters to use.

The intuition behind the Elbow curve is that the explained variation changes rapidly until the number of groups you have in the data and then it slows down leading to an elbow formation in the graph as shown below. The Elbow point is the number of clusters you should use for your K-Means algorithm.

Recently I discovered a library named Yellowbrick which can help us to plot the Elbow curve with just 1 line of code. It is a wrapper around Scikit-Learn and hence integrate well with it.

# Import ElbowVisualizer
from yellowbrick.cluster import KElbowVisualizermodel = KMeans()
# k is range of number of clusters.
visualizer = KElbowVisualizer(model, k=(4,12), timings=False)visualizer.fit(X)        # Fit the data to the visualizer
visualizer.show()        # Finalize and render the figure

The above code will generate this nice graph with all details. By default, it uses Distortion Score as a metric that computes the sum of squared distances from each point to its assigned center. You can try other metrics mentioned here as well.

Image: Source by Author

Some clustering problems might not result in elbow formation and can result in a continuously decreasing graph which makes it difficult to select the value of K. We can use other methods in this case as mentioned in the next sub-section.

Silhouette Curve

Generally, we don’t have ground truth (label) in clustering problems, evaluation needs to be done using the model itself. The silhouette coefficient calculates the density of the cluster by generating a score for each sample based on the difference between the average intra-cluster distance and the mean nearest-cluster distance for that sample normalized by the maximum value. We can find the optimal value of K by generating plots for different values of K and selecting the one with the best score depending on the cluster’s assignment.

Below, I plotted Silhouette plots for K = 6, 7, 8, 9 and you can see that we got the highest score for K = 7 as we got using the Elbow method. This also helps us to identify class imbalance as the width of the clusters shows the number of samples in that cluster and if you see the graph for K=9, we have many small clusters.

Source: Image By Author

I used Yellowbrick to generate the above plot. The code below will generate a single Sillhouette graph for K=7, you can refer my notebook on how you can loop through it to generate multiple plots.

model = KMeans(7, random_state=42)
visualizer = SilhouetteVisualizer(model, colors='yellowbrick')
visualizer.fit(X)        # Fit the data to the visualizer
visualizer.show()        # Finalize and render the figure

Intercluster Distance Maps

Although this might not help directly to find the number of clusters. It helps in evaluating the K-Means algorithm as it gives a sense of relative importance of clusters. By default, Yellowbrick uses MDS (multidimensional scaling) as an embedding algorithm to embed into 2-dimensional space. You can read more about it here .

model = KMeans(7)
visualizer = InterclusterDistance(model, random_state=0)visualizer.fit(X)        # Fit the data to the visualizer
visualizer.show()        # Finalize and render the figure

The above code would generate below plot-

Source: Image by Author

Use Other Algorithms

Another thing we can try is to use other clustering algorithms like Affinity Propagation which doesn’t require you to supply the value of K and make it a part of the learning process. These algorithms might not work for large datasets. So, in some cases, we need to try them on a subset of data and then use the value in K-Means. The Below code predicted 10 clusters that match the number of centers we have used.

from sklearn.cluster import AffinityPropagation# Creating Affinity Propagation model instance
affinity_propagation = AffinityPropagation(random_state=None, damping=0.90)# number of clusters found by Affinity propagation
len(affinity_propagation.cluster_centers_indices_)

Final Results

Now, as we evaluated using different methods, the optimal value for K which we got is 7. Let’s apply the K-Means algorithm with K=7 and see how it clusters our data points.

model = KMeans(n_clusters=7)# fit X
model.fit(X)# predict labels 
data['y_pred'] = model.predict(X)# plot results
sns.scatterplot(x='feature1', y='feature2', hue='y_pred', data=data)
Source: Image by Author

Scikit-Learn Tips

Scikit-learn provides different configurations for K-Means which we can utilize. You can find the complete list here . I will go through a few of them which can help in improving our model.

init=’k-means++’

K-Means algorithm depends largely on how you have initialized centroids (center of the clusters). Scikit-Learn provides different values of init which can be used but in general k-means++ stands out from others as it tries to initializes the centroids to be (generally) distant from each other, leading to provably better results.

Use Dimensionality Reduction Algorithms

K-Means uses Euclidean distance to calculate the distance between points. It is observed that in very high dimensional space, Euclidean distances tend to blow up and don’t work well. So, if you have a large number of features, using dimensionality reduction algorithms like PCA before k-means can overcome this issue and speed up the computation.

Mini Batch K-Means

For large datasets, K-Means can take a lot of time to converge. As stated in Scikit-Learn documentation-

The MiniBatchKMeans is a variant of the KMeans algorithm which uses mini-batches to reduce the computation time, while still attempting to optimise the same objective function. Mini-batches are subsets of the input data, randomly sampled in each training iteration. These mini-batches drastically reduce the amount of computation required to converge to a local solution. In contrast to other algorithms that reduce the convergence time of k-means, mini-batch k-means produces results that are generally only slightly worse than the standard algorithm.

So, if you have a large dataset, do check out MiniBatchKmeans.

Clustering Metrics

Most of the people who start Machine Learning knows about metrics like accuracy, precision, and recall. We tend to think that if there is no label then how we can measure the results. Scikit-Learn provides different metrics for clustering like homogeneity, completeness, v-measure, or silhouette coefficient which we discussed here. These metrics can help us to evaluate the model. While some of these follow a semi-supervised learning approach and require you to have target labels for a few data points at least others works without any labels. Do check out all the metrics available in Scikit-Learn for clustering here .

Scale the dataset

You must scale the dataset before applying k-means as it can impact the distance calculations.

Uneven Shapes (limitation)

Due to the way K-Means works, it is normally suitable for clusters that have even shapes like spherical. If you know you have data that follows uneven shapes then it is better to use other clustering algorithms like DBSCAN that works for uneven clusters.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK