21

DBSCAN Clustering Best Practices

 4 years ago
source link: https://towardsdatascience.com/dbscan-clustering-best-practices-38de9cf57610?gi=61d196efdc29
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Practical and informative guide of using DBSCAN Clustering and discussion about its advantages, disadvantages with pseudocode

May 8 ·6min read

DBSCAN(Density-Based Spatial Clustering of Applications with Noise) is a very famous density-based clustering algorithm. Intuitively, the DBSCAN algorithm can find all the dense regions of the sample points and treat these dense regions as clusters one by one.

The DBSCAN algorithm has the following characteristics:

  • Density-based, robust to noise points which are away from the density core
  • No need to know the number of clusters
  • Can create clusters of arbitrary shape

DBSCAN is usually suitable for cluster analysis of lower-dimensional data.

Basic concept

The basic concepts of DBSCAN can be summarized by 1, 2, 3, and 4.

:one: Core idea: Based on density.

Intuitively, the DBSCAN algorithm can find all the dense regions of the sample points and treat these dense regions as clusters one by one.

meQjInB.png!web

:two: Algorithm parameters: neighbourhood radius (eps/ɛ) and the minimum number of points (min_points).

These two algorithm parameters can actually describe what is dense in the cluster

When the number of points within the neighbourhood radius (eps) is greater than the minimum number of points (min_points), it is dense.

riyAfum.png!web

:three: Types of points: core points, boundary points and noise points.

  • The point where the number of sample points in the neighbourhood radius (eps) is greater than or equal to min_points is called the core point.
  • Points that do not belong to the core point but are within the neighbourhood of a certain core point are called boundary points.
  • Noise points are neither core points nor boundary points.

AnmaE32.png!web

The relationship between :four: kinds of points: direct density, reachable density, connected density, and non-density connection.

If P is the core point and Q is in the R neighbourhood of P, then the density from P to Q is called direct. If P to Q density is direct, then Q to P does not necessarily have relation direct density.

If there is a core point S such that the density from S to P and Q is reachable, then the density of P and Q are connected density.

The density connection is symmetrical, if P and Q density are connected, then Q and P are also connected with a certain density. Two points connected by density belong to the same cluster.

If the two points do not belong to the density connection relationship, the two points are not density connected. Two points that are not connected by density belong to different clusters, or there are noise points.

2ARJj2Y.jpg!web

DBSCAN algorithm steps

36nyQz2.png!web

The algorithm step of DBSCAN is divided into two steps.

1. Find the core point to form a temporary cluster.

Scan all sample points. If the number of points within a radius of a certain sample point is> = MinPoints

It will be included in the list of core points, and the points with direct density will form corresponding temporary clusters.

2. Combine temporary clusters to obtain clusters.

For each temporary cluster, check whether the point in it is a core point. If so, merge the temporary cluster corresponding to the point with the current temporary cluster to obtain a new temporary cluster.

This operation is repeated until each point in the current temporary cluster is either not in the core point list, or the point with a direct density is already in the temporary cluster, and the temporary cluster is upgraded to a cluster.

Continue to perform the same merge operation on the remaining temporary clusters until all the temporary clusters are processed.

B3mmAbA.jpg!web

DBSCAN use examples

Generate sample points

import numpy as np
import pandas as pd
from sklearn import datasets
%matplotlib inlineX,_ = datasets.make_moons(500,noise = 0.1,random_state=1)
df = pd.DataFrame(X,columns = [‘feature1’,’feature2'])df.plot.scatter(‘feature1’,’feature2', s = 100,alpha = 0.6, title = ‘dataset by make_moon’)

eEvUbmi.jpg!web

Call the DBSCAN interface to complete the clustering

from sklearn.cluster import dbscancore_samples,cluster_ids = dbscan(X, eps = 0.2, min_samples=20)df = pd.DataFrame(np.c_[X,cluster_ids],columns = [‘feature1’,’feature2',’cluster_id’])
df[‘cluster_id’] = df[‘cluster_id’].astype(‘i2’)df.plot.scatter(‘feature1’,’feature2', s = 100,
 c = list(df[‘cluster_id’]),cmap = ‘rainbow’,colorbar = False,
 alpha = 0.6,title = ‘DBSCAN cluster result’)

VZ7vq2J.jpg!web

DBSCAN algorithm generate overlapping clusters?

No. DBSCAN will produce mutually exclusive clusters. If minimum cluster size is 1, these clusters will also be exhaustive.

To generate overlapping clusters (also called hierarchical clusters), a variant of DBSCAN called HDBSCAN can be used

Summary of DBSCAN

Compared with the traditional K-Means algorithm, the biggest difference of DBSCAN is that there is no need to input the number of categories k. Of course, its biggest advantage is that it can find clusters of arbitrary shapes, not like K-Means, which is generally only used for convex ones. Sample set clustering. At the same time, it can find outliers while clustering, which is similar to the BIRCH algorithm.

Advantages

  • It is possible to cluster dense data sets of any shape. In contrast, clustering algorithms such as K-Means are generally only applicable to convex data sets.

Z32ee2M.jpg!web

We can clearly see the difference between DBSCAN and K-means (K-means trying to create spherical clusters)
  • Anomalies can be found while clustering, and are not sensitive to anomalies in the data set.
  • The clustering results are not biased. In contrast, the initial value of the clustering algorithm such as K-Means has a great influence on the clustering results.

Disadvantages

  • If the density of the sample set is not uniform and the difference in cluster spacing is very different, the cluster quality is poor, then DBSCAN clustering is generally not suitable.
  • If the sample set is large, the clustering convergence time is long, at this time, the size of the KD tree or the spherical tree established when searching for the nearest neighbour can be improved.

Conclusion

So when do we need to use DBSCAN to cluster?

Generally speaking, if the data set is dense and the data set is not convex, then DBSCAN will be much better than K-Means clustering.

If the data set is not dense, DBSCAN is not recommended for clustering.

We have made the third clustering algorithm i.e. DBSCAN Clustering. For related codes visit my Github repository .

Stay Tuned!


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK