Skip to main content

Density-Based Spatial Clustering (DBSCAN)

Premium

DBSCAN, which stands for density-based spatial clustering of applications with noise, is a popular clustering algorithm in machine learning and data mining. It is particularly well-suited for discovering clusters of varying shapes and sizes in data that contains noise and outliers. Unlike partitioning algorithms like K-means, DBSCAN does not require specifying the number of clusters in advance. Instead, it relies on two parameters: epsilon (ϵ\epsilon), which defines the radius of the neighborhood around each point, and the minimum number of points (minPts) required to form a dense region or cluster.

DBSCAN offers several advantages over traditional clustering methods:

  • No need for pre-defined clusters: Unlike K-means, DBSCAN does not require specifying the number of clusters beforehand.
  • Robust to noise: DBSCAN can identify and handle noise points (outliers), making it robust for real-world data that often contains irregularities.
  • Clusters of arbitrary shape: The algorithm can find clusters of varying shapes and sizes, as long as they are dense enough. This flexibility makes it applicable to a wide range of clustering tasks.

Overview

DBSCAN
Supervised/UnsupervisedUnsupervised
InputDataset where each data point is represented by a set of features or attributes. These features could be continuous or categorical, but they must be numerical in nature.
OutputCluster assignments in the form of cluster labels for each data point, indicating which cluster it belongs to.
Use CasesCustomer Segmentation, Image Compression, Anomaly Detection, Document Clustering

Algorithm

The DBSCAN algorithm relies on two key parameters.

Epsilon (ϵ\epsilon): This parameter defines the radius of the neighborhood around each point. In other words, it determines the maximum distance between two points for them to be considered neighbors. A smaller ϵ\epsilon will result in smaller, more tightly packed clusters, while a larger ϵ\epsilon may merge clusters that are close to each other.

MinPts: This parameter specifies the minimum number of points required to form a dense region (i.e., a cluster). A point is considered a core point if it has at least minPts points (including itself) within its ε-radius neighborhood. A higher minPts value will lead to fewer, larger clusters, while a lower value may result in more clusters, including smaller ones.

To estimate the best values for the two parameters in DBSCAN:

  1. MinPts: A common rule of thumb suggests setting minPts to twice the number of dimensions (minPts = 2·D), providing a starting point for exploration. However, larger values might be necessary for datasets with high noise levels or a large number of data points. It's important to note that minPts should be set to at least 3, as having minPts = 1 would designate every point as a core point, essentially undermining the clustering process. Similarly, minPts = 2 would result in a form of clustering that's akin to single-link clustering, which may not be appropriate for many datasets.
  2. Epsilon (ϵ\epsilon): Choose ϵ\epsilon using a k-distance graph, plotting the distance to the k = minPts-1 nearest neighbor. Good values for ϵ\epsilon are where the plot shows an "elbow," indicating a significant change in distance. Small ϵ\epsilon values are generally preferred (but not too small as it may leave many points unclustered), ensuring only a small fraction of points are within this distance of each other. k-distance graph

The DBSCAN algorithm classifies all point into 3 categories:

  1. Core point: A point with at least minPts points (including the point itself) in its surrounding area with radius ϵ\epsilon
  2. Border point: A point is reachable from a core point and has less than minPts points in its surrounding area
  3. Outlier: A point that is neither a core point nor a border point

DBSCAN types of points

In the example above, there's a representation of each type of point, considering a minimum number of neighboring points (minPts = 3). The dotted circle around each point illustrates the radius (ϵ\epsilon) within which neighboring points are considered.

The DBSCAN algorithm operates by examining the local density of data points to form clusters. The process can be summarized in the following steps:

  1. Select a point: Start with an arbitrary point in the dataset that has not been visited.
  2. Identify neighborhood: Find all points within the ϵ\epsilon-radius neighborhood of the selected point.
  3. Classify point:
    • If the selected point has at least minPts points within its neighborhood, it is classified as a core point and forms the basis of a new cluster.
    • If the selected point has fewer than minPts points within its neighborhood, it is temporarily classified as a border point and may later be classified as noise if it does not fall within the neighborhood of any core point.
  4. Expand cluster:
    • For a core point, consider all points in its neighborhood. These points are added to the cluster.
    • For each point in the cluster, repeat steps 2 and 3 to find their neighborhoods and add any new core points to the cluster.
    • Continue expanding the cluster by recursively applying the neighborhood density check to each core point until no new points can be added.
  5. Mark as Noise: Any point that is not reachable from any core point and is not part of any cluster is classified as noise.
  6. Repeat for All Points: Move to the next unvisited point in the dataset and repeat steps 1 to 5 until all points have been visited and classified.

For a closer look at how DBSCAN works, take a peek at this interactive visualization tool.

Pseudocode

Pseudocode
DBSCAN(Dataset, epsilon, minPts): Cluster_ID = 0 for each point P in Dataset: if P is visited: continue mark P as visited NeighborPts = all points within epsilon distance of point P if size(NeighborPts) < minPts: mark P as Noise else: Cluster_ID = Cluster_ID + 1 add P to Cluster_ID while NeighborPts is not empty: Q = first point in NeighborPts if Q is not visited: mark Q as visited NeighborPts_Q = all points within epsilon distance of point Q if size(NeighborPts_Q) >= minPts: NeighborPts = NeighborPts union NeighborPts_Q if Q is not yet member of any cluster: add Q to Cluster_ID remove Q from NeighborPts

Evaluation

Since DBSCAN is unsupervised learning, there is a lack of labeled datasets to calculate typical evaluation metrics like accuracy and confusion matrices. As such, evaluating the results of DBSCAN requires us to look at the clusters and evaluate how "well-formed" they are.

Silhouette analysis: Silhouette analysis measures how similar each point is to its own cluster (cohesion) compared to other clusters (separation). It yields a silhouette score that ranges from -1 to 1, where a higher silhouette score indicates better-defined clusters.

The silhouette score s(i)s(i) for a point ii is calculated as follows:

s(i)=b(i)a(i)max(a(i),b(i))s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}

Where

  • s(i)s(i) : silhouette value for point ii
  • a(i)a(i) : average distance from ii to all other points in the same cluster.
  • b(i)b(i) : average distance from ii to all points in the closest cluster that ii is not a part of.

A higher silhouette score indicates that the data points are well-matched to their own cluster and poorly matched to neighboring clusters, thus implying well-defined clusters.

Limitations

  1. DBSCAN struggles with clusters of significantly different densities because specifying two parameters, ϵ\epsilon and minPts, to accommodate all clusters becomes challenging.
  2. DBSCAN's effectiveness relies heavily on the chosen distance metric, commonly Euclidean distance. However, in high-dimensional data, the "Curse of Dimensionality" can make Euclidean distance less reliable, complicating the selection of an appropriate ϵ\epsilon value. This challenge is not unique to DBSCAN but affects any algorithm utilizing Euclidean distance.
  3. DBSCAN is not deterministic regarding border points, as their assignment to clusters can vary based on the sequence of data processing. However, this typically has minimal impact on the clustering result, as both core points and border points are deterministic.

The curse of dimensionality refers to the problems that pop up when dealing with data in spaces with lots of dimensions, meaning there are many different features being considered. Essentially, when you have a small amount of data relative to the number of features you're considering, things get tricky.

Here's why: as you add more dimensions, the space where your data lives grows super fast, making the data spread out thinly. This means you need a ton of data just to get reliable results, and the amount of data you need shoots up really quickly as you add more dimensions.

Plus, when you're trying to organize or search through this data, you often rely on finding groups of similar things. But in high-dimensional data, everything seems pretty different and spread out, making it tough to use common strategies for organizing and searching effectively.

Employing DBSCAN

We suggest utilizing DBSCAN from scikit-learn since it's widely recognized as a prominent machine learning library in Python.

Python
from sklearn.cluster import DBSCAN dbscan = DBSCAN(eps=0.5, min_samples=5) dbscan.fit(X) # X is your data labels = dbscan.labels_