Skip to main content

K-Means Clustering

Premium

K-means clustering is a machine learning algorithm that groups data points into clusters based on their similarities. It works by repeatedly adjusting cluster centers until they stabilize, forming distinct clusters where data points are more alike within a cluster than they are to those in other clusters. K-means is used in many areas, from organizing customer data for targeted marketing to simplifying images for faster processing.

By the end of this lesson, you'll have a better idea of how to answer the following commonly asked questions:

  1. Does K-means clustering always produce the same results?
  2. How can we mitigate the variability in K-means clustering results?

We'll provide the answers at the end of this lesson.

Overview

K-means
Supervised/UnsupervisedUnsupervised
InputDataset where each data point is represented by a set of features or attributes. These features could be continuous or categorical, but they must be numerical in nature.
OutputCluster assignments in the form of cluster labels for each data point, indicating which cluster it belongs to.
Use CasesCustomer Segmentation, Image Compression, Anomaly Detection, Document Clustering

To effectively apply K-means clustering, the data is typically reduced to lower dimensions using dimensionality reduction techniques before the clustering process. This pre-processing step enhances the algorithm's performance, particularly when dealing with high-dimensional datasets, by reducing computational complexity and improving clustering accuracy.

Algorithm

  1. Initialise parameter kk, and kk cluster seeds m1(1),m2(1),...,mk(1)m_1^{(1)}, m_2^{(1)}, ...,m_k^{(1)}
  2. Assignment step: Assign each sample to the nearest cluster
  3. Update step: Calculate the new means for all new clusters
  4. Repeat steps 2 and 3 until the centroids are stable / membership of cluster is stable

If a point is equidistant between two centroids, it can be assigned to either of the clusters depending on the tie-breaking mechanism used. One common approach is to randomly assign the point to one of the clusters.

Equation for the assignment step:

Si(t)={xp:xpmi(t)2xpmj(t)2,j,1jk}S_i^{(t)} = \{x_p:|| x_p-m_i^{(t)}||^2 \le||x_p-m_j^{(t)}||^2, \forall j, 1\le j\le k\}

Where

  • Si(t):S_i^{(t)}: Set of points belonging to cluster ii at step tt
  • xp:x_p: Position of point of interest
  • mi(t):m_i^{(t)}: Centroid of cluster ii at step

This equation assigns each data point xpx_p to the cluster ii whose centroid mi(t)m_i^{(t)} is closest to it, based on the Euclidean distance squared, among all centroids mj(t)m_j^{(t)} for jj in the range from 1 to kk.

While the Euclidean distance squared is commonly used in K-means clustering due to its computational efficiency and mathematical properties, other distance metrics can be employed based on the specific requirements of the application or the nature of the data.

For instance, if the data has different scales or contains categorical features, other distance metrics such as Manhattan distance, cosine similarity, or Mahalanobis distance may be more appropriate. The choice of distance metric depends on the characteristics of the data and the objectives of the clustering task.

Equation for the update step:

mi(t+1)=1Si(t)xjSi(t)xjm_i^{(t+1)} = \frac{1}{|S_i^{(t)}|}\sum_{x_j\in S_i^{(t)}} x_j

Where

  • Si(t):S_i^{(t)}: Set of points belonging to cluster ii at step tt
  • xj:x_j: Sample jj
  • mi(t):m_i^{(t)}: Centroid of cluster ii at step

This update step recalculates the centroid mi(t+1)m_i^{(t+1)} of cluster ii at the next iteration t+1t+1 by averaging the positions of all data points xjx_j belonging to cluster ii at the current iteration tt.

Example of the K-means algorithm in action:

The circles depict the data points, each colored according to its assigned cluster, while the triangles represent the centroids of those clusters. In this example, Euclidean distance is used for membership determination, and the algorithm converges after only three assignment and update steps, stabilizing the membership of each cluster.

K-means process

Pseudocode

From the outlined algorithm, we can construct a pseudocode for K-means clustering as follows:

Pseudocode
1. Initialize cluster centroids randomly 2. Repeat until convergence: a. Assign each data point to the nearest cluster centroid b. Update cluster centroids based on assigned data points 3. Check for convergence: Stop if centroids have not changed significantly 4. Output final cluster centroids and assignments

Evaluation

Since K-means is unsupervised learning, there is a lack of labeled datasets to calculate typical evaluation metrics like accuracy and confusion matrices. As such, evaluating the results of K-means requires us to look at the clusters and evaluate how "well-formed" they are. These are two popular evaluation techniques:

Elbow method. The elbow method is a common technique used to determine the optimal value of kk in K-means clustering, which represents the ideal number of clusters to effectively group the data. It works like this:

  1. First, try clustering the data with different values of K, e.g. 1 to 10.
  2. For each value of K, calculate the 'cost' or 'distortion', which measures how spread out the points are within their clusters. Lower distortion generally means better-defined clusters.
  3. Now, plot these costs against the corresponding values of K. It's like making a graph where the x-axis shows different values of K, and the y-axis shows the costs.
  4. Here's the key: As K increases, the cost will generally decrease. That's because with more clusters, each point is closer to its cluster center. But at some point, adding more clusters doesn't improve things much. The improvement starts to slow down, and the cost decreases more slowly.
  5. The "elbow point" on the graph is where this slowdown happens. It's called the elbow because if you imagine the graph as an arm, this point is where the arm bends like an elbow. Beyond this point, adding more clusters doesn't significantly decrease the cost.

Elbow method

So, the optimal value of K is often chosen as the value corresponding to this elbow point. This is where you get the most 'bang for your buck' in terms of improving clustering quality.

Silhouette analysis: Silhouette analysis measures how similar each point is to its own cluster (cohesion) compared to other clusters (separation). It yields a silhouette score that ranges from -1 to 1, where a higher silhouette score indicates better-defined clusters.

The silhouette score s(i)s(i) for a point ii is calculated as follows:

s(i)=b(i)a(i)max(a(i),b(i))s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}

Where

  • s(i)s(i) : silhouette value for point ii
  • a(i)a(i) : average distance from ii to all other points in the same cluster.
  • b(i)b(i) : average distance from ii to all points in the closest cluster that ii is not a part of.

A higher silhouette score indicates that the data points are well-matched to their own cluster and poorly matched to neighboring clusters, thus implying well-defined clusters.

Limitations

Variability in results: K-means clustering may yield different outcomes depending on the initial seed used, leading to potential inconsistency in results.

Forced clustering: It rigidly enforces the specified number of clusters (K), potentially partitioning data into K regions even if fewer clusters would be more appropriate, necessitating prior knowledge of the ideal value for K. In the example below, K-means forcefully splits a clearly two-cluster region into three clusters because the value of K was set to 3.

K-means forced split

When the true number of clusters is unknown, using K-means for clustering may lead to inaccurate results. You may use the elbow method (see evaluation techniques above) to determine the optimal kk value. Or consider employing density-based clustering methods like DBSCAN (Density-Based Spatial Clustering of Applications with Noise).

Shape bias: K-means tends to favor identifying circular clusters, which restricts its ability to accurately detect clusters with non-circular shapes. In the example below, although there are three distinct clusters, K-means misclassifies them because it struggles with non-circular shapes.

K-means shape

Susceptibility to unbalanced datasets: Uneven sample distributions within the dataset can cause K-means to generate inaccurate clusters, particularly if clusters are expected to be of similar sizes.

K-means size

The reason for the last two limitations is due to the fact that K-means relies on the convexity assumption and the isotropy assumption.

The convexity assumption means that the clusters we want to identify are convex shapes. A set is convex if, for any two points within the set, the line segment connecting them lies entirely within the set. In the context of K-means, this means that the algorithm works best when the true underlying clusters in the data can be separated by straight lines. If the clusters have complex shapes with curved or irregular boundaries, K-means may not perform well because it tries to assign points to clusters based on straight-line distances to the nearest centroid.

The isotropy assumption means that the clusters are spherical and have similar sizes and densities. Isotropy implies that the distance from the centroid to any point within a cluster is roughly the same in all directions, resembling a sphere in a multidimensional space.

Employing K-means clustering

Here are a few prevalent methods and libraries for utilizing K-means clustering in code:

Using scikit-learn:

Python
from sklearn.cluster import KMeans import numpy as np # Create a KMeans object with the desired number of clusters kmeans = KMeans(n_clusters=3) # Fit the KMeans model to the data kmeans.fit(data) # Get cluster assignments for each data point labels = kmeans.labels_ # Get cluster centroids centroids = kmeans.cluster_centers_

Using SciPy:

Python
from scipy.cluster.vq import kmeans, vq # Perform K-means clustering centroids, labels = kmeans(data, K) # Get cluster assignments for each data point labels, _ = vq(data, centroids)

Common questions

Q: Does K-means clustering always produce the same results?

A: No, K-means clustering may not always produce the same results due to its tendency to converge to a local optimum rather than a global optimum. This variability arises from factors such as the random initialization of centroids and the sensitivity of the algorithm to the order of data points.

Q: How can we mitigate the variability in K-means clustering results?

A: To mitigate this variability, multiple runs of the algorithm with different initializations can be performed. Then, the solution with the lowest distortion (inertia) can be selected as the final result. Additionally, advanced techniques such as K-means++ initialization or using a higher value of k can help improve the consistency of the results by guiding the initial centroid selection process.