K-Means Clustering
K-means clustering is a machine learning algorithm that groups data points into clusters based on their similarities. It works by repeatedly adjusting cluster centers until they stabilize, forming distinct clusters where data points are more alike within a cluster than they are to those in other clusters. K-means is used in many areas, from organizing customer data for targeted marketing to simplifying images for faster processing.
By the end of this lesson, you'll have a better idea of how to answer the following commonly asked questions:
- Does K-means clustering always produce the same results?
- How can we mitigate the variability in K-means clustering results?
We'll provide the answers at the end of this lesson.
Overview
To effectively apply K-means clustering, the data is typically reduced to lower dimensions using dimensionality reduction techniques before the clustering process. This pre-processing step enhances the algorithm's performance, particularly when dealing with high-dimensional datasets, by reducing computational complexity and improving clustering accuracy.
Algorithm
- Initialise parameter , and cluster seeds
- Assignment step: Assign each sample to the nearest cluster
- Update step: Calculate the new means for all new clusters
- Repeat steps 2 and 3 until the centroids are stable / membership of cluster is stable
If a point is equidistant between two centroids, it can be assigned to either of the clusters depending on the tie-breaking mechanism used. One common approach is to randomly assign the point to one of the clusters.
Equation for the assignment step:
Where
- Set of points belonging to cluster at step
- Position of point of interest
- Centroid of cluster at step
This equation assigns each data point to the cluster whose centroid is closest to it, based on the Euclidean distance squared, among all centroids for in the range from 1 to .
While the Euclidean distance squared is commonly used in K-means clustering due to its computational efficiency and mathematical properties, other distance metrics can be employed based on the specific requirements of the application or the nature of the data.
For instance, if the data has different scales or contains categorical features, other distance metrics such as Manhattan distance, cosine similarity, or Mahalanobis distance may be more appropriate. The choice of distance metric depends on the characteristics of the data and the objectives of the clustering task.
Equation for the update step:
Where
- Set of points belonging to cluster at step
- Sample
- Centroid of cluster at step
This update step recalculates the centroid of cluster at the next iteration by averaging the positions of all data points belonging to cluster at the current iteration .
Example of the K-means algorithm in action:
The circles depict the data points, each colored according to its assigned cluster, while the triangles represent the centroids of those clusters. In this example, Euclidean distance is used for membership determination, and the algorithm converges after only three assignment and update steps, stabilizing the membership of each cluster.

Pseudocode
From the outlined algorithm, we can construct a pseudocode for K-means clustering as follows:
Pseudocode1. Initialize cluster centroids randomly 2. Repeat until convergence: a. Assign each data point to the nearest cluster centroid b. Update cluster centroids based on assigned data points 3. Check for convergence: Stop if centroids have not changed significantly 4. Output final cluster centroids and assignments
Evaluation
Since K-means is unsupervised learning, there is a lack of labeled datasets to calculate typical evaluation metrics like accuracy and confusion matrices. As such, evaluating the results of K-means requires us to look at the clusters and evaluate how "well-formed" they are. These are two popular evaluation techniques:
Elbow method. The elbow method is a common technique used to determine the optimal value of in K-means clustering, which represents the ideal number of clusters to effectively group the data. It works like this:
- First, try clustering the data with different values of K, e.g. 1 to 10.
- For each value of K, calculate the 'cost' or 'distortion', which measures how spread out the points are within their clusters. Lower distortion generally means better-defined clusters.
- Now, plot these costs against the corresponding values of K. It's like making a graph where the x-axis shows different values of K, and the y-axis shows the costs.
- Here's the key: As K increases, the cost will generally decrease. That's because with more clusters, each point is closer to its cluster center. But at some point, adding more clusters doesn't improve things much. The improvement starts to slow down, and the cost decreases more slowly.
- The "elbow point" on the graph is where this slowdown happens. It's called the elbow because if you imagine the graph as an arm, this point is where the arm bends like an elbow. Beyond this point, adding more clusters doesn't significantly decrease the cost.

So, the optimal value of K is often chosen as the value corresponding to this elbow point. This is where you get the most 'bang for your buck' in terms of improving clustering quality.
Silhouette analysis: Silhouette analysis measures how similar each point is to its own cluster (cohesion) compared to other clusters (separation). It yields a silhouette score that ranges from -1 to 1, where a higher silhouette score indicates better-defined clusters.
The silhouette score for a point is calculated as follows:
Where
- : silhouette value for point
- : average distance from to all other points in the same cluster.
- : average distance from to all points in the closest cluster that is not a part of.
A higher silhouette score indicates that the data points are well-matched to their own cluster and poorly matched to neighboring clusters, thus implying well-defined clusters.
Limitations
Variability in results: K-means clustering may yield different outcomes depending on the initial seed used, leading to potential inconsistency in results.
Forced clustering: It rigidly enforces the specified number of clusters (K), potentially partitioning data into K regions even if fewer clusters would be more appropriate, necessitating prior knowledge of the ideal value for K. In the example below, K-means forcefully splits a clearly two-cluster region into three clusters because the value of K was set to 3.

When the true number of clusters is unknown, using K-means for clustering may lead to inaccurate results. You may use the elbow method (see evaluation techniques above) to determine the optimal value. Or consider employing density-based clustering methods like DBSCAN (Density-Based Spatial Clustering of Applications with Noise).
Shape bias: K-means tends to favor identifying circular clusters, which restricts its ability to accurately detect clusters with non-circular shapes. In the example below, although there are three distinct clusters, K-means misclassifies them because it struggles with non-circular shapes.

Susceptibility to unbalanced datasets: Uneven sample distributions within the dataset can cause K-means to generate inaccurate clusters, particularly if clusters are expected to be of similar sizes.

The reason for the last two limitations is due to the fact that K-means relies on the convexity assumption and the isotropy assumption.
The convexity assumption means that the clusters we want to identify are convex shapes. A set is convex if, for any two points within the set, the line segment connecting them lies entirely within the set. In the context of K-means, this means that the algorithm works best when the true underlying clusters in the data can be separated by straight lines. If the clusters have complex shapes with curved or irregular boundaries, K-means may not perform well because it tries to assign points to clusters based on straight-line distances to the nearest centroid.
The isotropy assumption means that the clusters are spherical and have similar sizes and densities. Isotropy implies that the distance from the centroid to any point within a cluster is roughly the same in all directions, resembling a sphere in a multidimensional space.
Employing K-means clustering
Here are a few prevalent methods and libraries for utilizing K-means clustering in code:
Using scikit-learn:
Pythonfrom sklearn.cluster import KMeans import numpy as np # Create a KMeans object with the desired number of clusters kmeans = KMeans(n_clusters=3) # Fit the KMeans model to the data kmeans.fit(data) # Get cluster assignments for each data point labels = kmeans.labels_ # Get cluster centroids centroids = kmeans.cluster_centers_
Using SciPy:
Pythonfrom scipy.cluster.vq import kmeans, vq # Perform K-means clustering centroids, labels = kmeans(data, K) # Get cluster assignments for each data point labels, _ = vq(data, centroids)
Common questions
Q: Does K-means clustering always produce the same results?
A: No, K-means clustering may not always produce the same results due to its tendency to converge to a local optimum rather than a global optimum. This variability arises from factors such as the random initialization of centroids and the sensitivity of the algorithm to the order of data points.
Q: How can we mitigate the variability in K-means clustering results?
A: To mitigate this variability, multiple runs of the algorithm with different initializations can be performed. Then, the solution with the lowest distortion (inertia) can be selected as the final result. Additionally, advanced techniques such as K-means++ initialization or using a higher value of k can help improve the consistency of the results by guiding the initial centroid selection process.