Skip to main content

Implement K-Means Clustering

EasyPremium

Introduction

KMeans clustering is a fundamental unsupervised learning algorithm used to partition a given dataset into K distinct, non-overlapping subsets (clusters). The goal is to determine the best way to group data points into clusters based on their similarity. A key part of this algorithm involves calculating the Euclidean distance between points to measure similarity.

Your task is to implement the KMeans clustering algorithm from scratch in Python. This includes creating classes for both the KMeans algorithm and the centroids involved in the clustering process.

Visit our Model & Algorithm Fundamentals module to master everything you need to know about KMeans for interviews.

Use NumPy for numerical operations for efficiency.

KMeans Class

The KMeans class should encapsulate the entire clustering process.

  • Attributes:
    • n_features: The number of features in the dataset (dimensionality).
    • k: The number of clusters to form.
    • centroids: A list of Centroid objects representing the centers of the clusters.
  • Initialization: The __init__ method should accept n_features and k as parameters, initializing k centroids with random locations in the feature space. This has been implemented already.
  • Methods:
    • distance(self, x, y): Calculate and return the Euclidean distance between two points x and y (float).
    • fit(self, X, n_iterations): Implement the fitting process that assigns data points in X to the relevant cluster
    • predict(self, x): Given a new data point x, predict and return the index of the cluster it belongs to (integer).

Evaluation Criteria

  • Correctness: The implementation should correctly perform clustering on a given dataset.
  • Efficiency: Code efficiency, especially in distance calculations and centroid updates, will be considered.
  • Code Quality: Clarity, readability, and organization of the code, including proper use of classes and methods.