Implement the KNN Algorithm
Implement the K-nearest neighbors (KNN) algorithm. The knn function will have four inputs:
X_train: a NumPy array of points that are already on the map.Y_train: a NumPy array of labels for each point on the map, either 1 or 0.X_new: a NumPy array for the new point you're adding to the map and want to classify.k: an integer that tells you how many of the closest points toX_newyou should look at to decide its label.
The K-nearest neighbors (KNN) algorithm classifies a new data point based on the most common category among its k closest neighbors in the dataset. It calculates the distance between points to determine "closeness."
Visit our Model & Algorithm Fundamentals module to master everything you need to know about KNN for interviews.
Python# Example X_train = np.array([[1, 2], [2, 3], [3, 4], [6, 7], [7, 8], [8, 9]]) y_train = np.array([0, 0, 0, 1, 1, 1]) X_new = np.array([2, 2]) # The new point we want to classify k = 3 # We'll look at the 3 closest points knn(X_train, y_train, X_new, k) # should return 0
In our implementation of the k-Nearest Neighbors (kNN) algorithm, we adopt a straightforward approach to classify a new data point based on the majority class among its k closest points from a given training dataset. Here's how we go about it:
- Calculate euclidean distance: For the new data point, we calculate the Euclidean distance to each point in the training dataset. We use the euclidean_distance function for this purpose, which measures the straight-line distance between two points in an n-dimensional feature space.
- Identify nearest neighbors: Next, we rank all the training points by their distance from the new point and select the k nearest ones. This is accomplished by sorting the distances and selecting the indices of the k smallest distances.
- Majority voting for classification: To determine the class of the new point, we look at the labels of these k nearest neighbors and conduct a majority vote. By summing the labels of the nearest neighbors (assuming binary classification with labels 0 and 1) and comparing this sum to k / 2.0, we decide the new point's class. If the sum is greater, indicating that more than half of the neighbors are from class 1, we classify the new point as 1; otherwise, it's classified as 0.
Pythonimport numpy as np
def euclidean_distance(x, y):
return np.sqrt((x - y).T.dot(x - y))
def knn(X_train, y_train, X_new, k):
distances = []
for X_i in X_train:
distance = euclidean_distance(X_i, X_new)
distances.append(distance)
top_k_indices = np.argsort(distances)[:k]
top_k_labels = [y_train[idx].item() for idx in top_k_indices]
# Ensure floating-point division by explicitly converting `k` to a float
if sum(top_k_labels) > (k / 2.0):
return 1
else:
return 0In this interview, Angie asks Raj (Snapchat) to "implement the K-nearest neighbor algorithm."
Below is a supplemental written solution that utilizes our interview framework.
Step 1: Understand the problem
Having asked the appropriate clarifying questions, we’ll make the following assumptions:
- We are implementing a binary classifier.
- We will use the Euclidean distance as the metric to compute the distance between an input data point and data points in the training set.
- The input features, X, will be a NumPy array with arbitrary dimension N examples and D features.
Step 2: Discuss the approach
KNN for classification will involve predicting an incoming data point based on the most prominent label amongst the data points in the training set that are closest to it.
As we discussed previously, we will be using the Euclidean distance to determine “closeness.”
We’ll define two functions, one to calculate the distance and one to infer the KNN algorithm: we can call these “euclidean_distance” and “knn.”
Step 3: Implement the algorithm
Pythondef euclidean_distance(x, y):
return np.sqrt((x - y).T.dot(x - y))
def knn(X_train, y_train, X_new, k):
distances = []
As clarified before, we will be using the Euclidean distance to determine closeness, and we’ll assume that we have the training input data and labels. In this case, we will also assume that the “knn” function takes in a single data point that represents input features. This could be extended to doing KNN for multiple data points.
We’ll start by implementing the basic logic for KNN, and then implement the logic for Euclidean distance afterwards. The algorithm involves calculating the distances between all the training data points and the incoming data point. For this, we’ll use a list to maintain the distances and populate the list using the distance function and a “for” loop:
Pythondistances = [] for X_i in X_train: distance = euclidean_distance(X_i, X_new) distances.append(distance)
Now, we have to get the indices of the top k minimum distances. We can use the NumPy built-in sorting to do this.
Pythontop_k_indices = np.argsort(distances)[:k] top_k_labels = [y_train[idx].item() for idx in top_k_indices]
The first line will get us the indices of the distances in sorted order. Since NumPy sorts in descending order by default, we will use “[::-1]” to reverse it to ascending order. Then, we will take the top K by indexing this sorted list of indices.
The second line will get the actual label at each of those indices.
Now that we have the labels of the closest data points, we just need to take the majority label and predict it.
Python# Ensure floating-point division by explicitly converting `k` to a float if sum(top_k_labels) > (k / 2.0): return 1 else: return 0
Since this is a binary classification, we know whether to predict 1 or 0 based on the sum of the labels. If it exceeds half of the total, the majority label must be 1. Else it is 0.
Now that we’ve implemented the logic for inference, let’s implement our helper function for the Euclidean distance.
The Euclidean distance is calculated by taking the squared difference between each corresponding feature of the two data points, summing them up, and then taking the square root of the entire result. This can actually be written more cleanly in NumPy to vectorize that logic:
Pythonreturn np.sqrt((x1 - x2).T.dot(x1 - x2))
Step 4: Test code and discuss results
We can test our code by creating some dummy data to make sure that our algorithm correctly gets the K nearest neighbors. We’ll choose some training data/labels that we know distances will be closer to than others.
PythonX_train = [ [0, 0], [0, 1], [2, 0] [2, 2], [2, 1] ] y_train = [0, 0, 0, 1, 1]
Test case 1: X_new = [0, 0], k = 3
The expected result is 0, because the closest points are [0,0] (label of 0), [0, 1] (label of 0), and [2, 0] (label of 0).
Test case 2: X_new = [2,2], k = 3
The expected result is 1, because the closest points are [0, 2] (label of 0), [2,2] (label of 1), and [2,1] (label of 1)
Even more faster and vectorized version, using np.linalg.norm - to avoid loop and np.argpartition to select lowest k. We dont need to sort whole array - we need to be sure that first k elements are lower than the rest.
import numpy as np def knn(X_train, y_train, X_new, k): distances = np.linalg.norm(X_train - X_new, axis=1) k_indices = np.argpartition(distances, k)[:k] # O(N) selection instead of O(N log N) sort return int(np.sum(y_train[k_indices]) > k / 2.0)