Top 5 Machine Learning Interview Problems

1 0 0 0 0

Implementing K-Nearest Neighbors (KNN) Classifier from Scratch

Introduction

The K-Nearest Neighbors (KNN) algorithm is one of the simplest machine learning algorithms for classification and regression tasks. It is a non-parametric algorithm, meaning it makes no assumptions about the underlying distribution of the data. KNN works by classifying data points based on the majority class of the K-nearest points in the feature space. It is widely used in various applications like image recognition, recommendation systems, and more.

In this tutorial, we will implement the KNN algorithm from scratch using Python and NumPy. We will build the entire process, starting from data preprocessing, calculating distances, finding the nearest neighbors, and making predictions. We will also explore how to choose the optimal value of K, handle ties, and evaluate the model.

By the end of this tutorial, you will have a deep understanding of the KNN algorithm and how to implement it effectively in Python.


1. Understanding the K-Nearest Neighbors Algorithm

The KNN algorithm operates on the premise that similar data points tend to be close to each other in the feature space. It is a lazy learner, meaning it does not explicitly learn a model during training but rather stores the training data. During prediction, the KNN algorithm compares the input data point to the stored training data and assigns the class label based on the majority class among its K nearest neighbors.

Key Steps in KNN Classification:

  1. Calculate the Distance: The distance between the test point and all other points in the training set is computed.
  2. Sort the Distances: The distances are sorted, and the K closest neighbors are selected.
  3. Assign a Label: The most common class label among the K neighbors is assigned to the test point.

Distance Metrics:

  • Euclidean Distance is the most common distance metric, defined as:

Screenshot 2025-04-14 160148

Where:

  • x1 and x2 are two points in n-dimensional space, and x1i and x2i are their respective feature values.

Other distance metrics include Manhattan Distance, Minkowski Distance, and Cosine Similarity.


2. Implementing the KNN Algorithm from Scratch

2.1 Data Preprocessing

We will start by preparing the dataset. For simplicity, let’s use the Iris dataset, a well-known dataset in machine learning that contains 150 samples of iris flowers, each described by four features (sepal length, sepal width, petal length, and petal width). The task is to classify each sample into one of three classes: Setosa, Versicolor, and Virginica.

We will begin by importing the necessary libraries and loading the dataset.

Code Sample:

import numpy as np

from sklearn.datasets import load_iris

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler

 

# Load the Iris dataset

iris = load_iris()

X = iris.data

y = iris.target

 

# Split the dataset into training and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

 

# Normalize the feature data

scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)

X_test = scaler.transform(X_test)

Explanation:

  • We load the Iris dataset using load_iris() from scikit-learn.
  • The data is split into training and testing sets using train_test_split().
  • We apply standardization to scale the features so that each feature has a mean of 0 and a standard deviation of 1, which is crucial for distance-based algorithms like KNN.

2.2 Calculating Euclidean Distance

The next step is to calculate the distance between a test point and all the points in the training set. We will use Euclidean distance as the metric for this task.

Code Sample:

def euclidean_distance(x1, x2):

    return np.sqrt(np.sum((x1 - x2)**2))

Explanation:

  • This function computes the Euclidean distance between two points, x1 and x2, using the formula mentioned earlier.

2.3 Finding the K Nearest Neighbors

Now, we need to implement a function that finds the K nearest neighbors for a given test point. This function will calculate the distance from the test point to all training points and then return the labels of the closest K neighbors.

Code Sample:

def knn_predict(X_train, y_train, X_test_point, k=3):

    # Calculate distances from the test point to all training points

    distances = [euclidean_distance(X_test_point, train_point) for train_point in X_train]

   

    # Get indices of the k smallest distances

    k_indices = np.argsort(distances)[:k]

   

    # Get the labels of the k nearest neighbors

    k_nearest_labels = [y_train[i] for i in k_indices]

   

    # Return the most common label among the k neighbors

    most_common = np.bincount(k_nearest_labels).argmax()

    return most_common

Explanation:

  • We calculate the distances between the test point (X_test_point) and all training points using euclidean_distance().
  • np.argsort() sorts the distances and returns the indices of the k smallest distances.
  • We then fetch the labels of the K nearest neighbors and use np.bincount() to determine the most common label among them.

2.4 Classifying the Test Set

Next, we use the knn_predict function to classify all the test points in the X_test set. For each test point, we will predict the class label based on the majority class among its K nearest neighbors.

Code Sample:

def knn_classifier(X_train, y_train, X_test, k=3):

    predictions = [knn_predict(X_train, y_train, test_point, k) for test_point in X_test]

    return np.array(predictions)

 

# Make predictions on the test set

predictions = knn_classifier(X_train, y_train, X_test, k=3)

Explanation:

  • The knn_classifier() function iterates through all test points and uses the knn_predict() function to classify each one.

2.5 Evaluating the Model

After making predictions on the test set, we can evaluate the performance of the KNN classifier using accuracy, which is the proportion of correctly classified instances.

Code Sample:

def accuracy(y_true, y_pred):

    return np.sum(y_true == y_pred) / len(y_true)

 

# Evaluate the accuracy of the KNN classifier

acc = accuracy(y_test, predictions)

print(f"Accuracy: {acc * 100:.2f}%")

Explanation:

  • The accuracy() function calculates the proportion of correctly predicted labels by comparing the true labels (y_test) with the predicted labels (predictions).

3. Choosing the Optimal Value of K

Choosing the right value of K is crucial for the performance of the KNN algorithm. A small value of K can lead to overfitting, while a large value can cause underfitting.

To find the optimal value of K, we can test different values of K and plot the accuracy for each.

Code Sample:

# Try different values of K

k_values = range(1, 21)

accuracies = []

 

for k in k_values:

    predictions = knn_classifier(X_train, y_train, X_test, k)

    acc = accuracy(y_test, predictions)

    accuracies.append(acc)

 

# Plot the results

plt.plot(k_values, accuracies)

plt.xlabel('K value')

plt.ylabel('Accuracy')

plt.title('KNN Classifier Accuracy for Different K Values')

plt.show()

Explanation:

  • We try different values of K and compute the accuracy for each.
  • We then plot the accuracy against the K values to visualize how the choice of K affects model performance.

4. Advanced Considerations

4.1 Distance Metrics

While Euclidean distance is the most common distance metric, you can use other distance metrics depending on the problem:

  • Manhattan Distance: Sum of absolute differences.
  • Cosine Similarity: Measures the angle between two vectors.

You can modify the euclidean_distance() function to compute different distance metrics.

4.2 Handling Ties

If there is a tie between the nearest neighbors (e.g., if there is an equal number of points from two different classes), you can:

  • Choose the class of the nearest point.
  • Randomly pick one class.
  • Use weighted voting, where closer points have more influence.

4.3 Optimizations

To speed up KNN, especially for large datasets, you can:

  • Use KD-Trees or Ball Trees for efficient nearest neighbor search.
  • Implement approximate nearest neighbor search algorithms.

5. Conclusion

In this tutorial, we implemented the K-Nearest Neighbors (KNN) classifier from scratch using Python and NumPy. We covered the following:

  • Data preprocessing: Scaling the features using StandardScaler.
  • Distance calculation: Using Euclidean distance to compute the proximity between data points.
  • Prediction: Finding the K nearest neighbors and predicting the class label using majority voting.
  • Model evaluation: Calculating accuracy to assess model performance.
  • Choosing the optimal K: Testing different values of K and plotting the results to select the best one.


Now that you have a foundational understanding of KNN, you can easily adapt and implement it for more complex tasks or datasets.

Back

FAQs


1. What is the difference between supervised and unsupervised learning?

Answer: Supervised learning involves training a model on labeled data (input-output pairs), while unsupervised learning involves finding patterns or structures in data without labeled responses.

2. What is the purpose of cross-validation in machine learning?

Answer: Cross-validation is used to assess the model’s performance by training and testing it on different subsets of the data, helping to avoid overfitting and ensuring the model generalizes well to unseen data.

3. How does gradient descent work in machine learning?

Answer: Gradient descent is an optimization algorithm that iteratively adjusts the model’s parameters in the opposite direction of the gradient of the loss function, thereby minimizing the loss.

4. What is the "kernel trick" in SVM?

Answer: The kernel trick is a technique that allows SVMs to efficiently perform non-linear classification by mapping the input data into a higher-dimensional space where a linear hyperplane can be found.

5. How do decision trees handle overfitting?

Answer: Decision trees can overfit if they grow too deep, capturing noise in the data. This can be controlled by limiting the depth of the tree or by pruning the tree after it has been built.

6. What is the main advantage of using a Random Forest over a single Decision Tree?

Answer: A Random Forest aggregates the predictions of multiple decision trees, which reduces variance and overfitting compared to using a single decision tree.

7. What is the intuition behind KNN?

Answer: KNN classifies data points based on the majority class of their K nearest neighbors in the feature space, using a distance metric like Euclidean distance.

8. How do you select the value of K in KNN?

Answer: The value of K is selected through experimentation or by using cross-validation. A small K may lead to overfitting, while a large K may underfit the model.

9. What are the advantages of SVM for classification?

Answer: SVMs are effective in high-dimensional spaces, handle non-linear data well using the kernel trick, and are less prone to overfitting compared to other classifiers like decision trees.

10. What is the difference between classification and regression problems?

Answer: Classification problems involve predicting discrete labels (e.g., classifying images as cats or dogs), while regression problems involve predicting continuous values (e.g., predicting house prices).