K-Means Clustering Explained: A Practical Guide with Real-World Example

0 0 0 0 0

📗 Chapter 2: How K-Means Works – Step-by-Step Breakdown

🎯 Objective

This chapter breaks down the inner mechanics of the K-Means algorithm. You’ll learn what happens at each iteration — from initializing centroids to forming final clusters — and how convergence is achieved through mathematical optimization.


🧩 K-Means: The Core Idea

K-Means aims to partition data into K clusters by minimizing the distance between points and their cluster centroid. The process is iterative and involves:

  • Initializing centroids
  • Assigning points to the nearest centroid
  • Recalculating centroids
  • Repeating until stabilization

This method is known as Lloyd’s algorithm.


️ Step-by-Step Workflow

Let’s go step-by-step into what K-Means actually does under the hood.


🔹 Step 1: Choose the Number of Clusters (K)

Before starting, you must define K, the number of clusters. This can be based on:

  • Domain knowledge
  • Trial and error
  • Techniques like the Elbow Method or Silhouette Score

🔹 Step 2: Initialize Centroids Randomly

  • K points are chosen randomly from the dataset as initial centroids
  • Alternatively, use K-Means++ to spread them out better

🔹 Step 3: Assign Each Point to the Nearest Centroid

  • Calculate the Euclidean distance from each data point to each centroid
  • Assign each data point to the cluster with the closest centroid

Mathematically:

Screenshot 2025-05-05 112212


🔹 Step 4: Update Centroids

  • For each cluster, recalculate the centroid as the mean of all points assigned to it

Screenshot 2025-05-05 112222


🔹 Step 5: Repeat Until Convergence

  • The algorithm continues repeating steps 3 and 4
  • Convergence occurs when assignments no longer change, or the change in centroids is minimal

🔁 Pseudocode of K-Means

text

 

1. Select K initial centroids randomly

2. While centroids do not stabilize:

    a. Assign data points to nearest centroid

    b. Recalculate centroids


🧮 Key Formula – Within-Cluster Sum of Squares (WCSS)

Screenshot 2025-05-05 112007

K-Means attempts to minimize WCSS.


🧠 Example Table

Iteration

Cluster 1 Centroid

Cluster 2 Centroid

WCSS

1

(2.1, 1.9)

(6.0, 5.8)

112.4

2

(2.3, 2.0)

(5.9, 6.1)

97.2

3

(2.3, 2.1)

(6.0, 6.0)

93.0

Final

No change

No change

93.0


🏁 How K-Means Converges

Convergence can be determined by:

  • Centroids stop moving
  • Assignments do not change
  • Max iterations reached

K-Means generally converges fast (in a few iterations), but not necessarily to the global optimum.


📋 Choosing Initialization Strategy

Method

Description

When to Use

Random

Default, fast but unstable

Small datasets

K-Means++

Spreads initial centroids strategically

Larger, high-dimensional data

PCA-Based

Uses principal components

For dimensionality reduction


🧪 Example: Iterative Clustering

Let’s say we have three clusters. On each iteration:

  1. Assign data points to the closest centroid
  2. Move the centroid to the mean of its assigned points
  3. Check if the assignment has changed
  4. Repeat until no further changes occur

This loop creates progressively tighter, more cohesive clusters.


🚧 Problems with Random Initialization

  • Different runs can produce different results
  • You might land in a local minimum
  • Some clusters might end up empty

Solution: Use K-Means++ to initialize centroids.


🔍 Best Practices

  • Always scale features
  • Run K-Means multiple times with different seeds
  • Use Elbow Method for selecting K
  • Plot convergence of WCSS or inertia

Summary Table


Step

Action

Output

Step 1

Select K

Number of clusters

Step 2

Random init of centroids

K starting points

Step 3

Assign points

Clusters based on distance

Step 4

Recompute centroids

New center for each group

Step 5

Repeat

Final cluster labels

Back

FAQs


1. What is K-Means Clustering?

K-Means Clustering is an unsupervised machine learning algorithm that groups data into K distinct clusters based on feature similarity. It minimizes the distance between data points and their assigned cluster centroid.

2. What does the 'K' in K-Means represent?

The 'K' in K-Means refers to the number of clusters you want the algorithm to form. This number is chosen before training begins.

3. How does the K-Means algorithm work?

 It works by randomly initializing K centroids, assigning data points to the nearest centroid, recalculating the centroids based on the points assigned, and repeating this process until the centroids stabilize.

4. What is the Elbow Method in K-Means?

The Elbow Method helps determine the optimal number of clusters (K) by plotting the within-cluster sum of squares (WCSS) for various values of K and identifying the point where adding more clusters yields diminishing returns.

5. When should you not use K-Means?

 K-Means is not suitable for datasets with non-spherical or overlapping clusters, categorical data, or when the number of clusters is not known and difficult to estimate.

6. What are the assumptions of K-Means?

K-Means assumes that clusters are spherical, equally sized, and non-overlapping. It also assumes all features contribute equally to the distance measurement.

7. What distance metric does K-Means use?

By default, K-Means uses Euclidean distance to measure the similarity between data points and centroids.

8. How does K-Means handle outliers?

K-Means is sensitive to outliers since they can significantly distort the placement of centroids, leading to poor clustering results.

9. What is K-Means++?

K-Means++ is an improved initialization technique that spreads out the initial centroids to reduce the chances of poor convergence and improve accuracy.

10. Can K-Means be used for image compression?

Yes, K-Means can cluster similar pixel colors together, which reduces the number of distinct colors in an image — effectively compressing it while maintaining visual quality.