K-Means Clustering Explained: A Practical Guide with Real-World Example

0 0 0 0 0

📙 Chapter 3: Practical Implementation with Python

🎯 Objective

This chapter focuses on the hands-on application of K-Means clustering using Python. You’ll learn how to load data, implement clustering, visualize results, and evaluate cluster performance using tools like scikit-learn, matplotlib, and pandas.


🧰 Required Libraries

To run K-Means clustering in Python, you’ll need the following libraries:

  • pandas: for data handling
  • numpy: for numeric computations
  • scikit-learn: for the KMeans algorithm
  • matplotlib / seaborn: for data visualization

Install them using:

bash

 

pip install numpy pandas scikit-learn matplotlib seaborn


📥 Step 1: Import Libraries and Generate Data

Start by importing required packages and generating synthetic data:

python

 

import numpy as np

import pandas as pd

from sklearn.datasets import make_blobs

import matplotlib.pyplot as plt

from sklearn.cluster import KMeans

 

# Generate synthetic dataset

X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)


🔍 Step 2: Apply K-Means Clustering

python

 

# Apply KMeans with K=4

kmeans = KMeans(n_clusters=4, random_state=0)

kmeans.fit(X)

 

# Predict clusters

y_kmeans = kmeans.predict(X)


📌 Step 3: Visualize the Clusters

python

 

# Plot the clusters and centroids

plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')

plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=200, c='red', marker='X')

plt.title("K-Means Clustering Results")

plt.xlabel("Feature 1")

plt.ylabel("Feature 2")

plt.grid(True)

plt.show()


📈 Step 4: Evaluate with the Elbow Method

To determine the best value of K:

python

 

wcss = []

for i in range(1, 11):

    km = KMeans(n_clusters=i)

    km.fit(X)

    wcss.append(km.inertia_)

 

plt.plot(range(1, 11), wcss, marker='o')

plt.title('Elbow Method')

plt.xlabel('Number of clusters (K)')

plt.ylabel('WCSS')

plt.show()


🧮 Output Table: Cluster Centers and Assignments

Cluster

Center X

Center Y

Sample Count

0

1.95

4.31

75

1

7.77

2.71

70

2

-1.38

2.06

80

3

2.42

-1.29

75


🧠 Advanced: Apply to Real Dataset

You can load real-world data from CSV:

python

 

df = pd.read_csv('customers.csv')

X = df[['AnnualIncome', 'SpendingScore']]

 

kmeans = KMeans(n_clusters=5)

clusters = kmeans.fit_predict(X)

df['Cluster'] = clusters

Then, visualize using:

python

 

plt.scatter(df['AnnualIncome'], df['SpendingScore'], c=df['Cluster'], cmap='viridis')

plt.xlabel("Annual Income")

plt.ylabel("Spending Score")

plt.title("Customer Segmentation")

plt.show()


Best Practices

  • Always scale features before clustering
  • Use KMeans++ for better initialization
  • Run the model multiple times and average performance
  • Use Silhouette Score for validation

📋 Summary Table


Step

Description

Import Data

Load or generate dataset

Apply KMeans

Fit and predict clusters

Visualize Clusters

Plot data and centroids

Evaluate

Use WCSS or Silhouette

Tune Parameters

Select best K using Elbow method

Back

FAQs


1. What is K-Means Clustering?

K-Means Clustering is an unsupervised machine learning algorithm that groups data into K distinct clusters based on feature similarity. It minimizes the distance between data points and their assigned cluster centroid.

2. What does the 'K' in K-Means represent?

The 'K' in K-Means refers to the number of clusters you want the algorithm to form. This number is chosen before training begins.

3. How does the K-Means algorithm work?

 It works by randomly initializing K centroids, assigning data points to the nearest centroid, recalculating the centroids based on the points assigned, and repeating this process until the centroids stabilize.

4. What is the Elbow Method in K-Means?

The Elbow Method helps determine the optimal number of clusters (K) by plotting the within-cluster sum of squares (WCSS) for various values of K and identifying the point where adding more clusters yields diminishing returns.

5. When should you not use K-Means?

 K-Means is not suitable for datasets with non-spherical or overlapping clusters, categorical data, or when the number of clusters is not known and difficult to estimate.

6. What are the assumptions of K-Means?

K-Means assumes that clusters are spherical, equally sized, and non-overlapping. It also assumes all features contribute equally to the distance measurement.

7. What distance metric does K-Means use?

By default, K-Means uses Euclidean distance to measure the similarity between data points and centroids.

8. How does K-Means handle outliers?

K-Means is sensitive to outliers since they can significantly distort the placement of centroids, leading to poor clustering results.

9. What is K-Means++?

K-Means++ is an improved initialization technique that spreads out the initial centroids to reduce the chances of poor convergence and improve accuracy.

10. Can K-Means be used for image compression?

Yes, K-Means can cluster similar pixel colors together, which reduces the number of distinct colors in an image — effectively compressing it while maintaining visual quality.