Embark on a journey of knowledge! Take the quiz and earn valuable credits.
Take A QuizChallenge yourself and boost your learning! Start the quiz now to earn credits.
Take A QuizUnlock your potential! Begin the quiz, answer questions, and accumulate credits along the way.
Take A Quiz
🎯 Objective
This chapter focuses on the hands-on application of
K-Means clustering using Python. You’ll learn how to load data, implement
clustering, visualize results, and evaluate cluster performance using tools
like scikit-learn, matplotlib, and pandas.
🧰 Required Libraries
To run K-Means clustering in Python, you’ll need the
following libraries:
Install them using:
bash
pip
install numpy pandas scikit-learn matplotlib seaborn
📥 Step 1: Import
Libraries and Generate Data
Start by importing required packages and generating
synthetic data:
python
import
numpy as np
import
pandas as pd
from
sklearn.datasets import make_blobs
import
matplotlib.pyplot as plt
from
sklearn.cluster import KMeans
#
Generate synthetic dataset
X,
y = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)
🔍 Step 2: Apply K-Means
Clustering
python
#
Apply KMeans with K=4
kmeans
= KMeans(n_clusters=4, random_state=0)
kmeans.fit(X)
#
Predict clusters
y_kmeans
= kmeans.predict(X)
📌 Step 3: Visualize the
Clusters
python
#
Plot the clusters and centroids
plt.scatter(X[:,
0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:,
0], kmeans.cluster_centers_[:, 1], s=200, c='red', marker='X')
plt.title("K-Means
Clustering Results")
plt.xlabel("Feature
1")
plt.ylabel("Feature
2")
plt.grid(True)
plt.show()
📈 Step 4: Evaluate with
the Elbow Method
To determine the best value of K:
python
wcss
= []
for
i in range(1, 11):
km = KMeans(n_clusters=i)
km.fit(X)
wcss.append(km.inertia_)
plt.plot(range(1,
11), wcss, marker='o')
plt.title('Elbow
Method')
plt.xlabel('Number
of clusters (K)')
plt.ylabel('WCSS')
plt.show()
🧮 Output Table: Cluster
Centers and Assignments
Cluster |
Center X |
Center Y |
Sample Count |
0 |
1.95 |
4.31 |
75 |
1 |
7.77 |
2.71 |
70 |
2 |
-1.38 |
2.06 |
80 |
3 |
2.42 |
-1.29 |
75 |
🧠 Advanced: Apply to Real
Dataset
You can load real-world data from CSV:
python
df
= pd.read_csv('customers.csv')
X
= df[['AnnualIncome', 'SpendingScore']]
kmeans
= KMeans(n_clusters=5)
clusters
= kmeans.fit_predict(X)
df['Cluster']
= clusters
Then, visualize using:
python
plt.scatter(df['AnnualIncome'],
df['SpendingScore'], c=df['Cluster'], cmap='viridis')
plt.xlabel("Annual
Income")
plt.ylabel("Spending
Score")
plt.title("Customer
Segmentation")
plt.show()
✅ Best Practices
📋 Summary Table
Step |
Description |
Import Data |
Load or generate
dataset |
Apply KMeans |
Fit and
predict clusters |
Visualize Clusters |
Plot data and
centroids |
Evaluate |
Use WCSS or
Silhouette |
Tune Parameters |
Select best K using
Elbow method |
K-Means Clustering is an unsupervised machine learning algorithm that groups data into K distinct clusters based on feature similarity. It minimizes the distance between data points and their assigned cluster centroid.
The 'K' in K-Means refers to the number of clusters you want the algorithm to form. This number is chosen before training begins.
It works by randomly initializing K centroids, assigning data points to the nearest centroid, recalculating the centroids based on the points assigned, and repeating this process until the centroids stabilize.
The Elbow Method helps determine the optimal number of clusters (K) by plotting the within-cluster sum of squares (WCSS) for various values of K and identifying the point where adding more clusters yields diminishing returns.
K-Means is not suitable for datasets with non-spherical or overlapping clusters, categorical data, or when the number of clusters is not known and difficult to estimate.
K-Means assumes that clusters are spherical, equally sized, and non-overlapping. It also assumes all features contribute equally to the distance measurement.
By default, K-Means uses Euclidean distance to measure the similarity between data points and centroids.
K-Means is sensitive to outliers since they can significantly distort the placement of centroids, leading to poor clustering results.
K-Means++ is an improved initialization technique that spreads out the initial centroids to reduce the chances of poor convergence and improve accuracy.
Yes, K-Means can cluster similar pixel colors together, which reduces the number of distinct colors in an image — effectively compressing it while maintaining visual quality.
Please log in to access this content. You will be redirected to the login page shortly.
LoginReady to take your education and career to the next level? Register today and join our growing community of learners and professionals.
Comments(0)