Embark on a journey of knowledge! Take the quiz and earn valuable credits.
Take A QuizChallenge yourself and boost your learning! Start the quiz now to earn credits.
Take A QuizUnlock your potential! Begin the quiz, answer questions, and accumulate credits along the way.
Take A Quiz
Transform Text Labels into Numerical Features for
Machine Learning
🧠 Introduction
Many machine learning models require numerical inputs,
yet most datasets contain categorical features like Gender, Country,
Department, or Product Type. To make these models work, we must convert
these categories into numbers without losing their meaning or introducing
bias. This process is called categorical encoding.
In this chapter, you’ll learn:
🔍 What is Categorical
Encoding?
Categorical encoding is the process of converting
labels (strings or categories) into a numerical format that machine learning
models can understand.
Example:
Gender (Original) |
Gender (Encoded) |
Male |
1 |
Female |
0 |
⚙️ Types of Categorical
Variables
Type |
Example Column |
Suitable Encoding
Method |
Nominal (No order) |
Color: Red, Blue,
Green |
One-hot, Label, Binary |
Ordinal (Has order) |
Size: Small,
Medium, Large |
Ordinal,
Integer, Target |
🔢 Step 1: Label Encoding
Label Encoding assigns each category a unique
integer.
▶ Code Example:
python
from
sklearn.preprocessing import LabelEncoder
import
pandas as pd
df
= pd.DataFrame({'Gender': ['Male', 'Female', 'Female', 'Male']})
le
= LabelEncoder()
df['Gender_encoded']
= le.fit_transform(df['Gender'])
print(df)
Output:
nginx
Gender Gender_encoded
0 Male 1
1 Female 0
2 Female 0
3 Male 1
⚠️ Caution: Use Label Encoding only
for ordinal data or binary categories — otherwise, it may introduce false
relationships.
🟩 Step 2: One-Hot
Encoding
One-Hot Encoding creates a separate column for each
category with binary values (0 or 1).
▶ Using pd.get_dummies():
python
df
= pd.DataFrame({'Color': ['Red', 'Blue', 'Green', 'Red']})
df_encoded
= pd.get_dummies(df, columns=['Color'], drop_first=False)
print(df_encoded)
Output:
nginx
Color_Blue Color_Green
Color_Red
0 0 0 1
1 1 0 0
2 0 1 0
3 0 0 1
▶ Drop one column to avoid multicollinearity (optional):
python
pd.get_dummies(df, columns=['Color'], drop_first=True)
🧮 Step 3: Ordinal
Encoding
Use this when categories have an inherent order, e.g., Low
< Medium < High.
python
df
= pd.DataFrame({'Size': ['Small', 'Medium', 'Large', 'Small']})
size_order
= {'Small': 0, 'Medium': 1, 'Large': 2}
df['Size_encoded']
= df['Size'].map(size_order)
🧠 Step 4: Binary Encoding
(for high-cardinality nominal features)
Binary encoding converts categories into binary digits and
splits them across multiple columns.
python
#
Requires category_encoders package
#
pip install category_encoders
import
category_encoders as ce
df
= pd.DataFrame({'City': ['London', 'Paris', 'Berlin', 'Rome', 'Paris']})
encoder
= ce.BinaryEncoder(cols=['City'])
df_encoded
= encoder.fit_transform(df)
🎯 Step 5: Frequency /
Count Encoding
Replace categories with their frequency of occurrence.
python
df
= pd.DataFrame({'Product': ['A', 'B', 'A', 'C', 'B', 'A']})
df['Product_encoded']
= df['Product'].map(df['Product'].value_counts())
Output:
css
Product Product_encoded
0 A 3
1 B 2
2 A 3
3 C 1
4 B 2
5 A 3
🧠 Step 6: Target Encoding
(Mean Encoding)
Replace a category with the average value of the target
variable for that category.
python
df
= pd.DataFrame({
'City': ['London', 'Paris', 'London',
'Berlin'],
'Sales': [100, 200, 120, 80]
})
city_avg
= df.groupby('City')['Sales'].mean()
df['City_encoded']
= df['City'].map(city_avg)
⚠️ Be careful of data leakage.
Use cross-validation or holdout sets.
🧪 Step 7: Handling
Unknown Categories
LabelEncoder/OrdinalEncoder:
Throws an error if an unknown category is encountered during
inference.
To avoid this, use handle_unknown='use_encoded_value' in
OrdinalEncoder.
One-Hot Encoding:
Use OneHotEncoder from Scikit-learn for consistency across
training and test sets.
python
from
sklearn.preprocessing import OneHotEncoder
ohe
= OneHotEncoder(handle_unknown='ignore', sparse=False)
encoded
= ohe.fit_transform(df[['Color']])
📊 Summary Table: Encoding
Techniques
Method |
Best For |
Pros |
Cons |
Label Encoding |
Ordinal or binary
categories |
Simple |
Implies order if not ordinal |
One-Hot Encoding |
Nominal,
small cardinality |
Model-friendly,
widely used |
Many columns
for large categories |
Ordinal Encoding |
Ordered categories |
Maintains hierarchy |
Hard-coded mappings |
Binary Encoding |
High-cardinality
nominal categories |
Fewer columns
than one-hot |
Less
interpretable |
Frequency Encoding |
Any categorical
variable |
Simple, fast |
May cause overfitting |
Target Encoding |
High-cardinality
with known target |
Powerful in
boosting models |
Risk of
leakage, overfitting |
🧠 Encoding in a Full ML
Pipeline (Scikit-learn)
python
from
sklearn.compose import ColumnTransformer
from
sklearn.preprocessing import OneHotEncoder
from
sklearn.pipeline import Pipeline
from
sklearn.ensemble import RandomForestClassifier
categorical_cols
= ['Gender', 'City']
numeric_cols
= ['Age']
preprocessor
= ColumnTransformer([
('cat',
OneHotEncoder(handle_unknown='ignore'), categorical_cols)
],
remainder='passthrough')
model
= Pipeline([
('preprocess', preprocessor),
('classifier', RandomForestClassifier())
])
📦 Best Practices for
Encoding
Tip |
Why It’s Important |
Standardize
categories before encoding |
Prevents redundant
encodings |
Drop one-hot column if using linear models |
Prevents
multicollinearity |
Use consistent
encoder for train/test split |
Avoids unseen category
errors |
Don’t apply target encoding without caution |
Can cause
data leakage if not cross-validated |
Use sparse matrix
formats for large encodings |
Saves memory for
high-cardinality features |
🧪 Encoding Case Study:
Product Category
Category |
One-Hot |
Label |
Frequency |
Target Avg |
Electronics |
1 0 0 |
2 |
3 |
120 |
Fashion |
0 1 0 |
1 |
1 |
95 |
Groceries |
0 0 1 |
0 |
2 |
110 |
🏁 Conclusion
Categorical encoding bridges the gap between raw text labels and numeric machine learning models. Whether you're dealing with two categories or two hundred, choosing the right encoding strategy can make or break your model performance. With Python and Scikit-learn, you have full control over how you represent your data — just make sure you're encoding with purpose and without bias.
BackAnswer: Data cleaning is the process of identifying and correcting (or removing) corrupt, inaccurate, or irrelevant records from a dataset. In Python, it ensures that the data is structured, consistent, and ready for analysis or modeling. Clean data improves the reliability and performance of machine learning models and analytics.
Answer: The most popular libraries include:
Answer: Use df.isnull() to detect missing values. You can drop them using df.dropna() or fill them with appropriate values using df.fillna(). For advanced imputation, SimpleImputer from Scikit-learn can be used.
Answer: Use df.drop_duplicates() to remove exact duplicate rows. To drop based on specific columns, you can use df.drop_duplicates(subset=['column_name']).
Answer: You can use statistical methods like Z-score or IQR to detect outliers. Once detected, you can either remove them or cap/floor the values based on business needs using np.where() or conditional logic in Pandas.
Answer:
Answer: Use pd.to_datetime(df['column']) to convert strings to datetime. Similarly, use astype() for converting numerical or categorical types (e.g., df['age'].astype(int)).
Answer: Common steps include:
Answer: Machine learning algorithms typically require numerical inputs. Encoding (like Label Encoding or One-Hot Encoding) converts categorical text into numbers so that algorithms can interpret and process them effectively.
Please log in to access this content. You will be redirected to the login page shortly.
LoginReady to take your education and career to the next level? Register today and join our growing community of learners and professionals.
Comments(0)