Embark on a journey of knowledge! Take the quiz and earn valuable credits.
Take A QuizChallenge yourself and boost your learning! Start the quiz now to earn credits.
Take A QuizUnlock your potential! Begin the quiz, answer questions, and accumulate credits along the way.
Take A Quiz
Turn Messy, Raw Data into Machine-Ready Gold
🧠 Introduction
You’ve loaded your dataset, explored the features, and
understood its structure. Now it’s time for what many consider the most
critical — and often time-consuming — phase of any data science project: data
cleaning and preprocessing.
According to multiple studies, up to 80% of a data
scientist's time is spent preparing data. This includes:
In this chapter, you’ll learn step-by-step how to clean and
preprocess your dataset using Python (primarily with Pandas and Scikit-learn),
so you can confidently move on to analysis or modeling.
🧹 1. Remove Duplicates
Duplicate rows can distort analysis and predictions.
▶ Code Example:
python
#
Check for duplicates
print(df.duplicated().sum())
#
Drop duplicates
df
= df.drop_duplicates()
🔎 2. Handle Missing
Values
First, understand how many values are missing per column:
python
df.isnull().sum()
✅ Ways to handle missing data:
Method |
When to Use |
Code Sample |
Drop rows |
When only a few rows
have missing values |
df.dropna(inplace=True) |
Fill with mean/median |
For numerical
columns |
df['Age'].fillna(df['Age'].mean()) |
Fill with mode |
For categorical
columns |
df['Gender'].fillna(df['Gender'].mode()[0]) |
Forward fill |
For time
series data |
df.fillna(method='ffill') |
Custom logic |
When context is
important |
df.loc[df['Age'].isnull(),
'Age'] = 30 |
✂️ 3. Trim and Clean Text Data
Text columns often contain extra whitespace, inconsistent
capitalization, or special characters.
▶ Code Examples:
python
df['Name']
= df['Name'].str.strip().str.title()
df['Comments']
= df['Comments'].str.replace(r'[^\w\s]', '', regex=True)
⏳ 4. Convert Data Types
Make sure each column has the correct data type:
python
df['Date']
= pd.to_datetime(df['Date'])
df['Age']
= df['Age'].astype(int)
Use .dtypes to view current types.
📦 5. Encode Categorical
Variables
Most machine learning models only accept numeric input. Use
encoding to convert categories into numbers.
▶ Label Encoding (for binary or ordinal categories):
python
from
sklearn.preprocessing import LabelEncoder
le
= LabelEncoder()
df['Gender']
= le.fit_transform(df['Gender'])
▶ One-Hot Encoding (for nominal categories):
python
df
= pd.get_dummies(df, columns=['City'], drop_first=True)
📏 6. Normalize or
Standardize Numerical Features
Helps improve model performance, especially for algorithms
like KNN or SVM.
▶ Min-Max Scaling:
python
from
sklearn.preprocessing import MinMaxScaler
scaler
= MinMaxScaler()
df[['Age',
'Income']] = scaler.fit_transform(df[['Age', 'Income']])
▶ Standard Scaling:
python
from
sklearn.preprocessing import StandardScaler
scaler
= StandardScaler()
df[['Age',
'Income']] = scaler.fit_transform(df[['Age', 'Income']])
🧠 7. Outlier Detection
and Handling
Outliers can distort your model’s learning process.
▶ Boxplot Visualization:
python
import
seaborn as sns
sns.boxplot(x=df['Income'])
▶ Remove or Cap Outliers:
python
Q1
= df['Income'].quantile(0.25)
Q3
= df['Income'].quantile(0.75)
IQR
= Q3 - Q1
lower
= Q1 - 1.5 * IQR
upper
= Q3 + 1.5 * IQR
df
= df[(df['Income'] >= lower) & (df['Income'] <= upper)]
🧪 8. Feature
Transformation
Sometimes features need transformation to reduce skewness or
normalize distribution.
▶ Log Transform:
python
import
numpy as np
df['Log_Income']
= np.log1p(df['Income']) # log(1 + x)
🧠 9. Feature Engineering
Create new features from existing ones to capture more
insight.
▶ Examples:
python
#
Create age groups
df['Age_Group']
= pd.cut(df['Age'], bins=[0, 18, 35, 60, 100], labels=['Teen', 'Young',
'Adult', 'Senior'])
#
Combine features
df['Income_per_Age']
= df['Income'] / df['Age']
🧮 10. Final Preprocessing
Checklist
Task |
Function/Tool Used |
Remove duplicates |
df.drop_duplicates() |
Fix missing values |
fillna(),
dropna() |
Standardize data
types |
astype(),
to_datetime() |
Clean string columns |
str.strip(),
str.lower() |
Encode categorical
variables |
LabelEncoder,
get_dummies() |
Normalize or scale numerical data |
MinMaxScaler,
StandardScaler |
Detect and handle
outliers |
IQR method, boxplot |
Feature transformation |
np.log1p(),
np.sqrt() |
Feature engineering |
pd.cut(), arithmetic
combinations |
✅ Sample Code: Full Preprocessing
Pipeline
python
#
Step-by-step preprocessing
import
pandas as pd
from
sklearn.preprocessing import LabelEncoder, MinMaxScaler
#
Load data
df
= pd.read_csv('titanic.csv')
#
Drop duplicates
df.drop_duplicates(inplace=True)
#
Fill missing values
df['Age'].fillna(df['Age'].mean(),
inplace=True)
df['Embarked'].fillna(df['Embarked'].mode()[0],
inplace=True)
#
Convert data types
df['Age']
= df['Age'].astype(int)
#
Encode categorical features
le
= LabelEncoder()
df['Sex']
= le.fit_transform(df['Sex'])
#
One-hot encode 'Embarked'
df
= pd.get_dummies(df, columns=['Embarked'], drop_first=True)
#
Normalize 'Age' and 'Fare'
scaler
= MinMaxScaler()
df[['Age',
'Fare']] = scaler.fit_transform(df[['Age', 'Fare']])
📊 Summary Table:
Preprocessing Tasks
Task |
Description |
Remove duplicates |
Avoid repeated records |
Fix missing values |
Ensure no
gaps in important data |
Standardize data
types |
Ensure numerical,
date, categorical match |
Encode categories |
Make data
model-ready |
Scale features |
Put features on same
scale |
Handle outliers |
Reduce impact
of anomalies |
Transform skewed
features |
Improve model learning |
Engineer new features |
Capture more
insights from existing data |
Answer: Not at all. Basic knowledge of statistics is helpful, but you can start your first project with a beginner-friendly dataset and learn concepts like mean, median, correlation, and regression as you go.
Answer: Python is the most popular and beginner-friendly choice, thanks to its simplicity and powerful libraries like Pandas, NumPy, Matplotlib, Seaborn, and Scikit-learn.
Answer: Great sources include:
Answer:
Answer: Keep it small and manageable — one target variable, 3–6 features, and under 10,000 rows of data. Focus more on understanding the process than building a complex model.
Answer: Yes, but keep it simple. Start with linear regression, logistic regression, or decision trees. Avoid deep learning or complex models until you're more confident.
Answer: Use:
Answer: Use:
Answer: It depends on your task:
Answer: Absolutely! A well-documented project with clear insights, code, and visualizations is a great way to show employers that you understand the end-to-end data science process.
Please log in to access this content. You will be redirected to the login page shortly.
LoginReady to take your education and career to the next level? Register today and join our growing community of learners and professionals.
Comments(0)