Embark on a journey of knowledge! Take the quiz and earn valuable credits.
Take A QuizChallenge yourself and boost your learning! Start the quiz now to earn credits.
Take A QuizUnlock your potential! Begin the quiz, answer questions, and accumulate credits along the way.
Take A Quiz
Transforming Raw Data into Machine Learning Gold
🧠 Introduction
Once your data is collected, it’s rarely ready for immediate
use. It’s messy. It’s inconsistent. It has missing values, errors, and
misaligned types. That’s why data cleaning and preprocessing is a
cornerstone of every data science workflow.
It’s said that 70-80% of a data scientist's time goes into
cleaning and preparing the data. And for good reason — clean data leads to
accurate models.
This chapter walks you through:
By the end, you’ll understand how to turn raw, chaotic data
into something usable and valuable for machine learning and analytics.
📦 1. Why Clean Data
Matters
Without Cleaning |
With Cleaning |
Inaccurate models |
Reliable predictions |
Poor user experience |
Trustworthy
insights |
Failures in
production |
Scalable pipelines |
Misleading visualizations |
Meaningful
reports |
📊 2. Checking the Data
Structure
Let’s load a sample dataset and see what we’re working with:
python
import
pandas as pd
df
= pd.read_csv('titanic.csv')
print(df.shape)
print(df.info())
print(df.head())
This gives us:
❌ 3. Removing Duplicates
python
print(df.duplicated().sum()) # Count duplicates
df
= df.drop_duplicates() # Drop them
🔍 4. Handling Missing
Values
Check how much is missing:
python
print(df.isnull().sum())
Common Strategies:
Column Type |
Strategy |
Example Code |
Numeric |
Fill with mean or
median |
df['Age'].fillna(df['Age'].mean()) |
Categorical |
Fill with
mode |
df['Embarked'].fillna(df['Embarked'].mode()[0]) |
Any type |
Drop rows/columns |
df.dropna() or
df.drop(columns=['Cabin']) |
Visualize Missing Data:
python
import
seaborn as sns
sns.heatmap(df.isnull(),
cbar=False)
🧠 5. Fixing Data Types
python
df['Age']
= df['Age'].astype(int)
df['Date']
= pd.to_datetime(df['Date'])
Why it matters:
✂️ 6. Cleaning Strings and
Categories
Strip whitespace, fix case:
python
df['Name']
= df['Name'].str.strip().str.title()
Replace or drop unwanted characters:
python
df['City']
= df['City'].str.replace(r'[^a-zA-Z ]', '', regex=True)
Unify categories:
python
df['Gender']
= df['Gender'].replace({'male': 'Male', 'M': 'Male'})
📚 7. Encoding Categorical
Data
Label Encoding (for binary or ordinal data):
python
from
sklearn.preprocessing import LabelEncoder
le
= LabelEncoder()
df['Gender']
= le.fit_transform(df['Gender']) # Male
= 1, Female = 0
One-Hot Encoding (for nominal data):
python
df
= pd.get_dummies(df, columns=['Embarked'], drop_first=True)
📏 8. Feature Scaling
Some ML models (like KNN, SVM) are sensitive to feature
scale.
Standardization:
python
from
sklearn.preprocessing import StandardScaler
scaler
= StandardScaler()
df[['Age',
'Fare']] = scaler.fit_transform(df[['Age', 'Fare']])
Normalization (0–1 range):
python
from
sklearn.preprocessing import MinMaxScaler
scaler
= MinMaxScaler()
df[['Age',
'Fare']] = scaler.fit_transform(df[['Age', 'Fare']])
🚫 9. Outlier Detection
and Treatment
Visual Detection:
python
import
matplotlib.pyplot as plt
import
seaborn as sns
sns.boxplot(x=df['Fare'])
plt.show()
Remove using IQR:
python
Q1
= df['Fare'].quantile(0.25)
Q3
= df['Fare'].quantile(0.75)
IQR
= Q3 - Q1
lower
= Q1 - 1.5 * IQR
upper
= Q3 + 1.5 * IQR
df
= df[(df['Fare'] >= lower) & (df['Fare'] <= upper)]
🔧 10. Feature Engineering
Basics
You might want to:
📋 11. Preprocessing
Checklist Before Modeling
Task |
Checked? ✅ |
Missing values
handled |
|
Data types corrected |
|
Categorical
features encoded |
|
Text cleaned |
|
Outliers treated |
|
Features scaled |
|
Dataset split or
ready |
✅ Full Pipeline Sample
python
#
Cleaned Titanic dataset preprocessing
import
pandas as pd
from
sklearn.preprocessing import LabelEncoder, StandardScaler
df
= pd.read_csv('titanic.csv')
#
Drop useless columns
df.drop(columns=['Cabin',
'Ticket'], inplace=True)
#
Fill missing
df['Age'].fillna(df['Age'].median(),
inplace=True)
df['Embarked'].fillna(df['Embarked'].mode()[0],
inplace=True)
#
Encode
le
= LabelEncoder()
df['Sex']
= le.fit_transform(df['Sex'])
#
One-hot
df
= pd.get_dummies(df, columns=['Embarked'], drop_first=True)
#
Scale
scaler
= StandardScaler()
df[['Age',
'Fare']] = scaler.fit_transform(df[['Age', 'Fare']])
🧪 Bonus: Automate with
Pipelines
python
from
sklearn.pipeline import Pipeline
from
sklearn.impute import SimpleImputer
from
sklearn.preprocessing import StandardScaler
pipeline
= Pipeline([
('imputer',
SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
df[['Age',
'Fare']] = pipeline.fit_transform(df[['Age', 'Fare']])
📚 Summary Table
Step |
Function/Tool |
Remove duplicates |
df.drop_duplicates() |
Missing values |
fillna(),
dropna() |
Type conversion |
astype(),
to_datetime() |
Encode categories |
LabelEncoder,
get_dummies() |
Clean text fields |
str.strip(),
str.lower(), replace() |
Outlier removal |
IQR method +
boxplot() |
Scaling |
StandardScaler,
MinMaxScaler |
Answer: The data science workflow is a structured step-by-step process used to turn raw data into actionable insights or solutions. It ensures clarity, efficiency, and reproducibility from problem definition to deployment.
Answer: Not necessarily. While there is a general order, data science is iterative. You may go back and forth between stages (like EDA and feature engineering) as new insights emerge.
Answer: Data cleaning prepares the dataset by fixing errors and inconsistencies, while EDA explores the data to find patterns, trends, and relationships to inform modeling decisions.
Answer: You can build a baseline model early, but robust feature engineering often improves performance significantly. It's best to iterate and refine after EDA and feature transformations.
Answer: Popular tools include Python libraries like scikit-learn, XGBoost, LightGBM, and TensorFlow for building models, and metrics functions within sklearn.metrics for evaluation.
Answer: It depends on the problem:
Answer: Start with lightweight options like:
Answer: Use logging for predictions, track performance metrics over time, and set alerts for significant drops. Tools like MLflow, Prometheus, and AWS CloudWatch are commonly used.
Answer: Yes. For learning or portfolio-building, it's okay to stop after model evaluation. But deploying at least one model enhances your understanding of real-world applications.
Answer: Choose a simple dataset (like Titanic or housing prices), go through every workflow step end-to-end, and document your process. Repeat with different types of problems to build experience.
Please log in to access this content. You will be redirected to the login page shortly.
LoginReady to take your education and career to the next level? Register today and join our growing community of learners and professionals.
Comments(0)