Chapters

Data Science Workflow: From Problem to Solution – A Complete Step-by-Step Journey for Beginners

5.81K 1 1 0 0

Manpreet Singh

📗 Chapter 3: Data Cleaning and Preprocessing

Transforming Raw Data into Machine Learning Gold

🧠 Introduction

Once your data is collected, it’s rarely ready for immediate use. It’s messy. It’s inconsistent. It has missing values, errors, and misaligned types. That’s why data cleaning and preprocessing is a cornerstone of every data science workflow.

It’s said that 70-80% of a data scientist's time goes into cleaning and preparing the data. And for good reason — clean data leads to accurate models.

This chapter walks you through:

Handling missing data
Fixing data types and formatting
Dealing with duplicates
Encoding categorical values
Scaling and normalizing features
Treating outliers
Preparing your dataset for modeling

By the end, you’ll understand how to turn raw, chaotic data into something usable and valuable for machine learning and analytics.

📦 1. Why Clean Data Matters

Without Cleaning	With Cleaning
Inaccurate models	Reliable predictions
Poor user experience	Trustworthy insights
Failures in production	Scalable pipelines
Misleading visualizations	Meaningful reports

📊 2. Checking the Data Structure

Let’s load a sample dataset and see what we’re working with:

python

import pandas as pd

df = pd.read_csv('titanic.csv')

print(df.shape)

print(df.info())

print(df.head())

This gives us:

Number of rows/columns
Data types
Null counts
Sample rows

❌ 3. Removing Duplicates

python

print(df.duplicated().sum()) # Count duplicates

df = df.drop_duplicates() # Drop them

🔍 4. Handling Missing Values

Check how much is missing:

python

print(df.isnull().sum())

Common Strategies:

Column Type	Strategy	Example Code
Numeric	Fill with mean or median	df['Age'].fillna(df['Age'].mean())
Categorical	Fill with mode	df['Embarked'].fillna(df['Embarked'].mode()[0])
Any type	Drop rows/columns	df.dropna() or df.drop(columns=['Cabin'])

Visualize Missing Data:

python

import seaborn as sns

sns.heatmap(df.isnull(), cbar=False)

🧠 5. Fixing Data Types

python

df['Age'] = df['Age'].astype(int)

df['Date'] = pd.to_datetime(df['Date'])

Why it matters:

Models treat integers differently than floats
Dates must be parsed correctly for time-based analysis

✂️ 6. Cleaning Strings and Categories

Strip whitespace, fix case:

python

df['Name'] = df['Name'].str.strip().str.title()

Replace or drop unwanted characters:

python

df['City'] = df['City'].str.replace(r'[^a-zA-Z ]', '', regex=True)

Unify categories:

python

df['Gender'] = df['Gender'].replace({'male': 'Male', 'M': 'Male'})

📚 7. Encoding Categorical Data

Label Encoding (for binary or ordinal data):

python

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

df['Gender'] = le.fit_transform(df['Gender']) # Male = 1, Female = 0

One-Hot Encoding (for nominal data):

python

df = pd.get_dummies(df, columns=['Embarked'], drop_first=True)

📏 8. Feature Scaling

Some ML models (like KNN, SVM) are sensitive to feature scale.

Standardization:

python

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

df[['Age', 'Fare']] = scaler.fit_transform(df[['Age', 'Fare']])

Normalization (0–1 range):

python

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

df[['Age', 'Fare']] = scaler.fit_transform(df[['Age', 'Fare']])

🚫 9. Outlier Detection and Treatment

Visual Detection:

python

import matplotlib.pyplot as plt

import seaborn as sns

sns.boxplot(x=df['Fare'])

plt.show()

Remove using IQR:

python

Q1 = df['Fare'].quantile(0.25)

Q3 = df['Fare'].quantile(0.75)

IQR = Q3 - Q1

lower = Q1 - 1.5 * IQR

upper = Q3 + 1.5 * IQR

df = df[(df['Fare'] >= lower) & (df['Fare'] <= upper)]

🔧 10. Feature Engineering Basics

You might want to:

Create age groups:
df['AgeGroup'] = pd.cut(df['Age'], bins=[0,18,40,60,100], labels=['Teen','Adult','Senior','Old'])
Extract date parts:
df['Year'] = df['Date'].dt.year

📋 11. Preprocessing Checklist Before Modeling

Task	Checked? ✅
Missing values handled
Data types corrected
Categorical features encoded
Text cleaned
Outliers treated
Features scaled
Dataset split or ready

✅ Full Pipeline Sample

python

# Cleaned Titanic dataset preprocessing

import pandas as pd

from sklearn.preprocessing import LabelEncoder, StandardScaler

df = pd.read_csv('titanic.csv')

# Drop useless columns

df.drop(columns=['Cabin', 'Ticket'], inplace=True)

# Fill missing

df['Age'].fillna(df['Age'].median(), inplace=True)

df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)

# Encode

le = LabelEncoder()

df['Sex'] = le.fit_transform(df['Sex'])

# One-hot

df = pd.get_dummies(df, columns=['Embarked'], drop_first=True)

# Scale

scaler = StandardScaler()

df[['Age', 'Fare']] = scaler.fit_transform(df[['Age', 'Fare']])

🧪 Bonus: Automate with Pipelines

python

from sklearn.pipeline import Pipeline

from sklearn.impute import SimpleImputer

from sklearn.preprocessing import StandardScaler

pipeline = Pipeline([

('imputer', SimpleImputer(strategy='median')),

('scaler', StandardScaler())

])

df[['Age', 'Fare']] = pipeline.fit_transform(df[['Age', 'Fare']])

📚 Summary Table

Step	Function/Tool
Remove duplicates	df.drop_duplicates()
Missing values	fillna(), dropna()
Type conversion	astype(), to_datetime()
Encode categories	LabelEncoder, get_dummies()
Clean text fields	str.strip(), str.lower(), replace()
Outlier removal	IQR method + boxplot()
Scaling	StandardScaler, MinMaxScaler

Back

FAQs

1. What is the data science workflow, and why is it important?

Answer: The data science workflow is a structured step-by-step process used to turn raw data into actionable insights or solutions. It ensures clarity, efficiency, and reproducibility from problem definition to deployment.

2. Do I need to follow the workflow in a strict order?

Answer: Not necessarily. While there is a general order, data science is iterative. You may go back and forth between stages (like EDA and feature engineering) as new insights emerge.

3. What’s the difference between EDA and data cleaning?

Answer: Data cleaning prepares the dataset by fixing errors and inconsistencies, while EDA explores the data to find patterns, trends, and relationships to inform modeling decisions.

4. Is it okay to start modeling before completing feature engineering?

Answer: You can build a baseline model early, but robust feature engineering often improves performance significantly. It's best to iterate and refine after EDA and feature transformations.

5. What tools are best for building and evaluating models?

Answer: Popular tools include Python libraries like scikit-learn, XGBoost, LightGBM, and TensorFlow for building models, and metrics functions within sklearn.metrics for evaluation.

6. How do I choose the right evaluation metric?

Answer: It depends on the problem:

For classification: accuracy, precision, recall, F1-score
For regression: MAE, RMSE, R²
Use domain knowledge to choose the metric that aligns with business goals.

7. What are some good deployment options for beginners?

Answer: Start with lightweight options like:

Streamlit or Gradio for dashboards
Flask or FastAPI for web APIs
Hosting on Heroku or Render is easy and free for small projects.

8. How do I monitor a deployed model in production?

Answer: Use logging for predictions, track performance metrics over time, and set alerts for significant drops. Tools like MLflow, Prometheus, and AWS CloudWatch are commonly used.

9. Can I skip deployment if my goal is just learning?

Answer: Yes. For learning or portfolio-building, it's okay to stop after model evaluation. But deploying at least one model enhances your understanding of real-world applications.

10. What’s the best way to practice the entire workflow?

Answer: Choose a simple dataset (like Titanic or housing prices), go through every workflow step end-to-end, and document your process. Repeat with different types of problems to build experience.

Previous Next

Comments(1)

Post Comment

soumya 2 weeks ago

Chapters

Data Science Workflow: From Problem to Solution – A Complete Step-by-Step Journey for Beginners

Manpreet Singh

📗 Chapter 3: Data Cleaning and Preprocessing

FAQs

1. What is the data science workflow, and why is it important?

2. Do I need to follow the workflow in a strict order?

3. What’s the difference between EDA and data cleaning?

4. Is it okay to start modeling before completing feature engineering?

5. What tools are best for building and evaluating models?

6. How do I choose the right evaluation metric?

7. What are some good deployment options for beginners?

8. How do I monitor a deployed model in production?

9. Can I skip deployment if my goal is just learning?

10. What’s the best way to practice the entire workflow?

Comments(1)

Explore Other Libraries

Online Exams

Question Bank

Career News

Feeds

Full Forms

Dictionary

Interview Question

Gigs

Quotes

Lyrics

Videos

Courses

Blogs

Tutorials

Forum

Educators

Corporates

Tools

Related Searches

Join Our Community Today