Data Science Workflow: From Problem to Solution – A Complete Step-by-Step Journey for Beginners

9.21K 0 0 0 0

📗 Chapter 3: Data Cleaning and Preprocessing

Transforming Raw Data into Machine Learning Gold


🧠 Introduction

Once your data is collected, it’s rarely ready for immediate use. It’s messy. It’s inconsistent. It has missing values, errors, and misaligned types. That’s why data cleaning and preprocessing is a cornerstone of every data science workflow.

It’s said that 70-80% of a data scientist's time goes into cleaning and preparing the data. And for good reason — clean data leads to accurate models.

This chapter walks you through:

  • Handling missing data
  • Fixing data types and formatting
  • Dealing with duplicates
  • Encoding categorical values
  • Scaling and normalizing features
  • Treating outliers
  • Preparing your dataset for modeling

By the end, you’ll understand how to turn raw, chaotic data into something usable and valuable for machine learning and analytics.


📦 1. Why Clean Data Matters

Without Cleaning

With Cleaning

Inaccurate models

Reliable predictions

Poor user experience

Trustworthy insights

Failures in production

Scalable pipelines

Misleading visualizations

Meaningful reports


📊 2. Checking the Data Structure

Let’s load a sample dataset and see what we’re working with:

python

 

import pandas as pd

 

df = pd.read_csv('titanic.csv')

print(df.shape)

print(df.info())

print(df.head())

This gives us:

  • Number of rows/columns
  • Data types
  • Null counts
  • Sample rows

3. Removing Duplicates

python

 

print(df.duplicated().sum())  # Count duplicates

df = df.drop_duplicates()     # Drop them


🔍 4. Handling Missing Values

Check how much is missing:

python

 

print(df.isnull().sum())

Common Strategies:

Column Type

Strategy

Example Code

Numeric

Fill with mean or median

df['Age'].fillna(df['Age'].mean())

Categorical

Fill with mode

df['Embarked'].fillna(df['Embarked'].mode()[0])

Any type

Drop rows/columns

df.dropna() or df.drop(columns=['Cabin'])

Visualize Missing Data:

python

 

import seaborn as sns

sns.heatmap(df.isnull(), cbar=False)


🧠 5. Fixing Data Types

python

 

df['Age'] = df['Age'].astype(int)

df['Date'] = pd.to_datetime(df['Date'])

Why it matters:

  • Models treat integers differently than floats
  • Dates must be parsed correctly for time-based analysis

️ 6. Cleaning Strings and Categories

Strip whitespace, fix case:

python

 

df['Name'] = df['Name'].str.strip().str.title()

Replace or drop unwanted characters:

python

 

df['City'] = df['City'].str.replace(r'[^a-zA-Z ]', '', regex=True)

Unify categories:

python

 

df['Gender'] = df['Gender'].replace({'male': 'Male', 'M': 'Male'})


📚 7. Encoding Categorical Data

Label Encoding (for binary or ordinal data):

python

 

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

df['Gender'] = le.fit_transform(df['Gender'])  # Male = 1, Female = 0

One-Hot Encoding (for nominal data):

python

 

df = pd.get_dummies(df, columns=['Embarked'], drop_first=True)


📏 8. Feature Scaling

Some ML models (like KNN, SVM) are sensitive to feature scale.

Standardization:

python

 

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

df[['Age', 'Fare']] = scaler.fit_transform(df[['Age', 'Fare']])

Normalization (0–1 range):

python

 

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

df[['Age', 'Fare']] = scaler.fit_transform(df[['Age', 'Fare']])


🚫 9. Outlier Detection and Treatment

Visual Detection:

python

 

import matplotlib.pyplot as plt

import seaborn as sns

sns.boxplot(x=df['Fare'])

plt.show()

Remove using IQR:

python

 

Q1 = df['Fare'].quantile(0.25)

Q3 = df['Fare'].quantile(0.75)

IQR = Q3 - Q1

lower = Q1 - 1.5 * IQR

upper = Q3 + 1.5 * IQR

 

df = df[(df['Fare'] >= lower) & (df['Fare'] <= upper)]


🔧 10. Feature Engineering Basics

You might want to:

  • Create age groups:
    df['AgeGroup'] = pd.cut(df['Age'], bins=[0,18,40,60,100], labels=['Teen','Adult','Senior','Old'])
  • Extract date parts:
    df['Year'] = df['Date'].dt.year

📋 11. Preprocessing Checklist Before Modeling

Task

Checked?

Missing values handled


Data types corrected


Categorical features encoded


Text cleaned


Outliers treated


Features scaled


Dataset split or ready



Full Pipeline Sample

python

 

# Cleaned Titanic dataset preprocessing

import pandas as pd

from sklearn.preprocessing import LabelEncoder, StandardScaler

 

df = pd.read_csv('titanic.csv')

 

# Drop useless columns

df.drop(columns=['Cabin', 'Ticket'], inplace=True)

 

# Fill missing

df['Age'].fillna(df['Age'].median(), inplace=True)

df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)

 

# Encode

le = LabelEncoder()

df['Sex'] = le.fit_transform(df['Sex'])

 

# One-hot

df = pd.get_dummies(df, columns=['Embarked'], drop_first=True)

 

# Scale

scaler = StandardScaler()

df[['Age', 'Fare']] = scaler.fit_transform(df[['Age', 'Fare']])


🧪 Bonus: Automate with Pipelines

python

 

from sklearn.pipeline import Pipeline

from sklearn.impute import SimpleImputer

from sklearn.preprocessing import StandardScaler

 

pipeline = Pipeline([

    ('imputer', SimpleImputer(strategy='median')),

    ('scaler', StandardScaler())

])

 

df[['Age', 'Fare']] = pipeline.fit_transform(df[['Age', 'Fare']])


📚 Summary Table

Step

Function/Tool

Remove duplicates

df.drop_duplicates()

Missing values

fillna(), dropna()

Type conversion

astype(), to_datetime()

Encode categories

LabelEncoder, get_dummies()

Clean text fields

str.strip(), str.lower(), replace()

Outlier removal

IQR method + boxplot()

Scaling

StandardScaler, MinMaxScaler



Back

FAQs


1. What is the data science workflow, and why is it important?

Answer: The data science workflow is a structured step-by-step process used to turn raw data into actionable insights or solutions. It ensures clarity, efficiency, and reproducibility from problem definition to deployment.

2. Do I need to follow the workflow in a strict order?

Answer: Not necessarily. While there is a general order, data science is iterative. You may go back and forth between stages (like EDA and feature engineering) as new insights emerge.

3. What’s the difference between EDA and data cleaning?

Answer: Data cleaning prepares the dataset by fixing errors and inconsistencies, while EDA explores the data to find patterns, trends, and relationships to inform modeling decisions.

4. Is it okay to start modeling before completing feature engineering?

Answer: You can build a baseline model early, but robust feature engineering often improves performance significantly. It's best to iterate and refine after EDA and feature transformations.

5. What tools are best for building and evaluating models?

Answer: Popular tools include Python libraries like scikit-learn, XGBoost, LightGBM, and TensorFlow for building models, and metrics functions within sklearn.metrics for evaluation.

6. How do I choose the right evaluation metric?

Answer: It depends on the problem:

  • For classification: accuracy, precision, recall, F1-score
  • For regression: MAE, RMSE, R²
  • Use domain knowledge to choose the metric that aligns with business goals.

7. What are some good deployment options for beginners?

Answer: Start with lightweight options like:

  • Streamlit or Gradio for dashboards
  • Flask or FastAPI for web APIs
  • Hosting on Heroku or Render is easy and free for small projects.

8. How do I monitor a deployed model in production?

Answer: Use logging for predictions, track performance metrics over time, and set alerts for significant drops. Tools like MLflow, Prometheus, and AWS CloudWatch are commonly used.

9. Can I skip deployment if my goal is just learning?

Answer: Yes. For learning or portfolio-building, it's okay to stop after model evaluation. But deploying at least one model enhances your understanding of real-world applications.

10. What’s the best way to practice the entire workflow?

Answer: Choose a simple dataset (like Titanic or housing prices), go through every workflow step end-to-end, and document your process. Repeat with different types of problems to build experience.