Data Science Workflow: From Problem to Solution – A Complete Step-by-Step Journey for Beginners

2.09K 0 0 0 0

📗 Chapter 4: Exploratory Data Analysis (EDA)

Uncovering Patterns, Trends, and Insights Before Modeling


🧠 Introduction

Before jumping into model building, a data scientist must first explore the data — not just technically, but intellectually and visually.

Exploratory Data Analysis (EDA) is the process of analyzing datasets to summarize their main characteristics, detect anomalies, and discover patterns using statistics and visualization.

EDA helps answer questions like:

  • What are the distributions of key features?
  • Are there any outliers?
  • How do features relate to one another?
  • Do we need feature transformation or binning?

This chapter will teach you how to:

  • Perform univariate, bivariate, and multivariate analysis
  • Visualize relationships using Python
  • Understand feature correlations
  • Prepare insights for stakeholders and modeling

📂 1. What is EDA?

Exploratory Data Analysis is the process of:

  • Profiling your data: Types, distributions, missingness
  • Visualizing relationships: Target variable vs. features
  • Generating hypotheses: What matters for prediction?
  • Identifying issues: Imbalances, collinearity, anomalies

It’s non-linear and iterative — you may return to data cleaning or feature engineering as you uncover insights.


🔍 2. Overview of the Dataset

We’ll use the Titanic dataset as an example.

python

 

import pandas as pd

df = pd.read_csv('titanic.csv')

df.head()

Basic Info

python

 

df.info()

df.describe()

df.isnull().sum()


📊 3. Univariate Analysis

Analysis of individual features.

Numerical Features

Histogram + KDE plot:

python

 

import seaborn as sns

import matplotlib.pyplot as plt

 

sns.histplot(df['Age'], kde=True)

plt.title('Age Distribution')

Boxplot for outliers:

python

 

sns.boxplot(x=df['Age'])

Summary stats:

python

 

df['Fare'].describe()


Categorical Features

Bar chart for counts:

python

 

sns.countplot(x='Sex', data=df)

Value counts:

python

 

df['Embarked'].value_counts(normalize=True)


🔗 4. Bivariate Analysis

Explore how one feature affects another, especially with respect to the target variable.

Categorical vs Target

python

 

sns.barplot(x='Sex', y='Survived', data=df)

Numerical vs Target

python

 

sns.boxplot(x='Survived', y='Fare', data=df)

Groupby statistics:

python

 

df.groupby('Pclass')['Survived'].mean()


🔄 5. Multivariate Analysis

Analyzing interactions between multiple variables.

Pairplot

python

 

sns.pairplot(df[['Age', 'Fare', 'Survived']], hue='Survived')

Heatmap of Correlation Matrix

python

 

corr = df.corr()

sns.heatmap(corr, annot=True, cmap='coolwarm')

What to look for:

  • Strong linear relationships
  • Redundant features (correlation > 0.85)
  • Potential predictors of target variable

📈 6. Feature vs. Feature

Sometimes you need to explore relationships between features.

python

 

sns.scatterplot(x='Age', y='Fare', hue='Survived', data=df)

Grouped boxplot:

python

 

sns.boxplot(x='Pclass', y='Age', hue='Survived', data=df)


🧪 7. Target Variable Analysis

Understanding your target variable is crucial.

python

 

df['Survived'].value_counts(normalize=True).plot(kind='bar')

If imbalanced, consider:

  • Using stratified splits
  • Adjusting evaluation metrics
  • Using SMOTE or under/oversampling later

📊 8. Dealing with Skewed Features

python

 

from scipy.stats import skew

 

skewness = df['Fare'].skew()

print("Skewness:", skewness)

Apply transformation:

python

 

df['Fare_log'] = np.log1p(df['Fare'])

sns.histplot(df['Fare_log'], kde=True)


🧠 9. Questions to Ask During EDA

Question

Why it Matters

Which variables are strongly correlated?

Helps with feature selection and reduction

Are any features heavily skewed?

May require transformation

Is the target variable imbalanced?

Affects model selection and evaluation

Do any variables have many missing values?

May be excluded or filled

Are there any obvious outliers?

May distort model training


10. Summary Table Example

Here’s a snapshot you might prepare from EDA:


Feature

Type

Missing %

Skewness

Correlation with Target

Age

Numeric

20%

0.4

-0.08

Sex

Category

0%

N/A

0.54

Fare

Numeric

0%

4.8

0.26

Pclass

Ordinal

0%

0.8

-0.31


🧪 EDA Summary Example for Titanic:


  • Age is right-skewed, with some outliers.
  • Sex has a strong relationship with survival.
  • Pclass is inversely related to survival.
  • Fare varies widely and benefits from log transformation.
  • Embarked has minor missingness; mode imputation is reasonable.
  • Class imbalance exists in the Survived variable.

Back

FAQs


1. What is the data science workflow, and why is it important?

Answer: The data science workflow is a structured step-by-step process used to turn raw data into actionable insights or solutions. It ensures clarity, efficiency, and reproducibility from problem definition to deployment.

2. Do I need to follow the workflow in a strict order?

Answer: Not necessarily. While there is a general order, data science is iterative. You may go back and forth between stages (like EDA and feature engineering) as new insights emerge.

3. What’s the difference between EDA and data cleaning?

Answer: Data cleaning prepares the dataset by fixing errors and inconsistencies, while EDA explores the data to find patterns, trends, and relationships to inform modeling decisions.

4. Is it okay to start modeling before completing feature engineering?

Answer: You can build a baseline model early, but robust feature engineering often improves performance significantly. It's best to iterate and refine after EDA and feature transformations.

5. What tools are best for building and evaluating models?

Answer: Popular tools include Python libraries like scikit-learn, XGBoost, LightGBM, and TensorFlow for building models, and metrics functions within sklearn.metrics for evaluation.

6. How do I choose the right evaluation metric?

Answer: It depends on the problem:

  • For classification: accuracy, precision, recall, F1-score
  • For regression: MAE, RMSE, R²
  • Use domain knowledge to choose the metric that aligns with business goals.

7. What are some good deployment options for beginners?

Answer: Start with lightweight options like:

  • Streamlit or Gradio for dashboards
  • Flask or FastAPI for web APIs
  • Hosting on Heroku or Render is easy and free for small projects.

8. How do I monitor a deployed model in production?

Answer: Use logging for predictions, track performance metrics over time, and set alerts for significant drops. Tools like MLflow, Prometheus, and AWS CloudWatch are commonly used.

9. Can I skip deployment if my goal is just learning?

Answer: Yes. For learning or portfolio-building, it's okay to stop after model evaluation. But deploying at least one model enhances your understanding of real-world applications.

10. What’s the best way to practice the entire workflow?

Answer: Choose a simple dataset (like Titanic or housing prices), go through every workflow step end-to-end, and document your process. Repeat with different types of problems to build experience.