Embark on a journey of knowledge! Take the quiz and earn valuable credits.
Take A QuizChallenge yourself and boost your learning! Start the quiz now to earn credits.
Take A QuizUnlock your potential! Begin the quiz, answer questions, and accumulate credits along the way.
Take A Quiz
Uncover Hidden Stories, Trends, and Patterns in Your
Data
🧠 Introduction
After loading and cleaning your dataset, it's time to
explore it deeply. This is where Exploratory Data Analysis (EDA) comes
in.
EDA is the detective work in data science. It helps you:
In short, EDA builds your intuition about the data,
which is essential before modeling.
In this chapter, you’ll learn:
🔍 1. Define the Goal of
EDA
Before diving in, ask:
📦 2. Understand Variable
Types
Variable Type |
Description |
Examples |
Numerical |
Quantitative values |
Age, Salary, Score |
Categorical |
Discrete
categories |
Gender, City,
Occupation |
Ordinal |
Ordered categories |
Low, Medium, High |
Datetime |
Time-based |
Date,
TimeStamp |
Check variable types:
python
df.dtypes
📈 3. Univariate Analysis
(One Variable at a Time)
🔹 Numerical Variables
▶ Histogram & Density Plot
python
import
matplotlib.pyplot as plt
import
seaborn as sns
sns.histplot(df['Age'],
kde=True)
plt.title('Age
Distribution')
plt.show()
▶ Boxplot
python
sns.boxplot(x=df['Age'])
Use .describe() for quick summary:
python
df['Age'].describe()
🔹 Categorical Variables
python
df['Gender'].value_counts().plot(kind='bar')
plt.title('Gender
Count')
plt.xticks(rotation=0)
plt.show()
Use .value_counts():
python
df['Gender'].value_counts(normalize=True)
🔗 4. Bivariate Analysis
(Two Variables)
Explore relationships between:
🔹 Numeric vs Numeric:
Scatter Plot
python
sns.scatterplot(x='Age',
y='Income', data=df)
🔹 Categorical vs Numeric:
Boxplot
python
sns.boxplot(x='Gender',
y='Income', data=df)
🔹 Categorical vs
Categorical: Cross Tab
python
pd.crosstab(df['Gender'],
df['Survived'], normalize='index')
🔁 5. Multivariate
Analysis (Three or More Variables)
▶ Pairplot
python
sns.pairplot(df[['Age',
'Fare', 'Survived']], hue='Survived')
▶ Grouped Summary
python
df.groupby(['Pclass',
'Sex'])['Survived'].mean().unstack()
📊 6. Correlation Analysis
Use .corr() to analyze numeric relationships:
python
corr_matrix
= df.corr()
sns.heatmap(corr_matrix,
annot=True, cmap='coolwarm')
plt.title('Correlation
Matrix')
plt.show()
⚠️ Note:
🔍 7. Outlier Detection
Use boxplots and IQR method.
python
sns.boxplot(x='Fare')
python
Q1
= df['Fare'].quantile(0.25)
Q3
= df['Fare'].quantile(0.75)
IQR
= Q3 - Q1
lower_bound
= Q1 - 1.5 * IQR
upper_bound
= Q3 + 1.5 * IQR
df_outliers
= df[(df['Fare'] < lower_bound) | (df['Fare'] > upper_bound)]
📅 8. Time Series Analysis
(if applicable)
If you have a datetime column:
python
df['Date']
= pd.to_datetime(df['Date'])
df.set_index('Date',
inplace=True)
df['Sales'].resample('M').sum().plot()
🧠 9. Ask Analytical
Questions
Question Type |
Example |
Descriptive |
What's the average
salary? |
Comparative |
Do men earn
more than women? |
Associative |
Does education level
affect survival? |
Temporal |
How do sales
vary by month? |
Grouped |
What is the churn rate
per product category? |
🧪 10. Tools Summary Table
Tool/Method |
Use Case |
df.describe() |
Numeric summary |
value_counts() |
Categorical
count |
histplot() |
Distribution plot |
boxplot() |
Outlier
detection & group comparison |
scatterplot() |
Relationship between
numeric variables |
pairplot() |
Multi-variable
scatter matrix |
heatmap(corr()) |
Feature correlation |
groupby().mean() |
Summary
statistics for grouped values |
resample() |
Time-series
aggregation |
✅ Full Example: Titanic Dataset
EDA
python
import
pandas as pd
import
seaborn as sns
import
matplotlib.pyplot as plt
df
= pd.read_csv('titanic.csv')
#
Overview
print(df.shape)
print(df.info())
#
Univariate: Age
sns.histplot(df['Age'].dropna(),
kde=True)
plt.title('Age
Distribution')
plt.show()
#
Bivariate: Gender vs Survival
sns.barplot(x='Sex',
y='Survived', data=df)
#
Multivariate: Pclass & Gender vs Survival
df.groupby(['Pclass',
'Sex'])['Survived'].mean().unstack().plot(kind='bar')
#
Correlation
sns.heatmap(df.corr(),
annot=True, cmap='YlGnBu')
#
Outliers
sns.boxplot(x='Fare',
data=df)
Answer: Not at all. Basic knowledge of statistics is helpful, but you can start your first project with a beginner-friendly dataset and learn concepts like mean, median, correlation, and regression as you go.
Answer: Python is the most popular and beginner-friendly choice, thanks to its simplicity and powerful libraries like Pandas, NumPy, Matplotlib, Seaborn, and Scikit-learn.
Answer: Great sources include:
Answer:
Answer: Keep it small and manageable — one target variable, 3–6 features, and under 10,000 rows of data. Focus more on understanding the process than building a complex model.
Answer: Yes, but keep it simple. Start with linear regression, logistic regression, or decision trees. Avoid deep learning or complex models until you're more confident.
Answer: Use:
Answer: Use:
Answer: It depends on your task:
Answer: Absolutely! A well-documented project with clear insights, code, and visualizations is a great way to show employers that you understand the end-to-end data science process.
Please log in to access this content. You will be redirected to the login page shortly.
LoginReady to take your education and career to the next level? Register today and join our growing community of learners and professionals.
Comments(0)