Embark on a journey of knowledge! Take the quiz and earn valuable credits.
Take A QuizChallenge yourself and boost your learning! Start the quiz now to earn credits.
Take A QuizUnlock your potential! Begin the quiz, answer questions, and accumulate credits along the way.
Take A Quiz
The Foundation for Accurate Imputation and Reliable
Data Science
🧠 Introduction
Handling missing data is one of the most overlooked — yet
most impactful — skills in data science.
Before you decide how to handle missing data, you need to understand
why it's missing.
This chapter explores:
This foundational knowledge will guide every decision you
make in future data cleaning and modeling tasks.
📘 1. What is Missing
Data?
Missing data refers to the absence of a value in a
dataset. It’s commonly represented by:
✅ Real-World Causes of Missing
Data
Cause |
Example |
Human error |
Data entry skipped
accidentally |
Privacy concerns |
Users choose
not to disclose income |
System failures |
Sensor went offline or
API failed |
Unlinked datasets |
Join
operation failed to match keys |
Survey structure |
Skipped optional
fields in feedback forms |
Domain logic |
“Not
applicable” fields (e.g., pregnancy question for men) |
🔎 2. Types of Missing
Data
Understanding the type of missingness helps choose the right
handling strategy.
📂 MCAR: Missing
Completely At Random
python
#
Example check: Compare stats of missing vs. non-missing groups
df['Age_missing']
= df['Age'].isnull()
df.groupby('Age_missing')['Fare'].mean()
📂 MAR: Missing At Random
python
#
Impute based on related feature
df['Income']
= df.groupby('Education')['Income'].transform(lambda x: x.fillna(x.median()))
📂 MNAR: Missing Not At
Random
⚠️ You can't detect MNAR with
data alone — external context is needed.
🛠️ 3. How to Detect
Missing Data in Pandas
python
import
pandas as pd
df
= pd.read_csv("data.csv")
#
Basic counts
df.isnull().sum()
#
Percent missing per column
df.isnull().mean()
* 100
Visualizing Missing Data
python
import
seaborn as sns
import
matplotlib.pyplot as plt
sns.heatmap(df.isnull(),
cbar=False)
Use missingno for an even better visual:
python
import
missingno as msno
msno.matrix(df)
msno.heatmap(df)
🧪 4. Quantifying the
Impact of Missing Data
➤ Count of missing values per
column
python
missing
= df.isnull().sum().sort_values(ascending=False)
percent_missing
= (df.isnull().sum() / len(df)) * 100
pd.DataFrame({'Missing
Count': missing, 'Percent': percent_missing})
➤ Drop if over threshold
python
threshold
= 0.5 # 50%
df
= df.loc[:, df.isnull().mean() < threshold]
Example Table: Missing Summary
Column |
Data Type |
Missing % |
Likely Type |
Suggested Action |
Age |
float |
12.5% |
MAR |
Group-wise imputation |
Gender |
object |
0% |
N/A |
Use directly |
Income |
float |
28.7% |
MNAR |
Add missing flag,
predictive model |
Zip Code |
object |
53.1% |
MCAR |
Drop or
ignore |
📈 5. When is Missingness
Informative?
Sometimes missingness is a feature. For example:
➤ Solution: Create missing
indicators
python
df['Income_missing']
= df['Income'].isnull().astype(int)
✅ Use these flags as additional
input features for your model.
🧠 6. Decision Framework:
What to Do Next?
Missing Type |
% Missing |
Action |
MCAR |
< 10% |
Drop or fill with
mean/median |
MAR |
10–30% |
Group-wise
imputation or model-based |
MNAR |
Any |
Add indicator + impute
conservatively |
Any type |
> 50% |
Consider
dropping the feature |
🔄 7. Handling Edge Cases
✅ Non-Standard Nulls
python
#
Replace with NaN
df.replace(['?',
'Unknown', '-1'], pd.NA, inplace=True)
✅ Categorical with Missingness
python
#
Fill with new category
df['State'].fillna('Missing',
inplace=True)
✅ Timestamp columns
python
#
Use forward fill for time series
df['Order
Date'] = df['Order Date'].fillna(method='ffill')
💡 8. Real-World Example:
Healthcare Dataset
python
#
Missing BMI in medical dataset
df['BMI'].isnull().mean() # 16%
#
Impute using Age and Gender group medians
df['BMI']
= df.groupby(['Gender', pd.cut(df['Age'], bins=5)])['BMI'].transform(lambda x:
x.fillna(x.median()))
✅ Best Practices
Tip |
Description |
Always analyze
before cleaning |
Context defines
correctness |
Imputation ≠ guessing |
Use logic or
evidence |
Test models with
and without imputation |
Check impact |
Use pipelines for production |
Automate the
process |
Document every
decision |
Transparency is key |
Answer: Missing data can result from system errors, human omission, privacy constraints, sensor failures, or survey respondents skipping questions. It can also be intentional (e.g., optional fields).
Answer: Use Pandas functions like df.isnull().sum() or visualize missingness using the missingno or seaborn heatmap to understand the extent and pattern of missing data.
Answer: No. Dropping rows is acceptable only when the number of missing entries is minimal. Otherwise, it can lead to data loss and bias. Consider imputation or flagging instead.
Answer: If the distribution is normal, use the mean. If it's skewed, use the median. For more advanced tasks, consider KNN imputation or iterative modeling.
Answer: You can fill them using the mode, group-based mode, or assign a new category like "Unknown" or "Missing" — especially if missingness is meaningful.
Answer: Yes! Models like KNNImputer, Random Forests, or IterativeImputer (based on MICE) can predict missing values based on other columns, especially when missingness is not random.
Answer: Data drift refers to changes in the data distribution over time. If drift occurs, previously rare missing values may increase, or your imputation logic may become outdated — requiring updates.
Answer: Absolutely. Creating a binary feature like column_missing = df['column'].isnull() can help the model learn if missingness correlates with the target variable.
Answer: Yes — unhandled missing values can cause models to crash, reduce accuracy, or introduce bias. Proper handling improves both robustness and generalizability.
Answer: Libraries like scikit-learn (for imputation pipelines), fancyimpute, Evidently, DVC, and YData Profiling are great for automating detection, imputation, and documentation.
Please log in to access this content. You will be redirected to the login page shortly.
LoginReady to take your education and career to the next level? Register today and join our growing community of learners and professionals.
Comments(0)