Embark on a journey of knowledge! Take the quiz and earn valuable credits.
Take A QuizChallenge yourself and boost your learning! Start the quiz now to earn credits.
Take A QuizUnlock your potential! Begin the quiz, answer questions, and accumulate credits along the way.
Take A Quiz
Filling the Gaps in Your Data with Confidence and
Simplicity
🧠 Introduction
Data is never perfect. Missing values can show up in almost
every dataset — but that doesn’t mean we have to lose that data.
Simple imputation is your first line of defense when facing
missing values. It allows you to fill in the blanks without
overcomplicating things.
In this chapter, we’ll explore:
🔍 1. What Is Imputation?
Imputation is the process of replacing missing
data with substituted values so the dataset remains usable for:
❓Why Not Just Drop Missing
Values?
You might lose:
💡 Imputation keeps
your data structure intact while minimizing information loss.
📊 2. Types of Simple
Imputation
Method |
Description |
Best Used For |
Mean |
Replaces missing with
column average |
Numeric, symmetric
distributions |
Median |
Replaces with
middle value |
Numeric,
skewed data |
Mode |
Most frequent value |
Categorical/ordinal
variables |
Constant |
Fixed value
like 0, -1, "Unknown" |
Flags or
missing-not-at-random |
📦 3. Mean Imputation
python
df['Age']
= df['Age'].fillna(df['Age'].mean())
Pros |
Cons |
Easy to apply |
Sensitive to outliers |
Preserves size |
Can distort
skewed distributions |
Example:
python
print("Before:",
df['Age'].isnull().sum())
df['Age'].fillna(df['Age'].mean(),
inplace=True)
print("After:",
df['Age'].isnull().sum())
📈 4. Median Imputation
python
df['Income']
= df['Income'].fillna(df['Income'].median())
✅ Ideal for income, price, age —
skewed or heavy-tailed distributions.
🧠 5. Mode Imputation (For
Categorical Data)
python
df['Gender']
= df['Gender'].fillna(df['Gender'].mode()[0])
Use case |
Examples |
Binary variables |
Gender, Yes/No |
Ordinal labels |
Education,
Ratings |
Repetitive text |
Cities, Categories |
🎯 6. Constant Value
Imputation
python
df['Marital_Status']
= df['Marital_Status'].fillna('Unknown')
df['Score']
= df['Score'].fillna(0)
🔐 Best for features where
missingness has meaning (e.g., not applicable, no response).
📌 7. Using Scikit-learn's
SimpleImputer
Import and Apply:
python
from
sklearn.impute import SimpleImputer
#
For numeric
imputer
= SimpleImputer(strategy='mean')
df[['Age']]
= imputer.fit_transform(df[['Age']])
#
For categorical
cat_imputer
= SimpleImputer(strategy='most_frequent')
df[['City']]
= cat_imputer.fit_transform(df[['City']])
Table: Imputer Strategies in Scikit-learn
Strategy |
Use With |
Code Parameter |
Mean |
Numerical |
strategy='mean' |
Median |
Numerical |
strategy='median' |
Most Frequent |
Categorical |
strategy='most_frequent' |
Constant |
Any type |
strategy='constant',
fill_value='Unknown' |
🧪 8. Column-by-Column
Imputation Strategy
Example:
python
#
Numeric column - median
df['Income']
= df['Income'].fillna(df['Income'].median())
#
Categorical column - mode
df['Education']
= df['Education'].fillna(df['Education'].mode()[0])
#
Flag column - 0
df['Has_Loan']
= df['Has_Loan'].fillna(0)
👩🔬
9. Group-Based Simple Imputation
Sometimes, imputing by group mean/median makes more sense.
python
df['Age']
= df.groupby('Gender')['Age'].transform(lambda x: x.fillna(x.mean()))
Why it’s better:
🧮 10. Impact on
Distribution
Compare before/after:
python
sns.kdeplot(df['Age'],
label='After Imputation')
sns.kdeplot(df_original['Age'],
label='Original')
✅ Check if your imputation skewed
the data.
💡 11. Pitfalls of Simple
Imputation
Pitfall |
Explanation |
Masking true
variability |
Adds false certainty |
Inflated correlations |
Mean-based
fills can create fake patterns |
Model bias |
Imputing target
variable can bias scores |
Non-representative values |
One-size-fits-all
doesn’t fit rare cases |
📋 12. Summary Table of
Methods
Method |
Suitable For |
Function |
Good For |
Mean |
Numeric |
fillna(df.mean()) |
Balanced data |
Median |
Numeric |
fillna(df.median()) |
Skewed data |
Mode |
Categorical |
fillna(df.mode()[0]) |
Repetitive labels |
Constant |
Any type |
fillna('Unknown')
or 0 |
Flags or
unknowns |
Groupwise |
Numeric, Categorical |
groupby().transform() |
Context-aware filling |
✅ 13. Best Practices for Simple
Imputation
Tip |
Reason |
Always impute
target variable last |
Avoids leakage |
Impute within training and test sets separately |
Prevents data
leakage in modeling |
Validate results
visually |
Catch unintended bias
or shifts |
Document choices |
Ensures
reproducibility |
🧠 14. Use Case Example:
Customer Churn Dataset
Scenario:
Feature |
Missing % |
Chosen Method |
Reason |
Age |
9% |
Group median by Gender |
Skewed + structured |
City |
3% |
Mode |
Few repeated
categories |
Income |
15% |
Median |
Skewed numeric |
Is_Churned |
0% |
Leave as is |
Target
variable |
Answer: Missing data can result from system errors, human omission, privacy constraints, sensor failures, or survey respondents skipping questions. It can also be intentional (e.g., optional fields).
Answer: Use Pandas functions like df.isnull().sum() or visualize missingness using the missingno or seaborn heatmap to understand the extent and pattern of missing data.
Answer: No. Dropping rows is acceptable only when the number of missing entries is minimal. Otherwise, it can lead to data loss and bias. Consider imputation or flagging instead.
Answer: If the distribution is normal, use the mean. If it's skewed, use the median. For more advanced tasks, consider KNN imputation or iterative modeling.
Answer: You can fill them using the mode, group-based mode, or assign a new category like "Unknown" or "Missing" — especially if missingness is meaningful.
Answer: Yes! Models like KNNImputer, Random Forests, or IterativeImputer (based on MICE) can predict missing values based on other columns, especially when missingness is not random.
Answer: Data drift refers to changes in the data distribution over time. If drift occurs, previously rare missing values may increase, or your imputation logic may become outdated — requiring updates.
Answer: Absolutely. Creating a binary feature like column_missing = df['column'].isnull() can help the model learn if missingness correlates with the target variable.
Answer: Yes — unhandled missing values can cause models to crash, reduce accuracy, or introduce bias. Proper handling improves both robustness and generalizability.
Answer: Libraries like scikit-learn (for imputation pipelines), fancyimpute, Evidently, DVC, and YData Profiling are great for automating detection, imputation, and documentation.
Please log in to access this content. You will be redirected to the login page shortly.
LoginReady to take your education and career to the next level? Register today and join our growing community of learners and professionals.
Comments(0)