Embark on a journey of knowledge! Take the quiz and earn valuable credits.
Take A QuizChallenge yourself and boost your learning! Start the quiz now to earn credits.
Take A QuizUnlock your potential! Begin the quiz, answer questions, and accumulate credits along the way.
Take A Quiz
How to Know If Your Missing Data Strategy Actually
Worked
🧠 Introduction
Imputation fills the gaps — but how do you know if you
filled them correctly?
Post-imputation evaluation is crucial to ensure you haven’t
introduced bias, noise, or distorted patterns.
This chapter teaches you:
📊 1. Why Evaluate
Imputation?
Imputation should:
If not evaluated:
🔄 2. Common Evaluation
Approaches
Method |
When to Use |
Description |
Compare with true
values |
When you can mask
known data |
Measure imputed vs.
actual |
Model performance analysis |
In predictive
modeling pipelines |
Evaluate
metrics (accuracy, F1, etc) |
Distribution
comparison |
For numerical features |
Use histograms, KDE,
boxplots |
Drift and correlation checks |
After batch
imputation |
Check data
stability |
Cross-validation |
On entire pipeline |
Holistic model testing |
🔬 3. Simulating
Missingness for Validation
When true missing values aren't known, simulate them.
Example:
python
import
numpy as np
#
Make a copy of original
df_eval
= df.copy()
#
Mask 20% of values from a column
mask
= np.random.rand(len(df_eval)) < 0.2
df_eval['Age_masked']
= df_eval['Age']
df_eval.loc[mask,
'Age_masked'] = np.nan
Apply your imputation method, then compare:
python
from
sklearn.metrics import mean_squared_error
imputed
= imputer.fit_transform(df_eval[['Age_masked']])
rmse
= mean_squared_error(df_eval['Age'][mask], imputed[mask], squared=False)
print(f"RMSE:
{rmse:.2f}")
📏 4. Useful Metrics for
Numerical Imputation
Metric |
Description |
Best For |
RMSE |
Root mean squared
error |
Distance from true
value |
MAE |
Mean absolute
error |
Robust to
outliers |
R² Score |
Variance explained |
Overall prediction fit |
KL Divergence |
Distributional
difference |
Distributional
preservation |
RMSE Code Example:
python
from
sklearn.metrics import mean_squared_error
rmse
= mean_squared_error(true_values, imputed_values, squared=False)
📏 5. Useful Metrics for
Categorical Imputation
Metric |
Description |
Best For |
Accuracy |
% of correct
predictions |
High cardinality cols |
F1 Score |
Precision-recall
balance |
Class
imbalance |
Mode Match |
Matches to most
frequent category |
Ordinal features |
Accuracy Example:
python
from
sklearn.metrics import accuracy_score
accuracy_score(true_values,
imputed_values)
📈 6. Visual Techniques
for Evaluation
KDE Plot Comparison
python
import
seaborn as sns
sns.kdeplot(df['Income'],
label='Original')
sns.kdeplot(df_imputed['Income'],
label='Imputed')
Boxplots:
python
sns.boxplot(data=[df['Income'].dropna(),
df_imputed['Income']], orient='h')
Shows spread, outliers, and symmetry.
🔁 7. Before vs. After
Comparison
Feature |
Mean (Before) |
Mean (After) |
Std Dev (Before) |
Std Dev (After) |
Age |
32.4 |
32.3 |
4.8 |
4.7 |
Income |
45,600 |
45,580 |
12,000 |
11,970 |
python
comparison
= pd.DataFrame({
'Before Mean': df['Age'].mean(),
'After Mean': df_imputed['Age'].mean(),
'Before Std': df['Age'].std(),
'After Std': df_imputed['Age'].std()
},
index=['Age'])
🧠 8. Use Modeling to
Validate
Build a prediction model on:
Compare results.
python
from
sklearn.ensemble import RandomForestClassifier
from
sklearn.model_selection import cross_val_score
scores
= cross_val_score(RandomForestClassifier(), X_imputed, y, cv=5)
print("CV
Accuracy:", scores.mean())
A good imputation strategy shouldn’t degrade performance.
🧰 9. Tools &
Libraries
Tool |
Use |
Scikit-learn |
RMSE, MAE, pipelines |
Seaborn / Matplotlib |
Visual
evaluation |
YData Profiling |
Full-feature
comparison dashboards |
Missingno |
Visualizes
null structure |
Dython |
Correlation after
imputation |
📉 10. Common Red Flags in
Post-Imputation
Symptom |
Likely Cause |
Overly smooth
distribution |
Mean/linear
interpolation overuse |
Peaks or dips at fill values |
Overuse of
constant/default fills |
High variance drop |
Over-flattening of
data |
Degraded model accuracy |
Bad
imputation logic or leakage |
✅ 11. Best Practice Tips
Tip |
Why It Matters |
Always validate on
unseen data |
Ensures generalization |
Simulate missing data when no ground truth exists |
Allows real
metric testing |
Visualize both distribution
and model output |
Captures subtle
distortions |
Use different imputation methods for comparison |
Select the
best for your data context |
📋 Summary Table:
Evaluation Strategies
Method |
Data Type |
Metric or Tool |
RMSE / MAE |
Numeric |
sklearn.metrics |
Accuracy / F1 Score |
Categorical |
sklearn.metrics |
KDE / Boxplot |
Any |
seaborn, matplotlib |
Modeling with cross-val |
Any |
cross_val_score |
Distributional
shift |
Any |
KL divergence, stats |
Answer: Missing data can result from system errors, human omission, privacy constraints, sensor failures, or survey respondents skipping questions. It can also be intentional (e.g., optional fields).
Answer: Use Pandas functions like df.isnull().sum() or visualize missingness using the missingno or seaborn heatmap to understand the extent and pattern of missing data.
Answer: No. Dropping rows is acceptable only when the number of missing entries is minimal. Otherwise, it can lead to data loss and bias. Consider imputation or flagging instead.
Answer: If the distribution is normal, use the mean. If it's skewed, use the median. For more advanced tasks, consider KNN imputation or iterative modeling.
Answer: You can fill them using the mode, group-based mode, or assign a new category like "Unknown" or "Missing" — especially if missingness is meaningful.
Answer: Yes! Models like KNNImputer, Random Forests, or IterativeImputer (based on MICE) can predict missing values based on other columns, especially when missingness is not random.
Answer: Data drift refers to changes in the data distribution over time. If drift occurs, previously rare missing values may increase, or your imputation logic may become outdated — requiring updates.
Answer: Absolutely. Creating a binary feature like column_missing = df['column'].isnull() can help the model learn if missingness correlates with the target variable.
Answer: Yes — unhandled missing values can cause models to crash, reduce accuracy, or introduce bias. Proper handling improves both robustness and generalizability.
Answer: Libraries like scikit-learn (for imputation pipelines), fancyimpute, Evidently, DVC, and YData Profiling are great for automating detection, imputation, and documentation.
Please log in to access this content. You will be redirected to the login page shortly.
LoginReady to take your education and career to the next level? Register today and join our growing community of learners and professionals.
Comments(0)