Embark on a journey of knowledge! Take the quiz and earn valuable credits.
Take A QuizChallenge yourself and boost your learning! Start the quiz now to earn credits.
Take A QuizUnlock your potential! Begin the quiz, answer questions, and accumulate credits along the way.
Take A Quiz
Quantify, Understand, and Strategize Before You Clean
🧠 Introduction
Before you impute, drop, or modify any missing data, ask
yourself:
“What’s the cost of this missing data to my model, insights, or business
decision?”
Treating all missing data equally is like treating all
medicine the same — context, dosage, and timing matter.
In this chapter, we’ll explore:
📊 1. Why Assessing
Missingness Matters
Missing data affects:
Ignoring the impact of missing data can lead to:
🧩 2. Quantifying
Missingness Across Columns
Start by profiling overall impact:
python
missing
= df.isnull().sum()
percent
= df.isnull().mean() * 100
missing_df
= pd.DataFrame({'Missing Count': missing, 'Percent': percent})
Visualize using a bar chart:
python
missing_df[missing_df['Percent']
> 0].sort_values('Percent').plot.barh()
Table Example:
Feature |
Missing % |
Data Type |
Model Role |
Severity |
Age |
12.5% |
Numeric |
High |
Medium |
Gender |
0% |
Categorical |
Medium |
None |
Income |
35% |
Numeric |
High (target var) |
High |
Zip Code |
55% |
Text |
Low |
Low (drop) |
🔍 3. Assessing
Statistical Bias in Missingness
Compare the distribution of missing vs. non-missing
groups.
Example: Compare income between those with and without
Age
python
df['age_missing']
= df['Age'].isnull()
df.groupby('age_missing')['Income'].describe()
If large differences exist, missingness may be not at
random (MNAR) or at random but biased (MAR).
Chi-Square for Categorical Variables:
python
from
scipy.stats import chi2_contingency
#
Gender vs. Age missingness
pd.crosstab(df['Gender'],
df['Age'].isnull()).pipe(chi2_contingency)
T-Test for Numerical Variables:
python
from
scipy.stats import ttest_ind
missing_age
= df[df['Age'].isnull()]
non_missing_age
= df[df['Age'].notnull()]
ttest_ind(missing_age['Income'],
non_missing_age['Income'], nan_policy='omit')
A significant p-value suggests statistical impact of
missingness.
🧪 4. Impact on
Correlations and EDA
Missing values distort correlation and relationships.
Example: Compare pairwise correlation before and after
dropping missing
python
corr1
= df.corr()
corr2
= df.dropna().corr()
diff
= corr1 - corr2
diff.abs().style.background_gradient(cmap='coolwarm')
Impact on Visual Trends
python
import
seaborn as sns
sns.scatterplot(x='Age',
y='Income', data=df) # With missing
sns.scatterplot(x='Age',
y='Income', data=df.dropna()) # Cleaned
Visual
inspection reveals how missingness alters patterns.
📉 5. Modeling the Impact:
Train With and Without Missing Columns
Try training two models:
Example with Random Forest:
python
from
sklearn.ensemble import RandomForestClassifier
from
sklearn.model_selection import train_test_split
from
sklearn.metrics import accuracy_score
from
sklearn.impute import SimpleImputer
X
= df.drop('target', axis=1)
y
= df['target']
#
Drop rows with missing data
model1
= RandomForestClassifier()
model1.fit(X.dropna(),
y[X.dropna().index])
#
Impute
imputer
= SimpleImputer(strategy='median')
X_imputed
= pd.DataFrame(imputer.fit_transform(X), columns=X.columns)
X_train,
X_test, y_train, y_test = train_test_split(X_imputed, y, test_size=0.2)
model2
= RandomForestClassifier()
model2.fit(X_train,
y_train)
#
Compare
print("Drop
Accuracy:", model1.score(X.dropna(), y[X.dropna().index]))
print("Impute
Accuracy:", model2.score(X_test, y_test))
Output Example:
Model Type |
Accuracy |
AUC |
Bias Risk |
Dropped Data |
0.77 |
0.81 |
High |
Median Imputation |
0.82 |
0.86 |
Medium |
🧠 6. Visualizing Impact
with Drift Analysis
Use EvidentlyAI to visualize feature drift
post-missing handling.
python
from
evidently.report import Report
from
evidently.metric_preset import DataDriftPreset
report
= Report(metrics=[DataDriftPreset()])
report.run(reference_data=reference_df,
current_data=df)
report.show()
💾 7. Business-Level
Impact Assessment
Missingness affects decisions in:
Solution: Assign a dollar/risk score to each
percentage of data loss. Use cost-benefit tables.
🧮 8. Cost-Benefit Matrix
for Handling Strategies
Feature |
Drop |
Impute |
Model Impact |
Interpretation
Loss |
Recommendation |
Age |
✗ |
✓ |
High |
Low |
Impute |
Zip Code |
✓ |
✗ |
Low |
None |
Drop |
Income |
✗ |
✓ |
Very High |
Moderate |
Impute + Flag |
📋 9. Final Impact
Evaluation Checklist
Question |
Check ✅ |
Have you identified
how much data is missing per feature? |
|
Is missingness randomly distributed or patterned? |
|
Have you quantified
the effect of dropping/imputing? |
|
Is your model sensitive to the missing features? |
|
Are your business
decisions affected by the loss? |
Answer: Missing data can result from system errors, human omission, privacy constraints, sensor failures, or survey respondents skipping questions. It can also be intentional (e.g., optional fields).
Answer: Use Pandas functions like df.isnull().sum() or visualize missingness using the missingno or seaborn heatmap to understand the extent and pattern of missing data.
Answer: No. Dropping rows is acceptable only when the number of missing entries is minimal. Otherwise, it can lead to data loss and bias. Consider imputation or flagging instead.
Answer: If the distribution is normal, use the mean. If it's skewed, use the median. For more advanced tasks, consider KNN imputation or iterative modeling.
Answer: You can fill them using the mode, group-based mode, or assign a new category like "Unknown" or "Missing" — especially if missingness is meaningful.
Answer: Yes! Models like KNNImputer, Random Forests, or IterativeImputer (based on MICE) can predict missing values based on other columns, especially when missingness is not random.
Answer: Data drift refers to changes in the data distribution over time. If drift occurs, previously rare missing values may increase, or your imputation logic may become outdated — requiring updates.
Answer: Absolutely. Creating a binary feature like column_missing = df['column'].isnull() can help the model learn if missingness correlates with the target variable.
Answer: Yes — unhandled missing values can cause models to crash, reduce accuracy, or introduce bias. Proper handling improves both robustness and generalizability.
Answer: Libraries like scikit-learn (for imputation pipelines), fancyimpute, Evidently, DVC, and YData Profiling are great for automating detection, imputation, and documentation.
Please log in to access this content. You will be redirected to the login page shortly.
LoginReady to take your education and career to the next level? Register today and join our growing community of learners and professionals.
Comments(0)