Embark on a journey of knowledge! Take the quiz and earn valuable credits.
Take A QuizChallenge yourself and boost your learning! Start the quiz now to earn credits.
Take A QuizUnlock your potential! Begin the quiz, answer questions, and accumulate credits along the way.
Take A Quiz
Detecting Data Gaps Before They Derail Your Project
🧠 Introduction
Before you can clean or impute missing data, you must first identify
and understand what’s missing, how much is missing, and where patterns
exist.
“You can’t fix what you don’t measure.” – That’s especially
true for missing data.
In this chapter, you'll learn:
Let’s explore how to shine a light on what’s not in
your data.
🔍 1. What Does
“Identifying Missing Data” Mean?
Identifying missing data isn’t just spotting NaN or None. It
means profiling your dataset to detect:
📦 2. Common
Representations of Missing Values
Format |
Description |
np.nan, None |
Native missing values
in Python |
"" |
Empty strings |
"Unknown" |
Placeholder for
unknown values |
-999, 0 |
Dummy values
used in legacy systems |
✅ Replace Custom Missing with
np.nan
python
import
numpy as np
df.replace(['Unknown',
'N/A', -999, 0], np.nan, inplace=True)
📊 3. Quantifying Missing
Data
Count & Percentage:
python
missing_count
= df.isnull().sum()
missing_percent
= df.isnull().mean() * 100
missing_df
= pd.DataFrame({
'Missing Values': missing_count,
'Percentage': missing_percent
}).sort_values(by='Percentage',
ascending=False)
Example Output:
Column |
Missing Values |
Percentage |
Age |
120 |
15.0% |
Income |
85 |
10.6% |
Gender |
0 |
0.0% |
📈 4. Visualizing Missing
Data
Visualizations make patterns obvious.
🔹 Using missingno:
python
import
missingno as msno
msno.matrix(df)
msno.heatmap(df)
msno.bar(df)
🔹 Using Seaborn Heatmap:
python
import
seaborn as sns
sns.heatmap(df.isnull(),
cbar=False)
🔄 5. Row-Level and
Segment Analysis
Some rows may have many missing values:
python
df['missing_per_row']
= df.isnull().sum(axis=1)
df['missing_percent_row']
= df.isnull().mean(axis=1) * 100
Segment Missing by Category:
python
df.groupby('Gender')['Age'].apply(lambda
x: x.isnull().mean())
Helps identify if missingness is biased toward a group (MAR
condition)
🧪 6. Finding Structured
or Patterned Missingness
Look for relationships between missingness:
python
#
Correlation between missing values
df_missing
= df.isnull().astype(int)
df_missing.corr()
Visualize as heatmap:
python
sns.heatmap(df_missing.corr(),
annot=True)
Example: Missing Income highly correlates with missing
CreditScore.
🔁 7. Time Series &
Index-Based Gaps
If your dataset has time-based or sequential data,
missingness could be episodic:
python
df.set_index('Date',
inplace=True)
df['Sales'].plot()
Look for:
Fill in timestamps:
python
df
= df.asfreq('D') # Daily frequency
🛠 8. Profiling Tools
🧰 Automated Profiling
with pandas_profiling (now ydata-profiling):
python
from
ydata_profiling import ProfileReport
profile
= ProfileReport(df)
profile.to_notebook_iframe()
Includes missing value stats, charts, and correlations — all
in one place.
🧰 Great Expectations (For
data validation)
bash
great_expectations
suite new
Helps enforce rules like “No more than 5% missing in Age”
📚 9. Sample Missingness
Report Format
Feature |
Missing % |
Imputation Plan |
Notes |
Age |
14.8% |
Median by Gender |
MAR assumed |
Income |
9.3% |
KNN Imputer |
Numeric
skewed |
Zip |
52.2% |
Drop Column |
High missingness |
Answer: Missing data can result from system errors, human omission, privacy constraints, sensor failures, or survey respondents skipping questions. It can also be intentional (e.g., optional fields).
Answer: Use Pandas functions like df.isnull().sum() or visualize missingness using the missingno or seaborn heatmap to understand the extent and pattern of missing data.
Answer: No. Dropping rows is acceptable only when the number of missing entries is minimal. Otherwise, it can lead to data loss and bias. Consider imputation or flagging instead.
Answer: If the distribution is normal, use the mean. If it's skewed, use the median. For more advanced tasks, consider KNN imputation or iterative modeling.
Answer: You can fill them using the mode, group-based mode, or assign a new category like "Unknown" or "Missing" — especially if missingness is meaningful.
Answer: Yes! Models like KNNImputer, Random Forests, or IterativeImputer (based on MICE) can predict missing values based on other columns, especially when missingness is not random.
Answer: Data drift refers to changes in the data distribution over time. If drift occurs, previously rare missing values may increase, or your imputation logic may become outdated — requiring updates.
Answer: Absolutely. Creating a binary feature like column_missing = df['column'].isnull() can help the model learn if missingness correlates with the target variable.
Answer: Yes — unhandled missing values can cause models to crash, reduce accuracy, or introduce bias. Proper handling improves both robustness and generalizability.
Answer: Libraries like scikit-learn (for imputation pipelines), fancyimpute, Evidently, DVC, and YData Profiling are great for automating detection, imputation, and documentation.
Please log in to access this content. You will be redirected to the login page shortly.
LoginReady to take your education and career to the next level? Register today and join our growing community of learners and professionals.
Comments(0)