Embark on a journey of knowledge! Take the quiz and earn valuable credits.
Take A QuizChallenge yourself and boost your learning! Start the quiz now to earn credits.
Take A QuizUnlock your potential! Begin the quiz, answer questions, and accumulate credits along the way.
Take A Quiz
When Less is More — Smart Removal Techniques for
Cleaner Data
🧠 Introduction
Not all data can (or should) be saved.
In some situations, dropping missing data — whether rows or
columns — is not only acceptable but also the best decision. This
chapter will walk you through when, why, and how to drop missing values
without compromising your dataset’s integrity.
You’ll learn:
“Sometimes subtraction is addition.” Dropping poor-quality
data can improve clarity and model performance.
🔍 1. When Is It Okay to
Drop Missing Data?
Dropping is ideal when:
📦 2. Dropping Columns
with High Missingness
Start by calculating percentage missing:
python
missing_percent
= df.isnull().mean() * 100
Drop columns above a threshold:
python
threshold
= 0.5 # 50%
df
= df.loc[:, missing_percent < threshold]
Example Table: Dropping Decision
Column |
Missing % |
Importance |
Drop? |
Zip Code |
58% |
Low |
✅ Yes |
Gender |
0% |
High |
❌
No |
Age |
12% |
High |
❌ No |
Email |
47% |
Medium |
⚠️
Maybe |
🧑🤝🧑
3. Dropping Rows with Missing Values
If only a few rows have missing values in key columns,
it may be safe to drop them:
python
df
= df.dropna(subset=['Age', 'Income'])
Drop all rows with any missing values:
python
df
= df.dropna()
Warning:
➤ Selective Drop Example
python
#
Drop only if Age OR Gender is missing
df
= df[df['Age'].notnull() & df['Gender'].notnull()]
📊 4. Visualization Before
Dropping
Compare row count:
python
print("Before:",
df.shape)
df_cleaned
= df.dropna()
print("After:",
df_cleaned.shape)
Heatmap:
python
import
seaborn as sns
sns.heatmap(df.isnull(),
cbar=False)
This helps visualize where drops will make the most impact.
⚖️ 5. Pros and Cons of Dropping
Data
Pros |
Cons |
Quick and easy |
Potential loss of
important info |
Reduces noise |
Can shrink
dataset too much |
Avoids biased or
poor imputations |
May bias analysis if
missing is systematic |
Ideal for features with over 60–70% missing |
Not suitable
for time series or streaming |
🤝 6. Group-Based Dropping
Sometimes, only certain groups have poor data.
python
#
Drop users from a location with 90% null values
df
= df[~((df['Country'] == 'Unknown') & (df['Age'].isnull()))]
Or drop columns per group:
python
#
Drop column only for a subgroup
df_group
= df[df['UserType'] == 'Guest']
df_group
= df_group.drop(columns=['Email'])
📈 7. Partial Column
Dropping
Split feature based on null ratio per segment.
python
#
If Age is missing only for Males, drop Age just for that group
df.loc[(df['Gender']
== 'Male') & (df['Age'].isnull()), 'Age'] = pd.NA
df
= df.dropna(subset=['Age'])
🧠 8. Combining Dropping
with Other Cleaning
Drop → then Impute → then Encode:
python
#
Step 1: Drop unimportant columns
df
= df.drop(columns=['ZipCode', 'Address'])
#
Step 2: Drop rows where target is missing
df
= df.dropna(subset=['Target'])
#
Step 3: Impute Age
df['Age']
= df['Age'].fillna(df['Age'].median())
💡 9. Best Practice Rules
for Dropping
Situation |
Recommended Action |
Feature missing
> 60% and low correlation |
Drop the column |
Less than 5% rows missing in key feature |
Drop those
rows |
Time series with
missing timestamps |
Avoid dropping — use interpolation |
Target variable missing |
Drop those
rows |
Training rows
missing too many columns |
Consider conditional
or threshold-based drop |
✅ 10. Drop Strategy Summary Table
Drop Type |
Code Example |
Best When... |
Drop column
(global) |
df.drop(columns=['ZipCode']) |
Feature is unimportant
& >50% missing |
Drop row (global) |
df.dropna() |
Dataset is
large; missingness is small |
Drop based on
column |
df.dropna(subset=['Age']) |
Specific features are
mission-critical |
Drop by group |
df = df[~((df['Region']
== 'X') & (df['Score'].isnull()))] |
Localized bad
data |
Answer: Missing data can result from system errors, human omission, privacy constraints, sensor failures, or survey respondents skipping questions. It can also be intentional (e.g., optional fields).
Answer: Use Pandas functions like df.isnull().sum() or visualize missingness using the missingno or seaborn heatmap to understand the extent and pattern of missing data.
Answer: No. Dropping rows is acceptable only when the number of missing entries is minimal. Otherwise, it can lead to data loss and bias. Consider imputation or flagging instead.
Answer: If the distribution is normal, use the mean. If it's skewed, use the median. For more advanced tasks, consider KNN imputation or iterative modeling.
Answer: You can fill them using the mode, group-based mode, or assign a new category like "Unknown" or "Missing" — especially if missingness is meaningful.
Answer: Yes! Models like KNNImputer, Random Forests, or IterativeImputer (based on MICE) can predict missing values based on other columns, especially when missingness is not random.
Answer: Data drift refers to changes in the data distribution over time. If drift occurs, previously rare missing values may increase, or your imputation logic may become outdated — requiring updates.
Answer: Absolutely. Creating a binary feature like column_missing = df['column'].isnull() can help the model learn if missingness correlates with the target variable.
Answer: Yes — unhandled missing values can cause models to crash, reduce accuracy, or introduce bias. Proper handling improves both robustness and generalizability.
Answer: Libraries like scikit-learn (for imputation pipelines), fancyimpute, Evidently, DVC, and YData Profiling are great for automating detection, imputation, and documentation.
Please log in to access this content. You will be redirected to the login page shortly.
LoginReady to take your education and career to the next level? Register today and join our growing community of learners and professionals.
Comments(0)