Embark on a journey of knowledge! Take the quiz and earn valuable credits.
Take A QuizChallenge yourself and boost your learning! Start the quiz now to earn credits.
Take A QuizUnlock your potential! Begin the quiz, answer questions, and accumulate credits along the way.
Take A Quiz
Data-Driven Techniques to Predict and Replace Missing
Values Intelligently
🧠 Introduction
Simple imputation techniques like filling with mean or
median work fine for basic use cases. But when your dataset is complex,
multi-dimensional, or when missingness isn’t random, you need machine
learning-based imputation.
ML-based imputation learns patterns from your data to predict
missing values with greater accuracy.
In this chapter, you’ll learn:
🔍 1. Why Use ML for
Imputation?
Feature |
Simple Imputation |
ML-Based
Imputation |
Learns patterns |
✘ |
✅ |
Handles nonlinear relationships |
✘ |
✅ |
Can use multiple
predictors |
✘ |
✅ |
Works on mixed data types |
⚠
(partial) |
✅ |
Accurate on
non-random missing |
✘ |
✅ |
📦 2. Key Machine Learning
Imputation Methods
Method |
Description |
KNN Imputer |
Uses nearest neighbors
to infer missing values |
Iterative Imputer |
Trains
regressors for each feature iteratively |
Random Forest |
Predicts missing using
other columns as input |
AutoML Approaches |
Learns
optimal imputation strategy automatically |
🧰 3. Preparing Your Data
Separate predictors and target (optional):
python
from
sklearn.model_selection import train_test_split
#
Assume 'Income' has missing values
X
= df.drop(columns=['Income'])
y
= df['Income']
Encode categorical columns:
python
df
= pd.get_dummies(df, drop_first=True)
Or use OrdinalEncoder/OneHotEncoder inside a pipeline.
🤖 4. Method 1: KNN
Imputation
How it works:
Finds K nearest rows (based on other columns), then fills
missing value using their average.
python
from
sklearn.impute import KNNImputer
imputer
= KNNImputer(n_neighbors=5)
df_imputed
= pd.DataFrame(imputer.fit_transform(df), columns=df.columns)
✅ Best for:
🔁 5. Method 2: Iterative
Imputer
Concept:
Each column with missing values is modeled as a function of
the other columns — using regressors.
python
from
sklearn.experimental import enable_iterative_imputer
from
sklearn.impute import IterativeImputer
imp
= IterativeImputer(random_state=0)
df_imputed
= pd.DataFrame(imp.fit_transform(df), columns=df.columns)
✅ Benefits:
🌲 6. Method 3: Random
Forest Regressor/Classifier
You can manually build an imputation routine using a
supervised model.
Steps:
python
from
sklearn.ensemble import RandomForestRegressor
#
1. Split data
train_data
= df[df['Income'].notnull()]
test_data
= df[df['Income'].isnull()]
X_train
= train_data.drop(columns=['Income'])
y_train
= train_data['Income']
X_test
= test_data.drop(columns=['Income'])
#
2. Train and predict
model
= RandomForestRegressor()
model.fit(X_train,
y_train)
imputed_values
= model.predict(X_test)
#
3. Merge results
df.loc[df['Income'].isnull(),
'Income'] = imputed_values
🧠 7. Choosing the Right
Model
Data Type |
Best Imputer |
Why |
Numeric only |
IterativeImputer
(mean) |
Flexible + supports
linearity |
Categorical |
Random
Forest, XGBoost |
Handles
splits well |
Mixed types |
KNN or XGBoost |
Versatile |
Time series |
Not ML, use trend-based |
Requires
sequential context |
🧮 8. Evaluation of
Imputation
If true values are known (or simulated by masking known
ones):
python
from
sklearn.metrics import mean_squared_error
rmse
= mean_squared_error(y_true, y_pred, squared=False)
print(f"RMSE:
{rmse:.2f}")
Or compare model accuracy before and after imputation.
📋 Evaluation Table
Example:
Method |
RMSE |
Bias Risk |
Complexity |
Mean Impute |
9.83 |
High |
Low |
KNN |
7.20 |
Low |
Medium |
Iterative |
6.75 |
Low |
High |
Random Forest |
6.30 |
Very Low |
High |
🔄 9. Integrating Into a
Pipeline
python
from
sklearn.pipeline import Pipeline
from
sklearn.preprocessing import StandardScaler
from
sklearn.linear_model import LogisticRegression
pipe
= Pipeline([
('imputer', IterativeImputer()),
('scaler', StandardScaler()),
('model', LogisticRegression())
])
Pipelines allow:
📦 10. AutoML Imputation
Tools
Tool |
Feature |
Datawig (Amazon) |
Deep learning for
imputation |
H2O AutoML |
Built-in
imputation strategies |
AutoSklearn |
Handles missing
automatically |
TPOT |
Evolves
pipeline w/ imputation |
📉 11. Pitfalls to Avoid
Pitfall |
Tip |
Overfitting imputed
values |
Use regularization, CV |
Leakage from imputation |
Only fit
imputer on training data |
Mixing targets into
predictors |
Never use target for
imputing features |
Ignoring categorical handling |
Use
appropriate encoders or models |
Answer: Missing data can result from system errors, human omission, privacy constraints, sensor failures, or survey respondents skipping questions. It can also be intentional (e.g., optional fields).
Answer: Use Pandas functions like df.isnull().sum() or visualize missingness using the missingno or seaborn heatmap to understand the extent and pattern of missing data.
Answer: No. Dropping rows is acceptable only when the number of missing entries is minimal. Otherwise, it can lead to data loss and bias. Consider imputation or flagging instead.
Answer: If the distribution is normal, use the mean. If it's skewed, use the median. For more advanced tasks, consider KNN imputation or iterative modeling.
Answer: You can fill them using the mode, group-based mode, or assign a new category like "Unknown" or "Missing" — especially if missingness is meaningful.
Answer: Yes! Models like KNNImputer, Random Forests, or IterativeImputer (based on MICE) can predict missing values based on other columns, especially when missingness is not random.
Answer: Data drift refers to changes in the data distribution over time. If drift occurs, previously rare missing values may increase, or your imputation logic may become outdated — requiring updates.
Answer: Absolutely. Creating a binary feature like column_missing = df['column'].isnull() can help the model learn if missingness correlates with the target variable.
Answer: Yes — unhandled missing values can cause models to crash, reduce accuracy, or introduce bias. Proper handling improves both robustness and generalizability.
Answer: Libraries like scikit-learn (for imputation pipelines), fancyimpute, Evidently, DVC, and YData Profiling are great for automating detection, imputation, and documentation.
Please log in to access this content. You will be redirected to the login page shortly.
LoginReady to take your education and career to the next level? Register today and join our growing community of learners and professionals.
Comments(0)