Embark on a journey of knowledge! Take the quiz and earn valuable credits.
Take A QuizChallenge yourself and boost your learning! Start the quiz now to earn credits.
Take A QuizUnlock your potential! Begin the quiz, answer questions, and accumulate credits along the way.
Take A Quiz
In almost every real-world dataset, missing values are
unavoidable. Whether data is collected via sensors, user input, APIs, or
surveys, gaps often exist due to various reasons like corruption, human error,
or system failure. Missing values, if not treated properly, can lead to biased
results or outright errors in machine learning models.
In this chapter, we'll explore why missing values matter,
how to detect them, and various ways to handle them using Python's Pandas
and Scikit-learn libraries.
🧠 Why Are Missing Values
a Problem?
Missing data can:
Hence, identifying and dealing with missing data is a
critical preprocessing step.
🔍 Step 1: Detecting
Missing Values
Pandas treats missing data as NaN (Not a Number). You
can easily detect them using .isnull() or .isna().
▶ Code Example:
python
import
pandas as pd
import
numpy as np
#
Sample dataset
data
= {
'Name': ['Alice', 'Bob', 'Charlie', None,
'Eve'],
'Age': [25, 30, None, 22, 29],
'Email': ['alice@gmail.com', None,
'charlie@gmail.com', 'david@gmail.com', 'eve@gmail.com']
}
df
= pd.DataFrame(data)
#
Detect missing values
print(df.isnull())
This returns a DataFrame with True for missing entries and
False for filled ones.
📊 Summary Table: Types of
Missing Data
Type |
Description |
Example |
MCAR (Missing
Completely at Random) |
No relationship
between missingness and any data |
A survey field left
blank by chance |
MAR (Missing at Random) |
Missingness
related to other observed variables |
Income
missing only for unemployed |
MNAR (Not Missing
at Random) |
Missingness related to
unobserved data |
People with high
income not disclosing it |
📌 Step 2: Count Missing
Values by Column
You can summarize how many missing values each column
contains:
python
print(df.isnull().sum())
Or the percentage of missing values:
python
print(df.isnull().mean()
* 100)
🧹 Step 3: Removing
Missing Values
🔸 Method 1: Drop rows
with missing values
python
df_dropped
= df.dropna()
print(df_dropped)
🔸 Method 2: Drop columns
with too many missing values
If a column is more than, say, 50% empty:
python
threshold
= len(df) * 0.5
df_dropped_col
= df.dropna(thresh=threshold, axis=1)
Use case: Drop features that offer little value due
to large gaps.
🧪 Step 4: Filling Missing
Values (Imputation)
Instead of dropping data, you can impute missing
values with:
🔹 Method 1: Fill with a
constant (e.g., "Unknown" or 0)
python
df_filled
= df.fillna('Unknown')
🔹 Method 2: Fill with
mean/median/mode
python
df['Age']
= df['Age'].fillna(df['Age'].mean())
🔹 Method 3: Forward Fill
(use previous value)
python
df_ffill
= df.fillna(method='ffill')
🔹 Method 4: Backward Fill
(use next value)
python
df_bfill
= df.fillna(method='bfill')
🧠 Advanced: Using
Scikit-learn's SimpleImputer
For larger pipelines or ML preprocessing, use SimpleImputer.
python
from
sklearn.impute import SimpleImputer
imp
= SimpleImputer(strategy='mean')
df['Age']
= imp.fit_transform(df[['Age']])
Other strategies include 'median', 'most_frequent', or
'constant'.
🧰 Summary Table: Handling
Techniques
Method |
Use Case |
Code Snippet
Example |
Drop rows |
When few rows are
affected |
df.dropna() |
Drop columns |
When a column
is mostly empty |
df.dropna(thresh=3,
axis=1) |
Fill with constant |
Categorical
placeholders like 'Unknown' |
df.fillna('Unknown') |
Fill with mean/median |
Numerical
features |
df['col'].fillna(df['col'].mean()) |
Forward/backward
fill |
Time series or
logically ordered data |
df.fillna(method='ffill') |
SimpleImputer (ML) |
Automated
pipelines in machine learning |
SimpleImputer(strategy='mean') |
✅ Best Practices
🧪 Bonus: Visualizing
Missing Data
Use seaborn or missingno for visual inspection:
python
import
seaborn as sns
import
matplotlib.pyplot as plt
sns.heatmap(df.isnull(),
cbar=False, cmap='viridis')
plt.show()
Or:
python
#
pip install missingno
import
missingno as msno
msno.matrix(df)
🏁 Conclusion
Missing values are a reality in every dataset. Your job as a
data professional is to treat them wisely — not just to clean the data,
but to do so in a way that preserves its integrity and analytical power.
By mastering these techniques in Python, you gain control over one of the most
error-prone phases of data analysis and machine learning.
Answer: Data cleaning is the process of identifying and correcting (or removing) corrupt, inaccurate, or irrelevant records from a dataset. In Python, it ensures that the data is structured, consistent, and ready for analysis or modeling. Clean data improves the reliability and performance of machine learning models and analytics.
Answer: The most popular libraries include:
Answer: Use df.isnull() to detect missing values. You can drop them using df.dropna() or fill them with appropriate values using df.fillna(). For advanced imputation, SimpleImputer from Scikit-learn can be used.
Answer: Use df.drop_duplicates() to remove exact duplicate rows. To drop based on specific columns, you can use df.drop_duplicates(subset=['column_name']).
Answer: You can use statistical methods like Z-score or IQR to detect outliers. Once detected, you can either remove them or cap/floor the values based on business needs using np.where() or conditional logic in Pandas.
Answer:
Answer: Use pd.to_datetime(df['column']) to convert strings to datetime. Similarly, use astype() for converting numerical or categorical types (e.g., df['age'].astype(int)).
Answer: Common steps include:
Answer: Machine learning algorithms typically require numerical inputs. Encoding (like Label Encoding or One-Hot Encoding) converts categorical text into numbers so that algorithms can interpret and process them effectively.
Please log in to access this content. You will be redirected to the login page shortly.
LoginReady to take your education and career to the next level? Register today and join our growing community of learners and professionals.
Comments(0)