Embark on a journey of knowledge! Take the quiz and earn valuable credits.
Take A QuizChallenge yourself and boost your learning! Start the quiz now to earn credits.
Take A QuizUnlock your potential! Begin the quiz, answer questions, and accumulate credits along the way.
Take A Quiz
Master DateTime Conversion, Feature Extraction &
Cleaning for Time-Based Data Analysis
🧠 Introduction
Dates and times are everywhere — from order timestamps and
transaction logs to sensor readings and customer records. However, when
imported into Python, these values often come in messy, inconsistent formats.
Before any kind of time-based analysis, forecasting, or filtering, you
must parse and standardize date values.
In this chapter, you'll learn how to:
🧩 Why Proper Date Parsing
Matters
If your date column is just a string (object type), you
cannot:
Fixing date parsing early enables powerful time-series
analysis, trend spotting, and predictive modeling.
🔍 Step 1: Detecting Date
Columns
First, identify which columns need conversion.
python
import
pandas as pd
df
= pd.DataFrame({
'JoinDate': ['2023-01-01', '01/02/2023',
'Feb 15, 2023', '2023.03.10', 'not a date']
})
print(df.dtypes)
Output:
vbnet
JoinDate object
dtype: object
This means JoinDate is a string column. We need to convert
it.
🛠 Step 2: Converting
Strings to datetime
Use pd.to_datetime() — the most powerful date parser in
Python.
▶ Basic Conversion
python
df['JoinDate']
= pd.to_datetime(df['JoinDate'], errors='coerce')
This handles:
And sets 'not a date' to NaT (Not a Time).
🧠 Step 3: Handling Custom
Date Formats
Some date strings follow unusual formats. You can specify
the exact format:
python
df
= pd.DataFrame({'date': ['31-01-2023', '15-02-2023']})
df['date']
= pd.to_datetime(df['date'], format='%d-%m-%Y')
✅ Common datetime Format Codes
Symbol |
Meaning |
Example |
%Y |
4-digit year |
2023 |
%y |
2-digit year |
23 |
%m |
Month (01–12) |
03 |
%d |
Day (01–31) |
15 |
%H |
Hour (00–23) |
17 |
%I |
Hour (01–12) |
05 |
%p |
AM/PM |
AM |
%M |
Minute
(00–59) |
45 |
%S |
Second (00–59) |
12 |
📉 Step 4: Handling Nulls
and Invalid Dates
Invalid strings like "N/A" or "not
available" will return NaT if errors='coerce' is set.
You can fill or drop them:
python
df['JoinDate'].fillna(method='ffill',
inplace=True)
Or:
python
df.dropna(subset=['JoinDate'],
inplace=True)
📆 Step 5: Extracting
Components from Dates
Once parsed, you can extract any part of a date:
python
df['Year']
= df['JoinDate'].dt.year
df['Month']
= df['JoinDate'].dt.month
df['Day']
= df['JoinDate'].dt.day
df['Weekday']
= df['JoinDate'].dt.day_name()
df['Week']
= df['JoinDate'].dt.isocalendar().week
🧮 Step 6: Calculating
Date Differences
You can compute the number of days, hours, or seconds
between two datetime columns.
python
df['Today']
= pd.to_datetime('today')
df['Days_Since_Join']
= (df['Today'] - df['JoinDate']).dt.days
🔍 Step 7: Filtering and
Sorting by Date
▶ Filter rows after a date:
python
df[df['JoinDate']
> '2023-02-01']
▶ Sort by date:
python
df.sort_values(by='JoinDate',
inplace=True)
🧠 Step 8: Creating Date
Ranges
Generate sequences of dates for analysis, simulations, or
time series:
python
dates
= pd.date_range(start='2023-01-01', periods=10, freq='D')
freq |
Meaning |
'D' |
Daily |
'M' |
Month end |
'MS' |
Month start |
'Y' |
Year end |
'H' |
Hourly |
📤 Step 9: Formatting
Dates for Export or Display
Convert datetime objects back to strings with a specific
format:
python
df['FormattedDate']
= df['JoinDate'].dt.strftime('%d-%b-%Y')
Output:
31-Jan-2023
🧠 Step 10: Time Zones and
Aware Datetimes
By default, datetime objects are naive (no timezone). You
can localize them:
python
df['JoinDate']
= df['JoinDate'].dt.tz_localize('UTC')
df['JoinDate']
= df['JoinDate'].dt.tz_convert('Asia/Kolkata')
Use this for global apps or analytics across time zones.
📊 Summary Table: Common
DateTime Operations
Task |
Code Example |
Convert to datetime |
pd.to_datetime(df['col']) |
Parse with custom format |
pd.to_datetime(df['col'],
format=...) |
Extract
year/month/day |
df['col'].dt.year |
Calculate difference |
df['A'] -
df['B'] |
Filter by date |
df[df['col'] >
'2023-01-01'] |
Format as string |
df['col'].dt.strftime('%Y-%m-%d') |
Localize and
convert timezone |
dt.tz_localize().tz_convert() |
💡 Pro Tips
python
pd.read_csv('file.csv',
parse_dates=['JoinDate'])
📉 Before vs After Parsing
Example
Input DataFrame:
JoinDate |
01/02/2023 |
2023.03.10 |
Feb 15, 2023 |
not a date |
After Parsing:
JoinDate |
Year |
Month |
Weekday |
2023-01-02 |
2023 |
1 |
Monday |
2023-03-10 |
2023 |
3 |
Friday |
2023-02-15 |
2023 |
2 |
Wednesday |
NaT |
NaN |
NaN |
NaN |
🏁 Conclusion
Date parsing and formatting in Python can be incredibly
powerful when done right. Whether you’re preparing data for time series
forecasting, cohort analysis, or simple filtering by month — a clean datetime
column unlocks tons of analytical potential.
With tools like pd.to_datetime(), .dt, and strftime, you
can:
The key is consistency — always ensure your dates are in
datetime64 format before proceeding with further logic.
Answer: Data cleaning is the process of identifying and correcting (or removing) corrupt, inaccurate, or irrelevant records from a dataset. In Python, it ensures that the data is structured, consistent, and ready for analysis or modeling. Clean data improves the reliability and performance of machine learning models and analytics.
Answer: The most popular libraries include:
Answer: Use df.isnull() to detect missing values. You can drop them using df.dropna() or fill them with appropriate values using df.fillna(). For advanced imputation, SimpleImputer from Scikit-learn can be used.
Answer: Use df.drop_duplicates() to remove exact duplicate rows. To drop based on specific columns, you can use df.drop_duplicates(subset=['column_name']).
Answer: You can use statistical methods like Z-score or IQR to detect outliers. Once detected, you can either remove them or cap/floor the values based on business needs using np.where() or conditional logic in Pandas.
Answer:
Answer: Use pd.to_datetime(df['column']) to convert strings to datetime. Similarly, use astype() for converting numerical or categorical types (e.g., df['age'].astype(int)).
Answer: Common steps include:
Answer: Machine learning algorithms typically require numerical inputs. Encoding (like Label Encoding or One-Hot Encoding) converts categorical text into numbers so that algorithms can interpret and process them effectively.
Please log in to access this content. You will be redirected to the login page shortly.
LoginReady to take your education and career to the next level? Register today and join our growing community of learners and professionals.
Comments(0)