Embark on a journey of knowledge! Take the quiz and earn valuable credits.
Take A QuizChallenge yourself and boost your learning! Start the quiz now to earn credits.
Take A QuizUnlock your potential! Begin the quiz, answer questions, and accumulate credits along the way.
Take A Quiz
Master Pattern-Based Data Cleaning with Regular
Expressions (Regex)
🧠 Introduction
When dealing with messy, unstructured, or semi-structured
data, traditional cleaning methods can fall short. This is where Regular
Expressions (Regex) come in — a powerful tool that lets you search,
match, extract, and clean data based on patterns. Whether you need to
extract phone numbers, clean email fields, validate formats, or strip out
special characters, regex is your best friend.
In this chapter, you’ll learn:
🔍 What is Regex?
Regex is a sequence of characters that forms a search
pattern. You can use it to:
Python provides built-in regex support through the re
module, and Pandas enhances it with .str.replace(), .str.extract(), and
.str.contains().
🛠 Step 1: Importing
Required Libraries
python
import
re
import
pandas as pd
🔎 Step 2: Basic Regex
Syntax and Rules
Symbol |
Meaning |
Example Pattern |
Matches |
. |
Any character except
newline |
a.c |
"abc",
"axc" |
\d |
Any digit |
\d+ |
"123",
"45" |
\D |
Non-digit |
\D+ |
"abc",
"XYZ" |
\w |
Word character
(letter, digit, _) |
\w+ |
"abc123",
"_value" |
\W |
Non-word character |
\W+ |
"@#$",
" " |
\s |
Whitespace |
\s+ |
"
", "\t", "\n" |
[] |
Character class |
[aeiou] |
"a",
"e" |
^ |
Start of
string |
^abc |
Matches
"abc" at start |
$ |
End of string |
xyz$ |
Matches
"xyz" at end |
* |
0 or more
occurrences |
a* |
"",
"aaa" |
+ |
1 or more occurrences |
a+ |
"a",
"aa" |
? |
0 or 1
occurrence |
a? |
"",
"a" |
` |
` |
OR |
`cat |
() |
Grouping |
(abc)+ |
"abc",
"abcabc" |
📦 Step 3: Cleaning Text
Fields with Regex in Pandas
▶ Sample Data
python
df
= pd.DataFrame({
'Email': ['john.doe@gmail.com',
'jane@domain', 'invalid@.com'],
'Phone': ['+1-202-555-0193', '2025550194',
'abc123'],
'Bio': ['Hi! 😊 I work in <b>AI</b>.', 'Contact me at
john@example.com.', 'N/A']
})
🔍 Step 4: Validate and
Extract Email Addresses
▶ Regex to Match Valid Emails
python
email_pattern
= r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
#
Extract emails
df['Extracted_Email']
= df['Email'].str.extract(f'({email_pattern})')
#
Check if valid
df['Is_Valid_Email']
= df['Email'].str.match(email_pattern)
☎️ Step 5: Clean and Extract
Phone Numbers
▶ Regex for US Phone Number
python
phone_pattern
= r'(\+?\d{1,2}[-.\s]?)?(\(?\d{3}\)?[-.\s]?)?\d{3}[-.\s]?\d{4}'
#
Extract phone numbers
df['Extracted_Phone']
= df['Phone'].str.extract(f'({phone_pattern})')
🧼 Step 6: Remove HTML
Tags from Text
python
df['Clean_Bio']
= df['Bio'].str.replace(r'<.*?>', '', regex=True)
✨ Step 7: Remove Emojis and
Non-ASCII Characters
python
df['Clean_Bio']
= df['Clean_Bio'].str.replace(r'[^\x00-\x7F]+', '', regex=True)
📊 Step 8: Replace
Multiple Spaces with One
python
df['Clean_Bio']
= df['Clean_Bio'].str.replace(r'\s+', ' ', regex=True).str.strip()
🧠 Step 9: Extract
Hashtags or Mentions
python
text_series
= pd.Series(['Follow @openai and tag #chatgpt!'])
#
Extract mentions
text_series.str.extractall(r'(@\w+)')
#
Extract hashtags
text_series.str.extractall(r'(#\w+)')
🔢 Step 10: Extract
Numbers, Dates, or Codes
python
#
Extract digits only
df['Digits']
= df['Bio'].str.extract(r'(\d+)')
#
Extract formatted date (e.g., 2023-04-15)
date_series
= pd.Series(['Posted on 2023-04-15.'])
date_series.str.extract(r'(\d{4}-\d{2}-\d{2})')
📊 Summary Table: Common
Regex Cleaning Tasks
Task |
Regex Pattern |
Purpose |
Email extraction |
[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,} |
Find valid emails |
Phone number |
\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4} |
Normalize US
phone numbers |
HTML tag removal |
<.*?> |
Clean web-scraped
content |
Remove emojis/non-ASCII |
[^\x00-\x7F]+ |
Clean
multilingual/unicode text |
Extra spaces |
\s+ |
Replace with a single
space |
Hashtag or mention extraction |
(#\w+),
(@\w+) |
Extract
social media tags |
Date pattern |
\d{4}-\d{2}-\d{2} |
Find formatted dates |
Only numbers |
\d+ |
Extract
digits |
🧪 Real-World Applications
Use Case |
Description |
Resume cleaning |
Extract emails, phone
numbers, skills |
Chatbot logs |
Remove
symbols, timestamps, extract commands |
Survey data
normalization |
Validate zip codes,
IDs, names |
Web scraping cleanup |
Remove HTML,
extract URLs/emails |
Social media
monitoring |
Extract mentions,
hashtags |
🚫 Common Pitfalls
Mistake |
What to Do Instead |
Forgetting to use
regex=True |
Always use regex=True
in Pandas string methods |
Not escaping special characters |
Use \\.
instead of . to match literal dot |
Not testing regex |
Test with re.findall()
or online tools (regex101) |
Overmatching (greedy matching) |
Use .*? for
non-greedy HTML/JSON tag removal |
🧠 Bonus: Use Python’s re
Module for Advanced Matching
python
text
= 'Email: alice@example.com, Phone: 123-456-7890'
#
Extract using re
emails
= re.findall(email_pattern, text)
phones
= re.findall(phone_pattern, text)
print(emails,
phones)
🏁 Conclusion
Regex is a superpower for any data professional. It
allows you to search, extract, clean, and validate data with flexible patterns
— especially when working with messy text inputs, log files, or scraped
content. With Pandas and Python’s re module, regex becomes an essential tool in
your data cleaning workflow.
Whether you're building a model, cleaning user input, or
validating records, regex gives you surgical precision.
Answer: Data cleaning is the process of identifying and correcting (or removing) corrupt, inaccurate, or irrelevant records from a dataset. In Python, it ensures that the data is structured, consistent, and ready for analysis or modeling. Clean data improves the reliability and performance of machine learning models and analytics.
Answer: The most popular libraries include:
Answer: Use df.isnull() to detect missing values. You can drop them using df.dropna() or fill them with appropriate values using df.fillna(). For advanced imputation, SimpleImputer from Scikit-learn can be used.
Answer: Use df.drop_duplicates() to remove exact duplicate rows. To drop based on specific columns, you can use df.drop_duplicates(subset=['column_name']).
Answer: You can use statistical methods like Z-score or IQR to detect outliers. Once detected, you can either remove them or cap/floor the values based on business needs using np.where() or conditional logic in Pandas.
Answer:
Answer: Use pd.to_datetime(df['column']) to convert strings to datetime. Similarly, use astype() for converting numerical or categorical types (e.g., df['age'].astype(int)).
Answer: Common steps include:
Answer: Machine learning algorithms typically require numerical inputs. Encoding (like Label Encoding or One-Hot Encoding) converts categorical text into numbers so that algorithms can interpret and process them effectively.
Please log in to access this content. You will be redirected to the login page shortly.
LoginReady to take your education and career to the next level? Register today and join our growing community of learners and professionals.
Comments(0)