Top 10 Data Cleaning Techniques in Python: Master the Art of Preprocessing for Accurate Analysis

3.03K 0 0 0 0

📘 Chapter 9: Regex-Based Cleaning in Python

Master Pattern-Based Data Cleaning with Regular Expressions (Regex)


🧠 Introduction

When dealing with messy, unstructured, or semi-structured data, traditional cleaning methods can fall short. This is where Regular Expressions (Regex) come in — a powerful tool that lets you search, match, extract, and clean data based on patterns. Whether you need to extract phone numbers, clean email fields, validate formats, or strip out special characters, regex is your best friend.

In this chapter, you’ll learn:

  • What regex is and how it works in Python
  • How to clean, extract, and validate data using regex
  • Regex patterns for emails, numbers, dates, and more
  • How to apply regex cleaning in Pandas
  • Real-world data wrangling examples

🔍 What is Regex?

Regex is a sequence of characters that forms a search pattern. You can use it to:

  • Search for specific sequences
  • Extract matched patterns
  • Replace or remove unwanted patterns

Python provides built-in regex support through the re module, and Pandas enhances it with .str.replace(), .str.extract(), and .str.contains().


🛠 Step 1: Importing Required Libraries

python

 

import re

import pandas as pd


🔎 Step 2: Basic Regex Syntax and Rules

Symbol

Meaning

Example Pattern

Matches

.

Any character except newline

a.c

"abc", "axc"

\d

Any digit

\d+

"123", "45"

\D

Non-digit

\D+

"abc", "XYZ"

\w

Word character (letter, digit, _)

\w+

"abc123", "_value"

\W

Non-word character

\W+

"@#$", " "

\s

Whitespace

\s+

" ", "\t", "\n"

[]

Character class

[aeiou]

"a", "e"

^

Start of string

^abc

Matches "abc" at start

$

End of string

xyz$

Matches "xyz" at end

*

0 or more occurrences

a*

"", "aaa"

+

1 or more occurrences

a+

"a", "aa"

?

0 or 1 occurrence

a?

"", "a"

`

`

OR

`cat

()

Grouping

(abc)+

"abc", "abcabc"


📦 Step 3: Cleaning Text Fields with Regex in Pandas

Sample Data

python

 

df = pd.DataFrame({

    'Email': ['john.doe@gmail.com', 'jane@domain', 'invalid@.com'],

    'Phone': ['+1-202-555-0193', '2025550194', 'abc123'],

    'Bio': ['Hi! 😊 I work in <b>AI</b>.', 'Contact me at john@example.com.', 'N/A']

})


🔍 Step 4: Validate and Extract Email Addresses

Regex to Match Valid Emails

python

 

email_pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'

 

# Extract emails

df['Extracted_Email'] = df['Email'].str.extract(f'({email_pattern})')

 

# Check if valid

df['Is_Valid_Email'] = df['Email'].str.match(email_pattern)


️ Step 5: Clean and Extract Phone Numbers

Regex for US Phone Number

python

 

phone_pattern = r'(\+?\d{1,2}[-.\s]?)?(\(?\d{3}\)?[-.\s]?)?\d{3}[-.\s]?\d{4}'

 

# Extract phone numbers

df['Extracted_Phone'] = df['Phone'].str.extract(f'({phone_pattern})')


🧼 Step 6: Remove HTML Tags from Text

python

 

df['Clean_Bio'] = df['Bio'].str.replace(r'<.*?>', '', regex=True)


Step 7: Remove Emojis and Non-ASCII Characters

python

 

df['Clean_Bio'] = df['Clean_Bio'].str.replace(r'[^\x00-\x7F]+', '', regex=True)


📊 Step 8: Replace Multiple Spaces with One

python

 

df['Clean_Bio'] = df['Clean_Bio'].str.replace(r'\s+', ' ', regex=True).str.strip()


🧠 Step 9: Extract Hashtags or Mentions

python

 

text_series = pd.Series(['Follow @openai and tag #chatgpt!'])

 

# Extract mentions

text_series.str.extractall(r'(@\w+)')

 

# Extract hashtags

text_series.str.extractall(r'(#\w+)')


🔢 Step 10: Extract Numbers, Dates, or Codes

python

 

# Extract digits only

df['Digits'] = df['Bio'].str.extract(r'(\d+)')

 

# Extract formatted date (e.g., 2023-04-15)

date_series = pd.Series(['Posted on 2023-04-15.'])

date_series.str.extract(r'(\d{4}-\d{2}-\d{2})')


📊 Summary Table: Common Regex Cleaning Tasks

Task

Regex Pattern

Purpose

Email extraction

[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}

Find valid emails

Phone number

\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}

Normalize US phone numbers

HTML tag removal

<.*?>

Clean web-scraped content

Remove emojis/non-ASCII

[^\x00-\x7F]+

Clean multilingual/unicode text

Extra spaces

\s+

Replace with a single space

Hashtag or mention extraction

(#\w+), (@\w+)

Extract social media tags

Date pattern

\d{4}-\d{2}-\d{2}

Find formatted dates

Only numbers

\d+

Extract digits


🧪 Real-World Applications

Use Case

Description

Resume cleaning

Extract emails, phone numbers, skills

Chatbot logs

Remove symbols, timestamps, extract commands

Survey data normalization

Validate zip codes, IDs, names

Web scraping cleanup

Remove HTML, extract URLs/emails

Social media monitoring

Extract mentions, hashtags


🚫 Common Pitfalls

Mistake

What to Do Instead

Forgetting to use regex=True

Always use regex=True in Pandas string methods

Not escaping special characters

Use \\. instead of . to match literal dot

Not testing regex

Test with re.findall() or online tools (regex101)

Overmatching (greedy matching)

Use .*? for non-greedy HTML/JSON tag removal


🧠 Bonus: Use Python’s re Module for Advanced Matching

python

 

text = 'Email: alice@example.com, Phone: 123-456-7890'

 

# Extract using re

emails = re.findall(email_pattern, text)

phones = re.findall(phone_pattern, text)

print(emails, phones)


🏁 Conclusion

Regex is a superpower for any data professional. It allows you to search, extract, clean, and validate data with flexible patterns — especially when working with messy text inputs, log files, or scraped content. With Pandas and Python’s re module, regex becomes an essential tool in your data cleaning workflow.


Whether you're building a model, cleaning user input, or validating records, regex gives you surgical precision.

Back

FAQs


1. What is data cleaning and why is it important in Python?

Answer: Data cleaning is the process of identifying and correcting (or removing) corrupt, inaccurate, or irrelevant records from a dataset. In Python, it ensures that the data is structured, consistent, and ready for analysis or modeling. Clean data improves the reliability and performance of machine learning models and analytics.

2. Which Python libraries are most commonly used for data cleaning?

Answer: The most popular libraries include:

  • Pandas – for data manipulation
  • NumPy – for handling arrays and numerical operations
  • Scikit-learn – for preprocessing tasks like encoding and scaling
  • Regex (re) – for pattern matching and cleaning strings

3. How do I handle missing values in a DataFrame using Pandas?

Answer: Use df.isnull() to detect missing values. You can drop them using df.dropna() or fill them with appropriate values using df.fillna(). For advanced imputation, SimpleImputer from Scikit-learn can be used.

4. What is the best way to remove duplicate rows in Python?

Answer: Use df.drop_duplicates() to remove exact duplicate rows. To drop based on specific columns, you can use df.drop_duplicates(subset=['column_name']).

5. How can I detect and handle outliers in my dataset?

Answer: You can use statistical methods like Z-score or IQR to detect outliers. Once detected, you can either remove them or cap/floor the values based on business needs using np.where() or conditional logic in Pandas.

6. What is the difference between normalization and standardization in data cleaning?

Answer:

  • Normalization scales data to a [0, 1] range (Min-Max Scaling).
  • Standardization (Z-score scaling) centers the data around mean 0 with standard deviation 1.
    Use MinMaxScaler or StandardScaler from Scikit-learn for these transformations.

7. How do I convert data types (like strings to datetime) in Python?

Answer: Use pd.to_datetime(df['column']) to convert strings to datetime. Similarly, use astype() for converting numerical or categorical types (e.g., df['age'].astype(int)).

8. How can I clean and standardize text data in Python?

Answer: Common steps include:

  • Lowercasing: df['col'] = df['col'].str.lower()
  • Removing punctuation/whitespace: using regex or .str.strip(), .str.replace()
  • Replacing inconsistent terms (e.g., "Male", "M", "male") using df.replace()

9. Why is encoding categorical variables necessary in data cleaning?

Answer: Machine learning algorithms typically require numerical inputs. Encoding (like Label Encoding or One-Hot Encoding) converts categorical text into numbers so that algorithms can interpret and process them effectively.