Embark on a journey of knowledge! Take the quiz and earn valuable credits.
Take A QuizChallenge yourself and boost your learning! Start the quiz now to earn credits.
Take A QuizUnlock your potential! Begin the quiz, answer questions, and accumulate credits along the way.
Take A Quiz
🔹 1. Introduction
Data cleaning and preprocessing are crucial steps in any
data analysis or machine learning workflow. Before you can analyze or visualize
data, it is essential to ensure the dataset is in the right shape and format.
Pandas, with its rich set of functions, allows you to load, clean, and
preprocess data easily.
In this chapter, we will explore:
Mastering these preprocessing techniques is essential for
effectively preparing data for further analysis or machine learning.
🔹 2. Loading Data into
Pandas
Pandas supports reading from a variety of file formats.
Let's go over the most common ways to load data into a Pandas DataFrame.
✅ Loading from a CSV File
import
pandas as pd
#
Load a CSV file
df
= pd.read_csv('data.csv')
#
Display the first 5 rows of the DataFrame
print(df.head())
This will load data from a CSV file into a Pandas
DataFrame. You can specify additional options, like column names, index
columns, or missing value markers.
Example of loading a CSV with custom parameters:
df
= pd.read_csv('data.csv', header=0, names=['Column1', 'Column2', 'Column3'])
✅ Loading from an Excel File
df
= pd.read_excel('data.xlsx', sheet_name='Sheet1')
You can specify the sheet name with the sheet_name
parameter. If you want to load all sheets, you can pass sheet_name=None to
return a dictionary of DataFrames.
✅ Loading from a SQL Database
Pandas can read data directly from SQL databases using the
read_sql() function:
import
sqlite3
#
Connect to SQLite database
conn
= sqlite3.connect('database.db')
#
Load data from a SQL query
df
= pd.read_sql('SELECT * FROM table_name', conn)
#
Display the first few rows
print(df.head())
🔹 3. Handling Missing
Data
One of the most common challenges when working with
real-world data is missing or NaN (Not a Number) values. Pandas provides
robust functions for handling missing data.
✅ Identifying Missing Data
You can check for missing values using the isna() function,
which returns a DataFrame of boolean values (True for missing, False for
non-missing).
df.isna()
✅ Removing Missing Data
If you want to remove rows or columns with missing data, you
can use dropna():
#
Drop rows with any missing data
df.dropna(axis=0,
inplace=True)
#
Drop columns with any missing data
df.dropna(axis=1,
inplace=True)
✅ Filling Missing Data
In many cases, it’s better to fill missing data
instead of dropping it. Pandas offers several strategies for this, including
filling with a specific value or with forward/backward filling.
#
Fill missing values with a specific value (e.g., 0)
df.fillna(0,
inplace=True)
#
Forward fill (propagate the last valid value forward)
df.fillna(method='ffill',
inplace=True)
#
Backward fill
df.fillna(method='bfill',
inplace=True)
🔹 4. Data Transformation
✅ Changing Data Types
Sometimes, the data types of columns are incorrect (e.g.,
numeric columns stored as strings). You can change data types using astype():
#
Convert the 'Age' column to integer
df['Age']
= df['Age'].astype(int)
#
Convert the 'Date' column to datetime type
df['Date']
= pd.to_datetime(df['Date'])
✅ Applying Functions to Columns
You can apply functions to columns or rows
using apply() or map().
Example with apply() (row-wise or column-wise
operations):
#
Apply a function to each column
df['Age']
= df['Age'].apply(lambda x: x + 1) #
Increment each age by 1
Example with map() (element-wise transformations):
#
Map each entry of a column to a new value
df['Category']
= df['Category'].map({'A': 1, 'B': 2, 'C': 3})
🔹 5. Removing Duplicates
It is common to encounter duplicate rows in datasets. You
can remove them using drop_duplicates():
#
Remove duplicate rows
df.drop_duplicates(inplace=True)
If you want to keep the first occurrence or last occurrence,
you can specify:
#
Keep the first occurrence of duplicates
df.drop_duplicates(keep='first',
inplace=True)
#
Keep the last occurrence of duplicates
df.drop_duplicates(keep='last',
inplace=True)
🔹 6. Filtering Data
Pandas makes it easy to filter data based on conditions.
✅ Example: Filtering Rows Based
on Conditions
# Filter rows where Age is greater than 30
df_filtered
= df[df['Age'] > 30]
print(df_filtered)
You can also apply multiple conditions using &
(and) and | (or):
#
Filter rows where Age > 30 and Salary > 50000
df_filtered
= df[(df['Age'] > 30) & (df['Salary'] > 50000)]
print(df_filtered)
Make sure to enclose each condition in parentheses
when using & or |.
🔹 7. Summary Table
Operation |
Function/Method |
Description |
Load CSV data |
pd.read_csv('file.csv') |
Load data
from a CSV file |
Load Excel
data |
pd.read_excel('file.xlsx') |
Load data
from an Excel sheet |
Check for
missing data |
df.isna() |
Check for
missing (NaN) values |
Drop missing
data |
df.dropna() |
Remove rows
or columns with NaN values |
Fill missing
data |
df.fillna() |
Fill missing
values with a specific value |
Change data
types |
df.astype() |
Convert a
column to a different data type |
Remove
duplicates |
df.drop_duplicates() |
Remove
duplicate rows |
Filter data |
df[df['column']
> value] |
Filter rows
based on conditions |
Apply
functions to columns |
df['col'].apply() |
Apply a
function to a column |
Pandas is a Python library for data manipulation and analysis, providing powerful data structures like DataFrames and Series.
While NumPy is great for numerical operations, Pandas is designed for working with structured data, including heterogeneous data types (strings, dates, integers, etc.) in a tabular format
A DataFrame is a two-dimensional data structure in Pandas, similar to a table or spreadsheet, with rows and columns. It’s the core structure for working with data in Pandas.
A Series is a one-dimensional data structure that can hold any data type (integers, strings, etc.), similar to a single column in a DataFrame.
You can load data using functions like pd.read_csv() for CSV files, pd.read_excel() for Excel files, and pd.read_sql() for SQL databases.
Yes — Pandas provides functions like fillna() to fill missing values, dropna() to remove rows/columns with missing data, and isna() to identify missing values.
You can filter data using conditions. For example: df[df['Age'] > 30] filters rows where the 'Age' column is greater than 30.
✅ Yes — use the groupby() function to group data by one or more columns and perform aggregations like mean(), sum(), or count().
Pandas integrates well with Matplotlib and provides a plot() function to create basic visualizations like line charts, bar charts, and histograms
Please log in to access this content. You will be redirected to the login page shortly.
LoginReady to take your education and career to the next level? Register today and join our growing community of learners and professionals.
Comments(0)