Embark on a journey of knowledge! Take the quiz and earn valuable credits.
Take A QuizChallenge yourself and boost your learning! Start the quiz now to earn credits.
Take A QuizUnlock your potential! Begin the quiz, answer questions, and accumulate credits along the way.
Take A Quiz
🔹 1. Introduction
After you’ve loaded and cleaned your data, the next
essential step in any data analysis process is data exploration and transformation.
This phase allows you to gain insights into the structure and characteristics
of the data and prepare it for further analysis or modeling.
In this chapter, we’ll explore various Pandas features that
allow you to:
By mastering these techniques, you will be able to
manipulate and analyze data efficiently, ensuring you can extract meaningful
insights for decision-making.
🔹 2. Descriptive
Statistics in Pandas
Pandas makes it easy to compute descriptive statistics
that summarize the central tendency, spread, and shape of data distributions.
The describe() function is a powerful method that provides summary statistics
for numerical columns.
✅ describe() Method
import
pandas as pd
#
Sample data
data
= {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [24, 30, 35, 40],
'Salary': [55000, 60000, 65000, 70000]}
df
= pd.DataFrame(data)
#
Descriptive statistics
print(df.describe())
Output:
Statistic |
Age |
Salary |
count |
4.0 |
4.0 |
mean |
32.25 |
63750.0 |
std |
6.23 |
6025.0 |
min |
24.0 |
55000.0 |
25% |
27.0 |
57500.0 |
50% |
32.5 |
62500.0 |
75% |
36.0 |
67500.0 |
max |
40.0 |
70000.0 |
Here, describe() calculates:
✅ Individual Statistics
You can also compute specific statistics using dedicated
functions:
#
Mean
mean_age
= df['Age'].mean()
#
Median
median_salary
= df['Salary'].median()
#
Standard deviation
std_salary
= df['Salary'].std()
print(f"Mean
Age: {mean_age}, Median Salary: {median_salary}, Std Salary:
{std_salary}")
🔹 3. Grouping and
Aggregating Data
Grouping allows you to split data into subsets based
on some criteria and perform aggregation (summarizing) operations on each
subset. This is useful when you want to analyze data by categories, such as grouping
sales by region or employees by department.
✅ Example: Grouping by Category
import
pandas as pd
#
Sample data
data
= {'Department': ['HR', 'IT', 'IT', 'HR'],
'Employee': ['Alice', 'Bob', 'Charlie',
'David'],
'Salary': [55000, 60000, 65000, 70000]}
df
= pd.DataFrame(data)
#
Group by 'Department' and calculate mean salary
grouped
= df.groupby('Department')['Salary'].mean()
print(grouped)
Output:
Department |
|
HR |
57500.0 |
IT |
62500.0 |
Name: Salary, dtype: float64
Here, the groupby() function groups the data by the Department
column and then calculates the mean salary for each group.
✅ Multiple Aggregations
You can apply multiple aggregation functions like sum(),
mean(), count(), etc., on grouped data:
#
Group by 'Department' and apply multiple aggregation functions
grouped
= df.groupby('Department')['Salary'].agg(['mean', 'sum', 'count'])
print(grouped)
Output:
Department |
mean |
sum |
count |
HR |
57500.0 |
115000 |
2 |
IT |
62500.0 |
125000 |
2 |
🔹 4. Reshaping Data:
Pivoting, Melting, and Stacking
Sometimes, the way data is structured may not be ideal for
analysis. Pandas provides several methods to reshape data, making it more
accessible for analysis.
✅ Pivoting Data
Pivoting allows you to reshape data into a new table by
selecting specific rows and columns.
import
pandas as pd
#
Sample data
data
= {'Department': ['HR', 'HR', 'IT', 'IT'],
'Employee': ['Alice', 'Bob', 'Charlie',
'David'],
'Salary': [55000, 60000, 65000, 70000]}
df
= pd.DataFrame(data)
#
Pivot data (department as index, employee as columns)
pivot_df
= df.pivot(index='Department', columns='Employee', values='Salary')
print(pivot_df)
Output:
Employee
Department |
Alice |
Bob |
Charlie |
David |
HR |
55000 |
60000 |
NaN |
NaN |
IT |
NaN |
NaN |
65000 |
70000 |
✅ Melting Data
Melting is the reverse of pivoting — it unpivots data,
making it long-format.
# Melt the pivoted DataFrame
melted_df
= pivot_df.reset_index().melt(id_vars=['Department'], value_vars=['Alice',
'Bob', 'Charlie', 'David'])
print(melted_df)
Output:
Department |
variable |
value |
HR |
Alice |
55000.0 |
IT |
Alice |
NaN |
HR |
Bob |
60000.0 |
IT |
Bob |
NaN |
HR |
Charlie |
NaN |
IT |
Charlie |
65000.0 |
HR |
David |
NaN |
IT |
David |
70000.0 |
✅ Stacking and Unstacking Data
You can also stack and unstack the rows and columns in a
DataFrame:
stacked_df
= df.set_index(['Department', 'Employee']).stack()
print(stacked_df)
Output:
Department |
Employee |
|
HR |
Alice |
55000 |
Bob |
60000 |
|
IT |
Charlie |
65000 |
David |
70000 |
dtype: int64
🔹 5. Applying Functions
to Columns and Rows
Pandas provides apply(), applymap(), and map()
methods to apply functions element-wise to rows, columns, or entire
DataFrames.
✅ Applying Functions to Columns
You can apply a function across a column using apply():
df['Salary']
= df['Salary'].apply(lambda x: x * 1.1)
# Apply a 10% increase to salary
✅ Applying Functions to Rows
Use axis=1 to apply a function across rows:
df['Salary_after_tax']
= df.apply(lambda row: row['Salary'] * 0.9, axis=1) # Apply a tax deduction
✅ Applying Functions to Entire
DataFrame
You can also apply a function to the entire DataFrame using
applymap() (works only on DataFrames with numerical data):
df
= df.applymap(lambda x: x * 2) # Double
all values
🔹 6. Summary Table
Operation |
Function/Method |
Description |
Calculate
mean, median, etc. |
df.describe() |
Summarize
statistics for numerical columns |
Group by
category |
df.groupby() |
Group data
and apply aggregation functions |
Reshaping
with pivoting |
df.pivot() |
Reshape data
based on columns and index |
Unpivot with
melting |
df.melt() |
Convert wide
data into long format |
Apply
function across columns |
df['column'].apply() |
Apply a
function to a column |
Apply
function across rows |
df.apply()
with axis=1 |
Apply a
function across rows |
Apply a
function to entire DataFrame |
df.applymap() |
Apply a
function to each element of DataFrame |
Pandas is a Python library for data manipulation and analysis, providing powerful data structures like DataFrames and Series.
While NumPy is great for numerical operations, Pandas is designed for working with structured data, including heterogeneous data types (strings, dates, integers, etc.) in a tabular format
A DataFrame is a two-dimensional data structure in Pandas, similar to a table or spreadsheet, with rows and columns. It’s the core structure for working with data in Pandas.
A Series is a one-dimensional data structure that can hold any data type (integers, strings, etc.), similar to a single column in a DataFrame.
You can load data using functions like pd.read_csv() for CSV files, pd.read_excel() for Excel files, and pd.read_sql() for SQL databases.
Yes — Pandas provides functions like fillna() to fill missing values, dropna() to remove rows/columns with missing data, and isna() to identify missing values.
You can filter data using conditions. For example: df[df['Age'] > 30] filters rows where the 'Age' column is greater than 30.
✅ Yes — use the groupby() function to group data by one or more columns and perform aggregations like mean(), sum(), or count().
Pandas integrates well with Matplotlib and provides a plot() function to create basic visualizations like line charts, bar charts, and histograms
Please log in to access this content. You will be redirected to the login page shortly.
LoginReady to take your education and career to the next level? Register today and join our growing community of learners and professionals.
Comments(0)