Chapters

Mastering Pandas in Python: Data Analysis and Manipulation Made Easy

3.15K 0 1 0 0

Manpreet Singh

Chapter 4: Data Exploration and Transformation in Pandas

🔹 1. Introduction

After you’ve loaded and cleaned your data, the next essential step in any data analysis process is data exploration and transformation. This phase allows you to gain insights into the structure and characteristics of the data and prepare it for further analysis or modeling.

In this chapter, we’ll explore various Pandas features that allow you to:

Calculate descriptive statistics for data summarization
Group and aggregate data
Reshape data using pivoting and stacking
Apply functions to columns and rows

By mastering these techniques, you will be able to manipulate and analyze data efficiently, ensuring you can extract meaningful insights for decision-making.

🔹 2. Descriptive Statistics in Pandas

Pandas makes it easy to compute descriptive statistics that summarize the central tendency, spread, and shape of data distributions. The describe() function is a powerful method that provides summary statistics for numerical columns.

✅ describe() Method

import pandas as pd

# Sample data

data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],

'Age': [24, 30, 35, 40],

'Salary': [55000, 60000, 65000, 70000]}

df = pd.DataFrame(data)

# Descriptive statistics

print(df.describe())

Output:

Statistic	Age	Salary
count	4.0	4.0
mean	32.25	63750.0
std	6.23	6025.0
min	24.0	55000.0
25%	27.0	57500.0
50%	32.5	62500.0
75%	36.0	67500.0
max	40.0	70000.0

Here, describe() calculates:

count: Number of non-null entries
mean: Average value
std: Standard deviation
min: Minimum value
25%, 50%, 75%: Percentiles (quartiles)
max: Maximum value

✅ Individual Statistics

You can also compute specific statistics using dedicated functions:

# Mean

mean_age = df['Age'].mean()

# Median

median_salary = df['Salary'].median()

# Standard deviation

std_salary = df['Salary'].std()

print(f"Mean Age: {mean_age}, Median Salary: {median_salary}, Std Salary: {std_salary}")

🔹 3. Grouping and Aggregating Data

Grouping allows you to split data into subsets based on some criteria and perform aggregation (summarizing) operations on each subset. This is useful when you want to analyze data by categories, such as grouping sales by region or employees by department.

✅ Example: Grouping by Category

import pandas as pd

# Sample data

data = {'Department': ['HR', 'IT', 'IT', 'HR'],

'Employee': ['Alice', 'Bob', 'Charlie', 'David'],

'Salary': [55000, 60000, 65000, 70000]}

df = pd.DataFrame(data)

# Group by 'Department' and calculate mean salary

grouped = df.groupby('Department')['Salary'].mean()

print(grouped)

Output:

Department
HR	57500.0
IT	62500.0

Name: Salary, dtype: float64

Here, the groupby() function groups the data by the Department column and then calculates the mean salary for each group.

✅ Multiple Aggregations

You can apply multiple aggregation functions like sum(), mean(), count(), etc., on grouped data:

# Group by 'Department' and apply multiple aggregation functions

grouped = df.groupby('Department')['Salary'].agg(['mean', 'sum', 'count'])

print(grouped)

Output:

Department	mean	sum	count
HR	57500.0	115000	2
IT	62500.0	125000	2

🔹 4. Reshaping Data: Pivoting, Melting, and Stacking

Sometimes, the way data is structured may not be ideal for analysis. Pandas provides several methods to reshape data, making it more accessible for analysis.

✅ Pivoting Data

Pivoting allows you to reshape data into a new table by selecting specific rows and columns.

import pandas as pd

# Sample data

data = {'Department': ['HR', 'HR', 'IT', 'IT'],

'Employee': ['Alice', 'Bob', 'Charlie', 'David'],

'Salary': [55000, 60000, 65000, 70000]}

df = pd.DataFrame(data)

# Pivot data (department as index, employee as columns)

pivot_df = df.pivot(index='Department', columns='Employee', values='Salary')

print(pivot_df)

Output:

Employee

Department	Alice	Bob	Charlie	David
HR	55000	60000	NaN	NaN
IT	NaN	NaN	65000	70000

✅ Melting Data

Melting is the reverse of pivoting — it unpivots data, making it long-format.

# Melt the pivoted DataFrame

melted_df = pivot_df.reset_index().melt(id_vars=['Department'], value_vars=['Alice', 'Bob', 'Charlie', 'David'])

print(melted_df)

Output:

Department	variable	value
HR	Alice	55000.0
IT	Alice	NaN
HR	Bob	60000.0
IT	Bob	NaN
HR	Charlie	NaN
IT	Charlie	65000.0
HR	David	NaN
IT	David	70000.0

✅ Stacking and Unstacking Data

You can also stack and unstack the rows and columns in a DataFrame:

stacked_df = df.set_index(['Department', 'Employee']).stack()

print(stacked_df)

Output:

Department	Employee
HR	Alice	55000
	Bob	60000
IT	Charlie	65000
	David	70000

dtype: int64

🔹 5. Applying Functions to Columns and Rows

Pandas provides apply(), applymap(), and map() methods to apply functions element-wise to rows, columns, or entire DataFrames.

✅ Applying Functions to Columns

You can apply a function across a column using apply():

df['Salary'] = df['Salary'].apply(lambda x: x * 1.1) # Apply a 10% increase to salary

✅ Applying Functions to Rows

Use axis=1 to apply a function across rows:

df['Salary_after_tax'] = df.apply(lambda row: row['Salary'] * 0.9, axis=1) # Apply a tax deduction

✅ Applying Functions to Entire DataFrame

You can also apply a function to the entire DataFrame using applymap() (works only on DataFrames with numerical data):

df = df.applymap(lambda x: x * 2) # Double all values

🔹 6. Summary Table

Operation	Function/Method	Description
Calculate mean, median, etc.	df.describe()	Summarize statistics for numerical columns
Group by category	df.groupby()	Group data and apply aggregation functions
Reshaping with pivoting	df.pivot()	Reshape data based on columns and index
Unpivot with melting	df.melt()	Convert wide data into long format
Apply function across columns	df['column'].apply()	Apply a function to a column
Apply function across rows	df.apply() with axis=1	Apply a function across rows
Apply a function to entire DataFrame	df.applymap()	Apply a function to each element of DataFrame

Back

FAQs

1. What is Pandas in Python?

Pandas is a Python library for data manipulation and analysis, providing powerful data structures like DataFrames and Series.

2. How does Pandas differ from NumPy?

While NumPy is great for numerical operations, Pandas is designed for working with structured data, including heterogeneous data types (strings, dates, integers, etc.) in a tabular format

3. What is a DataFrame in Pandas?

A DataFrame is a two-dimensional data structure in Pandas, similar to a table or spreadsheet, with rows and columns. It’s the core structure for working with data in Pandas.

4. What is a Series in Pandas?

A Series is a one-dimensional data structure that can hold any data type (integers, strings, etc.), similar to a single column in a DataFrame.

5. How do I load data into Pandas?

You can load data using functions like pd.read_csv() for CSV files, pd.read_excel() for Excel files, and pd.read_sql() for SQL databases.

6. Can I clean missing data with Pandas?

Yes — Pandas provides functions like fillna() to fill missing values, dropna() to remove rows/columns with missing data, and isna() to identify missing values.

7. How do I filter data in Pandas?

You can filter data using conditions. For example: df[df['Age'] > 30] filters rows where the 'Age' column is greater than 30.

8. Can I group and aggregate data in Pandas?

✅ Yes — use the groupby() function to group data by one or more columns and perform aggregations like mean(), sum(), or count().

9. How can I visualize data in Pandas?

Pandas integrates well with Matplotlib and provides a plot() function to create basic visualizations like line charts, bar charts, and histograms

Previous Next

Comments(0)

Post Comment

Chapters

Mastering Pandas in Python: Data Analysis and Manipulation Made Easy

Manpreet Singh

Chapter 4: Data Exploration and Transformation in Pandas

FAQs

1. What is Pandas in Python?

2. How does Pandas differ from NumPy?

3. What is a DataFrame in Pandas?

4. What is a Series in Pandas?

5. How do I load data into Pandas?

6. Can I clean missing data with Pandas?

7. How do I filter data in Pandas?

8. Can I group and aggregate data in Pandas?

9. How can I visualize data in Pandas?

Comments(0)

Explore Other Libraries

Online Exams

Question Bank

Career News

Feeds

Full Forms

Dictionary

Interview Question

Gigs

Quotes

Lyrics

Videos

Courses

Blogs

Tutorials

Forum

Educators

Corporates

Tools

Related Searches

Join Our Community Today