Mastering Pandas in Python: Data Analysis and Manipulation Made Easy

780 0 0 0 0

Chapter 4: Data Exploration and Transformation in Pandas

🔹 1. Introduction

After you’ve loaded and cleaned your data, the next essential step in any data analysis process is data exploration and transformation. This phase allows you to gain insights into the structure and characteristics of the data and prepare it for further analysis or modeling.

In this chapter, we’ll explore various Pandas features that allow you to:

  • Calculate descriptive statistics for data summarization
  • Group and aggregate data
  • Reshape data using pivoting and stacking
  • Apply functions to columns and rows

By mastering these techniques, you will be able to manipulate and analyze data efficiently, ensuring you can extract meaningful insights for decision-making.


🔹 2. Descriptive Statistics in Pandas

Pandas makes it easy to compute descriptive statistics that summarize the central tendency, spread, and shape of data distributions. The describe() function is a powerful method that provides summary statistics for numerical columns.

describe() Method

import pandas as pd

 

# Sample data

data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],

        'Age': [24, 30, 35, 40],

        'Salary': [55000, 60000, 65000, 70000]}

 

df = pd.DataFrame(data)

 

# Descriptive statistics

print(df.describe())

Output:

Statistic

Age

Salary

count

4.0

4.0

mean

32.25

63750.0

std

6.23

6025.0

min

24.0

55000.0

25%

27.0

57500.0

50%

32.5

62500.0

75%

36.0

67500.0

max

40.0

70000.0

 

Here, describe() calculates:

  • count: Number of non-null entries
  • mean: Average value
  • std: Standard deviation
  • min: Minimum value
  • 25%, 50%, 75%: Percentiles (quartiles)
  • max: Maximum value

Individual Statistics

You can also compute specific statistics using dedicated functions:

# Mean

mean_age = df['Age'].mean()

 

# Median

median_salary = df['Salary'].median()

 

# Standard deviation

std_salary = df['Salary'].std()

 

print(f"Mean Age: {mean_age}, Median Salary: {median_salary}, Std Salary: {std_salary}")


🔹 3. Grouping and Aggregating Data

Grouping allows you to split data into subsets based on some criteria and perform aggregation (summarizing) operations on each subset. This is useful when you want to analyze data by categories, such as grouping sales by region or employees by department.

Example: Grouping by Category

import pandas as pd

 

# Sample data

data = {'Department': ['HR', 'IT', 'IT', 'HR'],

        'Employee': ['Alice', 'Bob', 'Charlie', 'David'],

        'Salary': [55000, 60000, 65000, 70000]}

 

df = pd.DataFrame(data)

 

# Group by 'Department' and calculate mean salary

grouped = df.groupby('Department')['Salary'].mean()

print(grouped)

Output:

Department


HR

57500.0

IT

62500.0

Name: Salary, dtype: float64

Here, the groupby() function groups the data by the Department column and then calculates the mean salary for each group.

Multiple Aggregations

You can apply multiple aggregation functions like sum(), mean(), count(), etc., on grouped data:

# Group by 'Department' and apply multiple aggregation functions

grouped = df.groupby('Department')['Salary'].agg(['mean', 'sum', 'count'])

print(grouped)

Output:

Department

mean

sum

count

HR

57500.0

115000

2

IT

62500.0

125000

2

 


🔹 4. Reshaping Data: Pivoting, Melting, and Stacking

Sometimes, the way data is structured may not be ideal for analysis. Pandas provides several methods to reshape data, making it more accessible for analysis.

Pivoting Data

Pivoting allows you to reshape data into a new table by selecting specific rows and columns.

import pandas as pd

 

# Sample data

data = {'Department': ['HR', 'HR', 'IT', 'IT'],

        'Employee': ['Alice', 'Bob', 'Charlie', 'David'],

        'Salary': [55000, 60000, 65000, 70000]}

 

df = pd.DataFrame(data)

 

# Pivot data (department as index, employee as columns)

pivot_df = df.pivot(index='Department', columns='Employee', values='Salary')

print(pivot_df)

Output:

Employee

Department

Alice

Bob

Charlie

David

HR

55000

60000

NaN

NaN

IT

NaN

NaN

65000

70000

Melting Data

Melting is the reverse of pivoting — it unpivots data, making it long-format.

# Melt the pivoted DataFrame

melted_df = pivot_df.reset_index().melt(id_vars=['Department'], value_vars=['Alice', 'Bob', 'Charlie', 'David'])

print(melted_df)

Output:

Department

variable

value

HR

Alice

55000.0

IT

Alice

NaN

HR

Bob

60000.0

IT

Bob

NaN

HR

Charlie

NaN

IT

Charlie

65000.0

HR

David

NaN

IT

David

70000.0

 

Stacking and Unstacking Data

You can also stack and unstack the rows and columns in a DataFrame:

stacked_df = df.set_index(['Department', 'Employee']).stack()

print(stacked_df)

Output:

Department

Employee


HR

Alice

55000


Bob

60000

IT

Charlie

65000


David

70000

dtype: int64


🔹 5. Applying Functions to Columns and Rows

Pandas provides apply(), applymap(), and map() methods to apply functions element-wise to rows, columns, or entire DataFrames.

Applying Functions to Columns

You can apply a function across a column using apply():

df['Salary'] = df['Salary'].apply(lambda x: x * 1.1)  # Apply a 10% increase to salary

Applying Functions to Rows

Use axis=1 to apply a function across rows:

df['Salary_after_tax'] = df.apply(lambda row: row['Salary'] * 0.9, axis=1)  # Apply a tax deduction

Applying Functions to Entire DataFrame

You can also apply a function to the entire DataFrame using applymap() (works only on DataFrames with numerical data):

df = df.applymap(lambda x: x * 2)  # Double all values


🔹 6. Summary Table


Operation

Function/Method

Description

Calculate mean, median, etc.

df.describe()

Summarize statistics for numerical columns

Group by category

df.groupby()

Group data and apply aggregation functions

Reshaping with pivoting

df.pivot()

Reshape data based on columns and index

Unpivot with melting

df.melt()

Convert wide data into long format

Apply function across columns

df['column'].apply()

Apply a function to a column

Apply function across rows

df.apply() with axis=1

Apply a function across rows

Apply a function to entire DataFrame

df.applymap()

Apply a function to each element of DataFrame

Back

FAQs


1. What is Pandas in Python?

Pandas is a Python library for data manipulation and analysis, providing powerful data structures like DataFrames and Series.

2. How does Pandas differ from NumPy?

While NumPy is great for numerical operations, Pandas is designed for working with structured data, including heterogeneous data types (strings, dates, integers, etc.) in a tabular format

3. What is a DataFrame in Pandas?

A DataFrame is a two-dimensional data structure in Pandas, similar to a table or spreadsheet, with rows and columns. It’s the core structure for working with data in Pandas.

4. What is a Series in Pandas?

A Series is a one-dimensional data structure that can hold any data type (integers, strings, etc.), similar to a single column in a DataFrame.

5. How do I load data into Pandas?

You can load data using functions like pd.read_csv() for CSV files, pd.read_excel() for Excel files, and pd.read_sql() for SQL databases.

6. Can I clean missing data with Pandas?

Yes Pandas provides functions like fillna() to fill missing values, dropna() to remove rows/columns with missing data, and isna() to identify missing values.

7. How do I filter data in Pandas?

You can filter data using conditions. For example: df[df['Age'] > 30] filters rows where the 'Age' column is greater than 30.

8. Can I group and aggregate data in Pandas?

Yes use the groupby() function to group data by one or more columns and perform aggregations like mean(), sum(), or count().

9. How can I visualize data in Pandas?

Pandas integrates well with Matplotlib and provides a plot() function to create basic visualizations like line charts, bar charts, and histograms