Embark on a journey of knowledge! Take the quiz and earn valuable credits.
Take A QuizChallenge yourself and boost your learning! Start the quiz now to earn credits.
Take A QuizUnlock your potential! Begin the quiz, answer questions, and accumulate credits along the way.
Take A Quiz
Gathering the Right Data to Power Your Data Science
Project
🧠 Introduction
Once you’ve defined your problem clearly (as discussed in
Chapter 1), the next step is to gather the right data — because a great
model built on irrelevant or low-quality data is still a bad solution.
Data collection is where the practical meets the
strategic. You need to know where to find data, how to access it,
and what format it’s in — while also understanding the ethics,
legality, and scalability of your sources.
This chapter will walk you through:
🔍 1. What is Data
Collection?
Data collection is the process of sourcing raw
information that will be used to solve your problem. It could be structured
(CSV files, databases) or unstructured (text, images, videos).
🧩 2. Types of Data
Sources
Source Type |
Description |
Example |
Internal |
Within the
organization |
CRM, transaction logs,
user behavior |
External Public |
Free and open
to use |
Kaggle
datasets, UCI ML repository |
APIs |
External services
providing live data |
Twitter API,
OpenWeatherMap, Yelp API |
Web Scraping |
Extracting
content from websites |
Scraping job
listings, product prices |
IoT/Streamed |
Real-time or
time-series devices/systems |
Sensor data, mobile
app logs |
📦 3. Common Data Formats
Format |
Description |
Example Tool to
Read |
.csv |
Comma-separated values |
pd.read_csv() |
.json |
Nested
key-value data |
pd.read_json() |
.xlsx |
Excel file with sheets |
pd.read_excel() |
.sql |
Structured
query from a database |
pandas.read_sql_query()
(with SQLAlchemy) |
.parquet |
Optimized columnar
data format |
pd.read_parquet() |
🧰 4. Reading Local Files
in Python
▶ CSV Files:
python
import
pandas as pd
df
= pd.read_csv('data/customers.csv')
df.head()
▶ Excel Files:
python
df
= pd.read_excel('data/sales.xlsx', sheet_name='2023')
▶ JSON Files:
python
df
= pd.read_json('data/config.json')
🖧 5. Collecting Data via
APIs
APIs allow you to pull real-time or fresh data from external
providers.
▶ Example: OpenWeatherMap API
python
import
requests
url
=
"http://api.openweathermap.org/data/2.5/weather?q=London&appid=YOUR_API_KEY"
response
= requests.get(url)
data
= response.json()
print(data['weather'][0]['description'])
▶ Tools:
🌐 6. Web Scraping for
Custom Data
Web scraping extracts data directly from websites. Always
check site terms of use before scraping.
▶ Example using BeautifulSoup:
python
import
requests
from
bs4 import BeautifulSoup
url
= "https://example.com/products"
html
= requests.get(url).text
soup
= BeautifulSoup(html, 'html.parser')
titles
= [tag.text for tag in soup.find_all('h2', class_='product-title')]
print(titles)
▶ Popular Libraries:
🗃 7. Extracting from
Databases (SQL)
For internal enterprise data, you’ll often query a database.
▶ Example using SQLite:
python
import
sqlite3
conn
= sqlite3.connect('sales.db')
df
= pd.read_sql_query("SELECT * FROM orders", conn)
df.head()
▶ Other tools:
☁️ 8. Cloud-Based Data Sources
Data can also live in the cloud:
▶ Example: Reading from S3 using boto3
python
import
boto3
import
pandas as pd
s3
= boto3.client('s3')
s3.download_file('bucket-name',
'folder/file.csv', 'local_file.csv')
df
= pd.read_csv('local_file.csv')
📋 9. Considerations When
Acquiring Data
Concern |
What to Watch For |
Privacy |
Is the data PII? Is
consent required? |
Bias |
Is the sample
representative? |
Volume |
Can your machine
handle the size? |
Refreshability |
Will the data
change over time? |
Legality |
Are you allowed to
use/scrape this data? |
🧠 10. Best Practices for
Data Acquisition
Best Practice |
Why It Helps |
Document data
source |
For reproducibility
and credibility |
Check data freshness |
Especially
important for time-sensitive tasks |
Validate
schema/data types |
Prevents downstream
bugs |
Limit scraping frequency |
Avoid getting
blocked or rate-limited |
Automate data
ingestion (pipelines) |
For production-scale
work |
✅ Case Study Example: Acquiring
Customer Data
Scenario:
A retail business wants to predict customer churn. You’ve
been given access to internal CRM data and user interaction logs. You also need
supplementary demographic data from an open API.
Breakdown:
Source |
Type |
Method |
Tool |
CRM Database |
Internal |
SQL Query |
psycopg2 |
Clickstream |
Internal |
CSV Logs |
pandas.read_csv() |
Demographics |
External |
API |
requests, json |
🔄 Automation for Regular
Collection
Set up scheduled scripts to fetch and clean data periodically.
bash
#
Use cron (Linux) or Task Scheduler (Windows)
0
* * * * /usr/bin/python3 /home/project/scripts/fetch_data.py
You can also integrate tools like:
📊 Summary Table: Data
Acquisition Methods
Method |
Tool/Lib Used |
Ideal For |
CSV/Excel |
pandas |
Local or exported
tabular data |
SQL Database |
SQLAlchemy,
SQLite |
Internal
databases |
API |
requests, json |
Real-time or external
public data |
Web Scraping |
BeautifulSoup |
Unstructured
webpage content |
Cloud Buckets |
boto3, gcsfs |
Production-scale
storage |
Answer: The data science workflow is a structured step-by-step process used to turn raw data into actionable insights or solutions. It ensures clarity, efficiency, and reproducibility from problem definition to deployment.
Answer: Not necessarily. While there is a general order, data science is iterative. You may go back and forth between stages (like EDA and feature engineering) as new insights emerge.
Answer: Data cleaning prepares the dataset by fixing errors and inconsistencies, while EDA explores the data to find patterns, trends, and relationships to inform modeling decisions.
Answer: You can build a baseline model early, but robust feature engineering often improves performance significantly. It's best to iterate and refine after EDA and feature transformations.
Answer: Popular tools include Python libraries like scikit-learn, XGBoost, LightGBM, and TensorFlow for building models, and metrics functions within sklearn.metrics for evaluation.
Answer: It depends on the problem:
Answer: Start with lightweight options like:
Answer: Use logging for predictions, track performance metrics over time, and set alerts for significant drops. Tools like MLflow, Prometheus, and AWS CloudWatch are commonly used.
Answer: Yes. For learning or portfolio-building, it's okay to stop after model evaluation. But deploying at least one model enhances your understanding of real-world applications.
Answer: Choose a simple dataset (like Titanic or housing prices), go through every workflow step end-to-end, and document your process. Repeat with different types of problems to build experience.
Please log in to access this content. You will be redirected to the login page shortly.
LoginReady to take your education and career to the next level? Register today and join our growing community of learners and professionals.
Comments(0)