Data Science Workflow: From Problem to Solution – A Complete Step-by-Step Journey for Beginners

8.45K 0 0 0 0

📗 Chapter 2: Data Collection and Acquisition

Gathering the Right Data to Power Your Data Science Project


🧠 Introduction

Once you’ve defined your problem clearly (as discussed in Chapter 1), the next step is to gather the right data — because a great model built on irrelevant or low-quality data is still a bad solution.

Data collection is where the practical meets the strategic. You need to know where to find data, how to access it, and what format it’s in — while also understanding the ethics, legality, and scalability of your sources.

This chapter will walk you through:

  • Types of data and their sources
  • Methods of acquiring data (APIs, web scraping, databases, etc.)
  • Loading data in Python using pandas
  • Common challenges in data acquisition
  • Real-world examples and best practices

🔍 1. What is Data Collection?

Data collection is the process of sourcing raw information that will be used to solve your problem. It could be structured (CSV files, databases) or unstructured (text, images, videos).


🧩 2. Types of Data Sources

Source Type

Description

Example

Internal

Within the organization

CRM, transaction logs, user behavior

External Public

Free and open to use

Kaggle datasets, UCI ML repository

APIs

External services providing live data

Twitter API, OpenWeatherMap, Yelp API

Web Scraping

Extracting content from websites

Scraping job listings, product prices

IoT/Streamed

Real-time or time-series devices/systems

Sensor data, mobile app logs


📦 3. Common Data Formats

Format

Description

Example Tool to Read

.csv

Comma-separated values

pd.read_csv()

.json

Nested key-value data

pd.read_json()

.xlsx

Excel file with sheets

pd.read_excel()

.sql

Structured query from a database

pandas.read_sql_query() (with SQLAlchemy)

.parquet

Optimized columnar data format

pd.read_parquet()


🧰 4. Reading Local Files in Python

CSV Files:

python

 

import pandas as pd

 

df = pd.read_csv('data/customers.csv')

df.head()

Excel Files:

python

 

df = pd.read_excel('data/sales.xlsx', sheet_name='2023')

JSON Files:

python

 

df = pd.read_json('data/config.json')


🖧 5. Collecting Data via APIs

APIs allow you to pull real-time or fresh data from external providers.

Example: OpenWeatherMap API

python

 

import requests

 

url = "http://api.openweathermap.org/data/2.5/weather?q=London&appid=YOUR_API_KEY"

response = requests.get(url)

data = response.json()

 

print(data['weather'][0]['description'])

Tools:

  • requests, http.client
  • json for parsing
  • API keys and authentication

🌐 6. Web Scraping for Custom Data

Web scraping extracts data directly from websites. Always check site terms of use before scraping.

Example using BeautifulSoup:

python

 

import requests

from bs4 import BeautifulSoup

 

url = "https://example.com/products"

html = requests.get(url).text

soup = BeautifulSoup(html, 'html.parser')

 

titles = [tag.text for tag in soup.find_all('h2', class_='product-title')]

print(titles)

Popular Libraries:

  • BeautifulSoup
  • Scrapy
  • Selenium (for dynamic pages)

🗃 7. Extracting from Databases (SQL)

For internal enterprise data, you’ll often query a database.

Example using SQLite:

python

 

import sqlite3

 

conn = sqlite3.connect('sales.db')

df = pd.read_sql_query("SELECT * FROM orders", conn)

df.head()

Other tools:

  • PostgreSQL: psycopg2
  • MySQL: mysql.connector
  • SQLAlchemy: works with most relational databases

️ 8. Cloud-Based Data Sources

Data can also live in the cloud:

  • Google BigQuery
  • Amazon S3
  • Firebase
  • Snowflake

Example: Reading from S3 using boto3

python

 

import boto3

import pandas as pd

 

s3 = boto3.client('s3')

s3.download_file('bucket-name', 'folder/file.csv', 'local_file.csv')

df = pd.read_csv('local_file.csv')


📋 9. Considerations When Acquiring Data

Concern

What to Watch For

Privacy

Is the data PII? Is consent required?

Bias

Is the sample representative?

Volume

Can your machine handle the size?

Refreshability

Will the data change over time?

Legality

Are you allowed to use/scrape this data?


🧠 10. Best Practices for Data Acquisition

Best Practice

Why It Helps

Document data source

For reproducibility and credibility

Check data freshness

Especially important for time-sensitive tasks

Validate schema/data types

Prevents downstream bugs

Limit scraping frequency

Avoid getting blocked or rate-limited

Automate data ingestion (pipelines)

For production-scale work


Case Study Example: Acquiring Customer Data

Scenario:

A retail business wants to predict customer churn. You’ve been given access to internal CRM data and user interaction logs. You also need supplementary demographic data from an open API.

Breakdown:

Source

Type

Method

Tool

CRM Database

Internal

SQL Query

psycopg2

Clickstream

Internal

CSV Logs

pandas.read_csv()

Demographics

External

API

requests, json


🔄 Automation for Regular Collection

Set up scheduled scripts to fetch and clean data periodically.

bash

 

# Use cron (Linux) or Task Scheduler (Windows)

0 * * * * /usr/bin/python3 /home/project/scripts/fetch_data.py

You can also integrate tools like:

  • Apache Airflow
  • Prefect
  • Luigi

📊 Summary Table: Data Acquisition Methods


Method

Tool/Lib Used

Ideal For

CSV/Excel

pandas

Local or exported tabular data

SQL Database

SQLAlchemy, SQLite

Internal databases

API

requests, json

Real-time or external public data

Web Scraping

BeautifulSoup

Unstructured webpage content

Cloud Buckets

boto3, gcsfs

Production-scale storage

Back

FAQs


1. What is the data science workflow, and why is it important?

Answer: The data science workflow is a structured step-by-step process used to turn raw data into actionable insights or solutions. It ensures clarity, efficiency, and reproducibility from problem definition to deployment.

2. Do I need to follow the workflow in a strict order?

Answer: Not necessarily. While there is a general order, data science is iterative. You may go back and forth between stages (like EDA and feature engineering) as new insights emerge.

3. What’s the difference between EDA and data cleaning?

Answer: Data cleaning prepares the dataset by fixing errors and inconsistencies, while EDA explores the data to find patterns, trends, and relationships to inform modeling decisions.

4. Is it okay to start modeling before completing feature engineering?

Answer: You can build a baseline model early, but robust feature engineering often improves performance significantly. It's best to iterate and refine after EDA and feature transformations.

5. What tools are best for building and evaluating models?

Answer: Popular tools include Python libraries like scikit-learn, XGBoost, LightGBM, and TensorFlow for building models, and metrics functions within sklearn.metrics for evaluation.

6. How do I choose the right evaluation metric?

Answer: It depends on the problem:

  • For classification: accuracy, precision, recall, F1-score
  • For regression: MAE, RMSE, R²
  • Use domain knowledge to choose the metric that aligns with business goals.

7. What are some good deployment options for beginners?

Answer: Start with lightweight options like:

  • Streamlit or Gradio for dashboards
  • Flask or FastAPI for web APIs
  • Hosting on Heroku or Render is easy and free for small projects.

8. How do I monitor a deployed model in production?

Answer: Use logging for predictions, track performance metrics over time, and set alerts for significant drops. Tools like MLflow, Prometheus, and AWS CloudWatch are commonly used.

9. Can I skip deployment if my goal is just learning?

Answer: Yes. For learning or portfolio-building, it's okay to stop after model evaluation. But deploying at least one model enhances your understanding of real-world applications.

10. What’s the best way to practice the entire workflow?

Answer: Choose a simple dataset (like Titanic or housing prices), go through every workflow step end-to-end, and document your process. Repeat with different types of problems to build experience.