Data Science Workflow: From Problem to Solution – A Complete Step-by-Step Journey for Beginners

8.61K 0 0 0 0

📗 Chapter 6: Model Building and Training

From Clean Data to Predictive Power — Training Machine Learning Models with Confidence


🧠 Introduction

After cleaning your data, engineering meaningful features, and selecting the most important ones, it’s finally time to build your model.

Model building is where data science transforms into intelligent automation.
But it's not just about running .fit() — it’s about choosing the right model, training it properly, and validating it rigorously.

In this chapter, you’ll learn:

  • How to split your dataset for training and testing
  • Which algorithms to use for different tasks
  • How to build, train, and interpret models in Python
  • Best practices for cross-validation and performance tracking
  • Hands-on examples using scikit-learn

Let’s dive into one of the most rewarding stages in the data science workflow.


🧩 1. Understanding Supervised Learning

Model training typically refers to supervised learning — where we provide both features (X) and a target (y), and the model learns to map inputs to outputs.

🔹 Two Main Types:

Type

Goal

Examples

Classification

Predict class labels (discrete)

Spam vs. not spam, churn prediction

Regression

Predict continuous values

House prices, salary estimation


🧪 2. Train-Test Split

Before training, split your data to avoid data leakage and to measure how your model generalizes.

python

 

from sklearn.model_selection import train_test_split

 

X = df.drop('target', axis=1)

y = df['target']

 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

  • 80% of data is used for training
  • 20% is held back for testing

️ 3. Choosing the Right Model

Use this table as a quick reference:

Problem Type

Common Models

Binary Classification

Logistic Regression, Random Forest, SVM

Multiclass Classification

Decision Tree, XGBoost, KNN

Regression

Linear Regression, Random Forest Regressor

High Dimensional

Lasso, Ridge, Gradient Boosting

📌 Tip:

Start simple. Use a baseline model (like LogisticRegression) before trying complex models like XGBoost or Neural Networks.


4. Training a Classification Model: Logistic Regression

python

 

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score

 

model = LogisticRegression()

model.fit(X_train, y_train)

 

y_pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))


🌲 5. Training a Tree-Based Model: Random Forest

python

 

from sklearn.ensemble import RandomForestClassifier

 

rf = RandomForestClassifier(n_estimators=100)

rf.fit(X_train, y_train)

 

preds = rf.predict(X_test)

print("Accuracy:", accuracy_score(y_test, preds))

Advantages:

  • Handles non-linear data
  • No need for feature scaling
  • Works with both numerical and categorical data (if encoded)

📈 6. Training a Regression Model

Linear Regression

python

 

from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error

 

lr = LinearRegression()

lr.fit(X_train, y_train)

 

y_pred = lr.predict(X_test)

print("MSE:", mean_squared_error(y_test, y_pred))


🔄 7. Cross-Validation

Instead of just a train-test split, you can split into k folds and validate across all of them.

python

 

from sklearn.model_selection import cross_val_score

 

scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')

print("Cross-validated accuracy:", scores.mean())

  • Reduces overfitting
  • Provides a more stable estimate

🧠 8. Hyperparameter Tuning (Preview)

Every model has settings that control its behavior — called hyperparameters.

You can optimize them using GridSearchCV:

python

 

from sklearn.model_selection import GridSearchCV

 

params = {'max_depth': [3, 5, 10], 'min_samples_split': [2, 5]}

grid = GridSearchCV(RandomForestClassifier(), params, cv=3)

grid.fit(X_train, y_train)

 

print("Best params:", grid.best_params_)


📊 9. Comparing Models with Metrics

Use different metrics based on the task.

Problem Type

Metrics

Classification

Accuracy, Precision, Recall, F1-score

Regression

MSE, RMSE, R², MAE

python

 

from sklearn.metrics import classification_report

 

print(classification_report(y_test, y_pred))


🧪 10. Model Evaluation Table (Example)

Model

Accuracy

Precision

Recall

F1-score

Logistic Regression

0.82

0.81

0.79

0.80

Random Forest

0.85

0.84

0.83

0.83

SVM

0.83

0.82

0.80

0.81


💾 11. Saving and Reusing Models

Once you’ve trained your model, you can save it for future predictions.

python

 

import joblib

 

joblib.dump(model, 'model.pkl')

model = joblib.load('model.pkl')


Full Code Workflow Example

python

 

# Full workflow

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import classification_report

import pandas as pd

 

# Load data

df = pd.read_csv('titanic_clean.csv')

 

X = df.drop('Survived', axis=1)

y = df['Survived']

 

# Split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)

 

# Train

model = RandomForestClassifier()

model.fit(X_train, y_train)

 

# Predict & Evaluate

y_pred = model.predict(X_test)

print(classification_report(y_test, y_pred))


📋 Summary Table: Model Building Steps


Step

Tool / Function

Train/test split

train_test_split()

Choose model

LogisticRegression(), RandomForest()

Train model

.fit()

Predict

.predict()

Evaluate

accuracy_score(), classification_report()

Cross-validation

cross_val_score()

Save model

joblib.dump()

Back

FAQs


1. What is the data science workflow, and why is it important?

Answer: The data science workflow is a structured step-by-step process used to turn raw data into actionable insights or solutions. It ensures clarity, efficiency, and reproducibility from problem definition to deployment.

2. Do I need to follow the workflow in a strict order?

Answer: Not necessarily. While there is a general order, data science is iterative. You may go back and forth between stages (like EDA and feature engineering) as new insights emerge.

3. What’s the difference between EDA and data cleaning?

Answer: Data cleaning prepares the dataset by fixing errors and inconsistencies, while EDA explores the data to find patterns, trends, and relationships to inform modeling decisions.

4. Is it okay to start modeling before completing feature engineering?

Answer: You can build a baseline model early, but robust feature engineering often improves performance significantly. It's best to iterate and refine after EDA and feature transformations.

5. What tools are best for building and evaluating models?

Answer: Popular tools include Python libraries like scikit-learn, XGBoost, LightGBM, and TensorFlow for building models, and metrics functions within sklearn.metrics for evaluation.

6. How do I choose the right evaluation metric?

Answer: It depends on the problem:

  • For classification: accuracy, precision, recall, F1-score
  • For regression: MAE, RMSE, R²
  • Use domain knowledge to choose the metric that aligns with business goals.

7. What are some good deployment options for beginners?

Answer: Start with lightweight options like:

  • Streamlit or Gradio for dashboards
  • Flask or FastAPI for web APIs
  • Hosting on Heroku or Render is easy and free for small projects.

8. How do I monitor a deployed model in production?

Answer: Use logging for predictions, track performance metrics over time, and set alerts for significant drops. Tools like MLflow, Prometheus, and AWS CloudWatch are commonly used.

9. Can I skip deployment if my goal is just learning?

Answer: Yes. For learning or portfolio-building, it's okay to stop after model evaluation. But deploying at least one model enhances your understanding of real-world applications.

10. What’s the best way to practice the entire workflow?

Answer: Choose a simple dataset (like Titanic or housing prices), go through every workflow step end-to-end, and document your process. Repeat with different types of problems to build experience.