Embark on a journey of knowledge! Take the quiz and earn valuable credits.
Take A QuizChallenge yourself and boost your learning! Start the quiz now to earn credits.
Take A QuizUnlock your potential! Begin the quiz, answer questions, and accumulate credits along the way.
Take A Quiz
From Clean Data to Predictive Power — Training Machine
Learning Models with Confidence
🧠 Introduction
After cleaning your data, engineering meaningful features,
and selecting the most important ones, it’s finally time to build your model.
Model building is where data science transforms into
intelligent automation.
But it's not just about running .fit() — it’s about choosing the right model,
training it properly, and validating it rigorously.
In this chapter, you’ll learn:
Let’s dive into one of the most rewarding stages in the data
science workflow.
🧩 1. Understanding
Supervised Learning
Model training typically refers to supervised learning
— where we provide both features (X) and a target (y), and the
model learns to map inputs to outputs.
🔹 Two Main Types:
Type |
Goal |
Examples |
Classification |
Predict class labels
(discrete) |
Spam vs. not spam,
churn prediction |
Regression |
Predict
continuous values |
House prices,
salary estimation |
🧪 2. Train-Test Split
Before training, split your data to avoid data leakage and
to measure how your model generalizes.
python
from
sklearn.model_selection import train_test_split
X
= df.drop('target', axis=1)
y
= df['target']
X_train,
X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
⚙️ 3. Choosing the Right Model
Use this table as a quick reference:
Problem Type |
Common Models |
Binary
Classification |
Logistic Regression,
Random Forest, SVM |
Multiclass Classification |
Decision
Tree, XGBoost, KNN |
Regression |
Linear Regression,
Random Forest Regressor |
High Dimensional |
Lasso, Ridge,
Gradient Boosting |
📌 Tip:
Start simple. Use a baseline model (like LogisticRegression)
before trying complex models like XGBoost or Neural Networks.
✅ 4. Training a Classification
Model: Logistic Regression
python
from
sklearn.linear_model import LogisticRegression
from
sklearn.metrics import accuracy_score
model
= LogisticRegression()
model.fit(X_train,
y_train)
y_pred
= model.predict(X_test)
print("Accuracy:",
accuracy_score(y_test, y_pred))
🌲 5. Training a
Tree-Based Model: Random Forest
python
from
sklearn.ensemble import RandomForestClassifier
rf
= RandomForestClassifier(n_estimators=100)
rf.fit(X_train,
y_train)
preds
= rf.predict(X_test)
print("Accuracy:",
accuracy_score(y_test, preds))
▶ Advantages:
📈 6. Training a
Regression Model
▶ Linear Regression
python
from
sklearn.linear_model import LinearRegression
from
sklearn.metrics import mean_squared_error
lr
= LinearRegression()
lr.fit(X_train,
y_train)
y_pred
= lr.predict(X_test)
print("MSE:",
mean_squared_error(y_test, y_pred))
🔄 7. Cross-Validation
Instead of just a train-test split, you can split into k
folds and validate across all of them.
python
from
sklearn.model_selection import cross_val_score
scores
= cross_val_score(model, X, y, cv=5, scoring='accuracy')
print("Cross-validated
accuracy:", scores.mean())
🧠 8. Hyperparameter
Tuning (Preview)
Every model has settings that control its behavior — called hyperparameters.
You can optimize them using GridSearchCV:
python
from
sklearn.model_selection import GridSearchCV
params
= {'max_depth': [3, 5, 10], 'min_samples_split': [2, 5]}
grid
= GridSearchCV(RandomForestClassifier(), params, cv=3)
grid.fit(X_train,
y_train)
print("Best
params:", grid.best_params_)
📊 9. Comparing Models
with Metrics
Use different metrics based on the task.
Problem Type |
Metrics |
Classification |
Accuracy, Precision,
Recall, F1-score |
Regression |
MSE, RMSE,
R², MAE |
python
from
sklearn.metrics import classification_report
print(classification_report(y_test,
y_pred))
🧪 10. Model Evaluation
Table (Example)
Model |
Accuracy |
Precision |
Recall |
F1-score |
Logistic Regression |
0.82 |
0.81 |
0.79 |
0.80 |
Random Forest |
0.85 |
0.84 |
0.83 |
0.83 |
SVM |
0.83 |
0.82 |
0.80 |
0.81 |
💾 11. Saving and Reusing
Models
Once you’ve trained your model, you can save it for future
predictions.
python
import
joblib
joblib.dump(model,
'model.pkl')
model
= joblib.load('model.pkl')
✅ Full Code Workflow Example
python
#
Full workflow
from
sklearn.model_selection import train_test_split
from
sklearn.ensemble import RandomForestClassifier
from
sklearn.metrics import classification_report
import
pandas as pd
#
Load data
df
= pd.read_csv('titanic_clean.csv')
X
= df.drop('Survived', axis=1)
y
= df['Survived']
#
Split
X_train,
X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)
#
Train
model
= RandomForestClassifier()
model.fit(X_train,
y_train)
#
Predict & Evaluate
y_pred
= model.predict(X_test)
print(classification_report(y_test,
y_pred))
📋 Summary Table: Model
Building Steps
Step |
Tool / Function |
Train/test split |
train_test_split() |
Choose model |
LogisticRegression(),
RandomForest() |
Train model |
.fit() |
Predict |
.predict() |
Evaluate |
accuracy_score(),
classification_report() |
Cross-validation |
cross_val_score() |
Save model |
joblib.dump() |
Answer: The data science workflow is a structured step-by-step process used to turn raw data into actionable insights or solutions. It ensures clarity, efficiency, and reproducibility from problem definition to deployment.
Answer: Not necessarily. While there is a general order, data science is iterative. You may go back and forth between stages (like EDA and feature engineering) as new insights emerge.
Answer: Data cleaning prepares the dataset by fixing errors and inconsistencies, while EDA explores the data to find patterns, trends, and relationships to inform modeling decisions.
Answer: You can build a baseline model early, but robust feature engineering often improves performance significantly. It's best to iterate and refine after EDA and feature transformations.
Answer: Popular tools include Python libraries like scikit-learn, XGBoost, LightGBM, and TensorFlow for building models, and metrics functions within sklearn.metrics for evaluation.
Answer: It depends on the problem:
Answer: Start with lightweight options like:
Answer: Use logging for predictions, track performance metrics over time, and set alerts for significant drops. Tools like MLflow, Prometheus, and AWS CloudWatch are commonly used.
Answer: Yes. For learning or portfolio-building, it's okay to stop after model evaluation. But deploying at least one model enhances your understanding of real-world applications.
Answer: Choose a simple dataset (like Titanic or housing prices), go through every workflow step end-to-end, and document your process. Repeat with different types of problems to build experience.
Please log in to access this content. You will be redirected to the login page shortly.
LoginReady to take your education and career to the next level? Register today and join our growing community of learners and professionals.
Comments(0)