Embark on a journey of knowledge! Take the quiz and earn valuable credits.
Take A QuizChallenge yourself and boost your learning! Start the quiz now to earn credits.
Take A QuizUnlock your potential! Begin the quiz, answer questions, and accumulate credits along the way.
Take A Quiz
Train, Test, and Evaluate Your First Real Machine
Learning Model in Python
🧠 Introduction
You’ve explored your data, cleaned it, engineered meaningful
features, and selected the best ones — now it’s time for the exciting part: building
your first predictive model!
In this chapter, you’ll walk through:
Whether you’re building a classification model to predict
survival on the Titanic or a regression model to estimate house prices, this is
where your dataset starts providing answers.
🔮 1. What Is a Predictive
Model?
A predictive model learns from historical data and
makes predictions on new, unseen data.
🔸 Two Most Common Types:
Task |
Goal |
Example |
Classification |
Predict discrete class
labels |
Spam detection,
disease diagnosis |
Regression |
Predict
continuous numeric values |
House price,
temperature forecast |
📦 2. Preparing Your
Dataset
Make sure:
▶ Example Dataset Setup (Titanic-style):
python
X
= df.drop('Survived', axis=1) # Features
y
= df['Survived'] # Target
🔀 3. Splitting into Train
and Test Sets
Use 80% of the data to train, and 20% to test
performance.
python
from
sklearn.model_selection import train_test_split
X_train,
X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
🔍 4. Choosing Your First
Algorithm
Start simple:
We'll focus on classification using Logistic Regression
and Decision Tree.
✅ 5. Logistic Regression
(Classification Example)
python
from
sklearn.linear_model import LogisticRegression
from
sklearn.metrics import accuracy_score
model
= LogisticRegression()
model.fit(X_train,
y_train)
preds
= model.predict(X_test)
print("Accuracy:",
accuracy_score(y_test, preds))
🌳 6. Decision Tree
Classifier
python
from
sklearn.tree import DecisionTreeClassifier
tree
= DecisionTreeClassifier(max_depth=4)
tree.fit(X_train,
y_train)
tree_preds
= tree.predict(X_test)
print("Tree
Accuracy:", accuracy_score(y_test, tree_preds))
📈 7. Evaluation Metrics
for Classification
▶ Accuracy Score
python
from
sklearn.metrics import accuracy_score
print("Accuracy:",
accuracy_score(y_test, preds))
▶ Confusion Matrix
python
from
sklearn.metrics import confusion_matrix
import
seaborn as sns
cm
= confusion_matrix(y_test, preds)
sns.heatmap(cm,
annot=True, fmt='d')
Prediction Type |
Meaning |
True Positive |
Correctly predicted 1 |
False Positive |
Predicted 1
but actual is 0 |
False Negative |
Predicted 0 but actual
is 1 |
True Negative |
Correctly
predicted 0 |
▶ Precision, Recall, F1-Score
python
from
sklearn.metrics import classification_report
print(classification_report(y_test,
preds))
📊 8. Evaluation Metrics
for Regression
If you’re predicting a numeric value (e.g. house price):
python
from
sklearn.linear_model import LinearRegression
from
sklearn.metrics import mean_squared_error, r2_score
lr
= LinearRegression()
lr.fit(X_train,
y_train)
y_pred
= lr.predict(X_test)
print("MSE:",
mean_squared_error(y_test, y_pred))
print("R2:",
r2_score(y_test, y_pred))
🧪 9. Cross-Validation
(Optional but Useful)
Get a more stable estimate of model performance.
python
from
sklearn.model_selection import cross_val_score
scores
= cross_val_score(model, X, y, cv=5, scoring='accuracy')
print("Cross-validated
accuracy:", scores.mean())
⚙️ 10. Hyperparameter Tuning
Improve your model by finding the best parameters.
python
from
sklearn.model_selection import GridSearchCV
params
= {'max_depth': [3, 5, 7, 10]}
grid
= GridSearchCV(DecisionTreeClassifier(), param_grid=params, cv=5)
grid.fit(X_train,
y_train)
print("Best
depth:", grid.best_params_)
🔁 11. Save and Reload
Your Model
After training, you can save your model:
python
import
joblib
joblib.dump(model,
'my_model.pkl')
model
= joblib.load('my_model.pkl')
📋 12. Common
Classification Models Overview
Model |
When to Use |
Scikit-learn Class |
Logistic Regression |
Binary classification |
LogisticRegression |
Decision Tree |
Interpretable
rules |
DecisionTreeClassifier |
Random Forest |
Strong performance
with less tuning |
RandomForestClassifier |
K-Nearest Neighbors |
Simple,
distance-based |
KNeighborsClassifier |
SVM |
High-dimensional
datasets |
SVC |
XGBoost |
Competitive,
boosting-based |
xgboost.XGBClassifier
(external) |
🧠 13. Model Selection
Tips
Scenario |
Suggested Model |
Predict yes/no
outcome |
Logistic Regression |
Dataset has lots of noise |
Decision Tree
or RandomForest |
Very few features,
linearly separable |
Logistic/SVM |
Need explainable predictions |
Decision Tree |
📦 14. Summary Table:
Model Workflow
Step |
Tool/Method |
Split data |
train_test_split() |
Train model |
model.fit() |
Predict outcomes |
model.predict() |
Evaluate accuracy |
accuracy_score() |
Visualize confusion
matrix |
confusion_matrix(),
heatmap |
Score regression |
mean_squared_error(),
r2_score() |
Tune model |
GridSearchCV |
Save model |
joblib.dump() |
✅ Final Code Snippet: Titanic
Logistic Regression Example
python
import
pandas as pd
from
sklearn.model_selection import train_test_split
from
sklearn.linear_model import LogisticRegression
from
sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import
seaborn as sns
import
matplotlib.pyplot as plt
#
Load dataset
df
= pd.read_csv('titanic_clean.csv')
X
= df.drop('Survived', axis=1)
y
= df['Survived']
#
Split
X_train,
X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
#
Train
model
= LogisticRegression()
model.fit(X_train,
y_train)
#
Predict
preds
= model.predict(X_test)
#
Evaluate
print("Accuracy:",
accuracy_score(y_test, preds))
print(classification_report(y_test,
preds))
#
Confusion Matrix
sns.heatmap(confusion_matrix(y_test,
preds), annot=True, fmt='d')
plt.title("Confusion
Matrix")
plt.show()
Answer: Not at all. Basic knowledge of statistics is helpful, but you can start your first project with a beginner-friendly dataset and learn concepts like mean, median, correlation, and regression as you go.
Answer: Python is the most popular and beginner-friendly choice, thanks to its simplicity and powerful libraries like Pandas, NumPy, Matplotlib, Seaborn, and Scikit-learn.
Answer: Great sources include:
Answer:
Answer: Keep it small and manageable — one target variable, 3–6 features, and under 10,000 rows of data. Focus more on understanding the process than building a complex model.
Answer: Yes, but keep it simple. Start with linear regression, logistic regression, or decision trees. Avoid deep learning or complex models until you're more confident.
Answer: Use:
Answer: Use:
Answer: It depends on your task:
Answer: Absolutely! A well-documented project with clear insights, code, and visualizations is a great way to show employers that you understand the end-to-end data science process.
Please log in to access this content. You will be redirected to the login page shortly.
LoginReady to take your education and career to the next level? Register today and join our growing community of learners and professionals.
Comments(0)